Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts during thread starvation on Linux (but not on Windows) #2746

Open
quassnoi opened this issue Jun 25, 2024 · 0 comments
Open

Timeouts during thread starvation on Linux (but not on Windows) #2746

quassnoi opened this issue Jun 25, 2024 · 0 comments

Comments

@quassnoi
Copy link

We are in the process of migrating our .NET project from Framework to Core (and from Windows to Linux) and, during peak hours and thread starvation, we are getting a lot of Redis timeouts.

It's best illustrated with an MRE.

Here's what the script does:

  1. Creates and connects a Redis multiplexer.
  2. Saturates the thread pool with lots of units of work that do nothing except sleeping.
  3. Runs and awaits a single GET from Redis.

Run it by installing dotnet tool install -g dotnet-script and then running dotnet script mre.csx

#nullable enable

#r "nuget: StackExchange.Redis, 2.7.33"

using System.Runtime.CompilerServices;
using System.Threading;
using StackExchange.Redis;
using StackExchange.Redis.Profiling;

const int sleepTimeMilliseconds = 7000;
const int extraSleepingThreads = 10;
const string redisConnectionString = "127.0.0.1:7200";

ThreadPool.GetMinThreads(out var workerThreads, out var completionPortThreads);
WriteLine($"Min worker threads: {workerThreads}, min completion port threads: {completionPortThreads}");

var sleepingThreadCount = workerThreads + extraSleepingThreads;

var connectionMultiplexer = await ConnectionMultiplexer.ConnectAsync(redisConnectionString, options =>
{
    options.ClientName = "redis-timeout-test";
}, Error);
await connectionMultiplexer.GetDatabase(0).StringGetAsync("test");
var profilingSession = new ProfilingSession();
connectionMultiplexer.RegisterProfiler(() => profilingSession);


void PrintStats(string stage, int id, [CallerMemberName] string? caller = null)
{
    WriteLine($"{caller}\t{id}\t{stage}\t{Thread.CurrentThread.ManagedThreadId}");
}

void Sleep(object? state)
{
    var i = (int)state!;
    try
    {
        PrintStats("start", i);
        Thread.Sleep(sleepTimeMilliseconds);
        PrintStats("end", i);
    }
    catch
    {
        PrintStats("exception", i);
        throw;
    }
}

async Task Redis(int i)
{
    try
    {
        PrintStats("start", i);
        await connectionMultiplexer.GetDatabase().StringGetAsync("test");
        PrintStats("end", i);
    }
    catch
    {
        PrintStats("exception", i);
        throw;
    }
}

for (var i = 0; i < sleepingThreadCount; i++)
{
    ThreadPool.QueueUserWorkItem(Sleep, i);
}

await Redis(0);
foreach (var profiledCommand in profilingSession.FinishProfiling())
{
    WriteLine(profiledCommand);
}

Our expectation is that thread starvation should not affect the inner workings of Redis client, and that the request should succeed.

It works on Windows (both in .NET 4.7.2 and .NET 8), but not on Linux.

On Linux, it fails with a timeout:

Sleep   0       start   14
Sleep   3       start   8
Sleep   4       start   11
Sleep   5       start   7
Sleep   1       start   12
Redis   0       start   5
Sleep   6       start   13
Sleep   2       start   10
Sleep   7       start   9
Sleep   8       start   15
Sleep   9       start   29
Sleep   10      start   30
Sleep   11      start   31
Sleep   12      start   5
Sleep   13      start   32
Sleep   14      start   33
Sleep   15      start   34
Sleep   16      start   35
Sleep   17      start   36
Redis   0       exception       37
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5461ms elapsed, timeout is 5000ms), command=GET, next: GET test, inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 5, in-pipe: 0, out-pipe: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 3, serverEndpoint: 127.0.0.1:7201, conn-sec: 5,49, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: redis-timeout-test, PerfCounterHelperkeyHashSlot: 6918, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=19,Free=32748,Min=12,Max=32767), POOL: (Threads=19,QueuedItems=5,CompletedItems=101,Timers=1), v: 2.7.33.41805 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
   at Submission#0.<Redis>d__11.MoveNext() in /mnt/c/Users/******/Projects/redis-test/main.csx:line 54
--- End of stack trace from previous location ---
   at Submission#0.<<Initialize>>d__0.MoveNext() in /mnt/c/Users/******/Projects/redis-test/main.csx:line 69
--- End of stack trace from previous location ---
   at Dotnet.Script.Core.ScriptRunner.Execute[TReturn](String dllPath, IEnumerable`1 commandLineArgs) in C:\Users\runneradmin\AppData\Local\Temp\tmp8BAF\Dotnet.Script.Core\ScriptRunner.cs:line 110

The reason for this is that even though the library creates a dedicated thread pool and does all the work inside this pool, some continuations still run on the main thread pool (which kinda defeats the purpose).

The problem seems to be here:

Breakpoint reached: /home/******/.config/JetBrains/Rider2024.1/resharper-host/SourcesCache/37cf93917ec38cc6a96176b4240fe2fe05f294df6d9955a4c023fd9b30aa58/SocketAwaitableEventArgs.cs:142,21 
   at Pipelines.Sockets.Unofficial.SocketAwaitableEventArgs.OnCompleted(SocketAsyncEventArgs e) in /_/src/Pipelines.Sockets.Unofficial/SocketAwaitableEventArgs.cs:line 145
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32 bytesTransferred, Memory`1 socketAddress, SocketFlags receivedFlags, SocketError socketError)
   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

The breakpoint above shows that there is a unit of work within Pipelines.Sockets.Unofficial that is being run on the main thread pool. This unit of work is being scheduled from here: https://github.com/mgravell/Pipelines.Sockets.Unofficial/blob/6703efc310131037ee55b3136ff0d2d9e0df1e3f/src/Pipelines.Sockets.Unofficial/SocketConnection.Receive.cs#L221

Apparently, on Linux, Socket.ReceiveAsync(SocketAsyncEventArgs), when running asyncronously, always schedules the event callback on the main thread pool. If the main thread pool is backlogged, the continuation can't go through in time, and Redis client throws a timeout error.

This particular problem can be fixed by having the socket process its continuations on the socket thread. It can be done by setting the environment library DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1. However, this setting causes more problems with third-party libraries than it solves. Apparently, there is no way in .NET to set it for a particular socket (or I couldn't find it).

Is there any other way to have Pipelines.Sockets.Unofficial do all its work on the dedicated thread pool (and, ultimately, make Redis not time out if the process is thread starved)?

(cross-post from mgravell/Pipelines.Sockets.Unofficial#81 which seems to be where the root issue is)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant