Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging buffering #5635

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

evgenyfedorov2
Copy link
Contributor

@evgenyfedorov2 evgenyfedorov2 commented Nov 13, 2024

Related to the #5123 proposal, this PR is focused on the logging buffering only. See also the sampling part #5574.

Microsoft Reviewers: Open in CodeFlow

@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch 2 times, most recently from e40b7b6 to 9d61d12 Compare December 12, 2024 16:30
@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from 9d61d12 to 2f1a335 Compare December 12, 2024 17:36
Address PR comments
Add .NET 8 support
Add ExceptionJsonConverter
@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from 883334d to 022f00c Compare December 16, 2024 14:23
@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from 022f00c to a371d9c Compare December 16, 2024 14:51
@dotnet-comment-bot
Copy link
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.AspNetCore.Diagnostics.Middleware Line 100 96.03 🔻
Microsoft.AspNetCore.Diagnostics.Middleware Branch 100 98.51 🔻
Microsoft.Extensions.Telemetry Line 93 86.2 🔻
Microsoft.Extensions.Telemetry Branch 93 84.59 🔻
Microsoft.Extensions.Diagnostics.Testing Line 99 85.78 🔻
Microsoft.Extensions.Diagnostics.Testing Branch 99 79.57 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Extensions.Diagnostics.Probes 70 76
Microsoft.Extensions.Caching.Hybrid 75 86
Microsoft.Extensions.AI.OpenAI 72 77

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=896358&view=codecoverage-tab

{
_lastFlushTimestamp = _timeProvider.GetUtcNow();

SerializedLogRecord[] bufferedRecords = _buffer.ToArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this hold a global lock while it's serializing? I wonder if this could be a source of latency spikes if there are a few thousand records?

Copy link
Contributor Author

@evgenyfedorov2 evgenyfedorov2 Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses ConcurrentQueue<T>, which is lock-free

Copy link
Contributor Author

@evgenyfedorov2 evgenyfedorov2 Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I have fixed the race condition also mentioned here, but have to add a lock, unfortunately. Unless there is a way to avoid it. If Flush() is called relatively rarely, then the lock almost never blocks, but is still there and wastes some ns per each call

@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from 16185d5 to 6c85897 Compare March 20, 2025 06:27
@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from c3e2047 to e863401 Compare March 20, 2025 08:19
@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from e863401 to c824379 Compare March 20, 2025 08:47
@KalleOlaviNiemitalo
Copy link

I don't believe lock (_bufferSwapLock) in GlobalBuffer.Flush() entirely fixes the race condition, if TryEnqueue<TState> is running in parallel. Consider the sequence of events:

  1. In TryEnqueue, _activeBuffer.Enqueue(serializedLogRecord) reads _activeBuffer.
  2. Flush does
            ConcurrentQueue<SerializedLogRecord> bufferToFlush;
            lock (_bufferSwapLock)
            {
                bufferToFlush = _activeBuffer;
    
                // Swap to the empty standby buffer
                _activeBuffer = _standbyBuffer;
                _activeBufferSize = 0;
    
                // Prepare the new standby buffer for future Flush() calls
                _standbyBuffer = new ConcurrentQueue<SerializedLogRecord>();
            }
    
            SerializedLogRecord[] bufferedRecords = bufferToFlush.ToArray();
    but bufferToFlush does not yet contain the record that TryEnqueue is about to add there.
  3. In TryEnqueue, _activeBuffer.Enqueue(serializedLogRecord) finishes.
  4. In Flush, bufferToFlush is discarded and the log record is lost.

In addition, there is a race condition on _activeBufferSize, but it is not a problem because it can only add some inaccuracy to the size counting that is already inaccurate for other reasons. (The object size measurement API proposed in dotnet/runtime#24200 might help make it more accurate.)

@evgenyfedorov2
Copy link
Contributor Author

I don't believe lock (_bufferSwapLock) in GlobalBuffer.Flush() entirely fixes the race condition, if TryEnqueue is running in parallel. Consider the sequence of events:

  1. In TryEnqueue, _activeBuffer.Enqueue(serializedLogRecord) reads _activeBuffer.

  2. Flush does

            ConcurrentQueue<SerializedLogRecord> bufferToFlush;
            lock (_bufferSwapLock)
            {
                bufferToFlush = _activeBuffer;
    
                // Swap to the empty standby buffer
                _activeBuffer = _standbyBuffer;
                _activeBufferSize = 0;
    
                // Prepare the new standby buffer for future Flush() calls
                _standbyBuffer = new ConcurrentQueue<SerializedLogRecord>();
            }
    
            SerializedLogRecord[] bufferedRecords = bufferToFlush.ToArray();

    but bufferToFlush does not yet contain the record that TryEnqueue is about to add there.

  3. In TryEnqueue, _activeBuffer.Enqueue(serializedLogRecord) finishes.

  4. In Flush, bufferToFlush is discarded and the log record is lost.

In addition, there is a race condition on _activeBufferSize, but it is not a problem because it can only add some inaccuracy to the size counting that is already inaccurate for other reasons. (The object size measurement API proposed in dotnet/runtime#24200 might help make it more accurate.)

So I have pushed a fix, but it is with a lock. Also added a test to confirm, the test was indeed failing with the quoted code but is not failing with the current version (with locks).

@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from 01501a4 to e16928a Compare March 26, 2025 15:51
@evgenyfedorov2 evgenyfedorov2 force-pushed the evgenyfedorov2/log_buffering branch from e16928a to f6a2dec Compare March 26, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants