Fix writer starvation in MemoryJournal #7887

Aaronontheweb · 2025-10-08T02:11:59Z

Summary

Fixes writer starvation race condition in MemoryJournal by replacing ReaderWriterLockSlim with a simple object lock (Monitor).

Problem

Even after PR #7869 added locks to fix race conditions, tests were still failing with timeouts. The root cause was writer starvation: when query publishers continuously acquired read locks, persistent actors trying to acquire write locks could be delayed indefinitely, causing test timeouts.

Solution

Replaced ReaderWriterLockSlim with a simple lock statement using Monitor synchronization:

Fair FIFO lock ordering prevents writer starvation
Simpler code with unified lock API instead of separate read/write lock methods
Sufficient performance for in-memory test journal where correctness > micro-optimization

Changes

Changed Lock property type from ReaderWriterLockSlim to object in both MemoryJournal and SharedMemoryJournal
Replaced all EnterReadLock/ExitReadLock and EnterWriteLock/ExitWriteLock calls with lock (Lock) { } statements
Simplified Add() and Delete() methods to use single lock acquisition
Updated API verification files for Akka.Persistence (Core, DotNet, Net platforms)

Test Results

✅ All 46 InMemory persistence query tests pass
✅ All 16 API verification tests pass
✅ Fixes timeouts in InMemoryAllEventsSpec and InMemoryEventsByTagSpec

…lim with Monitor Replaced ReaderWriterLockSlim with simple object lock (Monitor) to prevent writer starvation in MemoryJournal. The previous implementation caused flaky test failures when query publishers continuously acquired read locks, preventing persistent actors from acquiring write locks. Changes: - Changed Lock property type from ReaderWriterLockSlim to object in MemoryJournal and SharedMemoryJournal - Replaced all EnterReadLock/ExitReadLock and EnterWriteLock/ExitWriteLock calls with lock statements - Simplified Add() and Delete() methods to use single lock acquisition - Updated API verification files for Akka.Persistence Benefits: - Fair FIFO lock ordering prevents writer starvation - Simpler code with unified lock API - Sufficient performance for in-memory test journal Fixes race condition timeouts in InMemoryAllEventsSpec and InMemoryEventsByTagSpec tests.

…Core.verified.txt

Added comprehensive telemetry to investigate intermittent test failures in InMemoryEventsByTagSpec.ReadJournal_live_query_EventsByTag_should_find_events_from_offset_exclusive. Changes: - Added Stopwatch-based timing instrumentation to track lock acquisition wait times in WriteMessagesAsync and ReplayTaggedMessagesAsync - Added DEBUG logging for all journal message types (ReplayTaggedMessages, ReplayAllEvents, SelectCurrentPersistenceIds) - Logs thread IDs, lock wait times, lock hold times, and event counts - Enabled DEBUG log level in InMemoryEventsByTagSpec test configuration Diagnostic output includes: - "[DIAG] WriteMessagesAsync called on thread X, attempting to acquire lock" - "[DIAG] Lock acquired after Xms on thread Y" - "[DIAG] Lock released after holding for Xms" - "[DIAG] Wrote event for {persistenceId}, seq {seqNr}, total events: {count}" - "[DIAG] ReplayTaggedMessages lock acquired after Xms" - "[DIAG] Found X events matching tag 'Y'" This instrumentation will capture timing data on CI/CD build servers to determine if the test failures are caused by: 1. Lock contention/fairness issues with Monitor 2. Thread pool starvation 3. Async execution context delays 4. RecoveryPermitter bottlenecks All diagnostic logging uses [DIAG] prefix for easy filtering. These changes are temporary for diagnosis and will be reverted once root cause is identified.

Added comprehensive logging to TestActor to diagnose where the actor gets stuck when the test times out: - PreStart: Confirms actor creation - OnRecover: Traces recovery message flow - OnCommand: Confirms message receipt - Before/after Persist call: Identifies if Persist blocks - Persist callback: Confirms persist completion This will identify whether the issue is: 1. Actor not starting 2. Actor stuck in recovery 3. Message not reaching OnCommand 4. Persist call blocking/deadlocking 5. Persist callback never executing All logs use [DIAG-ACTOR] prefix to distinguish from journal logs.

…nuation CRITICAL BUG FIX: Replaced Context.GetLogger() calls with Debug.WriteLine in WriteMessagesAsync and ReplayTaggedMessagesAsync. Root Cause: These methods are called from async continuations on thread pool threads (not actor threads) due to ConfigureAwait(false) in AsyncWriteJournal.ExecuteBatch. Calling Context.GetLogger() from non-actor threads is unsafe and can cause deadlock: 1. Context.GetLogger() accesses actor state/mailbox 2. If actor is waiting for WriteMessagesSuccess but logger tries to interact with actor system... 3. Circular wait/deadlock! This explains BOTH the ReaderWriterLockSlim and Monitor failures: - With ReaderWriterLockSlim: Logging from async thread while holding write lock tried to access actor context, causing contention with read-locked queries - With Monitor: Same issue, different lock mechanism The lock type wasn't the problem - accessing actor Context from thread pool threads was the root cause all along. Using Debug.WriteLine instead ensures diagnostic logging doesn't interfere with actor threading model or cause deadlocks.

Removed ConfigureAwait(false) to ensure async continuations run on the actor's dispatcher thread rather than arbitrary thread pool threads. Rationale: ConfigureAwait(false) causes the continuation after await to run on any available thread pool thread. In the actor model, this can cause issues: 1. Loss of synchronization context - no guarantee about memory visibility of changes made in WriteMessagesAsync 2. Timing/fairness issues under high contention when multiple actors compete for the same lock on different thread pool threads with different priorities 3. Potential for Context access violations if any code in the chain tries to access actor state By removing ConfigureAwait(false), the continuation will attempt to resume on the actor's original dispatcher thread, maintaining proper actor threading semantics and improving determinism under load. This may have been contributing to the intermittent test failures in InMemoryEventsByTagSpec even before diagnostic logging was added.

…Debug

Replaced System.Diagnostics.Debug.WriteLine calls with proper Akka.NET logging using a cached ILoggingAdapter instance. The cached logger is safe to use from thread pool threads after initial lazy initialization on the actor thread. This is a cleaner approach than Debug.WriteLine and ensures diagnostic logs appear in the normal Akka.NET logging infrastructure where they can be controlled via log level configuration.

…aronontheweb/akka.net into claude-wt-InMemoryPersistence3

…tion All async methods in MemoryJournal were doing synchronous work while blocking the calling thread. This could cause thread pool starvation in test environments where many actors are starting simultaneously. Changes: - WriteMessagesAsync: Wrapped in Task.Run - ReadHighestSequenceNrAsync: Wrapped in Task.Run - ReplayMessagesAsync: Wrapped in Task.Run - DeleteMessagesToAsync: Wrapped in Task.Run - ReplayTaggedMessagesAsync: Wrapped in Task.Run - ReplayAllEventsAsync: Wrapped in Task.Run All diagnostic logging preserved for troubleshooting.

Aaronontheweb added akka-persistence akka-persistence-query labels Oct 8, 2025

Aaronontheweb added 4 commits October 8, 2025 09:37

Merge branch 'dev' into claude-wt-InMemoryPersistence3

717aad0

Merge branch 'dev' into claude-wt-InMemoryPersistence3

3420cf8

Delete src/core/Akka.API.Tests/verify/CoreAPISpec.ApprovePersistence.…

efb2113

…Core.verified.txt

Aaronontheweb force-pushed the claude-wt-InMemoryPersistence3 branch from f603d7b to 83b2600 Compare October 9, 2025 18:16

Aaronontheweb added 6 commits October 9, 2025 13:41

Fix ambiguous Debug reference by fully qualifying System.Diagnostics.…

0a749ef

…Debug

Merge branch 'dev' into claude-wt-InMemoryPersistence3

1f7a2d2

Merge branch 'dev' into claude-wt-InMemoryPersistence3

2443a40

Aaronontheweb mentioned this pull request Oct 9, 2025

[DRAFT] Test: Remove ConfigureAwait(false) only - no locking changes #7896

Closed

Aaronontheweb added 7 commits October 9, 2025 16:53

Merge branch 'dev' into claude-wt-InMemoryPersistence3

983d9e7

Merge branch 'dev' into claude-wt-InMemoryPersistence3

007e079

Merge branch 'claude-wt-InMemoryPersistence3' of https://github.com/A…

e0eac39

…aronontheweb/akka.net into claude-wt-InMemoryPersistence3

remove public API changes

653d1b4

Merge branch 'dev' into claude-wt-InMemoryPersistence3

0ccdfe7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix writer starvation in MemoryJournal #7887

Fix writer starvation in MemoryJournal #7887

Uh oh!

Aaronontheweb commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix writer starvation in MemoryJournal #7887

Are you sure you want to change the base?

Fix writer starvation in MemoryJournal #7887

Uh oh!

Conversation

Aaronontheweb commented Oct 8, 2025

Summary

Problem

Solution

Changes

Test Results

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant