Skip to content

Conversation

@xemul
Copy link
Contributor

@xemul xemul commented Sep 5, 2025

After being bumped, the threshold remains high and doesn't capture more
potential problems. The patch splits the threshold into two values --
check and warn, and once the allocation size hits the check part it
either warns it and increases the warn part, or doesn't warn and checks
if the warn threshold can be decreased. The latter happens based on the
lowres_clock timestamps, the next decrease is allowed only after 10
seconds from the prevuous one, the decrease step is page-size.

fixes: #669

@travisdowns
Copy link
Contributor

This is a good idea, we've discussed same internally.

@xemul xemul requested a review from Copilot September 11, 2025 02:15
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a decay mechanism for the large allocation warning threshold to improve detection of subsequent large allocations. After the threshold is bumped due to a large allocation warning, it now gradually decreases back to the original configured value over time.

  • Adds timer-based threshold decay that reduces the warning threshold by page size every 10 seconds
  • Introduces base threshold tracking to remember the originally configured value
  • Updates related functions to properly handle the new decay mechanism

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

pages[nr_pages].free = false;
free_span_unaligned(reserved, nr_pages - reserved);
live_cpus[cpu_id].store(true, std::memory_order_relaxed);
large_allocation_warning_decay.set_callback(default_scheduling_group(), [] { internal::decrease_large_allocation_warning_threshold(); });
Copy link

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda callback captures nothing but calls get_cpu_mem() which may access different CPU-local data if the timer fires on a different CPU than where it was initialized. Consider capturing the current CPU's memory context or ensuring the timer runs on the same CPU.

Suggested change
large_allocation_warning_decay.set_callback(default_scheduling_group(), [] { internal::decrease_large_allocation_warning_threshold(); });
large_allocation_warning_decay.set_callback(default_scheduling_group(), [this] { this->decrease_large_allocation_warning_threshold(); });

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timers don't migrate between CPUs in seastar


if (threshold > base + cpu_pages::large_allocation_warning_decay_step) {
threshold -= cpu_pages::large_allocation_warning_decay_step;
get_cpu_mem().large_allocation_warning_decay.arm(cpu_pages::large_allocation_warning_decay_period);
Copy link

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using relative time with arm() instead of absolute time with rearm() may cause timing drift if the callback execution is delayed. Consider using rearm(lowres_clock::now() + large_allocation_warning_decay_period) for consistent intervals.

Suggested change
get_cpu_mem().large_allocation_warning_decay.arm(cpu_pages::large_allocation_warning_decay_period);
get_cpu_mem().large_allocation_warning_decay.rearm(
lowres_clock::now() + cpu_pages::large_allocation_warning_decay_period
);

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If timer execution delays, "now()" will also be delayed
Also, seastar timers' relative arms are overloads on top of absolute that do exactly the same anyway
And finally -- we don't care about potential delays, this decay is not about precision timing

size_t large_allocation_warning_threshold_base = std::numeric_limits<size_t>::max();
static constexpr size_t large_allocation_warning_decay_step = page_size;
static constexpr lowres_clock::duration large_allocation_warning_decay_period = std::chrono::seconds(10);
timer<lowres_clock> large_allocation_warning_decay;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory.cc is much lower level than the reactor etc.

I'd like to avoid entangling it with the rest of the reactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd even prefer to avoid having it talk with the clocks, but I don't see a way.

How about this:

if larger than original threshold)
   if larger than current threshold
       report
       consider increasing threshold based on history
   else
       consider decreasing threshold based on history

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If considering without clocks, then what's the ... average (?) expected (?) typical (?) allocations-per-second rate to rely on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

large allocation rate per second is much smaller than one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

large allocation rate per second is much smaller than one.

OK, let's take this as an axiom

@xemul xemul force-pushed the br-memory-decay-large-allocation-warning-threshold branch from 7ede025 to 9edb2f9 Compare September 29, 2025 12:41
@xemul
Copy link
Contributor Author

xemul commented Sep 29, 2025

upd:

  • do threshold decreasing based on lowres_clock timestamps

@xemul
Copy link
Contributor Author

xemul commented Oct 13, 2025

@avikivity , please re-review

@avikivity
Copy link
Member

Sorry about the delay. Maybe I should use a timer.

Even lowres_clock is too entangled. memory.cc depends on nothing now, let's keep it that way.

Decay can be every N large allocations that aren't greater than the current threshold. If N = 20, and large allocations are rare, that throttles large allocations significantly but still allows them to surface.

After being bumped, the threshold remains high and doesn't capture more
potential problems. The patch splits the threshold into two values --
check and warn, and once the allocation size hits the check part it
either warns it and increases the warn part, or doesn't warn and checks
if the warn threshold can be decreased. The latter happens based on the
lowres_clock timestamps, the next decrease is allowed only after 10
seconds from the prevuous one, the decrease step is page-size.

fixes: scylladb#669

Signed-off-by: Pavel Emelyanov <[email protected]>
@xemul xemul force-pushed the br-memory-decay-large-allocation-warning-threshold branch from 9edb2f9 to 8db864a Compare October 14, 2025 13:36
@xemul
Copy link
Contributor Author

xemul commented Oct 15, 2025

upd:

  • descend the warn threshold after check threshold is hit several times

@avikivity avikivity merged commit bd74b3f into scylladb:master Oct 15, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decay large allocation warning threshold with time

3 participants