Implement parallel ARC eviction #16486

allanjude · 2024-08-29T17:29:19Z

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.

Motivation and Context

Read and write performance can become limited by the arc_evict process being single threaded.
Additional data cannot be added to the ARC until sufficient existing data is evicted.

On many-core systems with TBs of RAM, a single thread becomes a significant bottleneck.

With the change we see a 25% increase in read and write throughput

Description

Use a new taskq to run multiple multiple arc_evict() threads at once, each given a fraction of the desired memory to reclaim

How Has This Been Tested?

Benchmarking with a full ARC to measure the performance difference.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2024-09-05T16:08:35Z

module/zfs/arc.c

+		uint64_t nchunks = ((left - 1) >> MIN_EVICT_PERTASK_SHIFT) + 1;
+		unsigned n = nchunks < num_sublists ? nchunks : num_sublists;
+		uint64_t fullrows = nchunks / n;
+		unsigned lastrowcols = nchunks % n;
+		unsigned k = (lastrowcols ? lastrowcols : n);
+
+		uint64_t bytes_pertask_low =
+		    fullrows << MIN_EVICT_PERTASK_SHIFT;
+		uint64_t bytes_pertask = bytes_pertask_low + (lastrowcols ?
+		    (1 << MIN_EVICT_PERTASK_SHIFT) : 0);


I think you are over-engineering here. I don't think eviction per taskq should really be a multiple of 1 << MIN_EVICT_PERTASK_SHIFT to complicate the logic, merely it should be bigger than one. So you could just use MIN_EVICT_PERTASK_SHIFT to decide number of tasks, and then split the eviction amount equally between them.

And I wonder if it would make sense to scale number of tasks with eviction not linearly, but in some logarithimic fashion to not spin too many threads at once, stressing the system more for diminishing return.

I am quite new to taskqs and so I trust your judgement. Could I ask you to elaborate a bit more on how you'd like this logic to look like? Thank you :)

This has nothing to do with taskqs. It simply makes no sense to create multiple parallel jobs/tasks and wake up multiple threads/CPUs for operations below certain size due to overheads. SPA_MAXBLOCKSHIFT is an absolute minimum there, since it may be impossible to free less than that any way if large blocks are used, but something bigger could be chosen based on some practical tests. So just, as I have told, remove all this unneeded complexity of trying to be multiple to MIN_EVICT_PERTASK_SHIFT, select number of tasks based on total size (divide total size by the minimum size, rounding up, and limit the result by number of threads), and then give each task its fraction of the total size (just divide total size by number of tasks, rounding up).

module/zfs/arc.c

adamdmoss · 2024-09-12T02:26:07Z

I've been casually testing this out (combined with the parallel_dbuf_evict PR) over the last couple of weeks (most recently, 5b070d1 ).

I've not been hammering it hard or specifically, just letting it do its thing with my messing-around desktop system.

Hit a probable regression today, though: while mv'ing a meager 8GB of files from one pool to another, all my zfs IO got really high-latency, and an iotop showed that the copy part of the move (this being a mv across pools, so in reality it's a copy-and-remove) was running at a painful few 100KB/sec, and the zfs arc_evict thread was taking a whole core... but just one core.

In time it all cleared up and of course I can't conclusively blame this PR's changes, but I left with two fuzzy observations:

In many years of mucking around with ZFS I've never(?) seemed to get the 'arc_evict is pegging CPU badly' issue until I started testing this PR's changes (though I'm aware the issue occurs in the wild for folks on master/release ZFSes)
arc_evict was only using one core as far as I can tell, so I guess the parallelism which is the point of this PR just wasn't kicking-in for some reason anyway and/or the spinning was happening outside of the parallelized part

0mp · 2024-09-12T11:56:09Z

I have updated the patch with a different logic for picking the default maximum number of ARC eviction threads. The new logic aims to pick the number that is one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

amotin · 2024-09-12T13:47:34Z

one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about highbit(), though then it will grow pretty slow for very large systems, so that the limit of 16 will never be reached. But I am not exactly sure the faster growth would make sense, since it may cause more lock contentions in memory allocator, etc.

allanjude · 2024-09-16T15:32:27Z

one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about highbit(), though then it will grow pretty slow for very large systems, so that the limit of 16 will never be reached. But I am not exactly sure the faster growth would make sense, since it may cause more lock contentions in memory allocator, etc.

Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled. But for something like 4-12 core systems, we would want it to use at least 2 threads, and then grow from there, reaching 16 threads at 128 cores.

amotin · 2024-09-16T15:51:42Z

Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled.

Now that you mentioned it, I've noticed its been disabled by default. I don't like the idea to tune it manually in production depending on system size. I would prefer to to have reasonable automatic defaults.

0mp · 2024-10-23T20:41:47Z

Hey! So, here's what changed in the patch:

Formula

There is now a different formula for automatically scaling the number of evict threads when the parameter is set to 0. The formula is:

MIN(MAX(max_ncpus > 6 ? 2 : 1, ilog2(max_ncpus) + (max_ncpus >> 6)), 16);

It looks like this (the x axis is the CPU count and the y axis is the evict thread count):

Here's also a table:

CPUs	`zfs_arc_evict_threads`	Evict threads count	Using taskq?
1	0	1 (autoscaled)	No
2	0	1 (autoscaled)	No
5	0	1 (autoscaled)	No
6	0	2 (autoscaled)	Yes
1024	0	16 (autoscaled)	Yes
(not using autoscaling, CPU count is irrelevant)	1	1	No
(not using autoscaling, CPU count is irrelevant)	32	32	Yes

Less parameters

zfs_arc_evict_threads is now the only parameter exposed to control the evict thread count. The zfs_arc_evict_threads_parallel has been removed in favor of enabling the use of taskqs when there are two or more evict threads.

This approach has been suggested by @tonyhutter in another PR (#16487 (comment)).

Stability improvements

It is no longer possible to modify the actual evict threads count during runtime. Since the evict taskqs are only created during arc_init(), the module saves the actual number of evict threads it is going to use and does not care about changes to zfs_arc_evict_threads from then on. This behavior has been documented in the manual page.

amotin

Thanks for automating it. Few comments to that part, and please take a look on my earlier comments.

man/man4/zfs.4

module/zfs/arc.c

amotin · 2024-10-30T18:02:32Z

module/zfs/arc.c

+		zfs_arc_evict_threads_live = MIN(MAX(max_ncpus > 6 ? 2 : 1,
+		    arc_ilog2(max_ncpus) + (max_ncpus >> 6)), 16);


Do we really need the MAX(max_ncpus > 6 ? 2 : 1 part? ilog2(6) should already be 2.

Good catch! Thanks!

Currently, we get the following thread counts:

CPU count Resulting thread count

1 1

2 1

3 1

4 2

... ....

So you are right that the MAX(max_ncpus > 6 ? 2 : 1 part is not working as expected currently. If we want to stick to 1 thread for 4 and 5 CPUs then we'd need to use the following formula:

// Version 2a MIN(max_ncpus < 6 ? 1 : arc_ilog2(max_ncpus) + (max_ncpus >> 6), 16);

If we decide to simplify that further and go for 2 threads on systems with 4 or 5 CPUs, then we can just use:

// Version 2b MAX(1, MIN(arc_ilog2(max_ncpus) + (max_ncpus >> 6), 16));

Version 2b looks good to me and is certainly easier to reason about.

2b looks cleaner, but 2 threads out of 4 sound a bit overkill to me. May be the curve could be thought more. BTW, speaking about more clean (readable) code, this is not performance-critical part and we are not in 1990's, there is no point to use bit shifts for division, you may just use / 64 and compilers will do it right. And then you would not need parenthesis around it.

Thank you for your feedback! I've cleaned up the formula. Now, we use 1 thread for less than 6 CPUs and then MIN((highbit64(max_ncpus) - 1) + max_ncpus / 64, 16). for larger systems.

Thanks. Just it seems like you are still creating a taskq with one thread, but never using it. ;)

Good catch. Thank you for pointing that out.

I've fixed that. Now all evict-taskq-related bits are wrapped with a check if the use of the taskq is on.

Thanks. That's fine, but I personally would check for arc_evict_taskq being NULL rather than !live > 1.

module/zfs/arc.c

amotin · 2024-11-06T15:48:39Z

module/zfs/arc.c

@@ -4071,25 +4117,108 @@ arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa,
 		multilist_sublist_unlock(mls);
 	}

+	evict_arg_t *evarg = kmem_alloc(sizeof (*evarg) * num_sublists,
+	    KM_SLEEP);


Sleepable memory allocation in eviction path is a request for potential troubles.

I've changed that to NOSLEEP. Now, if we cannot allocate the memory, we just fall back to the regular single evict.

Thanks. That is one way to solve it. Or we could pre-allocate it similar to markers.

amotin · 2024-11-13T14:08:26Z

module/zfs/arc.c

@@ -7809,6 +7963,8 @@ arc_init(void)
 void
 arc_fini(void)
 {
+	boolean_t useevicttaskq = zfs_arc_evict_threads_live > 1;
+


Extra new line.

amotin · 2024-11-19T23:22:48Z

I am not sure it is right, but it seems GCC does no like it:

  module/zfs/arc.c: In function 'arc_evict_state':
  module/zfs/arc.c:4095:15: error: 'evarg' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    evict_arg_t *evarg;
                 ^~~~~

Read and write performance can become limited by the arc_evict process being single threaded. Additional data cannot be added to the ARC until sufficient existing data is evicted. On many-core systems with TBs of RAM, a single thread becomes a significant bottleneck. With the change we see a 25% increase in read and write throughput Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]>

- Improve the description of the scaling algorithm in the manual page. Signed-off-by: Mateusz Piotrowski <[email protected]>

This parameter cannot be changed during runtime anyway in any meaningful way. Make it explicitly read-only. The manual page does not need to be updated. It already mentions that the thread count cannot be changed during runtime.

- Use a simple division instead of a bit shift for better readability. - Make sure that systems with less than 6 CPUs auto-scale to 1 eviction thread.

…ue entries fails

allanjude force-pushed the parallel_arc_evict branch from 4cd510d to f45bf2e Compare August 29, 2024 18:19

rincebrain mentioned this pull request Aug 30, 2024

dmu_objset: replace dnode_hash impl with cityhash4 #16483

Closed

13 tasks

allanjude force-pushed the parallel_arc_evict branch from f45bf2e to 146fe45 Compare September 4, 2024 23:16

amotin reviewed Sep 5, 2024

View reviewed changes

0mp force-pushed the parallel_arc_evict branch from 146fe45 to e128026 Compare September 12, 2024 10:03

behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 13, 2024

0mp force-pushed the parallel_arc_evict branch 2 times, most recently from b6a65a2 to e99733e Compare October 23, 2024 19:49

amotin reviewed Oct 30, 2024

View reviewed changes

amotin reviewed Nov 6, 2024

View reviewed changes

amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 6, 2024

github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Nov 12, 2024

amotin reviewed Nov 13, 2024

View reviewed changes

amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 19, 2024

Alexander Stetsenko and others added 9 commits November 21, 2024 11:09

zfs.4: Incorporate Alexander's feedback

9479f7b

- Improve the description of the scaling algorithm in the manual page. Signed-off-by: Mateusz Piotrowski <[email protected]>

arc.c: Use highbit64() instead of arc_ilog2()

7de05de

arc: Use ZMOD_RD for zfs_arc_evict_threads

f083ca2

This parameter cannot be changed during runtime anyway in any meaningful way. Make it explicitly read-only. The manual page does not need to be updated. It already mentions that the thread count cannot be changed during runtime.

arc_evict_state: Remove an unnecessary assert

3736c9f

arc: Use taskq_init_ent() instead of memset

a897669

arc: Simplify arguments to arc_evict_taskq's taskq_create()

c397dbf

arc: Improve the evict thread count auto-scaling formula

b6c7222

- Use a simple division instead of a bit shift for better readability. - Make sure that systems with less than 6 CPUs auto-scale to 1 eviction thread.

arc: Fall back to the regular single evict if kmem_alloc for task que…

b43c67b

…ue entries fails

arc: Do not allocate evict taskq if operating on a single evict thread

3218719

alex-stetsenko force-pushed the parallel_arc_evict branch from d899eaf to 3218719 Compare November 21, 2024 09:09

github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel ARC eviction #16486

Implement parallel ARC eviction #16486

allanjude commented Aug 29, 2024

amotin Sep 5, 2024

0mp Nov 8, 2024

amotin Nov 8, 2024

adamdmoss commented Sep 12, 2024 •

edited

Loading

0mp commented Sep 12, 2024

amotin commented Sep 12, 2024 •

edited

Loading

allanjude commented Sep 16, 2024

amotin commented Sep 16, 2024

0mp commented Oct 23, 2024

amotin left a comment

amotin Oct 30, 2024

0mp Nov 6, 2024 •

edited

Loading

amotin Nov 6, 2024 •

edited

Loading

0mp Nov 12, 2024

amotin Nov 12, 2024

0mp Nov 13, 2024

amotin Nov 13, 2024

amotin Nov 6, 2024

0mp Nov 13, 2024

amotin Nov 13, 2024

amotin Nov 13, 2024

amotin commented Nov 19, 2024

		zfs_arc_evict_threads_live = MIN(MAX(max_ncpus > 6 ? 2 : 1,
		arc_ilog2(max_ncpus) + (max_ncpus >> 6)), 16);

Implement parallel ARC eviction #16486

Are you sure you want to change the base?

Implement parallel ARC eviction #16486

Conversation

allanjude commented Aug 29, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamdmoss commented Sep 12, 2024 • edited Loading

0mp commented Sep 12, 2024

amotin commented Sep 12, 2024 • edited Loading

allanjude commented Sep 16, 2024

amotin commented Sep 16, 2024

0mp commented Oct 23, 2024

Formula

Less parameters

Stability improvements

amotin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0mp Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

amotin Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin commented Nov 19, 2024

adamdmoss commented Sep 12, 2024 •

edited

Loading

amotin commented Sep 12, 2024 •

edited

Loading

0mp Nov 6, 2024 •

edited

Loading

amotin Nov 6, 2024 •

edited

Loading