Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] FusionCache allocates on reads and allocates a lot on writes #185

Closed
neon-sunset opened this issue Dec 3, 2023 · 9 comments
Closed
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@neon-sunset
Copy link

neon-sunset commented Dec 3, 2023

Describe the issue
See numbers below. There are other issues like cache line pollution on reads through interlocked applied to a singular location which will ruin the scaling on many-core systems but I wanted to start with bringing attention to the implementation flaws in the first place.

In the outlined scenario of in-memory cache, FusionCache comes last and allocates significantly more than all other options. While FastCache is very focused at doing a singular task (in-memory cache with per-item expiration and eviction) that allows it to be as fast as is technically possible, MemoryCache and LazyCache have much more features and still allocate much less.

Unrelated but somewhat amusing is that OpenJDK improves by 10% at most between 17 and 21, and .NET sees 20-40% speed-up after just a single LTS version bump.

Benchmark code

using System;
using System.Diagnostics.CodeAnalysis;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using FastCache;
using LazyCache;
using Microsoft.Extensions.Caching.Memory;

namespace ZiggyCreatures.Caching.Fusion.Benchmarks
{
	[RankColumn]
	[MemoryDiagnoser]
	[ShortRunJob(RuntimeMoniker.Net60)]
	[ShortRunJob(RuntimeMoniker.Net80)]
	[Orderer(BenchmarkDotNet.Order.SummaryOrderPolicy.FastestToSlowest)]
	public class HappyPathBenchmark
	{
		const string Key = "test key";
		const string Value = "test value";

		readonly FusionCache Cache = new(new FusionCacheOptions());
		readonly MemoryCache MemoryCache = new(new MemoryCacheOptions());
		readonly CachingService LazyCache = new();

		[GlobalSetup]
		public void Setup()
		{
			Cache.Set(Key, Value);
			MemoryCache.Set(Key, Value);
			LazyCache.Add(Key, Value);
			Cached<string>.Save(Key, Value, TimeSpan.FromDays(1));
		}

		public class Reads : HappyPathBenchmark
		{
			[Benchmark(Baseline = true)]
			public string GetFusionCache()
			{
				var result = Cache.TryGet<string>(Key);
				return result.HasValue ? result.Value : Unreachable();
			}

			[Benchmark]
			public string GetMemoryCache()
			{
				return MemoryCache.TryGetValue<string>(Key, out var value)
					? value : Unreachable();
			}

			[Benchmark]
			public string GetLazyCache()
			{
				return LazyCache.TryGetValue<string>(Key, out var value)
					? value : Unreachable();
			}

			[Benchmark]
			public string GetFastCache()
			{
				return Cached<string>.TryGet(Key, out var value)
					? value : Unreachable();
			}
		}

		public class Writes : HappyPathBenchmark
		{
			[Benchmark(Baseline = true)]
			public void SetFusionCache() => Cache.Set(Key, Value);

			[Benchmark]
			public void SetMemoryCache() => MemoryCache.Set(Key, Value);

			[Benchmark]
			public void SetLazyCache() => LazyCache.Add(Key, Value);

			[Benchmark]
			public void SetFastCache() => Cached<string>.Save(Key, Value, TimeSpan.FromDays(1));
		}

		[DoesNotReturn]
		static string Unreachable() => throw new Exception("Unreachable code");
	}
}

Reads:

BenchmarkDotNet v0.13.10, Windows 11 (10.0.22631.2715/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100
  [Host]            : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  ShortRun-.NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
  ShortRun-.NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

IterationCount=3  LaunchCount=1  WarmupCount=3  

Method Job Runtime Mean Error Ratio Rank Gen0 Allocated Alloc Ratio
GetFastCache ShortRun-.NET 6.0 .NET 6.0 12.963 ns 0.3573 ns 0.10 1 - - 0.00
GetMemoryCache ShortRun-.NET 6.0 .NET 6.0 88.737 ns 4.0533 ns 0.71 2 - - 0.00
GetLazyCache ShortRun-.NET 6.0 .NET 6.0 104.442 ns 2.1849 ns 0.84 3 - - 0.00
GetFusionCache ShortRun-.NET 6.0 .NET 6.0 124.161 ns 8.5320 ns 1.00 4 0.0038 64 B 1.00
GetFastCache ShortRun-.NET 8.0 .NET 8.0 9.770 ns 1.5593 ns 0.11 1 - - 0.00
GetMemoryCache ShortRun-.NET 8.0 .NET 8.0 58.887 ns 1.7557 ns 0.69 2 - - 0.00
GetLazyCache ShortRun-.NET 8.0 .NET 8.0 73.699 ns 2.8998 ns 0.86 3 - - 0.00
GetFusionCache ShortRun-.NET 8.0 .NET 8.0 85.310 ns 44.8033 ns 1.00 4 0.0038 64 B 1.00

Writes

Method Job Runtime Mean Error Ratio Rank Gen0 Allocated Alloc Ratio
SetFastCache ShortRun-.NET 6.0 .NET 6.0 29.79 ns 1.440 ns 0.07 1 0.0024 40 B 0.04
SetMemoryCache ShortRun-.NET 6.0 .NET 6.0 194.55 ns 19.804 ns 0.48 2 0.0248 416 B 0.40
SetLazyCache ShortRun-.NET 6.0 .NET 6.0 309.68 ns 50.771 ns 0.76 3 0.0401 672 B 0.64
SetFusionCache ShortRun-.NET 6.0 .NET 6.0 409.26 ns 37.500 ns 1.00 4 0.0625 1048 B 1.00
SetFastCache ShortRun-.NET 8.0 .NET 8.0 23.16 ns 0.644 ns 0.07 1 0.0024 40 B 0.04
SetMemoryCache ShortRun-.NET 8.0 .NET 8.0 186.80 ns 7.310 ns 0.53 2 0.0248 416 B 0.43
SetLazyCache ShortRun-.NET 8.0 .NET 8.0 258.38 ns 25.020 ns 0.73 3 0.0353 592 B 0.61
SetFusionCache ShortRun-.NET 8.0 .NET 8.0 352.62 ns 17.272 ns 1.00 4 0.0577 968 B 1.00
@jodydonetti
Copy link
Collaborator

jodydonetti commented Dec 3, 2023

Hi @neon-sunset and thanks for your report, I appreciate the time it took you.

As you have pointed out other libs do in fact more than FastCache, and FusionCache in particular I think supports a lot of features and options, so it's something kinda expected, at least up to a certain point.

Having said that it's important to make sure the perf impact is as small as possible: in the last months I've been quite busy finishing the new Auto-Recovery feature which took a lot of time, and it may be that I haven't paid the needed attention to some extra allocations or cpu cycle spent here and there.

I'll try to make sure that the resources needed are as low as possible, and will update you on this later on as soon as I have updates: keep in mind though that to support all the extra features that FusionCache provides (2nd level, fail-safe, auto-recovery, etc), some extra resources are to be accounted for.

Meanwhile thanks again for your report!

@jodydonetti jodydonetti added the enhancement New feature or request label Dec 3, 2023
@jodydonetti jodydonetti self-assigned this Dec 3, 2023
@neon-sunset
Copy link
Author

I'm mostly worried about avoidable cost caused by some of the internal design decisions that are not necessary to achieve the desired functionality. This library is fairly popular so it is important to ensure that it does not suffer from common cache implementation issues.

In particular, Microsoft libraries like MemoryCache and FASTER do support many features - the former fully integrates with logging/tracing/ETW and supports flexible eviction behavior and callbacks, can be integrated with second-level distributed cache and much more without allocating nearly as much, and that despite its own design being also considered flawed/outdated for certain scenarios and memory footprint being higher than it could have been. See dotnet/extensions#4766

One of the key concepts promoted in .NET in the last few years is "pay for play" policy - users should only pay for what they use whenever possible.

Thank you for your work.

@jodydonetti
Copy link
Collaborator

I'm mostly worried about avoidable cost caused by some of the internal design decisions that are not necessary to achieve the desired functionality. This library is fairly popular so it is important to ensure that it does not suffer from common cache implementation issues.

I agree, and I'll try to see if I can improve it.

See dotnet/extensions#4766

eh, I know about this new discussion: I've been part of the old discussion some years ago so I'm writing my thoughts on this new one, will publish them later today.

One of the key concepts promoted in .NET in the last few years is "pay for play" policy - users should only pay for what they use whenever possible.

Totally agree with the concept: on a pragmatic note though, with the time and resources available to me (it's an oss work done in my spare time), personally I tried to find the right balance between adding useful features that cover various scenarios and common production problems with performance and resource allocation. It's not always easy as you can imagine, but I'll strive to make it even better in the future.

Thank you for your work.

Thanks to you for sharing your thoughts, it's good to see participation!

@jodydonetti
Copy link
Collaborator

jodydonetti commented Dec 4, 2023

Oh, one thing I'd like to add that may have been missing here, is that currently FusionCache is based on the 2 main caching abstraction in .net, IMemoryCache and IDistributedCache.
This is to say that, as of today, the perf limits will be MemoryCache + a delta, meaning it cannot be lower than MemoryCache.

Having said that, while there's a huge value in being able to use any implementation of IDistributedCache, there's less value in being able to support any implementation of IMemoryCache (which, basically, means MemoryCache). Because of this I'm thinking from some time to detach the hard dependency from IMemoryCache to bypass its limits, and to come up with support for more performant and lightweight implementations. This will be doable because from the start I decided to support the 2 core interfaces as "inputs", but to never expose them as "outputs": so I can add support without breaking anything.

As said I've been experimenting from some time with this approach, but currently have nothing to see yet, will update.

ps: is FastCache usable for heterogeneous data or just homogeneous? Meaning, can I use it for different types of data in the same cache (maybe by just using it with <object>) ?

Thanks.

@jodydonetti
Copy link
Collaborator

jodydonetti commented Feb 17, 2024

Hi @neon-sunset , I just finished a big effort in reducing cpu usage and memory allocation, and I pushed all the changes in the release/v1_0_0 branch.

First of all I'd like to thanks you again for opening this issue and pointing out some perf shortcoming that FusionCache had: lately I've been very busy with completing the last big features (the new Auto-Recovery, Open Telemetry support, etc), preparing for an On .NET episode and finishing the last touches for the big v1.0 release, culminated in the preview1 last week.

Having said that, what you pointed out is important, so I finally took some time to go allocation hunting and trying to optimize whatever I could in the common happy paths: I'm happy to report that I've been able to get some really nice results, which I'd like to show (again using your own benchmark code).

Set

Before:

Method Mean Error StdDev P95 Ratio Rank Allocated Alloc Ratio
SetFastCache 17.22 ns 0.283 ns 0.265 ns 17.65 ns 0.13 1 40 B 0.12
SetMemoryCache 54.79 ns 0.481 ns 0.450 ns 55.55 ns 0.42 2 104 B 0.31
SetLazyCache 95.75 ns 1.070 ns 1.001 ns 97.07 ns 0.74 3 280 B 0.83
SetFusionCache 128.94 ns 1.373 ns 1.284 ns 130.83 ns 1.00 4 336 B 1.00

After:

Method Mean Error StdDev P95 Ratio Rank Allocated Alloc Ratio
SetFastCache 17.55 ns 0.204 ns 0.191 ns 17.84 ns 0.16 1 40 B 0.25
SetMemoryCache 54.51 ns 0.397 ns 0.372 ns 55.06 ns 0.50 2 104 B 0.65
SetLazyCache 94.94 ns 0.892 ns 0.834 ns 96.08 ns 0.87 3 280 B 1.75
SetFusionCache 109.46 ns 1.102 ns 1.030 ns 111.10 ns 1.00 4 160 B 1.00

As you can see I've been able to drop the cpu usage by about 15%, barely above LazyCache while having more features, and we are still talking about nanoseconds.

The allocations instead dropped by 52%, from 336 bytes to 160 bytes: LazyCache now allocates 75% more, again while having less features.

Get

Before:

Method Mean Error StdDev P95 Ratio Rank Allocated Alloc Ratio
GetFastCache 6.474 ns 0.1256 ns 0.1174 ns 6.683 ns 0.15 1 - 0.00
GetMemoryCache 26.896 ns 0.3287 ns 0.2914 ns 27.140 ns 0.62 2 - 0.00
GetLazyCache 34.369 ns 0.4198 ns 0.3927 ns 34.895 ns 0.79 3 - 0.00
GetFusionCache 43.267 ns 0.2805 ns 0.2190 ns 43.494 ns 1.00 4 48 B 1.00

After:

Method Mean Error StdDev P95 Ratio Rank Allocated Alloc Ratio
GetFastCache 6.511 ns 0.0934 ns 0.0874 ns 6.633 ns 0.17 1 - NA
GetMemoryCache 26.503 ns 0.2110 ns 0.1974 ns 26.814 ns 0.68 2 - NA
GetLazyCache 33.986 ns 0.2404 ns 0.2131 ns 34.317 ns 0.87 3 - NA
GetFusionCache 38.936 ns 0.2911 ns 0.2581 ns 39.367 ns 1.00 4 - NA

Here cpu went down about 10-15% , which is nice.

Allocations instead became awesome: they dropped to a beautiful zero, so no more allocations (damn closures).

Conclusions

This is not the end of the optimization story, and now that all the main features (and then some) are there and v1.0 is around the corner, I plan on working even more on optimizing cpu usage and allocations.

Of course, as you pointed out yourself, while the other caches are memory (L1) caches with a set of features that goes from barebone (so to be micro-optimized) to relatively small, FusionCache is an hybrid cache (transparently L1 or L1+L2) with a lot of features available: resiliency, timeouts, optional 2nd level with a backplane, extensive logging, fully observable (Open Telemetry) and more.

All of this to say that, while the "pay for play" policy is something I absolutely agree on it's also fair to say that with all these features, even just the various "is null/enabled" checks here and there for each feature + a minimum of extra memory for the needed metadata in support of them, it will always be a kind of apples to oranges comparison, in a way.

I'm interested in knowing what you think, if you'd like.

Meanwhile, again thanks!

@jodydonetti jodydonetti added this to the v1.0.0 milestone Feb 17, 2024
@neon-sunset
Copy link
Author

neon-sunset commented Feb 23, 2024

Awesome! The reductions in memory usage are welcome and will help with reducing the churn on load spikes.

As for you asking whether FastCache is usable within FusionCache - at this point I'm regretting quite aggressive design decision to make it a global static cache, it definitely does not fit together well with FusionCache which is instanced and isolated in a more, you could say, reasonable way. Maybe in the future in some version 2.0 this will change but I just don't have time for this at the moment unfortunately. The idea of very thin layer on top of NonBlocking.ConcurrentDictionary worked really well in production with eviction not being "front-loaded" like other cache libraries do - letting another thread deal with that. But it needs work. There's another JitBit.FastCache library which stole copied the approach from FastCache.Cached but it does not do any of the necessary eviction heavy lifting (you have to do some form of parallelization and you can't be running the thing every 10s or so), does not use NonBlocking and boxes cached value in another class, all three of which become big limitations when scaling to >1-10M items. Just letting you know so you don't repeat repeat its mistakes.

Therefore, if you think it's worth investing the effort, FusionCache could see either its own FastCache-like memory cache implementation or opt to wait for the better first-party alternative to MemoryCache to be provided. The third extra difficult option which is what I wanted to do back then for 2.0 of fastcache was writing a Segcache-like implementation envisioned after Twitter's take on Segcache (https://github.com/twitter/pelikan/tree/master/src/storage/seg) but version 1.x turned out to be sufficient in terms of both memory and eviction throughput scaling for the tasks I was solving back then so such 6-12 months-long undertaking lost its attractiveness 😅

As for your note regarding the inherent cost of internal and external abstractions, there are a couple of points I wanted to share which may be of interest to you:

  • As of .NET 8, the actual cost on microbenchmarks will be hidden because DynamicPGO will devirtualize the calls to most frequently used type of virtual/interface call or a delegate. This makes estimating performance more difficult because should the user's code not have such monomorhpic distribution of used instances, the cost of abstractions will resurface. You may also want to use [SimpleJob(RuntimeMoniker.NativeAot80)] on occasion and see how much its performance differs from JIT which can partially showcase you how the code would perform should DynamicPGO be not as successful

  • Internal abstractions can be made zero-cost by using struct generics constrained to an interface type over accepting interface type itself - .NET will monomorphize such calls to exact implementation e.g. CoreLib uses this technique heavily for performance-sensitive code. However, this is one of the "last resort" optimizations that are useful when other low hanging fruits have been taken. So if you feel like continuing performance work, you may want to start with the profiler instead, I'm mostly saying this because am too guilty of not doing so and favouring blind microbenchmarks instead 😀

Watching the episode right now :)

@jodydonetti
Copy link
Collaborator

Thanks for all the info, really damn interesting.

Will definitely try the AOT benchmark: I was already curious and now even more 😬

About the second point: do you have maybe some material about that? Blog post, a docs page, examples, anything really.

Thanks again, and if you like let me know how about the episode.

@jodydonetti
Copy link
Collaborator

Hi all, I just released preview2 🥳

Please give it a try, thanks!

@jodydonetti
Copy link
Collaborator

Hi all, I've finally released v1.0 🥳

Feels a little unreal tbh, but yep: v1.0 is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants