Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM leakage in blobfuse2 > 2.3.0 #1639

Open
Vegoo89 opened this issue Feb 19, 2025 · 15 comments
Open

RAM leakage in blobfuse2 > 2.3.0 #1639

Vegoo89 opened this issue Feb 19, 2025 · 15 comments

Comments

@Vegoo89
Copy link

Vegoo89 commented Feb 19, 2025

Hello,
Related to previously closed issue: #1617

We are using blobfuse2 on AKS via CSI blob driver, latest version.

Few days ago we upgraded node pool to latest image and it automatically installed blobfuse version 2.4.0

When we ran our performance tests, nodes started to transition to NotReady state after very short period of time.

After conducting debug session, we realized blobfuse2 is not freeing RAM at all. It just keep growing until host becomes unresponsive due to lack of memory.

I conducted tests on following blobfuse2 versions:

  • 2.3.0 -> no issue
  • 2.3.2 -> issue persist
  • 2.4.0 -> issue persist
  • 2.4.1 -> issue persist

We are using Blob CSI via PV (RBAC UAMI auth) using below mount options:

mountOptions:
  - '-o allow_other'
  - '--file-cache-timeout-in-seconds=0'
  - '-o attr_timeout=0'
  - '-o entry_timeout=0'
  - '-o negative_timeout=0'
  - '--attr-timeout=0'
  - '--entry-timeout=0''
  - '--cancel-list-on-mount-seconds=10'
  - '--block-cache

We are using blob cache due to file cache limit (it doesn't clean up folders and inode limit is reached)

We tried -o direct_io but performance was very poor and not acceptable by our SLA

Any suggestions are welcome, thanks!

@vibhansa-msft
Copy link
Member

How many files do you have in your container which shall list as part of this mount ?

@vibhansa-msft vibhansa-msft added this to the v2-2.5.0 milestone Feb 19, 2025
@Vegoo89
Copy link
Author

Vegoo89 commented Feb 20, 2025

Hi,
Each node has 3 blobfuse2 processes running that are mapped to different containers.

Per one test (around 4000 tasks), we are producing ~10 small files per task with 800 tasks per minute ratio

One container is used for writing and reading of small files, mostly JSONs (reads are performed only after write is done, there is no integrity issue).
Second and third hold machine learning models (transformer models) and adapters.
Basic model is loaded once and kept in memory of 3 microservices, while adapters - which are very small - are loaded and switched constantly.

@vibhansa-msft
Copy link
Member

As you are using block-cache and you have not set any upper limit on the cache usage, each blobfuse instance running will try to use 80% of the available memory by default. Dur to multiple instnaces your memory is running low. You need to set a max memory usage limit for blockc-cahe using --block-cache-pool-size=<mem size in MB> cli parameter.

@Vegoo89
Copy link
Author

Vegoo89 commented Feb 21, 2025

I thought about it, but for version 2.3.0 limit was set to 4GB and process was using constantly 500-600 MB memory under load and it never reached this 4GB limit.
Now memory keeps growing - for the same load - up to 20GB per process, so I can limit it to some value (no idea what would be correct one) but I doubt it is root of the problem.

@vibhansa-msft
Copy link
Member

With autoconfig we reserve 80% of memory space in single instance, assuming all resources are at our dispersal. If you are running multiple instances then you will need to manually restrict the memory usage. This might solve the problem. Try it out and let us know if that helps.

@Vegoo89
Copy link
Author

Vegoo89 commented Feb 24, 2025

We are testing --block-cache-pool-size=2000 and should have results today or tomorrow.

However it is a bit concerning that this limit is out of scope on Kubernetes.

How do I make sure that my node doesn't die due to OOM when blobfuse2 process is executed outside of the container, directly on the host so it doesn't respect spec.resources.limits.memory set by blob-csi daemonset?

@vibhansa-msft
Copy link
Member

By default it asks OS about the total memory available on the system and then takes 80% of it. So if there are multiple instances are started in parallel then all may end up having the same value and thus they may end up over running the total limits. Restricting it manually might be the only option here I feel.

@Vegoo89
Copy link
Author

Vegoo89 commented Feb 24, 2025

Ok I get it, question is why does it keeps allocating so much memory and doesn't release any in newest versions.

In version 2.3.0 memory usage - without setting --block-cache-pool-size peaked at 600MB.

With 2.3.2 and above it just keep growing.

Should I expect significant performance issues if I set all PV to --block-cache-pool-size=600 ?

Was there some different memory management for block cache back in 2.3.0 that was removed or remade later on?

@vibhansa-msft
Copy link
Member

In either version it shall not continuously rise. At certain point it shall stablize. Some of the memory related issues we have found recently are due to Go version upgrade and we are actively working on those.

@syeleti-msft
Copy link
Member

In version 2.3.0 memory usage - without setting --block-cache-pool-size peaked at 600MB.

With 2.3.2 and above it just keep growing.

  1. Yes you are right, After 2.3.0 we made some changes on our memory management, when we reuse the buffer, in 2.3.0 we dont used to clear the buffer before using it(This might cause some data integrity issues) hence in >2.3.0 versions we try to clear the buffer by copying a zero buffer into the existing buffer. This is the reason why you see all the memory being used in the system.

  2. Currently in the latest release when you set the memory pool, the whole memory is used and only deallocated when the blobfuse terminates. This is a significant bottleneck and we are working on it to imporve the memory management in block cache.

Please refrain from using 2.3.0, there are many known data integrity issues, I suggest using the latest release.

@Vegoo89
Copy link
Author

Vegoo89 commented Feb 25, 2025

In either version it shall not continuously rise. At certain point it shall stablize. Some of the memory related issues we have found recently are due to Go version upgrade and we are actively working on those.

In our use case it doesn't stabilize for >2.3.0. It grows to ~20GB per process and causes node to become unresponsive so AKS moves it to NotReady state and removes it from the pool.

In version 2.3.0 memory usage - without setting --block-cache-pool-size peaked at 600MB.

With 2.3.2 and above it just keep growing.

  1. Yes you are right, After 2.3.0 we made some changes on our memory management, when we reuse the buffer, in 2.3.0 we dont used to clear the buffer before using it(This might cause some data integrity issues) hence in >2.3.0 versions we try to clear the buffer by copying a zero buffer into the existing buffer. This is the reason why you see all the memory being used in the system.
  2. Currently in the latest release when you set the memory pool, the whole memory is used and only deallocated when the blobfuse terminates. This is a significant bottleneck and we are working on it to imporve the memory management in block cache.

Please refrain from using 2.3.0, there are many known data integrity issues, I suggest using the latest release.

Thanks for explanation, appreciate it. Just last one thing I would love to understand is how setting --block-cache-pool-size=600 would affect performance.

We are working mostly with small files. Would setting this value to higher number be required if we would operate on blobs bigger than 600 MB? I am trying to understand what would work best for our use case and what impact this setting has on performance of the node (considering we have few blobfuse2 processes running there).

@syeleti-msft
Copy link
Member

We are working mostly with small files. Would setting this value to higher number be required if we would operate on blobs bigger than 600 MB?

No, doing operations on bigger blobs dont need more memory size

Do you see any perf difference when using 600M as memsize for your usecase in the latest release?

@syeleti-msft syeleti-msft self-assigned this Feb 25, 2025
@Vegoo89
Copy link
Author

Vegoo89 commented Feb 25, 2025

We did few rounds of testing today. Each performance test takes around 15 minutes and produces around 120k files.

2.3.0 -> no --block-cache-pool-size -> average CPU utilization 58% per node
2.4.1 -> --block-cache-pool-size=2000 -> average CPU utilization 50% per node
2.4.1 -> --block-cache-pool-size=600 -> average CPU utilization 54% per node

We will also perform more tests in upcoming days on bigger nodes, but results look good.

A note from my observation for version 2.4.1 - after tests are finished each blobfuse2 process stays at its limit and doesn't release memory. Also it allocates around ~23MB over the limit.

However logic for default limit of 80% host memory seems a bit off. I am not Go expert, but I checked the code and I don't see any guard that prevents OOM on the system in situation similar to described above.

In containerized environment it most likely would happen on all nodes at the same time since traffic is distributed evenly in most commons scenarios and it blows up entire worker node pool. Cluster recovery in our case takes up to 1 hour.

@syeleti-msft
Copy link
Member

However logic for default limit of 80% host memory seems a bit off. I am not Go expert, but I checked the code and I don't see any guard that prevents OOM on the system in situation similar to described above.

Constraints were placed in the code taking the fact that only one blobfuse instance would run per VM but after looking into your scenario, I think we should reconsider some things. Thanks for pointing out.

@Vegoo89
Copy link
Author

Vegoo89 commented Feb 26, 2025

Thanks! Would be great, since - best to my knowledge - it is considered best practice to split data between multiple containers instead of putting everything to single one and it means running more than one blobfuse process on the node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants