core: memory usage discussion #1956

dotnwat · 2021-08-03T04:32:48Z

dotnwat
Aug 3, 2021
Maintainer

Hello @vectorizedio/core

Without core dumps or reproducers, the OOM events we have been seeing are pretty tough to diagnose. We’re working on improving both approaches, but I tried many hours today to get a customer workload to OOM without any luck. These OOMs were once easy to diagnose because the culprits were egregious violators, but now we are searching in the tail of optimization.

So, it would be helpful for everyone to think about sub-systems you are working on and what kinds of memory usage might accumulate. Total size is important, but in core we also care about cases where large contiguous regions are allocated. By large I don’t mean MBs, I mean KBs (for example, there may be plenty of free memory, but not enough for your std::vector of a few hundred integers).

Segment index

Every segment on in a cluster has an associated index structure consisting of 3 vectors of integers ([]int32, []int32, []int64). A 1 GB segment will have roughly 23,000 entries which is about 2 80K allocations and a 150K allocation. These allocations fit into the large category.

The ratio of segment data to index size is approximately 3000:1

One customer has a cluster with approximately 13,000 active segments. They also have around 4 TB of data on disk. This works out to roughly 250 MB per segment on average. Even scaling down by a quarter, 80K/150K become 20K/37K. These are still large, and there are about 13K of them.

Quick solution is to use something like absl::btree or a segmented vector which will split up allocations. We can also hook the reclaimer into this to have access to a pretty good size chunk of normally idle memory to reclaim.

Segment reader

This doesn't appear to hold onto any significant resources. It's an open file descriptor and some metadata.

O(connections) overhead

We've had a report of a workload with 1000 nodes * 5 producers/node, so O(5000) producer connections on a 3 node cluster with 96 cores.

Readers cache

Holds log readers open until they become invalid or they become inactive for 30 seconds. Log reader has a lot of stuff going on, but probably its most significant allocation would be the buffer in its active seastar::input_stream. AFAIK this might be 128kb?

@mmaslankaprv are there scenarios where the total number of cached log readers might become large? It doesn't seem like there are any hard bounds in place.

Fetch session cache

@mmaslankaprv anything to think about here?

Chunk cache

This is not hooked up to the reclaimer, but no matter how hard I try I can't get it to use much memory.

Raft

@mmaslankaprv what are the scenarios we need to be concerned about here? where does data get batched up, and where might it get queued without back pressure being applied?

Foreign memory

Minor optimization, but we probably have a lot of cases: ownership of heap data sent across core should be in a foreign pointer. I noticed today the following areas where it looks like there are lots of non-foreign-owned cross core movements:

Kafka group management and transactions (pretty easy fixes)

Schema registry

@BenPope
@jcsp

HTTP proxy

@BenPope

Coproc

@graphcareful
@VadimPlh

Transactions and idempotence

@rystsov

Archival

@Lazin

Anything else?

emaxerrno · 2021-08-03T04:54:04Z

emaxerrno
Aug 3, 2021
Maintainer

I think we are missing

@BenPope - schema registry
@BenPope - HTTP Proxy (and histograms though rob just fixed some mem footprint)
@graphcareful - coproc work

2 replies

dotnwat Aug 3, 2021
Maintainer Author

Excellent point, and transactions

dotnwat Aug 3, 2021
Maintainer Author

And archival. I'm tired...

emaxerrno · 2021-08-03T05:24:04Z

emaxerrno
Aug 3, 2021
Maintainer

Thread on Readers Cache

1 reply

emaxerrno Aug 3, 2021
Maintainer

So me readers cache has the opportunity of locking up the entirety of the memory: Scenario:

start a reader with at least one memory region from teh buddy allocator
create a pattern such that there can be no actual compaction of the buddy allocator by having exacly enough memory tired up in a sparse format of the readers cache.
Perform a read of exactly 8KB (or whatever is the default readers-cache default page X 2)
OOM

I think we should look at long duration memory lifetimes and force churn to work w/ the buddy allocator.

Do we have limits on the readers cache w.r.t total memory used?

BenPope · 2021-08-03T09:16:30Z

BenPope
Aug 3, 2021
Collaborator

Thread on Pandaproxy (REST & Schema Registry):

General:

Requests/responses are linearised for JSON conversion. I have a plan to allow Readers/Writers over a stream backed by an iobuf, as well as a more async interface that will allow a more streaming style, to reduce memory pressure.
Kafka/Client stores various metadata in maps. There are many clients. The metadata could be shared between them,

Schema Registry

Stores linearised Schema - I don't really know how big they get in practice. I'd like them to be backed by an iobuf.
Stores all schema in memory. Could move to a cache with LRU eviction, and store just an offset->schema_id mapping.

0 replies

graphcareful · 2021-08-03T10:07:01Z

graphcareful
Aug 3, 2021

Thread on vanilla coproc (w/o v8)

The issue I am most concerned about would be what happens when a user pushes many scripts, to cut back on memory usage there is a cache of ntps to context info which are shared across scripts. This would be the main culprit for any memory usage oddities within coproc.

A second would be possibly degraded performance in the case there are many scripts, since each script has its own run loop, I would begin to be concerned if we would possibly be hurting performance in other areas of the system by holding up the reactor. Maybe we could use priorities here to solve this.

0 replies

Lazin · 2021-08-03T11:11:49Z

Lazin
Aug 3, 2021
Collaborator

Thread on archival.

There're two parts of the problem. Transient memory allocation and long term.

Long term memory used by archival is mostly manifests. We're storing a manifest per partition on a leader shard. It uses absl::btree_map so the fragmentation is unlikely. Also, there is a hash-map that contains all ntps that are being archived. It's an absl::node_hash_map. It can probably cause problems when we have a lot of ntps on every shard.

Transient memory allocations are mostly used by uploads. We have create a buffered output stream for uploads. Also, the manifests are linearized when they're parsed. If the manifest will grow big it will cause OOM eventually. To mitigate this I planned to split large manifests into parts which could be updated/parsed individually.

0 replies

Lazin · 2021-08-03T11:35:47Z

Lazin
Aug 3, 2021
Collaborator

Maybe we can create a replacements for vector and unordered_map that never allocate more than configured number of contiguous bytes. What do you think?

2 replies

dotnwat Aug 6, 2021
Maintainer Author

Yeh I did that for vector today see the latest PR

dotnwat Aug 6, 2021
Maintainer Author

unordered map seems harder. thoughts? I generally recommend using absl::btree_map for an associate container that limits contiguous allocations

mmaslankaprv · 2021-08-06T12:41:10Z

mmaslankaprv
Aug 6, 2021
Maintainer

We are missing an RPC in here. I did an analysis today and it looks like we read request body before acquiring memory semaphore on the server. This way we do not stop server from accepting requests even if we have gigabytes of them pending.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: memory usage discussion #1956

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

core: memory usage discussion #1956

dotnwat Aug 3, 2021 Maintainer

Segment index

Segment reader

O(connections) overhead

Readers cache

Fetch session cache

Chunk cache

Raft

Foreign memory

Schema registry

HTTP proxy

Coproc

Transactions and idempotence

Archival

Anything else?

Replies: 7 comments · 5 replies

emaxerrno Aug 3, 2021 Maintainer

dotnwat Aug 3, 2021 Maintainer Author

dotnwat Aug 3, 2021 Maintainer Author

emaxerrno Aug 3, 2021 Maintainer

emaxerrno Aug 3, 2021 Maintainer

BenPope Aug 3, 2021 Collaborator

General:

Schema Registry

graphcareful Aug 3, 2021

Lazin Aug 3, 2021 Collaborator

Lazin Aug 3, 2021 Collaborator

dotnwat Aug 6, 2021 Maintainer Author

dotnwat Aug 6, 2021 Maintainer Author

mmaslankaprv Aug 6, 2021 Maintainer

dotnwat
Aug 3, 2021
Maintainer

Replies: 7 comments 5 replies

emaxerrno
Aug 3, 2021
Maintainer

dotnwat Aug 3, 2021
Maintainer Author

dotnwat Aug 3, 2021
Maintainer Author

emaxerrno
Aug 3, 2021
Maintainer

emaxerrno Aug 3, 2021
Maintainer

BenPope
Aug 3, 2021
Collaborator

graphcareful
Aug 3, 2021

Lazin
Aug 3, 2021
Collaborator

Lazin
Aug 3, 2021
Collaborator

dotnwat Aug 6, 2021
Maintainer Author

dotnwat Aug 6, 2021
Maintainer Author

mmaslankaprv
Aug 6, 2021
Maintainer