Scaling / Configuration recommendations for compactor #3369

Packetslave · 2022-11-03T03:50:53Z

Packetslave
Nov 3, 2022

I need some advice on proper configuration / scaling for the compactor in our environment.

~50 million in-memory series
~650k samples / second
One (large) tenant
Mimir 2.3.1, hosted on EKS

Current setup

2 compactor instances (I've also tested running up to 4 compactors, with similar results)
8 cores (request, not limit)
8gb RAM (request AND limit)
1000gb gp2 EBS (so 3000 IOPS)

Non-default compactor config (note, I was guessing at most of these):
- compaction_concurrency: 4
- compactor_split_groups: 2
- compactor_split_and_merge_shards: 4

This config seemed to be working okay at the beginning, but now I'm seeing compactor-0 showing 8.75 hours since last successful, and compactor-1 showing 3.5 hours. That seems like a lot.

Compactor Resources dashboard shows both instances using lots of CPU and RAM (but not bottlenecking on either), and 1-200mb/sec of disk read/write activity. On gp2 that's pretty close to the max throughput, but it's not looking like it's bottlenecked from the graphs.

Clearly, there's something I could do here to improve the configuration and/or scaling of my compactors. I would be greatful for some suggested instance size / count based on our environment listed above. I have EKS resources to throw at the problem, but I don't want to do that if there's a config option that's going to magically fix everything.

Thanks!

pstibrany · 2022-11-03T08:02:34Z

pstibrany
Nov 3, 2022
Maintainer

Hi, I have some suggestions:

Instead of using 2 compactors with 8 cores each and compaction_concurrency: 4, it may work better to use 8 compactors, with 2 cores each and compaction concurrency=1. (ie. scale compactors horizontally) This way individual compactions don't fight for disk IO bandwidth, and it's easier to control required disk size too.
it would be interesting to see difference in compaction time for "split" compactions and "merge" compactions.

Split compactions are compaction of blocks from ingesters, it is also the only time when compactor_split_groups is used. If these take too long, increasing compactor_split_groups should help (at the cost of more intermediate blocks being generated, but on the other hand such blocks will be smaller).

What does your longer-term graph (few days) of cortex_bucket_blocks_count look like? If it's stable over several days, then your compactors are doing fine. If it's going up too fast, then compactors are not able to catch up with your load. (Every day there are many new blocks, but when all compactions are done, there should only be 4 new blocks per day per tenant, with compactor_split_and_merge_shards: 4 setting).

0 replies

P6rguVyrst · 2024-10-01T16:25:33Z

P6rguVyrst
Oct 1, 2024

Just so others could find it when running into this

For us the issue was that the default limits were too low, and compactor blocks/ tenant for our biggest tenancy just kept growing and growing.

Our biggest tenant had ~350mil active series.

Essentially, setting -compactor.max-compaction-time=0 made compactor functional for us again.

3 replies

AstroEngineeer Oct 7, 2024

Hey @P6rguVyrst, Im planning for a similar setup with around 300M active series. If you don't mind could you share your configuration. Would greatly appreciate it.

P6rguVyrst Nov 18, 2024

Not sure if we're tuned to perfection now, as I haven't been working on it & we're in middle of migrations - but this is what we have now, and it works well enough.

Fixing the compactor had massive impact in improving on our product costs, so if it's broken - in our experience - it's definitely worth the investment in time to fix it. It will reduce the number of store-gateways you'll need too.

    compactor:
      meta_sync_concurrency: 200
      block_sync_concurrency: 50
      max_compaction_time: 12h # default 1h, increased to give more compaction time to large tenant.

  overrides:
    big-tenant:
      compactor_split_and_merge_shards: 16 # recommended - 1 shard per every 8 million
      compactor_split_groups: 16 # recommended - 1 per every 8 million
      compactor_tenant_shard_size: 27 # Give 90% of total compactor replicas to this tenant

compactor:
  replicas: 30 # 1 compactor instance every 20 million time series ## Also ensure to set this parameter {compactor_tenant_shard_size} to 90% of this value for the big-tenant
  persistentVolume:
    size: 1200Gi
  resources:
    requests:
      cpu: 2000m
      memory: 12Gi
    limits:
      memory: 12Gi

Basically, I'd look at your active timeseries and such, and read the docs on these parameters on what the recommended configuration for those parameters is.

The exact configuration values will differ based on the amount of load your setup gets.

This bit of documentation was helpful for tuning the config: https://grafana.com/docs/mimir/latest/references/architecture/components/compactor/

Also, you'll save yourself some time if you do a checkbox exercise to make sure you're following the best practices from: https://grafana.com/docs/mimir/latest/manage/run-production-environment/production-tips/

deniszh Nov 19, 2024

Hi @P6rguVyrst ,
Thanks a lot for your reply, I applied to our env and we're finally catching up with compaction!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling / Configuration recommendations for compactor #3369

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Scaling / Configuration recommendations for compactor #3369

Packetslave Nov 3, 2022

Replies: 2 comments · 3 replies

pstibrany Nov 3, 2022 Maintainer

P6rguVyrst Oct 1, 2024

AstroEngineeer Oct 7, 2024

P6rguVyrst Nov 18, 2024

deniszh Nov 19, 2024

Packetslave
Nov 3, 2022

Replies: 2 comments 3 replies

pstibrany
Nov 3, 2022
Maintainer

P6rguVyrst
Oct 1, 2024