Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP: Cost attribution #10269

Open
wants to merge 71 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 64 commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
e315ebb
Poc: cost attribution proposal 2
ying-jeanne Oct 24, 2024
f04c28f
refectory
ying-jeanne Dec 17, 2024
2c422d1
add experimental features in about-versioning.md
ying-jeanne Dec 19, 2024
d2eab6b
change const variable to private
ying-jeanne Dec 19, 2024
1f39282
make timer service
ying-jeanne Dec 19, 2024
9b4337d
rename TrackerForUser to Tracker
ying-jeanne Dec 19, 2024
1a523e1
use fine locking
ying-jeanne Dec 19, 2024
f10f787
add comments explain why we use unchecked collector
ying-jeanne Dec 19, 2024
cc0e939
rename deleteUserTracker to deleteTracker
ying-jeanne Dec 19, 2024
c020be0
rename cat in cost attribution package to t or tracker
ying-jeanne Dec 19, 2024
71e4666
avoid get tracker twice
ying-jeanne Dec 19, 2024
9dd101b
refactor inactiveObservationsForUser
ying-jeanne Dec 19, 2024
7d4ea9a
refactor shouldDelete function
ying-jeanne Dec 19, 2024
6754666
rename calabels and calabelmap to labels and index
ying-jeanne Dec 19, 2024
fffc5b3
remove getter and setter of max cardinality and cooldown duration
ying-jeanne Dec 19, 2024
2cf8c3e
rename CompareLabels to hasSameLabels
ying-jeanne Dec 19, 2024
f994034
remove the mapping logic since the slices are ordered
ying-jeanne Dec 19, 2024
b060c09
remove unnecessary tracker nil checking
ying-jeanne Dec 19, 2024
e35a8d9
fix linting
ying-jeanne Dec 19, 2024
5cc0b5d
refactor updateOverflow method
ying-jeanne Dec 19, 2024
389dff0
remove stream in comments
ying-jeanne Dec 19, 2024
116a69e
make observation struct private
ying-jeanne Dec 19, 2024
9c30445
remove unnecessary pointers
ying-jeanne Dec 19, 2024
88ef49e
rename discardSampleMtx to discardedSampleMtx
ying-jeanne Dec 19, 2024
130636a
rename variable observedMtx because I write with feet
ying-jeanne Dec 19, 2024
b701ba7
update test name dum dum
ying-jeanne Dec 19, 2024
dccd9c8
remove test result
ying-jeanne Dec 19, 2024
eebd028
address doc change
ying-jeanne Dec 19, 2024
8386503
remove time checking
ying-jeanne Dec 24, 2024
d8f1e9b
add createIfDoesNotExist parameter
ying-jeanne Dec 24, 2024
b9efb94
add more condition for trigger newTracker
ying-jeanne Dec 24, 2024
a37e6de
remove the label adapter to labels call
ying-jeanne Dec 24, 2024
211b3a2
remove useless function dum dum
ying-jeanne Dec 24, 2024
f697e6f
make hardcoded increment value
ying-jeanne Dec 24, 2024
fe8a1e5
rename + make cooldownuntil a normal int64 and lock with observedMtx
ying-jeanne Dec 24, 2024
8b5836f
use build-in functon dum dum
ying-jeanne Dec 24, 2024
888d8b0
modify the copy of calabels instead of directly the slice
ying-jeanne Dec 24, 2024
b15b487
update mimir-prometheus
ying-jeanne Dec 24, 2024
87209d6
Merge remote-tracking branch 'origin/r322' into final-cost-attribution
ying-jeanne Dec 24, 2024
4706bde
vendor new mimir-prometheus
ying-jeanne Dec 24, 2024
1ab1f00
rename function
ying-jeanne Dec 24, 2024
8111b6c
fix lint
ying-jeanne Dec 24, 2024
17b64a9
add unittest in active series
ying-jeanne Dec 26, 2024
a191044
copy slice instead
ying-jeanne Dec 26, 2024
2bb1845
add test for discarded samples
ying-jeanne Dec 26, 2024
ddd507d
change small map to slice since it is quicker
ying-jeanne Dec 27, 2024
b27e379
remove unused parameter
ying-jeanne Dec 27, 2024
a79fac7
add new parameter
ying-jeanne Dec 27, 2024
37901b7
update config file
ying-jeanne Dec 27, 2024
f7115f4
Update pkg/costattribution/manager.go
ying-jeanne Dec 27, 2024
679f2cc
take config before locking tracker map
ying-jeanne Dec 30, 2024
66accc9
simplify logics
ying-jeanne Dec 30, 2024
f4a4efd
remove useless initialization
ying-jeanne Dec 30, 2024
f90ac0e
change int64 to time.x
ying-jeanne Dec 30, 2024
1ab89c5
change pointer to instance
ying-jeanne Dec 30, 2024
23b32cf
change instance to pointer in map
ying-jeanne Dec 30, 2024
7a60c7d
remove callback
ying-jeanne Dec 30, 2024
0287bf6
use string when create new key in map
ying-jeanne Dec 30, 2024
9c4c2df
move the logic to different place
ying-jeanne Dec 30, 2024
f8f2a49
get cat once out of loop
ying-jeanne Dec 30, 2024
1ad99ad
update tracker per request for received samples
ying-jeanne Dec 30, 2024
fa62ee1
make the lock fanny by dum dum
ying-jeanne Dec 30, 2024
1b0fb00
make ingester work
ying-jeanne Dec 30, 2024
ced8346
fix lock
ying-jeanne Dec 30, 2024
0a7c858
add changelog
ying-jeanne Dec 31, 2024
4336f7f
update changelog
ying-jeanne Dec 31, 2024
67b6cea
update doc with correct metrics name
ying-jeanne Dec 31, 2024
800fe85
remove useless function
ying-jeanne Dec 31, 2024
80e69fb
cast only once
ying-jeanne Dec 31, 2024
a2ffe5a
stop using string
ying-jeanne Jan 2, 2025
f28d672
simplify logics
ying-jeanne Jan 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -4368,6 +4368,50 @@
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_labels",
"required": false,
"desc": "Defines labels for cost attribution. Applies to metrics like cortex_distributor_attributed_received_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_attributed_received_samples_total{team='frontend', service='api'}.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "validation.cost-attribution-labels",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_labels_per_user",
"required": false,
"desc": "Maximum number of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 2,
"fieldFlag": "validation.max-cost-attribution-labels-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_cardinality_per_user",
"required": false,
"desc": "Maximum cardinality of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 10000,
"fieldFlag": "validation.max-cost-attribution-cardinality-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_cooldown",
"required": false,
"desc": "Cooldown period for cost attribution labels. Specifies the duration the cost attribution remains in overflow before attempting a reset. If the cardinality remains above the limit after this period, the system stays in overflow mode and extends the cooldown. Setting this value to 0 disables the cooldown, causing the system to continuously check whether the cardinality has dropped below the limit. A reset occurs when the cardinality falls below the limit.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "validation.cost-attribution-cooldown",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "ruler_evaluation_delay_duration",
Expand Down Expand Up @@ -19639,6 +19683,39 @@
"fieldFlag": "timeseries-unmarshal-caching-optimization-enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_eviction_interval",
"required": false,
"desc": "Time interval at which inactive cost attributions are evicted from the counter, ensuring they are not included in the cost attribution cardinality per user limit.",
"fieldValue": null,
"fieldDefaultValue": 1200000000000,
"fieldFlag": "cost-attribution.eviction-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_registry_path",
"required": false,
"desc": "Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "cost-attribution.registry-path",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_cleanup_interval",
"required": false,
"desc": "Time interval at which the cost attribution cleanup process runs, ensuring inactive cost attribution entries are purged.",
"fieldValue": null,
"fieldDefaultValue": 180000000000,
"fieldFlag": "cost-attribution.cleanup-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
14 changes: 14 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1283,6 +1283,12 @@ Usage of ./cmd/mimir/mimir:
Expands ${var} or $var in config according to the values of the environment variables.
-config.file value
Configuration file to load.
-cost-attribution.cleanup-interval duration
[experimental] Time interval at which the cost attribution cleanup process runs, ensuring inactive cost attribution entries are purged. (default 3m0s)
-cost-attribution.eviction-interval duration
[experimental] Time interval at which inactive cost attributions are evicted from the counter, ensuring they are not included in the cost attribution cardinality per user limit. (default 20m0s)
-cost-attribution.registry-path string
[experimental] Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.
-debug.block-profile-rate int
Fraction of goroutine blocking events that are reported in the blocking profile. 1 to include every blocking event in the profile, 0 to disable.
-debug.mutex-profile-fraction int
Expand Down Expand Up @@ -3317,10 +3323,18 @@ Usage of ./cmd/mimir/mimir:
Enable anonymous usage reporting. (default true)
-usage-stats.installation-mode string
Installation mode. Supported values: custom, helm, jsonnet. (default "custom")
-validation.cost-attribution-cooldown duration
[experimental] Cooldown period for cost attribution labels. Specifies the duration the cost attribution remains in overflow before attempting a reset. If the cardinality remains above the limit after this period, the system stays in overflow mode and extends the cooldown. Setting this value to 0 disables the cooldown, causing the system to continuously check whether the cardinality has dropped below the limit. A reset occurs when the cardinality falls below the limit.
-validation.cost-attribution-labels comma-separated-list-of-strings
[experimental] Defines labels for cost attribution. Applies to metrics like cortex_distributor_attributed_received_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_attributed_received_samples_total{team='frontend', service='api'}.
-validation.create-grace-period duration
Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
-validation.enforce-metadata-metric-name
Enforce every metadata has a metric name. (default true)
-validation.max-cost-attribution-cardinality-per-user int
[experimental] Maximum cardinality of cost attribution labels allowed per user. (default 10000)
-validation.max-cost-attribution-labels-per-user int
[experimental] Maximum number of cost attribution labels allowed per user. (default 2)
-validation.max-label-names-per-info-series int
Maximum number of label names per info series. Has no effect if less than the value of the maximum number of label names per series option (-validation.max-label-names-per-series) (default 80)
-validation.max-label-names-per-series int
Expand Down
13 changes: 13 additions & 0 deletions docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,19 @@ Experimental configuration and flags are subject to change.

The following features are currently experimental:

- Cost attribution
- Configure labels for cost attribution
- `-validation.cost-attribution-labels`
- Configure cost attribution limits, such as label cardinality and the maximum number of cost attribution labels
- `-validation.max-cost-attribution-labels-per-user`
- `-validation.max-cost-attribution-cardinality-per-user`
- Configure cooldown periods and eviction intervals for cost attribution
- `-validation.cost-attribution-cooldown`
- `-cost-attribution.eviction-interval`
- Configure the metrics endpoint dedicated to cost attribution
- `-cost-attribution.registry-path`
- Configure the cost attribution cleanup process run interval
- `-cost-attribution.cleanup-interval`
- Alertmanager
- Enable a set of experimental API endpoints to help support the migration of the Grafana Alertmanager to the Mimir Alertmanager.
- `-alertmanager.grafana-alertmanager-compatibility-enabled`
Expand Down
44 changes: 44 additions & 0 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,23 @@ overrides_exporter:
# (experimental) Enables optimized marshaling of timeseries.
# CLI flag: -timeseries-unmarshal-caching-optimization-enabled
[timeseries_unmarshal_caching_optimization_enabled: <boolean> | default = true]

# (experimental) Time interval at which inactive cost attributions are evicted
# from the counter, ensuring they are not included in the cost attribution
# cardinality per user limit.
# CLI flag: -cost-attribution.eviction-interval
[cost_attribution_eviction_interval: <duration> | default = 20m]

# (experimental) Defines a custom path for the registry. When specified, Mimir
# exposes cost attribution metrics through this custom path. If not specified,
# cost attribution metrics aren't exposed.
# CLI flag: -cost-attribution.registry-path
[cost_attribution_registry_path: <string> | default = ""]

# (experimental) Time interval at which the cost attribution cleanup process
# runs, ensuring inactive cost attribution entries are purged.
# CLI flag: -cost-attribution.cleanup-interval
[cost_attribution_cleanup_interval: <duration> | default = 3m]
```

### common
Expand Down Expand Up @@ -3569,6 +3586,33 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -querier.active-series-results-max-size-bytes
[active_series_results_max_size_bytes: <int> | default = 419430400]

# (experimental) Defines labels for cost attribution. Applies to metrics like
# cortex_distributor_attributed_received_samples_total. To disable, set to an
# empty string. For example, 'team,service' produces metrics such as
# cortex_distributor_attributed_received_samples_total{team='frontend',
# service='api'}.
# CLI flag: -validation.cost-attribution-labels
[cost_attribution_labels: <string> | default = ""]

# (experimental) Maximum number of cost attribution labels allowed per user.
# CLI flag: -validation.max-cost-attribution-labels-per-user
[max_cost_attribution_labels_per_user: <int> | default = 2]

# (experimental) Maximum cardinality of cost attribution labels allowed per
# user.
# CLI flag: -validation.max-cost-attribution-cardinality-per-user
[max_cost_attribution_cardinality_per_user: <int> | default = 10000]

# (experimental) Cooldown period for cost attribution labels. Specifies the
# duration the cost attribution remains in overflow before attempting a reset.
# If the cardinality remains above the limit after this period, the system stays
# in overflow mode and extends the cooldown. Setting this value to 0 disables
# the cooldown, causing the system to continuously check whether the cardinality
# has dropped below the limit. A reset occurs when the cardinality falls below
# the limit.
# CLI flag: -validation.cost-attribution-cooldown
[cost_attribution_cooldown: <duration> | default = 0s]

# Duration to delay the evaluation of rules to ensure the underlying metrics
# have been pushed.
# CLI flag: -ruler.evaluation-delay-duration
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ require (
)

// Using a fork of Prometheus with Mimir-specific changes.
replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20241219104229-b50052711673
replace github.com/prometheus/prometheus => github.com/grafana/mimir-prometheus v0.0.0-20241224134504-460b7be5bce8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not update mimir-prometheus here. It should be updated in main, and then you merge main into your branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vendor main when I got it merged, just to make the code compile now.


// Replace memberlist with our fork which includes some fixes that haven't been
// merged upstream yet:
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -1279,8 +1279,8 @@ github.com/grafana/gomemcache v0.0.0-20241016125027-0a5bcc5aef40 h1:1TeKhyS+pvzO
github.com/grafana/gomemcache v0.0.0-20241016125027-0a5bcc5aef40/go.mod h1:IGRj8oOoxwJbHBYl1+OhS9UjQR0dv6SQOep7HqmtyFU=
github.com/grafana/memberlist v0.3.1-0.20220714140823-09ffed8adbbe h1:yIXAAbLswn7VNWBIvM71O2QsgfgW9fRXZNR0DXe6pDU=
github.com/grafana/memberlist v0.3.1-0.20220714140823-09ffed8adbbe/go.mod h1:MS2lj3INKhZjWNqd3N0m3J+Jxf3DAOnAH9VT3Sh9MUE=
github.com/grafana/mimir-prometheus v0.0.0-20241219104229-b50052711673 h1:z3nSCBMtEMtD/LAIkwrHsT03n7qgeU+0M6rEMZQbxVI=
github.com/grafana/mimir-prometheus v0.0.0-20241219104229-b50052711673/go.mod h1:a5LEa2Vy87wOp0Vu6sLmEIR1V59fqH3QosOSiErAr30=
github.com/grafana/mimir-prometheus v0.0.0-20241224134504-460b7be5bce8 h1:/TwjdoLAxL7URxKJGJUeI539w6LUqcwIcj0WCUxDY/c=
github.com/grafana/mimir-prometheus v0.0.0-20241224134504-460b7be5bce8/go.mod h1:a5LEa2Vy87wOp0Vu6sLmEIR1V59fqH3QosOSiErAr30=
github.com/grafana/opentracing-contrib-go-stdlib v0.0.0-20230509071955-f410e79da956 h1:em1oddjXL8c1tL0iFdtVtPloq2hRPen2MJQKoAWpxu0=
github.com/grafana/opentracing-contrib-go-stdlib v0.0.0-20230509071955-f410e79da956/go.mod h1:qtI1ogk+2JhVPIXVc6q+NHziSmy2W5GbdQZFUHADCBU=
github.com/grafana/prometheus-alertmanager v0.25.1-0.20240930132144-b5e64e81e8d3 h1:6D2gGAwyQBElSrp3E+9lSr7k8gLuP3Aiy20rweLWeBw=
Expand Down
6 changes: 6 additions & 0 deletions pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/grafana/dskit/middleware"
"github.com/grafana/dskit/server"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"

"github.com/grafana/mimir/pkg/alertmanager"
"github.com/grafana/mimir/pkg/alertmanager/alertmanagerpb"
Expand Down Expand Up @@ -281,6 +282,11 @@ func (a *API) RegisterDistributor(d *distributor.Distributor, pushConfig distrib
a.RegisterRoute("/distributor/ha_tracker", d.HATracker, false, true, "GET")
}

// RegisterCostAttribution registers a Prometheus HTTP handler for the cost attribution metrics.
func (a *API) RegisterCostAttribution(customRegistryPath string, reg *prometheus.Registry) {
a.RegisterRoute(customRegistryPath, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), false, false, "GET")
}

// Ingester is defined as an interface to allow for alternative implementations
// of ingesters to be passed into the API.RegisterIngester() method.
type Ingester interface {
Expand Down
2 changes: 1 addition & 1 deletion pkg/blockbuilder/tsdb.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ type TSDBBuilder struct {
var softErrProcessor = mimir_storage.NewSoftAppendErrorProcessor(
func() {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func() {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func([]mimirpb.LabelAdapter) {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
)
Expand Down
Loading
Loading