Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP: Cost attribution #10269

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
e315ebb
Poc: cost attribution proposal 2
ying-jeanne Oct 24, 2024
f04c28f
refectory
ying-jeanne Dec 17, 2024
2c422d1
add experimental features in about-versioning.md
ying-jeanne Dec 19, 2024
d2eab6b
change const variable to private
ying-jeanne Dec 19, 2024
1f39282
make timer service
ying-jeanne Dec 19, 2024
9b4337d
rename TrackerForUser to Tracker
ying-jeanne Dec 19, 2024
1a523e1
use fine locking
ying-jeanne Dec 19, 2024
f10f787
add comments explain why we use unchecked collector
ying-jeanne Dec 19, 2024
cc0e939
rename deleteUserTracker to deleteTracker
ying-jeanne Dec 19, 2024
c020be0
rename cat in cost attribution package to t or tracker
ying-jeanne Dec 19, 2024
71e4666
avoid get tracker twice
ying-jeanne Dec 19, 2024
9dd101b
refactor inactiveObservationsForUser
ying-jeanne Dec 19, 2024
7d4ea9a
refactor shouldDelete function
ying-jeanne Dec 19, 2024
6754666
rename calabels and calabelmap to labels and index
ying-jeanne Dec 19, 2024
fffc5b3
remove getter and setter of max cardinality and cooldown duration
ying-jeanne Dec 19, 2024
2cf8c3e
rename CompareLabels to hasSameLabels
ying-jeanne Dec 19, 2024
f994034
remove the mapping logic since the slices are ordered
ying-jeanne Dec 19, 2024
b060c09
remove unnecessary tracker nil checking
ying-jeanne Dec 19, 2024
e35a8d9
fix linting
ying-jeanne Dec 19, 2024
5cc0b5d
refactor updateOverflow method
ying-jeanne Dec 19, 2024
389dff0
remove stream in comments
ying-jeanne Dec 19, 2024
116a69e
make observation struct private
ying-jeanne Dec 19, 2024
9c30445
remove unnecessary pointers
ying-jeanne Dec 19, 2024
88ef49e
rename discardSampleMtx to discardedSampleMtx
ying-jeanne Dec 19, 2024
130636a
rename variable observedMtx because I write with feet
ying-jeanne Dec 19, 2024
b701ba7
update test name dum dum
ying-jeanne Dec 19, 2024
dccd9c8
remove test result
ying-jeanne Dec 19, 2024
eebd028
address doc change
ying-jeanne Dec 19, 2024
8386503
remove time checking
ying-jeanne Dec 24, 2024
d8f1e9b
add createIfDoesNotExist parameter
ying-jeanne Dec 24, 2024
b9efb94
add more condition for trigger newTracker
ying-jeanne Dec 24, 2024
a37e6de
remove the label adapter to labels call
ying-jeanne Dec 24, 2024
211b3a2
remove useless function dum dum
ying-jeanne Dec 24, 2024
f697e6f
make hardcoded increment value
ying-jeanne Dec 24, 2024
fe8a1e5
rename + make cooldownuntil a normal int64 and lock with observedMtx
ying-jeanne Dec 24, 2024
8b5836f
use build-in functon dum dum
ying-jeanne Dec 24, 2024
888d8b0
modify the copy of calabels instead of directly the slice
ying-jeanne Dec 24, 2024
b15b487
update mimir-prometheus
ying-jeanne Dec 24, 2024
87209d6
Merge remote-tracking branch 'origin/r322' into final-cost-attribution
ying-jeanne Dec 24, 2024
4706bde
vendor new mimir-prometheus
ying-jeanne Dec 24, 2024
1ab1f00
rename function
ying-jeanne Dec 24, 2024
8111b6c
fix lint
ying-jeanne Dec 24, 2024
17b64a9
add unittest in active series
ying-jeanne Dec 26, 2024
a191044
copy slice instead
ying-jeanne Dec 26, 2024
2bb1845
add test for discarded samples
ying-jeanne Dec 26, 2024
ddd507d
change small map to slice since it is quicker
ying-jeanne Dec 27, 2024
b27e379
remove unused parameter
ying-jeanne Dec 27, 2024
a79fac7
add new parameter
ying-jeanne Dec 27, 2024
37901b7
update config file
ying-jeanne Dec 27, 2024
f7115f4
Update pkg/costattribution/manager.go
ying-jeanne Dec 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -4368,6 +4368,50 @@
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_labels",
"required": false,
"desc": "Defines labels for cost attribution. Applies to metrics like cortex_distributor_attributed_received_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_attributed_received_samples_total{team='frontend', service='api'}.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "validation.cost-attribution-labels",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_labels_per_user",
"required": false,
"desc": "Maximum number of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 2,
"fieldFlag": "validation.max-cost-attribution-labels-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_cardinality_per_user",
"required": false,
"desc": "Maximum cardinality of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 10000,
"fieldFlag": "validation.max-cost-attribution-cardinality-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_cooldown",
"required": false,
"desc": "Cooldown period for cost attribution labels. Specifies the duration the cost attribution remains in overflow before attempting a reset. If the cardinality remains above the limit after this period, the system stays in overflow mode and extends the cooldown. Setting this value to 0 disables the cooldown, causing the system to continuously check whether the cardinality has dropped below the limit. A reset occurs when the cardinality falls below the limit.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "validation.cost-attribution-cooldown",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "ruler_evaluation_delay_duration",
Expand Down Expand Up @@ -19639,6 +19683,28 @@
"fieldFlag": "timeseries-unmarshal-caching-optimization-enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_eviction_interval",
"required": false,
"desc": "Time interval at which inactive cost attributions are evicted from the counter, ensuring they are not included in the cost attribution cardinality per user limit.",
"fieldValue": null,
"fieldDefaultValue": 1200000000000,
"fieldFlag": "cost-attribution.eviction-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_registry_path",
"required": false,
"desc": "Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "cost-attribution.registry-path",
"fieldType": "string",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
12 changes: 12 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1283,6 +1283,10 @@ Usage of ./cmd/mimir/mimir:
Expands ${var} or $var in config according to the values of the environment variables.
-config.file value
Configuration file to load.
-cost-attribution.eviction-interval duration
[experimental] Time interval at which inactive cost attributions are evicted from the counter, ensuring they are not included in the cost attribution cardinality per user limit. (default 20m0s)
-cost-attribution.registry-path string
[experimental] Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.
-debug.block-profile-rate int
Fraction of goroutine blocking events that are reported in the blocking profile. 1 to include every blocking event in the profile, 0 to disable.
-debug.mutex-profile-fraction int
Expand Down Expand Up @@ -3317,10 +3321,18 @@ Usage of ./cmd/mimir/mimir:
Enable anonymous usage reporting. (default true)
-usage-stats.installation-mode string
Installation mode. Supported values: custom, helm, jsonnet. (default "custom")
-validation.cost-attribution-cooldown duration
[experimental] Cooldown period for cost attribution labels. Specifies the duration the cost attribution remains in overflow before attempting a reset. If the cardinality remains above the limit after this period, the system stays in overflow mode and extends the cooldown. Setting this value to 0 disables the cooldown, causing the system to continuously check whether the cardinality has dropped below the limit. A reset occurs when the cardinality falls below the limit.
-validation.cost-attribution-labels comma-separated-list-of-strings
[experimental] Defines labels for cost attribution. Applies to metrics like cortex_distributor_attributed_received_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_attributed_received_samples_total{team='frontend', service='api'}.
-validation.create-grace-period duration
Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
-validation.enforce-metadata-metric-name
Enforce every metadata has a metric name. (default true)
-validation.max-cost-attribution-cardinality-per-user int
[experimental] Maximum cardinality of cost attribution labels allowed per user. (default 10000)
-validation.max-cost-attribution-labels-per-user int
[experimental] Maximum number of cost attribution labels allowed per user. (default 2)
-validation.max-label-names-per-info-series int
Maximum number of label names per info series. Has no effect if less than the value of the maximum number of label names per series option (-validation.max-label-names-per-series) (default 80)
-validation.max-label-names-per-series int
Expand Down
11 changes: 11 additions & 0 deletions docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,17 @@ Experimental configuration and flags are subject to change.

The following features are currently experimental:

- Cost attribution
- Configure labels for cost attribution
- `-validation.cost-attribution-labels`
- Configure cost attribution limits, such as label cardinality and the maximum number of cost attribution labels
- `-validation.max-cost-attribution-labels-per-user`
- `-validation.max-cost-attribution-cardinality-per-user`
- Configure cooldown periods and eviction intervals for cost attribution
- `-validation.cost-attribution-cooldown`
- `-cost-attribution.eviction-interval`
- Configure the metrics endpoint dedicated to cost attribution
- `-cost-attribution.registry-path`
- Alertmanager
- Enable a set of experimental API endpoints to help support the migration of the Grafana Alertmanager to the Mimir Alertmanager.
- `-alertmanager.grafana-alertmanager-compatibility-enabled`
Expand Down
39 changes: 39 additions & 0 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,18 @@ overrides_exporter:
# (experimental) Enables optimized marshaling of timeseries.
# CLI flag: -timeseries-unmarshal-caching-optimization-enabled
[timeseries_unmarshal_caching_optimization_enabled: <boolean> | default = true]

# (experimental) Time interval at which inactive cost attributions are evicted
# from the counter, ensuring they are not included in the cost attribution
# cardinality per user limit.
# CLI flag: -cost-attribution.eviction-interval
[cost_attribution_eviction_interval: <duration> | default = 20m]

# (experimental) Defines a custom path for the registry. When specified, Mimir
# exposes cost attribution metrics through this custom path. If not specified,
# cost attribution metrics aren't exposed.
# CLI flag: -cost-attribution.registry-path
[cost_attribution_registry_path: <string> | default = ""]
```

### common
Expand Down Expand Up @@ -3569,6 +3581,33 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -querier.active-series-results-max-size-bytes
[active_series_results_max_size_bytes: <int> | default = 419430400]

# (experimental) Defines labels for cost attribution. Applies to metrics like
# cortex_distributor_attributed_received_samples_total. To disable, set to an
# empty string. For example, 'team,service' produces metrics such as
# cortex_distributor_attributed_received_samples_total{team='frontend',
# service='api'}.
# CLI flag: -validation.cost-attribution-labels
[cost_attribution_labels: <string> | default = ""]

# (experimental) Maximum number of cost attribution labels allowed per user.
# CLI flag: -validation.max-cost-attribution-labels-per-user
[max_cost_attribution_labels_per_user: <int> | default = 2]

# (experimental) Maximum cardinality of cost attribution labels allowed per
# user.
# CLI flag: -validation.max-cost-attribution-cardinality-per-user
[max_cost_attribution_cardinality_per_user: <int> | default = 10000]

# (experimental) Cooldown period for cost attribution labels. Specifies the
# duration the cost attribution remains in overflow before attempting a reset.
# If the cardinality remains above the limit after this period, the system stays
# in overflow mode and extends the cooldown. Setting this value to 0 disables
# the cooldown, causing the system to continuously check whether the cardinality
# has dropped below the limit. A reset occurs when the cardinality falls below
# the limit.
# CLI flag: -validation.cost-attribution-cooldown
[cost_attribution_cooldown: <duration> | default = 0s]

# Duration to delay the evaluation of rules to ensure the underlying metrics
# have been pushed.
# CLI flag: -ruler.evaluation-delay-duration
Expand Down
6 changes: 6 additions & 0 deletions pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/grafana/dskit/middleware"
"github.com/grafana/dskit/server"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"

"github.com/grafana/mimir/pkg/alertmanager"
"github.com/grafana/mimir/pkg/alertmanager/alertmanagerpb"
Expand Down Expand Up @@ -280,6 +281,11 @@ func (a *API) RegisterDistributor(d *distributor.Distributor, pushConfig distrib
a.RegisterRoute("/distributor/ha_tracker", d.HATracker, false, true, "GET")
}

// RegisterCostAttribution registers a Prometheus HTTP handler for the cost attribution metrics.
func (a *API) RegisterCostAttribution(customRegistryPath string, reg *prometheus.Registry) {
a.RegisterRoute(customRegistryPath, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), false, false, "GET")
}

// Ingester is defined as an interface to allow for alternative implementations
// of ingesters to be passed into the API.RegisterIngester() method.
type Ingester interface {
Expand Down
2 changes: 1 addition & 1 deletion pkg/blockbuilder/tsdb.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ type TSDBBuilder struct {
var softErrProcessor = mimir_storage.NewSoftAppendErrorProcessor(
func() {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func() {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func([]mimirpb.LabelAdapter) {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
)
Expand Down
170 changes: 170 additions & 0 deletions pkg/costattribution/manager.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
// SPDX-License-Identifier: AGPL-3.0-only

package costattribution

import (
"context"
"sort"
"sync"
"time"

"github.com/go-kit/log"
"github.com/grafana/dskit/services"
"github.com/prometheus/client_golang/prometheus"

"github.com/grafana/mimir/pkg/util/validation"
)

const (
trackerLabel = "tracker"
tenantLabel = "tenant"
defaultTrackerName = "cost-attribution"
missingValue = "__missing__"
overflowValue = "__overflow__"
)

type Manager struct {
services.Service
logger log.Logger
inactiveTimeout time.Duration
limits *validation.Overrides

mtx sync.RWMutex
trackersByUserID map[string]*Tracker
reg *prometheus.Registry
cleanupInterval time.Duration
metricsExportInterval time.Duration
}

func NewManager(cleanupInterval, exportInterval, inactiveTimeout time.Duration, logger log.Logger, limits *validation.Overrides, reg *prometheus.Registry) (*Manager, error) {
m := &Manager{
trackersByUserID: make(map[string]*Tracker),
limits: limits,
mtx: sync.RWMutex{},
inactiveTimeout: inactiveTimeout,
logger: logger,
reg: reg,
cleanupInterval: cleanupInterval,
metricsExportInterval: exportInterval,
}

m.Service = services.NewTimerService(cleanupInterval, nil, m.iteration, nil).WithName("cost attribution manager")
if err := reg.Register(m); err != nil {
return nil, err
}
return m, nil
}

func (m *Manager) iteration(_ context.Context) error {
return m.purgeInactiveAttributionsUntil(time.Now().Add(-m.inactiveTimeout).Unix())
}

func (m *Manager) EnabledForUser(userID string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't need to be exported.

if m == nil {
return false
}
return len(m.limits.CostAttributionLabels(userID)) > 0
}

func (m *Manager) Tracker(userID string) *Tracker {
if !m.EnabledForUser(userID) {
return nil
}

// Check if the tracker already exists, if exists return it. Otherwise lock and create a new tracker.
m.mtx.RLock()
tracker, exists := m.trackersByUserID[userID]
m.mtx.RUnlock()
if exists {
return tracker
}

m.mtx.Lock()
defer m.mtx.Unlock()
if tracker, exists = m.trackersByUserID[userID]; exists {
return tracker
}
tracker = newTracker(userID, m.limits.CostAttributionLabels(userID), m.limits.MaxCostAttributionCardinalityPerUser(userID), m.limits.CostAttributionCooldown(userID), m.logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I would collect all the information needed to build a tracker (all the m.limits.* calls before we take m.mtx.Lock, to spend less time locked.

m.trackersByUserID[userID] = tracker
return tracker
}

func (m *Manager) Collect(out chan<- prometheus.Metric) {
m.mtx.RLock()
defer m.mtx.RUnlock()
for _, tracker := range m.trackersByUserID {
tracker.Collect(out)
}
}

func (m *Manager) Describe(chan<- *prometheus.Desc) {
colega marked this conversation as resolved.
Show resolved Hide resolved
// Describe is not implemented because the metrics include dynamic labels. The Manager functions as an unchecked exporter.
// For more details, refer to the documentation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#hdr-Custom_Collectors_and_constant_Metrics
}

func (m *Manager) deleteTracker(userID string) {
m.mtx.Lock()
defer m.mtx.Unlock()
delete(m.trackersByUserID, userID)
}

func (m *Manager) updateTracker(userID string) *Tracker {
t := m.Tracker(userID)

if t == nil {
m.deleteTracker(userID)
return nil
}

newTrackedLabels := m.limits.CostAttributionLabels(userID)

// sort the labels to ensure the order is consistent
sort.Slice(newTrackedLabels, func(i, j int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to modify the slice returned by m.limits.CostAttributionLabels(userID)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify the copy instead, addressed in 888d8b0

return newTrackedLabels[i] < newTrackedLabels[j]
})
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved

if !t.hasSameLabels(newTrackedLabels) {
m.mtx.Lock()
t = newTracker(userID, newTrackedLabels, m.limits.MaxCostAttributionCardinalityPerUser(userID), m.limits.CostAttributionCooldown(userID), m.logger)
m.trackersByUserID[userID] = t
m.mtx.Unlock()
return t
}

maxCardinality := m.limits.MaxCostAttributionCardinalityPerUser(userID)
if t.maxCardinality != maxCardinality {
t.maxCardinality = maxCardinality
}

cooldown := int64(m.limits.CostAttributionCooldown(userID).Seconds())
if cooldown != t.cooldownDuration {
t.cooldownDuration = cooldown
}
return t
}

func (m *Manager) purgeInactiveAttributionsUntil(deadline int64) error {
m.mtx.RLock()
userIDs := make([]string, 0, len(m.trackersByUserID))
for userID := range m.trackersByUserID {
userIDs = append(userIDs, userID)
}
m.mtx.RUnlock()

for _, userID := range userIDs {
t := m.updateTracker(userID)
if t == nil {
continue
}

invalidKeys := t.inactiveObservations(deadline)
for _, key := range invalidKeys {
t.cleanupTrackerAttribution(key)
}

if t.shouldDelete(deadline) {
m.deleteTracker(userID)
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
}
}
return nil
}
Loading
Loading