Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP: Cost attribution #10269

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
e315ebb
Poc: cost attribution proposal 2
ying-jeanne Oct 24, 2024
f04c28f
refectory
ying-jeanne Dec 17, 2024
2c422d1
add experimental features in about-versioning.md
ying-jeanne Dec 19, 2024
d2eab6b
change const variable to private
ying-jeanne Dec 19, 2024
1f39282
make timer service
ying-jeanne Dec 19, 2024
9b4337d
rename TrackerForUser to Tracker
ying-jeanne Dec 19, 2024
1a523e1
use fine locking
ying-jeanne Dec 19, 2024
f10f787
add comments explain why we use unchecked collector
ying-jeanne Dec 19, 2024
cc0e939
rename deleteUserTracker to deleteTracker
ying-jeanne Dec 19, 2024
c020be0
rename cat in cost attribution package to t or tracker
ying-jeanne Dec 19, 2024
71e4666
avoid get tracker twice
ying-jeanne Dec 19, 2024
9dd101b
refactor inactiveObservationsForUser
ying-jeanne Dec 19, 2024
7d4ea9a
refactor shouldDelete function
ying-jeanne Dec 19, 2024
6754666
rename calabels and calabelmap to labels and index
ying-jeanne Dec 19, 2024
fffc5b3
remove getter and setter of max cardinality and cooldown duration
ying-jeanne Dec 19, 2024
2cf8c3e
rename CompareLabels to hasSameLabels
ying-jeanne Dec 19, 2024
f994034
remove the mapping logic since the slices are ordered
ying-jeanne Dec 19, 2024
b060c09
remove unnecessary tracker nil checking
ying-jeanne Dec 19, 2024
e35a8d9
fix linting
ying-jeanne Dec 19, 2024
5cc0b5d
refactor updateOverflow method
ying-jeanne Dec 19, 2024
389dff0
remove stream in comments
ying-jeanne Dec 19, 2024
116a69e
make observation struct private
ying-jeanne Dec 19, 2024
9c30445
remove unnecessary pointers
ying-jeanne Dec 19, 2024
88ef49e
rename discardSampleMtx to discardedSampleMtx
ying-jeanne Dec 19, 2024
130636a
rename variable observedMtx because I write with feet
ying-jeanne Dec 19, 2024
b701ba7
update test name dum dum
ying-jeanne Dec 19, 2024
dccd9c8
remove test result
ying-jeanne Dec 19, 2024
eebd028
address doc change
ying-jeanne Dec 19, 2024
8386503
remove time checking
ying-jeanne Dec 24, 2024
d8f1e9b
add createIfDoesNotExist parameter
ying-jeanne Dec 24, 2024
b9efb94
add more condition for trigger newTracker
ying-jeanne Dec 24, 2024
a37e6de
remove the label adapter to labels call
ying-jeanne Dec 24, 2024
211b3a2
remove useless function dum dum
ying-jeanne Dec 24, 2024
f697e6f
make hardcoded increment value
ying-jeanne Dec 24, 2024
fe8a1e5
rename + make cooldownuntil a normal int64 and lock with observedMtx
ying-jeanne Dec 24, 2024
8b5836f
use build-in functon dum dum
ying-jeanne Dec 24, 2024
888d8b0
modify the copy of calabels instead of directly the slice
ying-jeanne Dec 24, 2024
b15b487
update mimir-prometheus
ying-jeanne Dec 24, 2024
87209d6
Merge remote-tracking branch 'origin/r322' into final-cost-attribution
ying-jeanne Dec 24, 2024
4706bde
vendor new mimir-prometheus
ying-jeanne Dec 24, 2024
1ab1f00
rename function
ying-jeanne Dec 24, 2024
8111b6c
fix lint
ying-jeanne Dec 24, 2024
17b64a9
add unittest in active series
ying-jeanne Dec 26, 2024
a191044
copy slice instead
ying-jeanne Dec 26, 2024
2bb1845
add test for discarded samples
ying-jeanne Dec 26, 2024
ddd507d
change small map to slice since it is quicker
ying-jeanne Dec 27, 2024
b27e379
remove unused parameter
ying-jeanne Dec 27, 2024
a79fac7
add new parameter
ying-jeanne Dec 27, 2024
37901b7
update config file
ying-jeanne Dec 27, 2024
f7115f4
Update pkg/costattribution/manager.go
ying-jeanne Dec 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -4368,6 +4368,50 @@
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_labels",
"required": false,
"desc": "Defines labels for cost attribution, applied to metrics like cortex_distributor_attributed_received_samples_total. Set to an empty string to disable. Example: 'team,service' will produce metrics such as cortex_distributor_attributed_received_samples_total{team='frontend', service='api'}.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "validation.cost-attribution-labels",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_labels_per_user",
"required": false,
"desc": "Maximum number of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 2,
"fieldFlag": "validation.max-cost-attribution-labels-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_cardinality_per_user",
"required": false,
"desc": "Maximum cardinality of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 10000,
"fieldFlag": "validation.max-cost-attribution-cardinality-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_cooldown",
"required": false,
"desc": "Cooldown period for cost attribution labels. Specifies the duration the cost attribution remains in overflow before attempting a reset. If the cardinality remains above the limit after this period, the system will stay in overflow mode and extend the cooldown. Setting this value to 0 disables the cooldown, causing the system to continuously check whether the cardinality has dropped below the limit. A reset will occur once the cardinality falls below the limit.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "validation.cost-attribution-cooldown",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "ruler_evaluation_delay_duration",
Expand Down Expand Up @@ -19639,6 +19683,28 @@
"fieldFlag": "timeseries-unmarshal-caching-optimization-enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_eviction_interval",
"required": false,
"desc": "Time interval at which inactive cost attributions are evicted from the counter, ensuring they are not included in the cost attribution cardinality per user limit.",
"fieldValue": null,
"fieldDefaultValue": 1200000000000,
"fieldFlag": "cost-attribution.eviction-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_registry_path",
"required": false,
"desc": "Defines a custom path for the registry. When specified, Mimir will expose cost attribution metrics through this custom path, if not specified, cost attribution metrics won't be exposed.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "cost-attribution.registry-path",
"fieldType": "string",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
12 changes: 12 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1283,6 +1283,10 @@ Usage of ./cmd/mimir/mimir:
Expands ${var} or $var in config according to the values of the environment variables.
-config.file value
Configuration file to load.
-cost-attribution.eviction-interval duration
[experimental] Time interval at which inactive cost attributions are evicted from the counter, ensuring they are not included in the cost attribution cardinality per user limit. (default 20m0s)
-cost-attribution.registry-path string
[experimental] Defines a custom path for the registry. When specified, Mimir will expose cost attribution metrics through this custom path, if not specified, cost attribution metrics won't be exposed.
-debug.block-profile-rate int
Fraction of goroutine blocking events that are reported in the blocking profile. 1 to include every blocking event in the profile, 0 to disable.
-debug.mutex-profile-fraction int
Expand Down Expand Up @@ -3317,10 +3321,18 @@ Usage of ./cmd/mimir/mimir:
Enable anonymous usage reporting. (default true)
-usage-stats.installation-mode string
Installation mode. Supported values: custom, helm, jsonnet. (default "custom")
-validation.cost-attribution-cooldown duration
[experimental] Cooldown period for cost attribution labels. Specifies the duration the cost attribution remains in overflow before attempting a reset. If the cardinality remains above the limit after this period, the system will stay in overflow mode and extend the cooldown. Setting this value to 0 disables the cooldown, causing the system to continuously check whether the cardinality has dropped below the limit. A reset will occur once the cardinality falls below the limit.
-validation.cost-attribution-labels comma-separated-list-of-strings
[experimental] Defines labels for cost attribution, applied to metrics like cortex_distributor_attributed_received_samples_total. Set to an empty string to disable. Example: 'team,service' will produce metrics such as cortex_distributor_attributed_received_samples_total{team='frontend', service='api'}.
-validation.create-grace-period duration
Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
-validation.enforce-metadata-metric-name
Enforce every metadata has a metric name. (default true)
-validation.max-cost-attribution-cardinality-per-user int
[experimental] Maximum cardinality of cost attribution labels allowed per user. (default 10000)
-validation.max-cost-attribution-labels-per-user int
[experimental] Maximum number of cost attribution labels allowed per user. (default 2)
-validation.max-label-names-per-info-series int
Maximum number of label names per info series. Has no effect if less than the value of the maximum number of label names per series option (-validation.max-label-names-per-series) (default 80)
-validation.max-label-names-per-series int
Expand Down
39 changes: 39 additions & 0 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,18 @@ overrides_exporter:
# (experimental) Enables optimized marshaling of timeseries.
# CLI flag: -timeseries-unmarshal-caching-optimization-enabled
[timeseries_unmarshal_caching_optimization_enabled: <boolean> | default = true]

# (experimental) Time interval at which inactive cost attributions are evicted
# from the counter, ensuring they are not included in the cost attribution
# cardinality per user limit.
# CLI flag: -cost-attribution.eviction-interval
[cost_attribution_eviction_interval: <duration> | default = 20m]

# (experimental) Defines a custom path for the registry. When specified, Mimir
# will expose cost attribution metrics through this custom path, if not
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
# specified, cost attribution metrics won't be exposed.
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
# CLI flag: -cost-attribution.registry-path
[cost_attribution_registry_path: <string> | default = ""]
colega marked this conversation as resolved.
Show resolved Hide resolved
```

### common
Expand Down Expand Up @@ -3569,6 +3581,33 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -querier.active-series-results-max-size-bytes
[active_series_results_max_size_bytes: <int> | default = 419430400]

# (experimental) Defines labels for cost attribution, applied to metrics like
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
# cortex_distributor_attributed_received_samples_total. Set to an empty string
# to disable. Example: 'team,service' will produce metrics such as
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
# cortex_distributor_attributed_received_samples_total{team='frontend',
# service='api'}.
# CLI flag: -validation.cost-attribution-labels
[cost_attribution_labels: <string> | default = ""]

# (experimental) Maximum number of cost attribution labels allowed per user.
# CLI flag: -validation.max-cost-attribution-labels-per-user
[max_cost_attribution_labels_per_user: <int> | default = 2]

# (experimental) Maximum cardinality of cost attribution labels allowed per
# user.
# CLI flag: -validation.max-cost-attribution-cardinality-per-user
[max_cost_attribution_cardinality_per_user: <int> | default = 10000]

# (experimental) Cooldown period for cost attribution labels. Specifies the
# duration the cost attribution remains in overflow before attempting a reset.
# If the cardinality remains above the limit after this period, the system will
# stay in overflow mode and extend the cooldown. Setting this value to 0
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
# disables the cooldown, causing the system to continuously check whether the
# cardinality has dropped below the limit. A reset will occur once the
# cardinality falls below the limit.
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
# CLI flag: -validation.cost-attribution-cooldown
[cost_attribution_cooldown: <duration> | default = 0s]

# Duration to delay the evaluation of rules to ensure the underlying metrics
# have been pushed.
# CLI flag: -ruler.evaluation-delay-duration
Expand Down
6 changes: 6 additions & 0 deletions pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/grafana/dskit/middleware"
"github.com/grafana/dskit/server"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"

"github.com/grafana/mimir/pkg/alertmanager"
"github.com/grafana/mimir/pkg/alertmanager/alertmanagerpb"
Expand Down Expand Up @@ -280,6 +281,11 @@ func (a *API) RegisterDistributor(d *distributor.Distributor, pushConfig distrib
a.RegisterRoute("/distributor/ha_tracker", d.HATracker, false, true, "GET")
}

// RegisterCostAttribution registers a Prometheus HTTP handler for the cost attribution metrics.
func (a *API) RegisterCostAttribution(customRegistryPath string, reg *prometheus.Registry) {
a.RegisterRoute(customRegistryPath, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), false, false, "GET")
}

// Ingester is defined as an interface to allow for alternative implementations
// of ingesters to be passed into the API.RegisterIngester() method.
type Ingester interface {
Expand Down
2 changes: 1 addition & 1 deletion pkg/blockbuilder/tsdb.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ type TSDBBuilder struct {
var softErrProcessor = mimir_storage.NewSoftAppendErrorProcessor(
func() {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func() {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func([]mimirpb.LabelAdapter) {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
)
Expand Down
172 changes: 172 additions & 0 deletions pkg/costattribution/manager.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
// SPDX-License-Identifier: AGPL-3.0-only

package costattribution

import (
"context"
"sort"
"sync"
"time"

"github.com/go-kit/log"
"github.com/grafana/dskit/services"
"github.com/prometheus/client_golang/prometheus"

"github.com/grafana/mimir/pkg/util/validation"
)

const (
TrackerLabel = "tracker"
TenantLabel = "tenant"
colega marked this conversation as resolved.
Show resolved Hide resolved
defaultTrackerName = "cost-attribution"
missingValue = "__missing__"
overflowValue = "__overflow__"
)

type Manager struct {
services.Service
logger log.Logger
inactiveTimeout time.Duration
limits *validation.Overrides

mtx sync.RWMutex
trackersByUserID map[string]*Tracker
reg *prometheus.Registry
cleanupInterval time.Duration
metricsExportInterval time.Duration
}

func NewManager(cleanupInterval, exportInterval, inactiveTimeout time.Duration, logger log.Logger, limits *validation.Overrides, reg *prometheus.Registry) (*Manager, error) {
m := &Manager{
trackersByUserID: make(map[string]*Tracker),
limits: limits,
mtx: sync.RWMutex{},
inactiveTimeout: inactiveTimeout,
logger: logger,
reg: reg,
cleanupInterval: cleanupInterval,
metricsExportInterval: exportInterval,
}

m.Service = services.NewBasicService(nil, m.running, nil).WithName("cost attribution manager")
if err := reg.Register(m); err != nil {
return nil, err
}
return m, nil
}

func (m *Manager) running(ctx context.Context) error {
colega marked this conversation as resolved.
Show resolved Hide resolved
t := time.NewTicker(m.cleanupInterval)
defer t.Stop()

for {
select {
case <-t.C:
if err := m.purgeInactiveAttributionsUntil(time.Now().Add(-m.inactiveTimeout).Unix()); err != nil {
return err
}
case <-ctx.Done():
return nil
}
}
}

func (m *Manager) EnabledForUser(userID string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't need to be exported.

if m == nil {
return false
}
return len(m.limits.CostAttributionLabels(userID)) > 0
}

func (m *Manager) TrackerForUser(userID string) *Tracker {
colega marked this conversation as resolved.
Show resolved Hide resolved
if !m.EnabledForUser(userID) {
return nil
}

m.mtx.Lock()
defer m.mtx.Unlock()

if tracker, exists := m.trackersByUserID[userID]; exists {
colega marked this conversation as resolved.
Show resolved Hide resolved
return tracker
}

tracker := newTracker(userID, m.limits.CostAttributionLabels(userID), m.limits.MaxCostAttributionCardinalityPerUser(userID), m.limits.CostAttributionCooldown(userID), m.logger)
m.trackersByUserID[userID] = tracker
return tracker
}

func (m *Manager) Collect(out chan<- prometheus.Metric) {
m.mtx.RLock()
defer m.mtx.RUnlock()
for _, tracker := range m.trackersByUserID {
tracker.Collect(out)
}
}

func (m *Manager) Describe(chan<- *prometheus.Desc) {
colega marked this conversation as resolved.
Show resolved Hide resolved
}

func (m *Manager) deleteUserTracker(userID string) {
colega marked this conversation as resolved.
Show resolved Hide resolved
m.mtx.Lock()
defer m.mtx.Unlock()
delete(m.trackersByUserID, userID)
}

func (m *Manager) purgeInactiveAttributionsUntil(deadline int64) error {
m.mtx.RLock()
userIDs := make([]string, 0, len(m.trackersByUserID))
for userID := range m.trackersByUserID {
userIDs = append(userIDs, userID)
}
m.mtx.RUnlock()

for _, userID := range userIDs {
if !m.EnabledForUser(userID) {
m.deleteUserTracker(userID)
continue
}

invalidKeys := m.inactiveObservationsForUser(userID, deadline)
ying-jeanne marked this conversation as resolved.
Show resolved Hide resolved
cat := m.TrackerForUser(userID)
colega marked this conversation as resolved.
Show resolved Hide resolved
colega marked this conversation as resolved.
Show resolved Hide resolved
for _, key := range invalidKeys {
cat.cleanupTrackerAttribution(key)
}

if cat != nil && cat.cooldownUntil != nil && cat.cooldownUntil.Load() < deadline {
if len(cat.observed) <= cat.MaxCardinality() {
cat.state = OverflowComplete
m.deleteUserTracker(userID)
} else {
cat.cooldownUntil.Store(deadline + cat.cooldownDuration)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this logic seems to belong to Tracker. Please create a method there that would do it, and would just tell us whether we need to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 7d4ea9a

return nil
}

func (m *Manager) inactiveObservationsForUser(userID string, deadline int64) []string {
cat := m.TrackerForUser(userID)
newTrackedLabels := m.limits.CostAttributionLabels(userID)
sort.Slice(newTrackedLabels, func(i, j int) bool {
return newTrackedLabels[i] < newTrackedLabels[j]
})

if !cat.CompareCALabels(newTrackedLabels) {
m.mtx.Lock()
cat = newTracker(userID, newTrackedLabels, m.limits.MaxCostAttributionCardinalityPerUser(userID), m.limits.CostAttributionCooldown(userID), m.logger)
m.trackersByUserID[userID] = cat
m.mtx.Unlock()
return nil
}
maxCardinality := m.limits.MaxCostAttributionCardinalityPerUser(userID)
if cat.MaxCardinality() != maxCardinality {
cat.UpdateMaxCardinality(maxCardinality)
}

cooldown := int64(m.limits.CostAttributionCooldown(userID).Seconds())
if cooldown != cat.CooldownDuration() {
cat.UpdateCooldownDuration(cooldown)
}

return cat.InactiveObservations(deadline)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does much more than it says, can we extract a common "updateTracker(userID) *Tracker, make it responsible for updating (or deleting! like in the for loop above), and returning the proper *Tracker for a user?

Then in the for loop above we can just do something like:

t := m.updatedTracker(userID)
if t == nil {
    continue
}
t.cleanupInactiveObservations(deadline)
// etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in commit 9dd101b

Loading
Loading