Skip to content

Commit

Permalink
Add ignoreMatchingElsewhere option to promql/series
Browse files Browse the repository at this point in the history
  • Loading branch information
prymitive committed Feb 20, 2025
1 parent bdce0a1 commit ed9616f
Show file tree
Hide file tree
Showing 5 changed files with 356 additions and 27 deletions.
1 change: 1 addition & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- Added `names` option to the `parser` block, which controls how does Prometheus validates
label names.
- Added [promql/impossible](checks/promql/impossible.md) check.
- Added `ignoreMatchingElsewhere` option to [promql/series](checks/promql/series.md) check.

### Fixed

Expand Down
44 changes: 39 additions & 5 deletions docs/checks/promql/series.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,11 +156,12 @@ Syntax:

```js
check "promql/series" {
lookbackRange = "7d"
lookbackStep = "5m"
ignoreMetrics = [ "(.*)", ... ]
ignoreLabelsValue = { "...": [ "...", ... ] }
fallbackTimeout = "5m"
lookbackRange = "7d"
lookbackStep = "5m"
ignoreMetrics = [ "(.*)", ... ]
ignoreMatchingElsewhere = [ "(.*)", ... ]
ignoreLabelsValue = { "...": [ "...", ... ] }
fallbackTimeout = "5m"
}
```

Expand All @@ -178,6 +179,9 @@ check "promql/series" {
- `ignoreMetrics` - list of regexp matchers, if a metric is missing from Prometheus
but the name matches any of provided regexp matchers then pint will only report a
warning, instead of a bug level report.
- `ignoreMatchingElsewhere` - list of [metric selectors](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors), when `promql/series` validates if `foo` is present on Prometheus
`promA` but if finds it on another Prometheus (`promB`) and the time series found on `promB`
match any of the selector specified here, then pint won't report `foo` as missing.
- `ignoreLabelsValue` - allows to configure a global list of label **names** for which pint
should **NOT** report problems if there's a query that uses a **value** that does not exist.
This can be also set per rule using `# pint rule/set promql/series ignore/label-value $labelName`
Expand Down Expand Up @@ -247,6 +251,36 @@ check "promql/series" {
You can only use label matchers that would match the selector from the query itself, not from the time series
the query would return. This whole logic applies only to the query, not to the results of it.

Sometimes metrics come from sources that don't guarnatee that metrics are always present, for example
[pushgateway](https://github.com/prometheus/pushgateway) will only have metrics if they were pushed to it.
If such metrics are missing everywhere we can't tell why, but if they are present on other Prometheus servers
and they come from `{job="pushgateway"}` there, then that gives us signal that pushgatway is the source of this
metric and so we can ignore them.
Let's say we have a query using `my_service_errors_total{env="prod"}` and three Prometheus servers configured
(`promA`, `promB`, `promC`). Both `promA` and `promC` do have this metric present with these labels:

- `promA` - `my_service_errors_total{env="prod", job="pushgateway", instance="foo", reason="bad request"}`
- `promC` - `my_service_errors_total{env="prod", job="pushgateway", instance="bar", reason="not found"}`

but `promB` does have any time series for `my_service_errors_total` at all.
We also have this configuration for pint:

```js
check "promql/series" {
ignoreMatchingElsewhere = {
"{job=\"pushgateway\"}"
}
}
```

When pint checks validates the rule using this query it will runs `promql/series`
three time - once for `promA`, once for `promB` and once for `promC`.
When it runs on `promB` it doesn't find `my_service_errors_total{env="prod"}` present, so it checks
`promA` and `promB` to tell you if it's present there. Since `promA` has `my_service_errors_total{env="prod"}`
time series present pint will next check if these time series match any of the selectors from `ignoreMatchingElsewhere`.
`my_service_errors_total{env="prod", job="pushgateway", instance="foo", reason="bad request"}` matches `{job="pushgateway"}`
so pint decided NOT to report that `foo` is missing on `promB`.

### min-age

But default this check will report a problem if a metric was present
Expand Down
82 changes: 60 additions & 22 deletions internal/checks/promql_series.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,16 @@ import (
)

type PromqlSeriesSettings struct {
IgnoreLabelsValue map[string][]string `hcl:"ignoreLabelsValue,optional" json:"ignoreLabelsValue,omitempty"`
LookbackRange string `hcl:"lookbackRange,optional" json:"lookbackRange,omitempty"`
LookbackStep string `hcl:"lookbackStep,optional" json:"lookbackStep,omitempty"`
IgnoreMetrics []string `hcl:"ignoreMetrics,optional" json:"ignoreMetrics,omitempty"`
FallbackTimeout string `hcl:"fallbackTimeout,optional" json:"fallbackTimeout,omitempty"`
ignoreMetricsRe []*regexp.Regexp
lookbackRangeDuration time.Duration
lookbackStepDuration time.Duration
fallbackTimeout time.Duration
IgnoreLabelsValue map[string][]string `hcl:"ignoreLabelsValue,optional" json:"ignoreLabelsValue,omitempty"`
LookbackRange string `hcl:"lookbackRange,optional" json:"lookbackRange,omitempty"`
LookbackStep string `hcl:"lookbackStep,optional" json:"lookbackStep,omitempty"`
IgnoreMetrics []string `hcl:"ignoreMetrics,optional" json:"ignoreMetrics,omitempty"`
IgnoreMatchingElsewhere []string `hcl:"ignoreMatchingElsewhere,optional" json:"ignoreMatchingElsewhere,omitempty"`
FallbackTimeout string `hcl:"fallbackTimeout,optional" json:"fallbackTimeout,omitempty"`
ignoreMetricsRe []*regexp.Regexp
lookbackRangeDuration time.Duration
lookbackStepDuration time.Duration
fallbackTimeout time.Duration
}

func (c *PromqlSeriesSettings) Validate() error {
Expand Down Expand Up @@ -76,6 +77,12 @@ func (c *PromqlSeriesSettings) Validate() error {
}
}

for _, selector := range c.IgnoreMatchingElsewhere {
if _, err := promParser.ParseMetricSelector(selector); err != nil {
return fmt.Errorf("%q is not a valid PromQL metric selector: %w", selector, err)
}
}

return nil
}

Expand Down Expand Up @@ -310,13 +317,15 @@ func (c SeriesCheck) Check(ctx context.Context, _ discovery.Path, rule parser.Ru
),
Bug,
)
problems = append(problems, Problem{
Lines: expr.Value.Lines,
Reporter: c.Reporter(),
Text: text,
Details: c.checkOtherServer(ctx, selector.String(), settings.fallbackTimeout),
Severity: severity,
})
if details, shouldReport := c.checkOtherServer(ctx, selector.String(), settings); shouldReport {
problems = append(problems, Problem{
Lines: expr.Value.Lines,
Reporter: c.Reporter(),
Text: text,
Details: details,
Severity: severity,
})
}
slog.Debug("No historical series for base metric", slog.String("check", c.Reporter()), slog.String("selector", (&bareSelector).String()))
continue
}
Expand Down Expand Up @@ -578,7 +587,7 @@ func (c SeriesCheck) Check(ctx context.Context, _ discovery.Path, rule parser.Ru
return problems
}

func (c SeriesCheck) checkOtherServer(ctx context.Context, query string, timeout time.Duration) string {
func (c SeriesCheck) checkOtherServer(ctx context.Context, query string, settings *PromqlSeriesSettings) (string, bool) {
var servers []*promapi.FailoverGroup
if val := ctx.Value(promapi.AllPrometheusServers); val != nil {
for _, s := range val.([]*promapi.FailoverGroup) {
Expand All @@ -590,7 +599,7 @@ func (c SeriesCheck) checkOtherServer(ctx context.Context, query string, timeout
}

if len(servers) == 0 {
return SeriesCheckCommonProblemDetails
return SeriesCheckCommonProblemDetails, true
}

var suffix string
Expand All @@ -602,13 +611,13 @@ func (c SeriesCheck) checkOtherServer(ctx context.Context, query string, timeout
start := time.Now()
var tested, matches, skipped int
for _, prom := range servers {
if time.Since(start) >= timeout {
if time.Since(start) >= settings.fallbackTimeout {
slog.Debug("Time limit reached for checking if metric exists on any other Prometheus server",
slog.String("check", c.Reporter()),
slog.String("selector", query),
)
suffix = fmt.Sprintf("\npint tried to check %d server(s) but stopped after checking %d server(s) due to reaching time limit (%s).\n",
len(servers), tested, output.HumanizeDuration(timeout))
len(servers), tested, output.HumanizeDuration(settings.fallbackTimeout))
break
}

Expand All @@ -632,6 +641,13 @@ func (c SeriesCheck) checkOtherServer(ctx context.Context, query string, timeout
uri := prom.URI()

if series > 0 {
for _, selector := range settings.IgnoreMatchingElsewhere {
m, _ := promParser.ParseMetricSelector(selector)
if c.hasSeriesWithSelector(ctx, prom, query, m) {
return "", false
}
}

matches++
if matches > 10 {
skipped++
Expand All @@ -656,10 +672,32 @@ func (c SeriesCheck) checkOtherServer(ctx context.Context, query string, timeout
buf.WriteString("\nYou might be trying to deploy this rule to the wrong Prometheus server instance.\n")

if matches > 0 {
return buf.String()
return buf.String(), true
}

return SeriesCheckCommonProblemDetails, true
}

func (c SeriesCheck) hasSeriesWithSelector(ctx context.Context, prom *promapi.FailoverGroup, query string, matchers []*labels.Matcher) bool {
qr, err := prom.Query(ctx, query)
if err != nil {
return false
}

return SeriesCheckCommonProblemDetails
for _, s := range qr.Series {
var seriesMatches bool
for _, m := range matchers {
s.Labels.Range(func(l labels.Label) {
if m.Name == l.Name && m.Matches(l.Value) {
seriesMatches = true
}
})
}
if seriesMatches {
return true
}
}
return false
}

func (c SeriesCheck) queryProblem(err error, expr parser.PromQLExpr) Problem {
Expand Down
Loading

0 comments on commit ed9616f

Please sign in to comment.