Skip to content

Commit de8c9e7

Browse files
committed
Prometheus metrics scraper
This change decouples the metrics collection and scraping, and allows custom scrapers. The default scraper stays the osiris one, using the auto-injected sidecar container. And we introduce a new scraper, using the prometheus format. It doesn't require the osiris proxy, and instead scrap metrics from a application-provided prometheus-compliant endpoint. see deislabs#48 for more details
1 parent a197da3 commit de8c9e7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+6272
-151
lines changed

Gopkg.lock

+30
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

+44
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,7 @@ The following table lists the supported annotations for Kubernetes `Deployments`
187187
| `osiris.deislabs.io/enabled` | Enable the zeroscaler component to scrape and analyze metrics from the deployment's pods and scale the deployment to zero when idle. Allowed values: `y`, `yes`, `true`, `on`, `1`. | _no value_ (= disabled) |
188188
| `osiris.deislabs.io/minReplicas` | The minimum number of replicas to set on the deployment when Osiris will scale up. If you set `2`, Osiris will scale the deployment from `0` to `2` replicas directly. Osiris won't collect metrics from deployments which have more than `minReplicas` replicas - to avoid useless collections of metrics. | `1` |
189189
| `osiris.deislabs.io/metricsCheckInterval` | The interval in which Osiris would repeatedly track the pod http request metrics. The value is the number of seconds of the interval. Note that this value override the global value defined by the `zeroscaler.metricsCheckInterval` Helm value. | _value of the `zeroscaler.metricsCheckInterval` Helm value_ |
190+
| `osiris.deislabs.io/metricsCollector` | Configure the collection of metrics for a deployment's pods. The value is a JSON object with at least a `type` string, and an optional `implementation` object. See the *Metrics Scraping* section for more. | `{ "type": "osiris" }` |
190191

191192
#### Pod Annotations
192193

@@ -212,6 +213,49 @@ The following table lists the supported annotations for Kubernetes `Services` an
212213

213214
Note that you might see an `osiris.deislabs.io/selector` annotation - this is for internal use only, and you shouldn't try to set/update or delete it.
214215

216+
#### Metrics Scraping Configuration
217+
218+
Scraping the metrics from the pods is done automatically using Osiris provided sidecar container by default. But if you don't want to use the auto-injected sidecar container, you can also configure a custom metrics scraper, using the `osiris.deislabs.io/metricsCollector` annotation on your deployment.
219+
220+
The following scrapers are supported:
221+
222+
**osiris**
223+
224+
This is the default scraper, which doesn't need any configuration.
225+
226+
**prometheus**
227+
228+
The prometheus scraper retrieves metrics about the opened & closed connections from your own prometheus endpoint. To use it, your application need to expose an endpoint with metrics in the prometheus format.
229+
You can then set the following annotation:
230+
231+
```
232+
annotations:
233+
osiris.deislabs.io/metricsCollector: |
234+
{
235+
"type": "prometheus",
236+
"implementation": {
237+
"port": 8080,
238+
"path": "/metrics",
239+
"openedConnectionsMetricName": "connections",
240+
"openedConnectionsMetricLabels": {
241+
"type": "opened"
242+
},
243+
"closedConnectionsMetricName": "connections",
244+
"closedConnectionsMetricLabels": {
245+
"type": "closed"
246+
}
247+
}
248+
}
249+
```
250+
251+
The schema of the prometheus implementation configuration is:
252+
- a mandatory `port` integer
253+
- an optional `path` string - default to `/metrics` if not set
254+
- a mandatory `openedConnectionsMetricName` string, for the name of the metric that expose the number of opened connections
255+
- a mandatory `closedConnectionsMetricName` string, for the name of the metric that expose the number of closed connections
256+
- an optional `openedConnectionsMetricLabels` object, for all labels that should match the metric for opened connections
257+
- an optional `closedConnectionsMetricLabels` object, for all labels that should match the metric for closed connections
258+
215259
### Demo
216260

217261
Deploy the [example application](example/hello-osiris.yaml) `hello-osiris` :

pkg/deployments/zeroscaler/metrics_collector.go

+46-117
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,10 @@ package zeroscaler
33
import (
44
"context"
55
"encoding/json"
6-
"fmt"
7-
"io/ioutil"
8-
"net/http"
96
"sync"
107
"time"
118

129
k8s "github.com/deislabs/osiris/pkg/kubernetes"
13-
"github.com/deislabs/osiris/pkg/metrics"
1410
"github.com/golang/glog"
1511
corev1 "k8s.io/api/core/v1"
1612
"k8s.io/apimachinery/pkg/labels"
@@ -19,52 +15,45 @@ import (
1915
"k8s.io/client-go/tools/cache"
2016
)
2117

22-
const (
23-
proxyContainerName = "osiris-proxy"
24-
proxyPortName = "osiris-metrics"
25-
)
26-
27-
type metricsCollector struct {
28-
kubeClient kubernetes.Interface
18+
type metricsCollectorConfig struct {
2919
deploymentName string
3020
deploymentNamespace string
3121
selector labels.Selector
3222
metricsCheckInterval time.Duration
33-
podsInformer cache.SharedIndexInformer
34-
currentAppPods map[string]*corev1.Pod
35-
allAppPodStats map[string]*podStats
36-
appPodsLock sync.Mutex
37-
httpClient *http.Client
38-
cancelFunc func()
23+
scraperConfig metricsScraperConfig
24+
}
25+
26+
type metricsCollector struct {
27+
config metricsCollectorConfig
28+
scraper metricsScraper
29+
kubeClient kubernetes.Interface
30+
podsInformer cache.SharedIndexInformer
31+
currentAppPods map[string]*corev1.Pod
32+
allAppPodStats map[string]*podStats
33+
appPodsLock sync.Mutex
34+
cancelFunc func()
3935
}
4036

4137
func newMetricsCollector(
4238
kubeClient kubernetes.Interface,
43-
deploymentName string,
44-
deploymentNamespace string,
45-
selector labels.Selector,
46-
metricsCheckInterval time.Duration,
47-
) *metricsCollector {
39+
config metricsCollectorConfig,
40+
) (*metricsCollector, error) {
41+
s, err := newMetricsScraper(config.scraperConfig)
42+
if err != nil {
43+
return nil, err
44+
}
4845
m := &metricsCollector{
49-
kubeClient: kubeClient,
50-
deploymentName: deploymentName,
51-
deploymentNamespace: deploymentNamespace,
52-
selector: selector,
53-
metricsCheckInterval: metricsCheckInterval,
46+
config: config,
47+
scraper: s,
48+
kubeClient: kubeClient,
5449
podsInformer: k8s.PodsIndexInformer(
5550
kubeClient,
56-
deploymentNamespace,
51+
config.deploymentNamespace,
5752
nil,
58-
selector,
53+
config.selector,
5954
),
6055
currentAppPods: map[string]*corev1.Pod{},
6156
allAppPodStats: map[string]*podStats{},
62-
// A very aggressive timeout. When collecting metrics, we want to do it very
63-
// quickly to minimize the possibility that some pods we've checked on have
64-
// served requests while we've been checking on OTHER pods.
65-
httpClient: &http.Client{
66-
Timeout: 2 * time.Second,
67-
},
6857
}
6958
m.podsInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
7059
AddFunc: m.syncAppPod,
@@ -73,7 +62,7 @@ func newMetricsCollector(
7362
},
7463
DeleteFunc: m.syncDeletedAppPod,
7564
})
76-
return m
65+
return m, nil
7766
}
7867

7968
func (m *metricsCollector) run(ctx context.Context) {
@@ -83,14 +72,14 @@ func (m *metricsCollector) run(ctx context.Context) {
8372
<-ctx.Done()
8473
glog.Infof(
8574
"Stopping metrics collection for deployment %s in namespace %s",
86-
m.deploymentName,
87-
m.deploymentNamespace,
75+
m.config.deploymentName,
76+
m.config.deploymentNamespace,
8877
)
8978
}()
9079
glog.Infof(
9180
"Starting metrics collection for deployment %s in namespace %s",
92-
m.deploymentName,
93-
m.deploymentNamespace,
81+
m.config.deploymentName,
82+
m.config.deploymentNamespace,
9483
)
9584
go m.podsInformer.Run(ctx.Done())
9685
// When this exits, the cancel func will stop the informer
@@ -123,7 +112,7 @@ func (m *metricsCollector) syncDeletedAppPod(obj interface{}) {
123112
}
124113

125114
func (m *metricsCollector) collectMetrics(ctx context.Context) {
126-
ticker := time.NewTicker(m.metricsCheckInterval)
115+
ticker := time.NewTicker(m.config.metricsCheckInterval)
127116
defer ticker.Stop()
128117
var periodStartTime, periodEndTime *time.Time
129118
for {
@@ -146,28 +135,19 @@ func (m *metricsCollector) collectMetrics(ctx context.Context) {
146135
// Get metrics for all of the deployment's CURRENT pods.
147136
var scrapeWG sync.WaitGroup
148137
for _, pod := range m.currentAppPods {
149-
podMetricsPort, ok := getMetricsPort(pod)
150-
if !ok {
151-
continue
152-
}
153-
url := fmt.Sprintf(
154-
"http://%s:%d/metrics",
155-
pod.Status.PodIP,
156-
podMetricsPort,
157-
)
158138
scrapeWG.Add(1)
159-
go func(podName string) {
139+
go func(pod *corev1.Pod) {
160140
defer scrapeWG.Done()
161141
// Get the results
162-
pcs, ok := m.scrape(url)
163-
if ok {
164-
ps := m.allAppPodStats[podName]
142+
pcs := m.scraper.Scrap(pod)
143+
if pcs != nil {
144+
ps := m.allAppPodStats[pod.Name]
165145
ps.prevStatTime = ps.recentStatTime
166146
ps.prevStats = ps.recentStats
167147
ps.recentStatTime = periodEndTime
168-
ps.recentStats = &pcs
148+
ps.recentStats = pcs
169149
}
170-
}(pod.Name)
150+
}(pod)
171151
}
172152
// Wait until we're done checking all pods.
173153
scrapeWG.Wait()
@@ -226,64 +206,11 @@ func (m *metricsCollector) collectMetrics(ctx context.Context) {
226206
}
227207
}
228208

229-
func getMetricsPort(pod *corev1.Pod) (int32, bool) {
230-
for _, c := range pod.Spec.Containers {
231-
if c.Name == proxyContainerName && len(c.Ports) > 0 {
232-
for _, port := range c.Ports {
233-
if port.Name == proxyPortName {
234-
return port.ContainerPort, true
235-
}
236-
}
237-
}
238-
}
239-
return 0, false
240-
}
241-
242-
func (m *metricsCollector) scrape(
243-
target string,
244-
) (metrics.ProxyConnectionStats, bool) {
245-
pcs := metrics.ProxyConnectionStats{}
246-
// Requests made with this client time out after 2 seconds
247-
resp, err := m.httpClient.Get(target)
248-
if err != nil {
249-
glog.Errorf("Error requesting metrics from %s: %s", target, err)
250-
return pcs, false
251-
}
252-
defer resp.Body.Close()
253-
if resp.StatusCode != 200 {
254-
glog.Errorf(
255-
"Received unexpected HTTP response code %d when requesting metrics "+
256-
"from %s",
257-
resp.StatusCode,
258-
target,
259-
)
260-
return pcs, false
261-
}
262-
bodyBytes, err := ioutil.ReadAll(resp.Body)
263-
if err != nil {
264-
glog.Errorf(
265-
"Error reading metrics request response from %s: %s",
266-
target,
267-
err,
268-
)
269-
return pcs, false
270-
}
271-
if err := json.Unmarshal(bodyBytes, &pcs); err != nil {
272-
glog.Errorf(
273-
"Error umarshaling metrics request response from %s: %s",
274-
target,
275-
err,
276-
)
277-
return pcs, false
278-
}
279-
return pcs, true
280-
}
281-
282209
func (m *metricsCollector) scaleToZero() {
283210
glog.Infof(
284211
"Scale to zero starting for deployment %s in namespace %s",
285-
m.deploymentName,
286-
m.deploymentNamespace,
212+
m.config.deploymentName,
213+
m.config.deploymentNamespace,
287214
)
288215

289216
patches := []k8s.PatchOperation{{
@@ -292,23 +219,25 @@ func (m *metricsCollector) scaleToZero() {
292219
Value: 0,
293220
}}
294221
patchesBytes, _ := json.Marshal(patches)
295-
if _, err := m.kubeClient.AppsV1().Deployments(m.deploymentNamespace).Patch(
296-
m.deploymentName,
222+
if _, err := m.kubeClient.AppsV1().Deployments(
223+
m.config.deploymentNamespace,
224+
).Patch(
225+
m.config.deploymentName,
297226
k8s_types.JSONPatchType,
298227
patchesBytes,
299228
); err != nil {
300229
glog.Errorf(
301230
"Error scaling deployment %s in namespace %s to zero: %s",
302-
m.deploymentName,
303-
m.deploymentNamespace,
231+
m.config.deploymentName,
232+
m.config.deploymentNamespace,
304233
err,
305234
)
306235
return
307236
}
308237

309238
glog.Infof(
310239
"Scaled deployment %s in namespace %s to zero",
311-
m.deploymentName,
312-
m.deploymentNamespace,
240+
m.config.deploymentName,
241+
m.config.deploymentNamespace,
313242
)
314243
}

0 commit comments

Comments
 (0)