V3 release (#195)

pingles · web-flow · commit a069095bdfb8 · 2018-12-06T11:30:55.000Z
v3 release
- added changelog
- start work on upgrade doc to cover v2 -&gt; v3
- rename histogram metrics to clarify seconds
- update grafana dashboard to fix plotting of histogram data
- add quay badge to readme
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,40 @@
 # Changelog
 
+## v3.0
+6 December 2018
+
+v3 introduces a change to the gRPC API. Servers are compatible with v2.x Agents although **v3 Agents require v3 Servers**. Other breaking changes have been made so it's worth reading through [docs/UPGRADING.md](docs/UPGRADING.md) for more detail on moving from v2 to v3. 
+
+Notable changes:
+
+* [#109](https://github.com/uswitch/kiam/pull/109) v3 API
+* [#110](https://github.com/uswitch/kiam/pull/110) Restrict metadata routes. Everything other than credentials **will be blocked by default**
+* [#122](https://github.com/uswitch/kiam/pull/122) Record Server error messages as Events on Pod
+* [#131](https://github.com/uswitch/kiam/pull/131) Replace go-metrics with native Prometheus metrics client
+* [#140](https://github.com/uswitch/kiam/pull/140) Example Grafana dashboard for Prometheus metrics
+* [#163](https://github.com/uswitch/kiam/pull/163) Server manifests use 127.0.0.1 rather than localhost to avoid DNS
+* [#173](https://github.com/uswitch/kiam/pull/173) Metadata Agent uses 301 rather than 308 redirects
+* [#180](https://github.com/uswitch/kiam/pull/180) Fix race condition with xtables.lock
+* [#193](https://github.com/uswitch/kiam/pull/193) Add optional pprof http handler to add monitoring in live clusters
+
+A huge thanks to the following contributors for this release:
+
+* [@Joseph-Irving](https://github.com/Joseph-Irving)
+* [@max-lobur](https://github.com/max-lobur)
+* [@fernandocarletti](https://github.com/fernandocarletti)
+* [@integrii](https://github.com/integrii)
+* [@duncward](https://github.com/duncward)
+* [@stevenjm](https://github.com/stevenjm)
+* [@tasdikrahman](https://github.com/tasdikrahman)
+* [@word](https://github.com/word)
+* [@DewaldV](https://github.com/DewaldV)
+* [@roffe](https://github.com/roffe)
+* [@sambooo](https://github.com/sambooo)
+* [@idiamond-stripe](https://github.com/idiamond-stripe)
+* [@ash2k](https://github.com/ash2k)
+* [@moofish32](https://github.com/moofish32)
+* [@sp-joseluis-ledesma](https://github.com/sp-joseluis-ledesma)
+
 ## v2.8
 1st June 2018
 
diff --git a/README.md b/README.md
@@ -1,4 +1,7 @@
 # kiam
+
+[![Docker Repository on Quay](https://quay.io/repository/uswitch/kiam/status "Docker Repository on Quay")](https://quay.io/repository/uswitch/kiam)
+
 kiam runs as an agent on each node in your Kubernetes cluster and allows cluster users to associate IAM roles to Pods.
 
 Docker images are available at [https://quay.io/repository/uswitch/kiam](https://quay.io/repository/uswitch/kiam).
diff --git a/docs/METRICS.md b/docs/METRICS.md
@@ -37,7 +37,7 @@ daemonset status from kube-state-metrics & container metrics from cAdvisor if av
 
 #### Metadata Subsystem
 
-- `kiam_metadata_handler_latency_milliseconds` - Bucketed histogram of handler timings. Tagged by handler
+- `kiam_metadata_handler_latency_seconds` - Bucketed histogram of handler timings. Tagged by handler
 - `kiam_metadata_credential_fetch_errors_total` - Number of errors fetching the credentials for a pod
 - `kiam_metadata_credential_encode_errors_total` - Number of errors encoding credentials for a pod
 - `kiam_metadata_find_role_errors_total` - Number of errors finding the role for a pod
@@ -51,7 +51,7 @@ daemonset status from kube-state-metrics & container metrics from cAdvisor if av
 - `kiam_sts_cache_hit_total` - Number of cache hits to the metadata cache
 - `kiam_sts_cache_miss_total` - Number of cache misses to the metadata cache
 - `kiam_sts_issuing_errors_total` - Number of errors issuing credentials
-- `kiam_sts_assumerole_timing_milliseconds` - Bucketed histogram of assumeRole timings
+- `kiam_sts_assumerole_timing_seconds` - Bucketed histogram of assumeRole timings
 - `kiam_sts_assumerole_current` - Number of assume role calls currently executing
 
 #### K8s Subsystem
diff --git a/docs/UPGRADING.md b/docs/UPGRADING.md
@@ -0,0 +1,20 @@
+# Upgrading
+
+## v2 to v3
+
+Kiam changed significantly between v2.X and v3.0. Breaking changes are:
+
+* The gRPC API was changed. v3 Agent processes can only connect and communicate with v3 Server processes.
+* The Agent metadata proxy HTTP server now blocks access to any path other than those used for obtaining credentials.
+* Server's handling of TLS has changed to remove port from Host. This requires certificates to name `kiam-server` rather than `kiam-server:443`, for example. Any issued certificates will likely need re-issuing.
+* Separated agent, server and health commands have been merged into a kiam binary. This means that when upgrading the image referenced the command and arguments used will also need to change.
+* Server now reports events to Pods, requiring additional RBAC privileges for the service account.
+
+We would suggest upgrading in the following way:
+
+1. Generate new TLS assets. You can use [docs/TLS.md](docs/TLS.md) to create new certificates, or use something like [cert-manager](https://github.com/jetstack/cert-manager) or [Vault](https://vaultproject.io). Given the TLS changes make sure that your server certificate supports names:
+    * `kiam-server`
+    * `kiam-server:443`
+    * `127.0.0.1`
+2. Create a new DaemonSet to deploy the v3 Server processes and should use the new TLS assets deployed above. This will ensure that you have new server processes running alongside the old servers. Once the v3 servers are running and passing their health checks you can proceed. **Please note that RBAC policy changes are required for the Server** and are documented in [deploy/server-rbac.yaml](deploy/server-rbac.yaml)
+3. Update the Agent DaemonSet to use the v3 image. Because the command has changed it's worth being careful when changing this as the existing configuration will not work with v3. One option is to ensure your DaemonSet uses a `OnDelete` [update strategy](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy): you can deploy new nodes running new agents connecting to new servers while leaving existing nodes as-is. 
diff --git a/docs/dashboard-prom.json b/docs/dashboard-prom.json
@@ -972,14 +972,14 @@
         "min": null,
         "mode": "spectrum"
       },
-      "dataFormat": "timeseries",
+      "dataFormat": "tsbuckets",
       "datasource": "$datasource",
       "description": "Bucketed histogram of handler timings. Tagged by handler",
       "gridPos": {
         "h": 5,
         "w": 12,
         "x": 0,
-        "y": 13
+        "y": 24
       },
       "heatmap": {},
       "highlightCards": true,
@@ -990,8 +990,9 @@
       "links": [],
       "targets": [
         {
-          "expr": "sum(increase(kiam_metadata_handler_latency_milliseconds_bucket{handler=\"credentials\"}[$interval])) by (le)",
-          "format": "time_series",
+          "expr": "sum(rate(kiam_metadata_handler_latency_seconds_bucket{handler=\"credentials\"}[$interval])) by (le)",
+          "format": "heatmap",
+          "interval": "",
           "intervalFactor": 2,
           "legendFormat": "{{le}}",
           "refId": "A",
@@ -1012,14 +1013,14 @@
       "xBucketSize": null,
       "yAxis": {
         "decimals": null,
-        "format": "ms",
+        "format": "s",
         "logBase": 1,
         "max": null,
         "min": null,
         "show": true,
         "splitFactor": null
       },
-      "yBucketBound": "auto",
+      "yBucketBound": "upper",
       "yBucketNumber": null,
       "yBucketSize": null
     },
@@ -1037,14 +1038,14 @@
         "min": null,
         "mode": "spectrum"
       },
-      "dataFormat": "timeseries",
+      "dataFormat": "tsbuckets",
       "datasource": "$datasource",
       "description": "Bucketed histogram of handler timings. Tagged by handler",
       "gridPos": {
         "h": 5,
         "w": 12,
         "x": 12,
-        "y": 13
+        "y": 24
       },
       "heatmap": {},
       "highlightCards": true,
@@ -1055,8 +1056,8 @@
       "links": [],
       "targets": [
         {
-          "expr": "sum(increase(kiam_metadata_handler_latency_milliseconds_bucket{handler=\"roleName\"}[$interval])) by (le)",
-          "format": "time_series",
+          "expr": "sum(rate(kiam_metadata_handler_latency_seconds_bucket{handler=\"roleName\"}[$interval])) by (le)",
+          "format": "heatmap",
           "interval": "",
           "intervalFactor": 2,
           "legendFormat": "{{le}}",
@@ -1084,7 +1085,7 @@
         "show": true,
         "splitFactor": null
       },
-      "yBucketBound": "auto",
+      "yBucketBound": "upper",
       "yBucketNumber": null,
       "yBucketSize": null
     },
@@ -1102,14 +1103,14 @@
         "min": null,
         "mode": "spectrum"
       },
-      "dataFormat": "timeseries",
+      "dataFormat": "tsbuckets",
       "datasource": "$datasource",
       "description": "Bucketed histogram of assumeRole timings",
       "gridPos": {
         "h": 6,
         "w": 24,
         "x": 0,
-        "y": 18
+        "y": 29
       },
       "heatmap": {},
       "highlightCards": true,
@@ -1120,8 +1121,8 @@
       "links": [],
       "targets": [
         {
-          "expr": "sum(increase(kiam_sts_assumerole_timing_milliseconds_bucket[$interval])) by (le)",
-          "format": "time_series",
+          "expr": "sum(rate(kiam_sts_assumerole_timing_seconds_bucket[$interval])) by (le)",
+          "format": "heatmap",
           "intervalFactor": 2,
           "legendFormat": "{{le}}",
           "refId": "A",
@@ -1142,7 +1143,7 @@
       "xBucketSize": null,
       "yAxis": {
         "decimals": null,
-        "format": "ms",
+        "format": "s",
         "logBase": 1,
         "max": null,
         "min": null,
diff --git a/pkg/aws/metadata/metrics.go b/pkg/aws/metadata/metrics.go
@@ -9,7 +9,7 @@ var (
 		prometheus.HistogramOpts{
 			Namespace: "kiam",
 			Subsystem: "metadata",
-			Name:      "handler_latency_milliseconds",
+			Name:      "handler_latency_seconds",
 			Help:      "Bucketed histogram of handler timings",
 
 			// 1ms to 5min
diff --git a/pkg/aws/sts/metrics.go b/pkg/aws/sts/metrics.go
@@ -34,7 +34,7 @@ var (
 		prometheus.HistogramOpts{
 			Namespace: "kiam",
 			Subsystem: "sts",
-			Name:      "assumerole_timing_milliseconds",
+			Name:      "assumerole_timing_seconds",
 			Help:      "Bucketed histogram of assumeRole timings",
 
 			// 1ms to 5min