Diego v2.42.0
Resources
- Download release v2.42.0 from bosh.io.
- Verified with cloudfoundry/cf-deployment @
8586950a55497c912c2ad109a93383c68ed34e51
.
Changes from v2.41.0 to v2.42.0
Significant changes
Per-Instance Proxy
- Envoy proxy binary bumped to 4e2f2022af592ef13afd325988262789db577f3b
App Logging and Metrics
To address the app "noisy neighbor problem", an app log rate limiting feature has been published in this diego-release.
- The configurable app log rate limit is exposed as the
diego.executor.max_log_lines_per_second
property in diego'srep
sub-component. - The default value of the property is
0
- no rate limit is enabled - The rate limit can be set to a unique or equivalent value on a per-diego-cell instance-group basis.
- In the scenario where an Isolation Segment , Windows, and standard diego-cell instance group has been deployed, each instance group can be configured separately with a distinct
diego.executor.max_log_lines_per_second
value. - In order to set the same rate limit across the entire foundation in the 3 diego-cell instance group deployment scenario described above, the rep
diego.executor.max_log_lines_per_second
property must be set explicitly in the deployment manifest for each instance group.
- In the scenario where an Isolation Segment , Windows, and standard diego-cell instance group has been deployed, each instance group can be configured separately with a distinct
- For additional details regarding the implementation of the log rate limiting feature, see the bottom of release notes
- The stories associated with the feature:
- As a platform operator I would like to be able enable and set a rate-limit for app log-lines/second on the diego-cells and/or isolated diego-cells, so I can be sure no spurious "chatty" app could cause log loss for other apps on the same cell
- As a platform operator I would like to be able to observe a metric revealing the cumulative number of app instances that have exceeded their log-lines-per-second rate limit over the last 5 minutes so I can tune the rate limit I've set on the platform and/or become aware I'll need to work with the offending app developers to decrease the volume of logs being generated by their apps
- As an app developer I would like to see a log entry in my app log-stream when app-instances that have exceeded their log-lines-per-second rate limit so I can make appropriate changes to my app to reduce the volume of log generation and/or reach out to the platform administrator to update the log-rate-limit for the platform
- As an app developer, I expect that when an app instance that has exceeded it's max-log-lines-per-second rate limit (when set on the platform) is stopped, any logs that have not been written to the CF app log stream are dropped so my app's logs are not polluted with logs from an instance that is no longer running
Component Logging and Metrics
- cloudfoundry/auction #6: Change cellStates logging to DEBUG
- cloudfoundry/auction #7: Reduce logging of fetched-cell-state to debug
BOSH property changes
rep
and rep_windows
- New -
diego.executor.max_log_lines_per_second
:
Maximum log lines allowed per second per app instance (default:0)
Default value of 0 disables rate limiting
Minimum recommended value, if set: 100
Log Rate Limit Feature Implementation Details
The implementation used by Diego for the app log rate limiting feature is the golang rate limiting library.
The library essentially uses the [Token Bucket Algorithm](https://en.wikipedia.org/wiki/Token_bucket algorithm).
- If an app exceeds its log rate limit, the rate at which the app's logs are pushed into the logging system will be limited (will slow down) up to the buffer limit.
- If/when the buffer limit it hit, if the app is still logging, those extra log messages will be dropped (Only logs generated by the app instance that's exceeding the rate limit will be dropped. Other apps colocated on the same cell as the "noisy neighbor" will not be affected).
- If the noisy neighbor app recovers to a logging rate that's below the rate limit set on the platform before the buffer limit is hit (for instance, after a short burst of extremely high log message generation), all the messages will still be printed out albeit with a slight delay).