-
Notifications
You must be signed in to change notification settings - Fork 236
Address goroutine leak with dynamically determined log destinations #1848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for e := range eventsCh { | ||
err := dest.Publish([]LogEvent{e}) | ||
if err == ErrOutputStopped { | ||
log.Printf("I! [logagent] Log destination %v has stopped, finalizing %v/%v", l.destNames[dest], src.Group(), src.Stream()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was actually a concurrent map write condition here. This non-sync map is written to in the Run
routine and is potentially read from the runSrcToDest
goroutine.
destination := fileconfig.Destination | ||
if destination == "" { | ||
destination = t.Destination | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is unused
mergeChan := make(chan logs.LogEvent) | ||
|
||
// Merge events from both blocking and non-blocking channel | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this to a named function just to make the pprof output a little nicer. Since this is an unnamed function, the name of the function in the output is a bit obfuscated start.func1+0xcf
:
github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher.(*queue).start.func1+0xcf github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher/queue.go:117
Still findable from the line number, just not super obvious
52f58b5
to
2114e27
Compare
2114e27
to
a2c0747
Compare
func (c *CloudWatchLogs) Write(metrics []telegraf.Metric) error { | ||
for _, m := range metrics { | ||
c.writeMetricAsStructuredLog(m) | ||
} | ||
return nil | ||
// we no longer expect this to be used. We now use the OTel awsemfexporter for sending EMF metrics to CloudWatch Logs | ||
return fmt.Errorf("unexpected call to Write") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make cwDest thread-safe more easily, I removed this functionality. This function is here to adhere to the Telegraf interface, but it is no longer used. This functionality was for pushing EMF metrics to CloudWatch Logs using this telegraf-based output plugin. However, the agent no longer uses this plugin to push EMF metrics. EMF metrics now go through the OTel awsemfexporter plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're removing this function, should we also remove cwDest.switchToEMF
return client | ||
} | ||
|
||
func (c *CloudWatchLogs) writeMetricAsStructuredLog(m telegraf.Metric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used by Write
which is no longer used
return "Configuration for AWS CloudWatchLogs output." | ||
} | ||
|
||
func (c *CloudWatchLogs) getLogEventFromMetric(metric telegraf.Metric) *structuredLogEvent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used by Write
which is no longer used
## Amazon REGION | ||
region = "us-east-1" | ||
type structuredLogEvent struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used by Write
which is no longer used
} | ||
|
||
// Description returns a one-sentence description on the Output | ||
func (c *CloudWatchLogs) Description() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved these two CloudWatchLogs functions to the CloudWatchLogs block (above cwDest). Didn't make much sense to define these these functions separately from the rest.
Also, when possible, can you run the integ tests? |
Was having trouble with the log tailer unit test on Windows again... Eventually worked after a couple of retries. Added the tag and kicked off integ tests. |
This PR was marked stale due to lack of activity. |
func (c *CloudWatchLogs) Write(metrics []telegraf.Metric) error { | ||
for _, m := range metrics { | ||
c.writeMetricAsStructuredLog(m) | ||
} | ||
return nil | ||
// we no longer expect this to be used. We now use the OTel awsemfexporter for sending EMF metrics to CloudWatch Logs | ||
return fmt.Errorf("unexpected call to Write") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're removing this function, should we also remove cwDest.switchToEMF
} | ||
|
||
// eventsCh has been closed meaning the src has been stopped | ||
dest.NotifySourceStopped() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use a defer after the defer src.Stop()
at the start of the function? What if the eventsCh
didn't close, but the runToDest
function exits? Wouldn't that result in an inconsistent refCount
and prevent it from ever closing the pusher?
wg *sync.WaitGroup | ||
} | ||
|
||
var _ (Queue) = (*queue)(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Not necessary because the constructor should already enforce this.
} | ||
|
||
func (s *senderPool) Stop() { | ||
// workerpool is stopped by the plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this still stop the underlying sender? I think if this doesn't, then nothing else will.
} | ||
|
||
func (m *mockSender) Stop() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it'd be worth asserting this is called when the pusher is stopped in one of the tests.
func (s *sender) Stop() { | ||
if s.stopped { | ||
return | ||
} | ||
close(s.stopCh) | ||
s.stopped = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not thread-safe, but likely not an issue since the sender should only be stopped from a single thread.
Description of the issue
There are two primary issues causing leaking goroutines to drive up CloudWatch Agent memory usage over time.
Everliving Destinations
The CloudWatch Agent can publish logs from log files to CloudWatch Log Streams determined by the file name. Each log file the agent is reading from creates a "source" object (specifically
tailerSrc
type) with several running goroutines, and each log stream that the CloudWatch Agent is pushing logs to creates a "destination" object (specificallycwDest
type) with several running goroutines. The source objects' goroutines are closed when the associated log file is closed, but the destination objects are never subsequently cleaned up. This causes a memory leak as the goroutines are never closed and keep piling up.Dynamically generated log stream names can be generated when using the
publish_multi_logs
flag. Here's is a sample entry in the collect list:For this config, the agent will periodically look for log files that match the
/tmp/test_logs/publish_multi_logs_*.log
glob pattern. For each one that it finds, it will write the contents of that file totest-log-rotation/rotiatest-test-stream-<logfilename>
loggroup/logstream. So each file it finds will create one new source object and one new destination object.It's important to note that there may be several source objects referencing one destination. For example, a customer could use the following in their collect list to collect logs from different files but push to the same destination:
Duplicate cloudwatch logs clients
Each time a destination object is created, a new cloudwatch logs client is created. Request/response handlers are injected into the client to collect agent health metrics. These handlers have more underlying clients which have their own goroutines and caches. The underlying cache is a way to associate request data with response data, like Payload size and latency, which are eventually forwarded to the agent health metrics extension. The handlers are created by a middleware configurer. Each time a new cloudwatch logs client is created, a new middleware configurer is created, which creates new request/response handlers and spins up more goroutines. These clients have no way to be closed so the goroutines are never closed and keep piling up.
Description of changes
There are two sets of changes to address the two issues noted above.
Reference Counting Destinations
Add reference counting to the destination objects and a notification system for the source objects to tell the destinations it's no longer being used. When the destination object is no longer used, it stops itself and tells the CloudWatchLogs plugin it's no longer usable.
This implementation assumes that nothing other than the source objects are using the destination objects. There is some vestigial code left over from when the CloudWatchLogs telegraf plugin was used to process EMF metrics which could use the destination objects, but that functionality has been moved to the OTel awsemfexporter plugin. So that code path is more or less unreachable. I removed all functionality in the unused code path to make it easier to make cwDest thread-safe.
As mentioned previously, it is important to note that it is possible for multiple source objects to reference one destination. This possibility of sharing means source objects cannot close the destinations directly and a signaling mechanism is necessary.
Using the above JSON as an example, the agent will create two sources for each of the log files and one destination object for the
test-log-rotation/shared-destination-stream
loggroup/logstream. When the fileshared_destination_to_close.txt
is closed, the source object will notify the destination object that the source has closed and it's no longer being referenced. The destination will remain open since it knows theshared_destination.txt
is still using it. Ifshared_destination.txt
is closed, then the destination will then know it's no longer being used and will close itself, terminating the associated goroutines.Single Middleware Configurer
Create one middleware configurer (per CloudWatchLogs instance, but it's a singleton, so there's only ever one configurer). These request/response handlers are all identical, so there's really no need to create new ones, nor is there a need to create a new middleware configurer.
The one side effect is that destination objects will share request/response handlers, which means there is only one cache to support all request/responses. There are no concurrency concerns here as the request/response handlers already support concurrent request/responses.
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Integration Tests
https://github.com/aws/amazon-cloudwatch-agent/actions/runs/17582601756
Manual Tests
Configure the agent with publish_multi_logs flag. Compare memory usage over time with baseline agent. With the baseline agent, the memory usage when from ~7% to ~13% in 7 hours. With the modified agent, there is no noticeable increase in memory usage after 3 days.
Additionally, tested reading logs from multiple files and writing to the same destination and then closing one of the log files. This tests that when the source log file is closed, the associated destination isn't closed until all sources are closed.
Baseline Agent
Modified Agent
Configure agent with the following configuration:
Use the following python script to generate log files that match the configuration:
Requirements
Before commiting your code, please do the following steps.
make fmt
andmake fmt-sh
make lint
Integration Tests
To run integration tests against this PR, add the
ready for testing
label.