-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reformat procfs scraping to handle errors #1178
base: main
Are you sure you want to change the base?
Conversation
…nown errors Signed-off-by: Caleb Metz <[email protected]>
Signed-off-by: Caleb Metz <[email protected]>
&mut self, | ||
process: Process, | ||
aggr: &mut memory::smaps_rollup::Aggregator, | ||
) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any codepath that returns an error? I'm not seeing any.
My thought here is that we need to explicitly handle errors when reading process information, ie, avoid the ?
operator. So if we make this process not have a return type, we could avoid any accidental error handling that uses ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yeah I think I see what you are saying. I chose this implementation specifically to handle ?
without having to explicitly handle errors. Mostly just because the proc_exe(), proc_comm(), and proc_cmdline() all can error if the pid had exited and I thought it was cleaner to just blanket catch any of those failures rather then add messiness catching different failures that are all resulting from the same root cause (the pid exiting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very open to changing this tho to catch every error explicitly, just thought this may be easier to add other code later in handle_process without having to risk observer crashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proc_exe(), proc_comm(), and proc_cmdline() all can error if the pid had exited and I thought it was cleaner to just blanket catch any of those failures rather then add messiness catching different failures that are all resulting from the same root cause (the pid exiting)
What makes these failures different from, eg process.status()
failing?
In my understanding, any and all errors that could be caused by a process exiting during execution of handle_process
should be explicitly caught and handled by ignoring and return
/return Ok(())
.
If there are errors that can occur that are not caused by the process exiting early, then those would be a valid reason to bubble an error out of this function, but I don't think process-exit-errors should escape this function, otherwise what is the point of extracting into handle_process
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I see what you are saying. It is the same type of failure as process.status()
. I will update the code to catch what I expect to be specific "pid-exit" errors. The only exception to that I see is the uptime poll since that is more broad than a specific process
What does this PR do?
Extracts all the per-pid operations into a helper function so that if we cant read something for a given pid it doesnt blow up the whole observer.
Motivation
Processes exiting mid poll loop can cause numerous errors, all of which, if not handled properly, polute the collection of data from active processes.
Related issues
https://datadoghq.atlassian.net/browse/SMPTNG-574
First attempts at this in #1175 and #1142
After some discussion we thought it better to just catch all per-pid related errors rather than keeping up with new errors on individual functions related to collecting per process information.
Validating the change in this notebook