Reformat procfs scraping to handle errors #1178

cmetz100 · 2024-12-19T21:15:43Z

What does this PR do?

Extracts all the per-pid operations into a helper function so that if we cant read something for a given pid it doesnt blow up the whole observer.

Motivation

Processes exiting mid poll loop can cause numerous errors, all of which, if not handled properly, polute the collection of data from active processes.

Related issues

https://datadoghq.atlassian.net/browse/SMPTNG-574

First attempts at this in #1175 and #1142
After some discussion we thought it better to just catch all per-pid related errors rather than keeping up with new errors on individual functions related to collecting per process information.

Validating the change in this notebook

…nown errors Signed-off-by: Caleb Metz <[email protected]>

Signed-off-by: Caleb Metz <[email protected]>

scottopell · 2025-01-03T19:00:36Z

lading/src/observer/linux/procfs.rs

+        &mut self,
+        process: Process,
+        aggr: &mut memory::smaps_rollup::Aggregator,
+    ) -> Result<(), Error> {


Is there any codepath that returns an error? I'm not seeing any.

My thought here is that we need to explicitly handle errors when reading process information, ie, avoid the ? operator. So if we make this process not have a return type, we could avoid any accidental error handling that uses ?

Hm yeah I think I see what you are saying. I chose this implementation specifically to handle ? without having to explicitly handle errors. Mostly just because the proc_exe(), proc_comm(), and proc_cmdline() all can error if the pid had exited and I thought it was cleaner to just blanket catch any of those failures rather then add messiness catching different failures that are all resulting from the same root cause (the pid exiting)

Very open to changing this tho to catch every error explicitly, just thought this may be easier to add other code later in handle_process without having to risk observer crashes

proc_exe(), proc_comm(), and proc_cmdline() all can error if the pid had exited and I thought it was cleaner to just blanket catch any of those failures rather then add messiness catching different failures that are all resulting from the same root cause (the pid exiting)

What makes these failures different from, eg process.status() failing?

In my understanding, any and all errors that could be caused by a process exiting during execution of handle_process should be explicitly caught and handled by ignoring and return/return Ok(()).

If there are errors that can occur that are not caused by the process exiting early, then those would be a valid reason to bubble an error out of this function, but I don't think process-exit-errors should escape this function, otherwise what is the point of extracting into handle_process?

Ah yeah I see what you are saying. It is the same type of failure as process.status(). I will update the code to catch what I expect to be specific "pid-exit" errors. The only exception to that I see is the uptime poll since that is more broad than a specific process

Extracted per process operations into a function to better handle unk…

394095d

…nown errors Signed-off-by: Caleb Metz <[email protected]>

cmetz100 added the no-changelog label Dec 19, 2024

Make clippy happy

7d85e52

Signed-off-by: Caleb Metz <[email protected]>

cmetz100 marked this pull request as ready for review January 3, 2025 18:48

cmetz100 requested a review from a team as a code owner January 3, 2025 18:48

scottopell reviewed Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reformat procfs scraping to handle errors #1178

Reformat procfs scraping to handle errors #1178

cmetz100 commented Dec 19, 2024 •

edited

Loading

scottopell Jan 3, 2025

cmetz100 Jan 3, 2025

cmetz100 Jan 3, 2025

scottopell Jan 3, 2025

cmetz100 Jan 3, 2025

Reformat procfs scraping to handle errors #1178

Are you sure you want to change the base?

Reformat procfs scraping to handle errors #1178

Conversation

cmetz100 commented Dec 19, 2024 • edited Loading

What does this PR do?

Motivation

Related issues

scottopell Jan 3, 2025

Choose a reason for hiding this comment

cmetz100 Jan 3, 2025

Choose a reason for hiding this comment

cmetz100 Jan 3, 2025

Choose a reason for hiding this comment

scottopell Jan 3, 2025

Choose a reason for hiding this comment

cmetz100 Jan 3, 2025

Choose a reason for hiding this comment

cmetz100 commented Dec 19, 2024 •

edited

Loading