Skip to content

Add log tailer for MPS control logs #575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 13, 2024
Merged

Conversation

elezar
Copy link
Member

@elezar elezar commented Mar 6, 2024

This change adds a background tail -n +1 -f command for the control.log associated with an MPS daemon.

This is outputed to stdout to ensure that MPS control daemon logs are also available in the kubectl logs output.

I0313 19:23:55.081285      69 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu"
[2024-03-13 19:23:55.063 Control    98] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2024-03-13 19:23:55.063 Control    98] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
...
[2024-03-13 19:24:23.903 Control    98] NEW UI
[2024-03-13 19:24:23.903 Control    98] Cmd:get_default_active_thread_percentage
[2024-03-13 19:24:23.903 Control    98] 10.0
[2024-03-13 19:24:23.903 Control    98] UI closed
I0313 19:24:40.850399      69 main.go:144] Received signal "terminated", shutting down.
I0313 19:24:40.850462      69 main.go:214] Stopping MPS daemons.
[2024-03-13 19:24:40.870 Control    98] Accepting connection...
[2024-03-13 19:24:40.870 Control    98] NEW UI
[2024-03-13 19:24:40.871 Control    98] Cmd:quit
[2024-03-13 19:24:40.871 Control    98] Exit with status 0
I0313 19:24:40.871316      69 daemon.go:145] "Stopped MPS control daemon" resource="nvidia.com/gpu"
I0313 19:24:40.886798      69 daemon.go:148] "Stopped log tailer" resource="nvidia.com/gpu" error="signal: killed"

@elezar elezar requested review from klueska and cdesiniotis March 6, 2024 12:18
@elezar elezar self-assigned this Mar 6, 2024
@elezar elezar requested a review from tariq1890 March 7, 2024 12:23
klog.ErrorS(err, "Failed to stop log tailer")
}

err := d.tail.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- what happens if an error is returned when sending the kill signal to the tail process? Should we skip this Wait() call in that case? I would expect the tail process to eventually get reaped anyways once this container exits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was going to ask the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that probably makes sense. I have updated the implementation to not call Wait in the event of an error. (and also switched to SIGTERM).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here which avoids this entirely.


if d.tail != nil {
klog.InfoS("Stopping log tailer", "resource", d.rm.Resource())
if err := d.tail.Process.Signal(os.Kill); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want os.Kill or os.Term?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.Term is probably better as it allows tail to clean things up.

Does it make sense to send os.Term and then os.Kill if that fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to create the command with a CommandContext() rather than a Command() and then just call cancel() on that context when we want the process to exit. With that we can track d.tailCancel instead of d.tail and just call d.tailCancel() here unconditionally.

So something like this earlier on:

	ctx, cancel := context.WithCancel(context.Background())
	d.tailCancel = cancel

	tail = exec.CommandContext(ctx, "tail", "-n", "+1", "-f", filepath.Join(logDir, "control.log"))
	tail.Stdout = os.Stdout
	tail.Stderr = os.Stderr

	if err := tail.Start(); err != nil {
		klog.ErrorS(err, "Could not start tail command on control.log; ignoring logs")
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does sound cleaner. Let me update.

klog.ErrorS(err, "Failed to stop log tailer")
}

err := d.tail.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was going to ask the same thing.

@elezar elezar force-pushed the add-log-tailer branch 2 times, most recently from e4e624b to 2c3772b Compare March 13, 2024 14:46
@elezar elezar requested review from cdesiniotis and klueska March 13, 2024 14:47
Comment on lines 155 to 156
klog.InfoS("Stopping log tailer", "resource", d.rm.Resource())
d.tailCancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more (after looking at it), we should probably still track d.tail and call d.tail.Wait() after the cancel(). It is guaranteed to return and we can check / log its error (if there is one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it d.tailCmd though, so as to by symmetric with d.tailCancel

@elezar elezar requested a review from klueska March 13, 2024 19:25
@elezar
Copy link
Member Author

elezar commented Mar 13, 2024

@klueska a minor simplification since the last review. Also ran some tests and updated the description with the output.

@elezar elezar merged commit 6199a40 into NVIDIA:main Mar 13, 2024
6 checks passed
@elezar elezar deleted the add-log-tailer branch March 13, 2024 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants