Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Never-ending tasks stop showing up after a while: Message too big #580

Open
sergioabreu-g opened this issue Aug 23, 2024 · 2 comments
Open
Labels
S-bug Severity: bug

Comments

@sergioabreu-g
Copy link

What crate(s) in this repo are involved in the problem?

tokio-console, console-subscriber

What is the issue?

My application has some tasks that are always running, so they never get completed. I'm using tokio-console so I can debug my process at any time, but I'm facing a huge issue here. After a while, these tasks stop showing up on tokio-console, and I get the following error in my log file: ERROR Message too big. Start with smaller retention. min_retention=1s.

As far as I know, retention should only affect how long completed tasks are held in memory, at least that's what the documentation says. Also, I'm using a custom retention of 60 seconds, but these tasks keep showing up in tokio-console for much longer (about 1 hour or maybe a bit less), so there doesn't seem to be a correlation.

Am I missing something here? Is this a limitation of the crate or am I doing something wrong?

Thank you!

How can the bug be reproduced?

  • Setup tokio-console in your project.
  • Spawn some never-ending tasks using tokio::spawn.
  • Launch the process and wait for about 1h.
  • Connect to the tokio-console server, the tasks won't show up and an error will be logged.

Logs, error output, etc

ERROR Message too big. Start with smaller retention. min_retention=1s

Versions

│ ├── console-subscriber v0.4.0
│ │ ├── console-api v0.8.0
│ ├── console-subscriber v0.4.0 (*)

tokio-console 0.1.12

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

yes

@sergioabreu-g sergioabreu-g added the S-bug Severity: bug label Aug 23, 2024
@sergioabreu-g
Copy link
Author

sergioabreu-g commented Aug 23, 2024

Looking at the code, it seems like the console-subscriber starts reducing the retention when the message it has to send to tokio-console is too big. Since retention doesn't affect uncompleted tasks, which in this case are the ones responsible for exceeding the maximum message size, it gets reduced to its minimum but the message with the tasks' information is never sent.

I don't think reducing the retention forcefully is the right way to reduce message sizes, especially since it isn't communicated to the user at any point, so you might end up with a lower retention that you set up without knowing it.

Could message size be reduced in any other way? Splitting it up and making multiple sends sounds like the most straightforward solution, but I'm not an expert on this crate so I have no idea of its limitations and whether that's feasible to implement within the current codebase.

I'd be willing to make the necessary changes but I'd need guidance from some experienced contributor to this crate.

@hds
Copy link
Collaborator

hds commented Aug 27, 2024

@sergioabreu-g Thanks for you report!

Given the error message you've received, I think that you're hitting the second case where nothing will be sent at all because console-subscriber has already tried to reduce the retention below the minimum (which is the publishing interval).

I actually wonder whether any of this works properly at all, since cleaning up closed items depends on them not being dirty, but if those objects haven't been sent to any client then they're all dirty and so maybe nothing is being cleaned up at all...

Certainly, a mechanism that chunks the initial state so that it can be sent bit by bit to the client would be a more robust and generally better solution.

I'd be more than happy to provide guidance.

We would have to create a better message builder. One that somehow takes the maximum size into account. The current message construction happens here (for the initial state update):

let update = proto::instrument::Update {
task_update: Some(self.task_update(Include::All)),
resource_update: Some(self.resource_update(Include::All)),
async_op_update: Some(self.async_op_update(Include::All)),
now: Some(self.base_time.to_timestamp(now)),
new_metadata: Some(proto::RegisterMetadata {
metadata: (*self.all_metadata).clone(),
}),
};

It adds everything.

I see 2 possible strategies:

  1. Fill up with objects of a specific type until the limit and send that (including some new flag that this is a partial update). The order in the code would likely make sense: tasks, resources, async ops.
  2. Send all object tyoes up to a certain point in time, then send the next chunk of time, and so on. Again, with some partial update flag.

In either case, we would need to modify tokio console to understand that it doesn't yet have a full state and so it should probably not display anything yet.

Finally, we'd want to try and detect a pathelogical case where updates are coming in faster than they can be sent, and so the client (tokio console) will never get a full state update. In this case, sending some signal to tokio console that it may never get up to date might provide a better user experience than we currently have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-bug Severity: bug
Projects
None yet
Development

No branches or pull requests

2 participants