Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Keep Key Logs #73

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions text/0000-keep-key-logs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# RFC: Keep Key Logs

- RFC PR:
- Tracking issue:

## Motivation

To address https://github.com/tikv/tikv/issues/10841, we decide to prioritize availability over log completeness thus switch the overflow strategy of logger from `Block` to `Drop`.

However, some logs are useful for diagnosis in practice. We heavily depend on logs to investigate bugs or abnormalities of the system. If all logs are dropped randomly, reasoning about the raftstore or transaction issues may be impossible.

Therefore, it is only a temporary workaround to simply drop overflowed logs. We hope to keep logs that are important for diagnosis if possible.

## Detailed design

The problem is that we don't define which logs are important. Whether a log is important for diagnosis is not related to the log level.

For example, we prefer to keep info level logs in the raftstore rather than keep warn level logs in the coprocessor module. The former one is critical for finding out how each region evolves while the latter one may be only a retryable error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to set different log level for different modules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. But module level setting is coarse grained. This RFC enables you to even choose which logs can be dropped in the same module.

Copy link

@tonyxuqqi tonyxuqqi Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to have this problem too. Error log is not enough for diagnostics, info log is too much.
The solution we made is to cache a buffer for current logs and as long as we find an error log, we print the whole log buffer (including every logs before the error log).
I guess it may mitigate the problem.

Copy link

@lucifinil lucifinil Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern here is:
In many cases we need to troubleshoot the resource-intensive scenarios and understand why it happened.
Hence we need actually need the logs even more in such scenarios. This then leads to a paradox.
I agree with Tony on the suggestion.

We can create a circular buffer to keep diagnostic information and flush regularly or before a key point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tony's solution sounds good but may be not so suitable for today's TiKV. TiKV is weak at detecting problems autonomously. It is really common that service downgrade or data corruption happens but TiKV itself does not know it and does not print any related error logs immediately. So the "key point" is hard to define.

Anyway, it probably hints what we can work on in the future.


So, these important logs deserve to have another set of macros:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like slow log, maybe we can use tag to identify whether a logs is key log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As said in the "Changes to slog modules" section, these macros are just shotcuts to hide the magic tag. So it would be easier for users.


- `key_crit!`
- `key_error!`
- `key_warn!`
- `key_info!`
- `key_debug!`
- `key_trace!`

Logs from these macros should not be dropped as they are key information. These macros should **NOT** be abused.

### Reserved space for key logs

We use async logging to reduce the performance impact on the working threads. The order of logs must be retained, so it is a good idea to still have only one channel to guarantee this property. That is to say, the key logs are pushed to the same channel as normal logs.

But we do not want normal logs to occupy all spaces of the channel. The solution is that we set a threshold for normal logs.

If the message number in the channel goes beyond half of the channel capacity, normal logs are refused and dropped. Key logs can be pushed to the channel until the channel is totally full.

In this way, we make sure at least half of the total capacity is always available to key logs.

### Overflow fallback

Despite the design of key logs, it is still possible we abuse key logs and make the channel full.

The principle is that availability should be never sacrificed. So we must not block the sender on overflow, nor use an unbounded channel which has the risk of OOM. But we still hope to keep the key logs.

The fallback solution is to write logs to a file synchronously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronously is ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it to "directly in the working threads"


If key logs fail to be sent to the channel, they are formatted at the sender thread and written to the fallback log file. Without disk syncing, these writes should finish with very small latency and reasonable throughput. Then the working threads will not be badly affected and we do not really drop any key logs.

### Changes to `slog` modules

Key logs are distinguished by its tag. `key_info!(...)` is a macro equivalent to `info!(#"key_log", ...)`.

Add a new method `reserve_for_key_logs(self, fallback_path: &Path)` to `AsyncBuilder`.

If this option is set, the length of the channel should be checked in `<Async as Drain>::log` if the tag name is different from `"key_log"`. If the length is bigger than half the capacity, the log should be dropped.

## Unresolved questions

- Naming of the key log macros
- Channel capacity
- Configuration and filename of the fallback log file