-
Notifications
You must be signed in to change notification settings - Fork 35.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logging: Use LogFatal instead LogError for fatal errors #30347
Conversation
Keep using LogError for nonfatal errors. Also rename LogWarning to LogCritical to be clear it is only intended to be used for "severe problems that the node admin should address," as described in the documentation.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
I'm largely in agreement with the sentiment here and think the introduction of the Does LogCritical("pruned datadir may not have more than %d blocks; only checking available blocks\n", ...actually require user intervention or other corrective action? nit: The macros `LogInfo`, `LogDebug`, `LogTrace`, `LogWarning` and `LogError` are available for |
Tentative Concept ACK on adding Fatal if the distinction is considered essential, unsure about Critical replacing Warning. |
@@ -77,6 +77,8 @@ namespace BCLog { | |||
Info, // Default | |||
Warning, | |||
Error, | |||
Critical, | |||
Fatal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot.
Between 6 and 8 levels seems to be standard? The most frequent levels might be our current ones, plus Fatal, i.e.
Fatal, Error, Warning, Info, Debug, Trace
https://www.crowdstrike.com/cybersecurity-101/observability/logging-levels/
https://sematext.com/blog/logging-levels/
https://betterstack.com/community/guides/logging/log-levels-explained/
https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels
https://www.sumologic.com/glossary/log-levels/
https://medium.com/@m.merkulov/logging-levels-explained-4dd61815601f
https://techhub.hpe.com/eginfolib/networking/docs/switches/7500/5200-1946a_nmm_cg/content/495508377.htm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The medium post describes five levels with the same names as ours currently: error, warn, info, debug, trace; the other six links all describe syslog's behaviour, which was designed for unix systems in the 1980s. Logging systems we actually use include python's (which has 5 default levels), leveldb (which does not have levels, so we treat it all as debug), and libevent (which has 4 levels). So, no, 6-8 levels is not standard, and for a custom subsystem we should be doing what's best for us, not necessarily what's standard in any case.
The flowchart from the stackoverflow link might be helpful, though: https://stackoverflow.com/a/64806781/84932 For us I think it would look more like:
flowchart TD
start-->|No|dbg
start{{Is this critical information?}}-->|Yes|crit
dbg-->|No|logtrace(Trace + Category)
dbg{{Is this moderate volume?}}-->|Yes|logdebug(Debug + Category)
crit{{Is there a problem?}}-->|No|loginfo(Info)
crit-->|Yes|warnerr
warnerr{{Do we need to abort?}}-->|No|logwarn(Warning/Critical)
warnerr-->|Yes|logfatal(Error/Fatal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #30347 (comment)
I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.
I think the reality is we already have 7 levels of logging, but we are using 5 names to describe them. This PR is a small fix to make the names used in the code match the intent of the code.
For example, the documentation says LogError
should only be used to log fatal errors. But currently it is being used to log both fatal errors and nonfatal errors. If you would like to log errors less, or at lower priority, I would enthusiastically agree with that goal. But I don't think those changes should be folded into this PR, which is minimal and not trying to fix problems that are broader and more complicated than simple bad naming.
Your flowchart is a good idealization, but it does not describe the realities of current code, which is using logging for ordinary error reporting, not just the most critical errors. A description of the actual log levels in our code base is:
Fatal
messages indicate severe problems (e.g. data loss) that will trigger shutdownsCritical
messages indicate severe problems (e.g. incorrect system time, internal bug detected) that users should address in some wayError
messages indicate less severe problems (e.g. loss of connectivity) that the users should probably know about, but might be transientWarning
messages indicate unexpected conditions (e.g. user specified-checkblocks
number higher than number of blocks that may be available to check) that the user should know about, and indicate potential problems
For completeness, Info
messages provide information about normal activity, Debug
messages can help users troubleshoot issues, and Trace
is mostly unused but geared more towards developers.
Overall, I agree with the direction you want to take in reducing number of log levels, and I would applaud any effort to reduce amount of log spam and report errors by other means, but I don't think those goals should get in the way of fixing names that are confusing and misleading to developers.
If you want to open a PR that takes a broader approach and also fixes the bad names, I will happily go along with it and close this PR. But I think reasonable to want to take one step at at time here and just fix the bad naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot.
Meta:
It's not clear why people are 👍 on that comment. Are 5 "a lot", despite what other projects do, or based on our needs? Or is the 👍 about 5 being ok, but not having more than 5?
I've seen drive-by 👍 on other comments recently where the comment made multiple points, and it was similarly not clear: which of the points the 👍 pertains to? All? Some?
If a reviewer has a comment, I think it would be more helpful in general to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me that there's any reason to separate the current LogWarning
/ LogError
levels; those existed prior to #28318 and I documented them in that PR as best I understood them. However, there's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following. So it could make sense to merge Warning/Error into:
LogAlert(fmt, params...)
should be used in place ofLogInfo
for (potentially) severe problems that require attention from the node admin (e.g., system time appears to be wrong, unknown soft fork appears to have activated, insufficient storage space).
which would reduce the flowchart to:
flowchart TD
start-->|No|dbg
start{{Is this critical information?}}-->|Yes|crit
dbg-->|No|logtrace(Trace + Category)
dbg{{Is this moderate volume?}}-->|Yes|logdebug(Debug + Category)
crit{{Is there a problem?}}-->|No|loginfo(Info)
crit-->|Yes|logfatal(Alert)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code mostly hasn't migrated to the multiple severity levels from the original LogPrintf/LogPrint split. It would be nice to spend time working on that, but we're instead constantly bikeshedding the API instead and treating that as a blocker for any other logging changes.
The logging was mostly migrated in the severity level parent PR #25203 before the whole thing was taken off the rails with the overwrought API and rules. (There have also been a number of pulls opened that were a copy or similar version of the commits there.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By overwrought, jonatack means LogError("CHECKSUM ERROR ...")
, while his preferred non-overwrought api from #25203 looks like LogPrintLevel(BCLog::NET, BCLog::Level::Error, "CHECKSUM ERROR ...")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By overwrought, jonatack means
LogError("CHECKSUM ERROR ...")
, while his preferred non-overwrought api from #25203 looks likeLogPrintLevel(BCLog::NET, BCLog::Level::Error, "CHECKSUM ERROR ...")
.
As you know, I've described what I mean several times, including in my comments in the pull you mention. Here is the most recent one: #29256 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #30347 (comment)
I like your idea to collapse the log levels further and I am curious about ways you think we can make progress towards that goal.
However, there's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following.
If we don't need to have a dedicated level for fatal errors, it suggests a possible alternative to this PR, implemented in #30361. It would solve the issue that was reported and might get closer to the long term outcome you outlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated 475b70d -> b27dc67 (pr/logfat.1
-> pr/logfat.2
, compare) with suggested fixes and expanded it to cover Logging::Warning
and Logging::Error
.
re: #30347 (comment)
pruned datadir may not have more than %d blocks;
Nice catch! I updated this PR with your corrections and also updated the #30348 issue description to reflect this.
@@ -77,6 +77,8 @@ namespace BCLog { | |||
Info, // Default | |||
Warning, | |||
Error, | |||
Critical, | |||
Fatal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #30347 (comment)
I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.
I think the reality is we already have 7 levels of logging, but we are using 5 names to describe them. This PR is a small fix to make the names used in the code match the intent of the code.
For example, the documentation says LogError
should only be used to log fatal errors. But currently it is being used to log both fatal errors and nonfatal errors. If you would like to log errors less, or at lower priority, I would enthusiastically agree with that goal. But I don't think those changes should be folded into this PR, which is minimal and not trying to fix problems that are broader and more complicated than simple bad naming.
Your flowchart is a good idealization, but it does not describe the realities of current code, which is using logging for ordinary error reporting, not just the most critical errors. A description of the actual log levels in our code base is:
Fatal
messages indicate severe problems (e.g. data loss) that will trigger shutdownsCritical
messages indicate severe problems (e.g. incorrect system time, internal bug detected) that users should address in some wayError
messages indicate less severe problems (e.g. loss of connectivity) that the users should probably know about, but might be transientWarning
messages indicate unexpected conditions (e.g. user specified-checkblocks
number higher than number of blocks that may be available to check) that the user should know about, and indicate potential problems
For completeness, Info
messages provide information about normal activity, Debug
messages can help users troubleshoot issues, and Trace
is mostly unused but geared more towards developers.
Overall, I agree with the direction you want to take in reducing number of log levels, and I would applaud any effort to reduce amount of log spam and report errors by other means, but I don't think those goals should get in the way of fixing names that are confusing and misleading to developers.
If you want to open a PR that takes a broader approach and also fixes the bad names, I will happily go along with it and close this PR. But I think reasonable to want to take one step at at time here and just fix the bad naming.
Concept NACK; introducing more different levels just makes it require more effort to use the logging system, and increases the chances they will be used incorrectly. |
Remove documentation that says LogError should be used for "severe problems that require the node (or a subsystem) to shut down entirely" because: - This is not how `LogError` and `Level::Error` are used currently. Of 129 current uses only 58 cases are fatal according to bitcoin#30348 - "[T]here's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following." according to bitcoin#30347 (comment)
Remove documentation that says LogError should be used for "severe problems that require the node (or a subsystem) to shut down entirely" because: - This is not how `LogError` and `Level::Error` are used currently. Of 129 current uses only 58 cases are fatal according to bitcoin#30348 - "[T]here's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following." according to bitcoin#30347 (comment)
Closing this PR as #30361 is a more direct fix for the issue, assuming we are ok with log levels not making a distinction between fatal errors and other types of errors. It seems like it never might have been anybody's intention to even make this distinction in the first place and it just got written in when new documentation was added. |
Use
LogFatal
insteadLogError
for fatal errors.Keep using
LogError
for nonfatal errors.Also rename
LogWarning
toLogCritical
to be clear it is only intended to be used for "severe problems that the node admin should address," as described in the documentation.This PR is a draft so other ideas and approaches can be discussed in issue #30348