logging: Use LogFatal instead LogError for fatal errors #30347

ryanofsky · 2024-06-27T12:30:06Z

Use LogFatal instead LogError for fatal errors.

Keep using LogError for nonfatal errors.

Also rename LogWarning to LogCritical to be clear it is only intended to be used for "severe problems that the node admin should address," as described in the documentation.

This PR is a draft so other ideas and approaches can be discussed in issue #30348

Keep using LogError for nonfatal errors. Also rename LogWarning to LogCritical to be clear it is only intended to be used for "severe problems that the node admin should address," as described in the documentation.

DrahtBot · 2024-06-27T12:30:08Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept NACK	ajtowns
Concept ACK	jonatack

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30361 (doc: Drop description of LogError messages as fatal by ryanofsky)
#30343 (wallet, logging: Replace WalletLogPrintf() with LogInfo() by ryanofsky)
#30342 (kernel, logging: Pass Logger instances to kernel objects by ryanofsky)
#30155 (validation: Make ReplayBlocks interruptible by mzumsande)
#29656 (chainparams: Change nChainTx type to uint64_t by fjahr)
#29641 (scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint by maflcko)
#29256 (Improve new LogDebug/Trace/Info/Warning/Error Macros by ryanofsky)
#27006 (reduce cs_main scope, guard block index 'nFile' under a local mutex by furszy)
#26022 (Add util::ResultPtr class by ryanofsky)
#25722 (refactor: Use util::Result class for wallet loading by ryanofsky)
#25665 (refactor: Add util::Result failure values, multiple error and warning messages by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

willcl-ark · 2024-06-27T13:23:48Z

I'm largely in agreement with the sentiment here and think the introduction of the LogFatal is nice to distiguish fatal errors.

Does

LogCritical("pruned datadir may not have more than %d blocks; only checking available blocks\n",

...actually require user intervention or other corrective action?

nit: LogWarning remains in L728 after this change:

The macros `LogInfo`, `LogDebug`, `LogTrace`, `LogWarning` and `LogError` are available for

doc/developer-notes.md

src/logging.h

jonatack · 2024-06-27T13:48:01Z

Tentative Concept ACK on adding Fatal if the distinction is considered essential, unsure about Critical replacing Warning.

ajtowns · 2024-06-27T14:23:47Z

src/logging.h

@@ -77,6 +77,8 @@ namespace BCLog {
        Info,      // Default
        Warning,
        Error,
+        Critical,
+        Fatal,


I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot.

Between 6 and 8 levels seems to be standard? The most frequent levels might be our current ones, plus Fatal, i.e.

Fatal, Error, Warning, Info, Debug, Trace

https://www.crowdstrike.com/cybersecurity-101/observability/logging-levels/
https://sematext.com/blog/logging-levels/
https://betterstack.com/community/guides/logging/log-levels-explained/
https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels
https://www.sumologic.com/glossary/log-levels/
https://medium.com/@m.merkulov/logging-levels-explained-4dd61815601f
https://techhub.hpe.com/eginfolib/networking/docs/switches/7500/5200-1946a_nmm_cg/content/495508377.htm

The medium post describes five levels with the same names as ours currently: error, warn, info, debug, trace; the other six links all describe syslog's behaviour, which was designed for unix systems in the 1980s. Logging systems we actually use include python's (which has 5 default levels), leveldb (which does not have levels, so we treat it all as debug), and libevent (which has 4 levels). So, no, 6-8 levels is not standard, and for a custom subsystem we should be doing what's best for us, not necessarily what's standard in any case.

The flowchart from the stackoverflow link might be helpful, though: https://stackoverflow.com/a/64806781/84932 For us I think it would look more like:

flowchart TD start-->|No|dbg start{{Is this critical information?}}-->|Yes|crit dbg-->|No|logtrace(Trace + Category) dbg{{Is this moderate volume?}}-->|Yes|logdebug(Debug + Category) crit{{Is there a problem?}}-->|No|loginfo(Info) crit-->|Yes|warnerr warnerr{{Do we need to abort?}}-->|No|logwarn(Warning/Critical) warnerr-->|Yes|logfatal(Error/Fatal)

Loading

re: #30347 (comment)

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.

I think the reality is we already have 7 levels of logging, but we are using 5 names to describe them. This PR is a small fix to make the names used in the code match the intent of the code.

For example, the documentation says LogError should only be used to log fatal errors. But currently it is being used to log both fatal errors and nonfatal errors. If you would like to log errors less, or at lower priority, I would enthusiastically agree with that goal. But I don't think those changes should be folded into this PR, which is minimal and not trying to fix problems that are broader and more complicated than simple bad naming.

Your flowchart is a good idealization, but it does not describe the realities of current code, which is using logging for ordinary error reporting, not just the most critical errors. A description of the actual log levels in our code base is:

Fatal messages indicate severe problems (e.g. data loss) that will trigger shutdowns

Critical messages indicate severe problems (e.g. incorrect system time, internal bug detected) that users should address in some way

Error messages indicate less severe problems (e.g. loss of connectivity) that the users should probably know about, but might be transient

Warning messages indicate unexpected conditions (e.g. user specified -checkblocks number higher than number of blocks that may be available to check) that the user should know about, and indicate potential problems

For completeness, Info messages provide information about normal activity, Debug messages can help users troubleshoot issues, and Trace is mostly unused but geared more towards developers.

Overall, I agree with the direction you want to take in reducing number of log levels, and I would applaud any effort to reduce amount of log spam and report errors by other means, but I don't think those goals should get in the way of fixing names that are confusing and misleading to developers.

If you want to open a PR that takes a broader approach and also fixes the bad names, I will happily go along with it and close this PR. But I think reasonable to want to take one step at at time here and just fix the bad naming.

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot.

Meta:

It's not clear why people are 👍 on that comment. Are 5 "a lot", despite what other projects do, or based on our needs? Or is the 👍 about 5 being ok, but not having more than 5?

I've seen drive-by 👍 on other comments recently where the comment made multiple points, and it was similarly not clear: which of the points the 👍 pertains to? All? Some?

If a reviewer has a comment, I think it would be more helpful in general to be explicit.

It's not clear to me that there's any reason to separate the current LogWarning / LogError levels; those existed prior to #28318 and I documented them in that PR as best I understood them. However, there's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following. So it could make sense to merge Warning/Error into:

LogAlert(fmt, params...) should be used in place of LogInfo for (potentially) severe problems that require attention from the node admin (e.g., system time appears to be wrong, unknown soft fork appears to have activated, insufficient storage space).

which would reduce the flowchart to:

flowchart TD start-->|No|dbg start{{Is this critical information?}}-->|Yes|crit dbg-->|No|logtrace(Trace + Category) dbg{{Is this moderate volume?}}-->|Yes|logdebug(Debug + Category) crit{{Is there a problem?}}-->|No|loginfo(Info) crit-->|Yes|logfatal(Alert)

Loading

The current code mostly hasn't migrated to the multiple severity levels from the original LogPrintf/LogPrint split. It would be nice to spend time working on that, but we're instead constantly bikeshedding the API instead and treating that as a blocker for any other logging changes.

The logging was mostly migrated in the severity level parent PR #25203 before the whole thing was taken off the rails with the overwrought API and rules. (There have also been a number of pulls opened that were a copy or similar version of the commits there.)

By overwrought, jonatack means LogError("CHECKSUM ERROR ..."), while his preferred non-overwrought api from #25203 looks like LogPrintLevel(BCLog::NET, BCLog::Level::Error, "CHECKSUM ERROR ...").

By overwrought, jonatack means LogError("CHECKSUM ERROR ..."), while his preferred non-overwrought api from #25203 looks like LogPrintLevel(BCLog::NET, BCLog::Level::Error, "CHECKSUM ERROR ...").

As you know, I've described what I mean several times, including in my comments in the pull you mention. Here is the most recent one: #29256 (comment)

re: #30347 (comment)

I like your idea to collapse the log levels further and I am curious about ways you think we can make progress towards that goal.

However, there's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following.

If we don't need to have a dedicated level for fatal errors, it suggests a possible alternative to this PR, implemented in #30361. It would solve the issue that was reported and might get closer to the long term outcome you outlined.

ryanofsky

Updated 475b70d -> b27dc67 (pr/logfat.1 -> pr/logfat.2, compare) with suggested fixes and expanded it to cover Logging::Warning and Logging::Error.

re: #30347 (comment)

pruned datadir may not have more than %d blocks;

Nice catch! I updated this PR with your corrections and also updated the #30348 issue description to reflect this.

doc/developer-notes.md

ryanofsky · 2024-06-28T14:41:25Z

src/logging.h

@@ -77,6 +77,8 @@ namespace BCLog {
        Info,      // Default
        Warning,
        Error,
+        Critical,
+        Fatal,


re: #30347 (comment)

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.

I think the reality is we already have 7 levels of logging, but we are using 5 names to describe them. This PR is a small fix to make the names used in the code match the intent of the code.

For example, the documentation says LogError should only be used to log fatal errors. But currently it is being used to log both fatal errors and nonfatal errors. If you would like to log errors less, or at lower priority, I would enthusiastically agree with that goal. But I don't think those changes should be folded into this PR, which is minimal and not trying to fix problems that are broader and more complicated than simple bad naming.

Your flowchart is a good idealization, but it does not describe the realities of current code, which is using logging for ordinary error reporting, not just the most critical errors. A description of the actual log levels in our code base is:

Fatal messages indicate severe problems (e.g. data loss) that will trigger shutdowns

Critical messages indicate severe problems (e.g. incorrect system time, internal bug detected) that users should address in some way

Error messages indicate less severe problems (e.g. loss of connectivity) that the users should probably know about, but might be transient

Warning messages indicate unexpected conditions (e.g. user specified -checkblocks number higher than number of blocks that may be available to check) that the user should know about, and indicate potential problems

For completeness, Info messages provide information about normal activity, Debug messages can help users troubleshoot issues, and Trace is mostly unused but geared more towards developers.

Overall, I agree with the direction you want to take in reducing number of log levels, and I would applaud any effort to reduce amount of log spam and report errors by other means, but I don't think those goals should get in the way of fixing names that are confusing and misleading to developers.

If you want to open a PR that takes a broader approach and also fixes the bad names, I will happily go along with it and close this PR. But I think reasonable to want to take one step at at time here and just fix the bad naming.

src/logging.h

ajtowns · 2024-06-28T17:15:36Z

Concept NACK; introducing more different levels just makes it require more effort to use the logging system, and increases the chances they will be used incorrectly.

Remove documentation that says LogError should be used for "severe problems that require the node (or a subsystem) to shut down entirely" because: - This is not how `LogError` and `Level::Error` are used currently. Of 129 current uses only 58 cases are fatal according to bitcoin#30348 - "[T]here's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following." according to bitcoin#30347 (comment)

ryanofsky · 2024-06-30T16:02:39Z

Closing this PR as #30361 is a more direct fix for the issue, assuming we are ok with log levels not making a distinction between fatal errors and other types of errors. It seems like it never might have been anybody's intention to even make this distinction in the first place and it just got written in when new documentation was added.

logging: Use LogFatal instead LogError for fatal errors

b27dc67

Keep using LogError for nonfatal errors. Also rename LogWarning to LogCritical to be clear it is only intended to be used for "severe problems that the node admin should address," as described in the documentation.

ryanofsky mentioned this pull request Jun 27, 2024

RFC: Misused LogError and LogWarning macros #30348

Open

jonatack reviewed Jun 27, 2024

View reviewed changes

doc/developer-notes.md Show resolved Hide resolved

jonatack reviewed Jun 27, 2024

View reviewed changes

src/logging.h Show resolved Hide resolved

ajtowns reviewed Jun 27, 2024

View reviewed changes

ajtowns mentioned this pull request Jun 27, 2024

Improve new LogDebug/Trace/Info/Warning/Error Macros #29256

Open

ryanofsky force-pushed the pr/logfat branch from 475b70d to b27dc67 Compare June 28, 2024 14:41

ryanofsky commented Jun 28, 2024

View reviewed changes

DrahtBot mentioned this pull request Jun 28, 2024

scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint #29641

Draft

ryanofsky mentioned this pull request Jun 28, 2024

doc: Drop description of LogError messages as fatal #30361

Open

DrahtBot mentioned this pull request Jun 29, 2024

validation: Make ReplayBlocks interruptible #30155

Open

ryanofsky mentioned this pull request Jun 30, 2024

logging: Replace LogError and LogWarning with LogAlert #30364

Open

ryanofsky closed this Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logging: Use LogFatal instead LogError for fatal errors #30347

logging: Use LogFatal instead LogError for fatal errors #30347

ryanofsky commented Jun 27, 2024 •

edited

Loading

DrahtBot commented Jun 27, 2024 •

edited

Loading

willcl-ark commented Jun 27, 2024

jonatack commented Jun 27, 2024

ajtowns Jun 27, 2024

jonatack Jun 27, 2024

ajtowns Jun 27, 2024

ryanofsky Jun 28, 2024

jonatack Jun 28, 2024

ajtowns Jun 28, 2024 •

edited

Loading

jonatack Jun 28, 2024

ajtowns Jun 28, 2024

jonatack Jun 28, 2024

ryanofsky Jun 28, 2024 •

edited

Loading

ryanofsky left a comment •

edited

Loading

ryanofsky Jun 28, 2024

ajtowns commented Jun 28, 2024

ryanofsky commented Jun 30, 2024

logging: Use LogFatal instead LogError for fatal errors #30347

logging: Use LogFatal instead LogError for fatal errors #30347

Conversation

ryanofsky commented Jun 27, 2024 • edited Loading

DrahtBot commented Jun 27, 2024 • edited Loading

Code Coverage

Reviews

Conflicts

willcl-ark commented Jun 27, 2024

jonatack commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajtowns Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanofsky Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

ryanofsky left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajtowns commented Jun 28, 2024

ryanofsky commented Jun 30, 2024

ryanofsky commented Jun 27, 2024 •

edited

Loading

DrahtBot commented Jun 27, 2024 •

edited

Loading

ajtowns Jun 28, 2024 •

edited

Loading

ryanofsky Jun 28, 2024 •

edited

Loading

ryanofsky left a comment •

edited

Loading