Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logging: Use LogFatal instead LogError for fatal errors #30347

Closed
wants to merge 1 commit into from

Conversation

ryanofsky
Copy link
Contributor

@ryanofsky ryanofsky commented Jun 27, 2024

Use LogFatal instead LogError for fatal errors.

Keep using LogError for nonfatal errors.

Also rename LogWarning to LogCritical to be clear it is only intended to be used for "severe problems that the node admin should address," as described in the documentation.

This PR is a draft so other ideas and approaches can be discussed in issue #30348

Keep using LogError for nonfatal errors. Also rename LogWarning to LogCritical
to be clear it is only intended to be used for "severe problems that the
node admin should address," as described in the documentation.
@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 27, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept NACK ajtowns
Concept ACK jonatack

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #30361 (doc: Drop description of LogError messages as fatal by ryanofsky)
  • #30343 (wallet, logging: Replace WalletLogPrintf() with LogInfo() by ryanofsky)
  • #30342 (kernel, logging: Pass Logger instances to kernel objects by ryanofsky)
  • #30155 (validation: Make ReplayBlocks interruptible by mzumsande)
  • #29656 (chainparams: Change nChainTx type to uint64_t by fjahr)
  • #29641 (scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint by maflcko)
  • #29256 (Improve new LogDebug/Trace/Info/Warning/Error Macros by ryanofsky)
  • #27006 (reduce cs_main scope, guard block index 'nFile' under a local mutex by furszy)
  • #26022 (Add util::ResultPtr class by ryanofsky)
  • #25722 (refactor: Use util::Result class for wallet loading by ryanofsky)
  • #25665 (refactor: Add util::Result failure values, multiple error and warning messages by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@willcl-ark
Copy link
Member

I'm largely in agreement with the sentiment here and think the introduction of the LogFatal is nice to distiguish fatal errors.

Does

LogCritical("pruned datadir may not have more than %d blocks; only checking available blocks\n",

...actually require user intervention or other corrective action?

nit: LogWarning remains in L728 after this change:

The macros `LogInfo`, `LogDebug`, `LogTrace`, `LogWarning` and `LogError` are available for

@jonatack
Copy link
Contributor

Tentative Concept ACK on adding Fatal if the distinction is considered essential, unsure about Critical replacing Warning.

@@ -77,6 +77,8 @@ namespace BCLog {
Info, // Default
Warning,
Error,
Critical,
Fatal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The medium post describes five levels with the same names as ours currently: error, warn, info, debug, trace; the other six links all describe syslog's behaviour, which was designed for unix systems in the 1980s. Logging systems we actually use include python's (which has 5 default levels), leveldb (which does not have levels, so we treat it all as debug), and libevent (which has 4 levels). So, no, 6-8 levels is not standard, and for a custom subsystem we should be doing what's best for us, not necessarily what's standard in any case.

The flowchart from the stackoverflow link might be helpful, though: https://stackoverflow.com/a/64806781/84932 For us I think it would look more like:

flowchart TD
    start-->|No|dbg
    start{{Is this critical information?}}-->|Yes|crit

    dbg-->|No|logtrace(Trace + Category)
    dbg{{Is this moderate volume?}}-->|Yes|logdebug(Debug + Category)

    crit{{Is there a problem?}}-->|No|loginfo(Info)
    crit-->|Yes|warnerr

    warnerr{{Do we need to abort?}}-->|No|logwarn(Warning/Critical)
    warnerr-->|Yes|logfatal(Error/Fatal)
Loading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #30347 (comment)

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.

I think the reality is we already have 7 levels of logging, but we are using 5 names to describe them. This PR is a small fix to make the names used in the code match the intent of the code.

For example, the documentation says LogError should only be used to log fatal errors. But currently it is being used to log both fatal errors and nonfatal errors. If you would like to log errors less, or at lower priority, I would enthusiastically agree with that goal. But I don't think those changes should be folded into this PR, which is minimal and not trying to fix problems that are broader and more complicated than simple bad naming.

Your flowchart is a good idealization, but it does not describe the realities of current code, which is using logging for ordinary error reporting, not just the most critical errors. A description of the actual log levels in our code base is:

  • Fatal messages indicate severe problems (e.g. data loss) that will trigger shutdowns
  • Critical messages indicate severe problems (e.g. incorrect system time, internal bug detected) that users should address in some way
  • Error messages indicate less severe problems (e.g. loss of connectivity) that the users should probably know about, but might be transient
  • Warning messages indicate unexpected conditions (e.g. user specified -checkblocks number higher than number of blocks that may be available to check) that the user should know about, and indicate potential problems

For completeness, Info messages provide information about normal activity, Debug messages can help users troubleshoot issues, and Trace is mostly unused but geared more towards developers.

Overall, I agree with the direction you want to take in reducing number of log levels, and I would applaud any effort to reduce amount of log spam and report errors by other means, but I don't think those goals should get in the way of fixing names that are confusing and misleading to developers.

If you want to open a PR that takes a broader approach and also fixes the bad names, I will happily go along with it and close this PR. But I think reasonable to want to take one step at at time here and just fix the bad naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot.

Meta:

It's not clear why people are 👍 on that comment. Are 5 "a lot", despite what other projects do, or based on our needs? Or is the 👍 about 5 being ok, but not having more than 5?

I've seen drive-by 👍 on other comments recently where the comment made multiple points, and it was similarly not clear: which of the points the 👍 pertains to? All? Some?

If a reviewer has a comment, I think it would be more helpful in general to be explicit.

Copy link
Contributor

@ajtowns ajtowns Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that there's any reason to separate the current LogWarning / LogError levels; those existed prior to #28318 and I documented them in that PR as best I understood them. However, there's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following. So it could make sense to merge Warning/Error into:

  • LogAlert(fmt, params...) should be used in place of LogInfo for (potentially) severe problems that require attention from the node admin (e.g., system time appears to be wrong, unknown soft fork appears to have activated, insufficient storage space).

which would reduce the flowchart to:

flowchart TD
    start-->|No|dbg
    start{{Is this critical information?}}-->|Yes|crit

    dbg-->|No|logtrace(Trace + Category)
    dbg{{Is this moderate volume?}}-->|Yes|logdebug(Debug + Category)

    crit{{Is there a problem?}}-->|No|loginfo(Info)
    crit-->|Yes|logfatal(Alert)
Loading

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code mostly hasn't migrated to the multiple severity levels from the original LogPrintf/LogPrint split. It would be nice to spend time working on that, but we're instead constantly bikeshedding the API instead and treating that as a blocker for any other logging changes.

The logging was mostly migrated in the severity level parent PR #25203 before the whole thing was taken off the rails with the overwrought API and rules. (There have also been a number of pulls opened that were a copy or similar version of the commits there.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By overwrought, jonatack means LogError("CHECKSUM ERROR ..."), while his preferred non-overwrought api from #25203 looks like LogPrintLevel(BCLog::NET, BCLog::Level::Error, "CHECKSUM ERROR ...").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By overwrought, jonatack means LogError("CHECKSUM ERROR ..."), while his preferred non-overwrought api from #25203 looks like LogPrintLevel(BCLog::NET, BCLog::Level::Error, "CHECKSUM ERROR ...").

As you know, I've described what I mean several times, including in my comments in the pull you mention. Here is the most recent one: #29256 (comment)

Copy link
Contributor Author

@ryanofsky ryanofsky Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #30347 (comment)

I like your idea to collapse the log levels further and I am curious about ways you think we can make progress towards that goal.

However, there's not much benefit in a log that says "hey this error is about to cause the node to stop" -- you already get that information by seeing "Shutdown: in progress..." immediately following.

If we don't need to have a dedicated level for fatal errors, it suggests a possible alternative to this PR, implemented in #30361. It would solve the issue that was reported and might get closer to the long term outcome you outlined.

Copy link
Contributor Author

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 475b70d -> b27dc67 (pr/logfat.1 -> pr/logfat.2, compare) with suggested fixes and expanded it to cover Logging::Warning and Logging::Error.


re: #30347 (comment)

pruned datadir may not have more than %d blocks;

Nice catch! I updated this PR with your corrections and also updated the #30348 issue description to reflect this.

doc/developer-notes.md Show resolved Hide resolved
@@ -77,6 +77,8 @@ namespace BCLog {
Info, // Default
Warning,
Error,
Critical,
Fatal,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #30347 (comment)

I don't think it makes sense for us to have 7 levels of logging -- even 5 is a lot. Renaming Warning to Critical and Error to Fatal for the reasons given seems fine, but the old names shouldn't be kept around in that case.

I think the reality is we already have 7 levels of logging, but we are using 5 names to describe them. This PR is a small fix to make the names used in the code match the intent of the code.

For example, the documentation says LogError should only be used to log fatal errors. But currently it is being used to log both fatal errors and nonfatal errors. If you would like to log errors less, or at lower priority, I would enthusiastically agree with that goal. But I don't think those changes should be folded into this PR, which is minimal and not trying to fix problems that are broader and more complicated than simple bad naming.

Your flowchart is a good idealization, but it does not describe the realities of current code, which is using logging for ordinary error reporting, not just the most critical errors. A description of the actual log levels in our code base is:

  • Fatal messages indicate severe problems (e.g. data loss) that will trigger shutdowns
  • Critical messages indicate severe problems (e.g. incorrect system time, internal bug detected) that users should address in some way
  • Error messages indicate less severe problems (e.g. loss of connectivity) that the users should probably know about, but might be transient
  • Warning messages indicate unexpected conditions (e.g. user specified -checkblocks number higher than number of blocks that may be available to check) that the user should know about, and indicate potential problems

For completeness, Info messages provide information about normal activity, Debug messages can help users troubleshoot issues, and Trace is mostly unused but geared more towards developers.

Overall, I agree with the direction you want to take in reducing number of log levels, and I would applaud any effort to reduce amount of log spam and report errors by other means, but I don't think those goals should get in the way of fixing names that are confusing and misleading to developers.

If you want to open a PR that takes a broader approach and also fixes the bad names, I will happily go along with it and close this PR. But I think reasonable to want to take one step at at time here and just fix the bad naming.

src/logging.h Show resolved Hide resolved
@ajtowns
Copy link
Contributor

ajtowns commented Jun 28, 2024

Concept NACK; introducing more different levels just makes it require more effort to use the logging system, and increases the chances they will be used incorrectly.

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 28, 2024
Remove documentation that says LogError should be used for "severe problems
that require the node (or a subsystem) to shut down entirely" because:

- This is not how `LogError` and `Level::Error` are used currently. Of 129
  current uses only 58 cases are fatal according to
  bitcoin#30348

- "[T]here's not much benefit in a log that says "hey this error is about to
  cause the node to stop" -- you already get that information by seeing
  "Shutdown: in progress..." immediately following." according to
  bitcoin#30347 (comment)
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Jun 30, 2024
Remove documentation that says LogError should be used for "severe problems
that require the node (or a subsystem) to shut down entirely" because:

- This is not how `LogError` and `Level::Error` are used currently. Of 129
  current uses only 58 cases are fatal according to
  bitcoin#30348

- "[T]here's not much benefit in a log that says "hey this error is about to
  cause the node to stop" -- you already get that information by seeing
  "Shutdown: in progress..." immediately following." according to
  bitcoin#30347 (comment)
@ryanofsky
Copy link
Contributor Author

Closing this PR as #30361 is a more direct fix for the issue, assuming we are ok with log levels not making a distinction between fatal errors and other types of errors. It seems like it never might have been anybody's intention to even make this distinction in the first place and it just got written in when new documentation was added.

@ryanofsky ryanofsky closed this Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants