CORE-7691 License: enterprise license verification on major version upgrade#23503
Conversation
| throw std::runtime_error{ | ||
| "Looks like you’ve enabled a Redpanda Enterprise feature(s) without " | ||
| "a valid license. You cannot upgrade the cluster with Redpanda " | ||
| "Enterprise features without a valid license. Please downgrade this " | ||
| "broker to the pre-upgrade version and enter an active Redpanda " | ||
| "license key (e.g. rpk cluster license set <key>) before performing " | ||
| "the upgrade. If you don’t have one, please request a new/trial " | ||
| "license at https://redpanda.com/license-request"}; |
There was a problem hiding this comment.
@redpanda-data/documentation This is the error message I would like your review on for this PR. This is what we are going to print in the broker logs when the broker aborts upon detecting that the broker is undergoing a major version upgrade in a non-compliant enterprise license state.
There was a problem hiding this comment.
There was a problem hiding this comment.
Note: Github doesn't update the preview shown here, so make sure to click on the name of the file at the top of this comment and view the up-to-date message.
With the current state of the code, the message looks like this:
ERROR 2024-09-27 10:44:49,704 [shard 0:main] main - application.cc:511 - Failure during startup: std::runtime_error (Looks like you’ve enabled Redpanda Enterprise features ([rbac]) without a valid license. You cannot upgrade the cluster with Redpanda Enterprise features without a valid license. Please downgrade this broker to the pre-upgrade version and enter an active Redpanda license key (e.g. rpk cluster license set <key>) before performing the upgrade. If you don’t have one, please request a new/trial license at https://redpanda.com/license-request)
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55196#01922e65-3eb4-4d19-8747-a2dcdf6c9a61 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55196#01922f0a-7388-4883-a11c-0ff0d4b9c632 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42ca-4b33-bc9d-ff9c03ec78be ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55337#019234b8-e969-4f81-8b0c-b00ab57eecf0 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55424#01924283-bbc1-4b3c-b11d-cc16ca8d2326 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55538#01924879-59d6-475b-ab65-bb620f192383 |
michael-redpanda
left a comment
There was a problem hiding this comment.
Looks reasonable to me. Just some chores/nits/questions
| prev_version = installer.highest_from_prior_feature_version( | ||
| RedpandaInstaller.HEAD) | ||
| latest_version = installer.head_version() | ||
| self.logger.info( | ||
| f"Testing with versions: {prev_version=} {latest_version=}") |
There was a problem hiding this comment.
chore/discussion: nothing wrong here, but triggered this thought: I think once we branch the release (say v24.3.x), we should update the feature table as the next commit to dev. I think this should be policy in core. Thoughts @dotnwat @mmaslankaprv ?
24378bb to
fef1277
Compare
|
Force-push: address the code review feedback above |
fef1277 to
4e1aeff
Compare
|
Force-push: fix linting error |
|
new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42ca-4b33-bc9d-ff9c03ec78be: new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42cc-457c-93b9-87878587cf53: new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42d0-4afa-a435-cc405671f477: new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42ce-4161-a84a-c78bb4375659: new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923405-bd3d-4d6d-ba64-73747df990fa: |
4e1aeff to
866f06c
Compare
|
866f06c to
159defa
Compare
|
Force-push: improved the assertion error message when the test is trying to install a license but there is none available |
oleiman
left a comment
There was a problem hiding this comment.
looking good. several comments, all small stuff or questions
| bool _am_controller_leader{false}; | ||
|
|
||
| // Blocks cluster upgrades until the enterprise license has been verified | ||
| bool _verified_enterprise_license{false}; |
There was a problem hiding this comment.
suggestion: could we just as easily use a semaphore for this? do an abortable sem::wait rather than explicitly sleep in a loop?
There was a problem hiding this comment.
i think in this case we can use a ss::condition_variable it is intended to be used as a wait() notify() mechanism.
159defa to
b0363cd
Compare
|
|
/cdt |
|
/test-release-pipeline |
oleiman
left a comment
There was a problem hiding this comment.
holding off ✅ until CDT is fixed, but code lgtm
dotnwat
left a comment
There was a problem hiding this comment.
code looks really good. just some minor feedback, and clarifications about how the phases of start-up function.
| # Installing a license is required for version upgrades with enterprise features | ||
| if license_required: | ||
| self.redpanda.install_license() |
There was a problem hiding this comment.
is there a reason this is repeated here in redpanda_test.py and e2e_test.py and upgrade_test.py, rather than being somewhere common like the RedpandaService constructor in redpanda.py?
There was a problem hiding this comment.
From the commit message:
Note that we don't install a license in all tests, but only in
those that require one. This is to ensure that during local development
we can conveniently run any test without a license that doesn't require
one.
| }; | ||
|
|
||
| constexpr cluster::cluster_version to_cluster_version(release_version rv) { | ||
| return cluster::cluster_version{static_cast<int64_t>(rv)}; |
There was a problem hiding this comment.
should check bounds, or better (because the space isn't necessarily dense), just have a switch statement for the translation.
There was a problem hiding this comment.
Good point. We only call this function with a valid release_version so it shouldn't be a problem as far as I know, but it makes sense to fail on invalid release_versions here so I'll go with the switch statement.
| if ( | ||
| version < to_cluster_version(release_version::MIN) | ||
| || version > to_cluster_version(release_version::MAX)) { | ||
| // Unknown versions default to being a major version release |
There was a problem hiding this comment.
when does it happen? should it be an error?
There was a problem hiding this comment.
This happens during tests that use some dummy features that don't correspond to real releases:
redpanda/src/v/features/feature_table.cc
Lines 185 to 192 in 7ae3406
In production, we have the guarantee that the active version of the cluster is within the earliest_version and latest_version range of the binary, which should all be present in the release_version enum.
| // Unknown versions default to being a major version release | ||
| return true; | ||
| } | ||
| switch (static_cast<release_version>(version())) { |
There was a problem hiding this comment.
maybe there just needs to be a note on the enum that states that between MIN/MAX the range must be dense for this to not be UB?
There was a problem hiding this comment.
enum class is scoped by definition, so I don't think the cast can result in UB as such. the result is unspecified (I think) if version() doesn't correspond to one of the enum values though, so point taken.
There was a problem hiding this comment.
Added a comment now on the enum to call this out.
| v24_1_1 = 12, | ||
| v24_2_1 = 13, | ||
| v24_3_1 = 14, | ||
| MAX = v24_3_1, |
There was a problem hiding this comment.
maybe leave a comment around here somewhere that this affects feature_table.cc::latest_version. same for earliest version?
There was a problem hiding this comment.
The part about latest_version makes sense because it is defined in terms of MAX. But the earliest_version is defined in terms of a specific release version and not the min/max so I'm not sure what I could add here.
| vlog(clusterlog.info, "Verifying enterprise license..."); | ||
|
|
||
| if (!need_to_verify_enterprise_license()) { | ||
| vlog(clusterlog.info, "Enterprise license verification skipped..."); |
There was a problem hiding this comment.
i might consider making these two debug level
There was a problem hiding this comment.
Yeah that makes sense
| std::ignore = co_await ss::get_units( | ||
| _verified_enterprise_license, 1, _as.local()); |
There was a problem hiding this comment.
isn't this the same as _verified_enterprise_license.wait(_as.local()) and avoids needing to explain why the units are dropped?
There was a problem hiding this comment.
Yes that's nicer. Thanks.
This is not equivalent because wait consumes those units, and we will enter this method multiple times, so we want to reset the semaphore to 1 after we've established that the precondition is met.
| ss::future<> feature_manager::maybe_update_feature_table() { | ||
| // Before doing any feature enablement or active version update, check that | ||
| // the cluster has an enterprise license if they use enterprise features | ||
| if (need_to_verify_enterprise_license()) { |
There was a problem hiding this comment.
do you need to check need_to_verify_enterprise_license here, if you have code elsewhere that already checks it, and sets _verified_enterprise_license?
There was a problem hiding this comment.
It's not required, but it means that we can unblock this fibre (slightly) earlier for non-upgrading nodes. I went with this to make startup more similar to the existing behaviour for non-upgrading nodes, but it is not strictly necessary.
| controller->get_feature_manager() | ||
| .invoke_on( | ||
| cluster::feature_manager::backend_shard, | ||
| [](cluster::feature_manager& fm) { | ||
| return fm.verify_enterprise_license(); | ||
| }) | ||
| .get(); |
There was a problem hiding this comment.
Presumably the proper we want is that start-up happens in phases.
- Phase 1 - Startup to the point that "active version" is set and correct and pause.
- Phase 2 - Check license, and abort start-up if it is not valid
- Phase 3 - Allow the feature manager the opportunity agree on a newer version for the cluster
Is this what is happening? If so, great, but it's kind of hard to "see" that this happening. The phases are spread out, and maybe even some of them are asynchronously running?
I guess the first thing to do is make sure we're on the same page about what is happening. Then we can figure out if there is any changes needed. The minor concern I have is that it will be easy to break this with even simple code refactoring?
There was a problem hiding this comment.
Is this what is happening?
Yes.
I think the complexity comes from:
- The fact that Phase 1 can take separate paths depending on whether this is a restarting node or a node that has just joined the cluster. I've tried to describe this in the comment above.
- In Phase 3, while the application startup itself is sequential, the fibre that performs the potential advancing of a newer version for the cluster runs asynchronously, so we need to signal to that fibre which we do with a mutex.
The minor concern I have is that it will be easy to break this with even simple code refactoring?
Do you have a specific example in mind? The main behaviour we care about is that non-compliant clusters abort and compliant clusters don't abort during an upgrading start, which I've tried to provide reasonable coverage for in the new ducktape tests.
|
/cdt |
fddde2a to
51ddcb3
Compare
A number of ducktape tests require an enterprise license, so introduce this helper to make installing the license straightforward in these tests.
The commits in this PR change Redpanda to require an enterprise license on upgrade, which means that ducktape tests that use enterprise features and are upgrading nodes also require a license. Therefore, this installs a license in all existing tests that require a license. Note that we don't install a license in all tests, but only in those that require one. This is to ensure that during local development we can conveniently run any test without a license that doesn't require one.
This introduces the `release_version` which is going to be useful to define which logical version bumps correspond to a major version release. This allows us to check if the node is trying to perform a major version upgrade. In the next commit we use this to perform license validation only on feature version increases that are major version upgrades.
Pure refactor, no behaviour change. Use the `release_version` enum more widely throughout the codebase as it makes the code more readable. Also moves some translation-unit-local variable from static to anonymous namespace.
* Print the list of enterprise features used in license nag * Improve wording and references based on Docs Team's feedback Includes a small refactor to remove the now trivial `feature_manager::license_required_feature_enabled` helper.
Introduces stict checking of the enterprise license on upgrade when enterprise features are used in the cluster. The license enforcement only applies on major version upgrades for now. Users should be able to bootstrap a cluster with enterprise features even without an enterprise license. When startup is aborted, the following message gets printed: ``` ERROR 2024-10-01 12:46:21,863 [shard 0:main] main - application.cc:511 - Failure during startup: std::runtime_error (A Redpanda Enterprise Edition license is required to use enterprise features: ([rbac]). Enter an active license key (for example, rpk cluster license set <key>). To request a license, see https://redpanda.com/license-request. For more information, see https://docs.redpanda.com/current/get-started/licenses.) ```
Remind us to consider upgrading the earliest_version whenever we update the latest_version. Ideally these two should be at most 1 version apart, but for now, this commit just maintains the status quo.
51ddcb3 to
54bb24f
Compare
|
|
/cdt |
This introduces strict enterprise license enforcement to redpanda, which is enforced on major version logical version upgrades starting with the major version release v24.3.x. When the cluster uses enterprise features without a valid enterprise license, and the Redpanda process detects that it is undergoing a major version upgrade, the process will abort and prompt the user to install a valid enterprise license.
This introduces an enum
release_versionthat associates logical versions with particular releases and provides a way to describe which logical version bumps correspond to major version releases.Ducktape tests, which use enterprise features and perform major version upgrades, also require a license. For this, we modify these tests to install a sample license available in CI through an environment variable. There are plans to automate configuring the license locally for developers using vtools, but that is beyond this PR. For now, a sample license needs to be set in the
REDPANDA_SAMPLE_LICENSEenvironment variable when running these tests locally.Note that
join_nodeprevention has been discussed in the RFC. This is not included in this PR and will be future work we will likely take up.Fixes https://redpandadata.atlassian.net/browse/CORE-7691
Backports Required
Release Notes