Skip to content

CORE-7691 License: enterprise license verification on major version upgrade#23503

Merged
pgellert merged 7 commits intoredpanda-data:devfrom
pgellert:license/strict-enforcement-on-upgrade
Oct 4, 2024
Merged

CORE-7691 License: enterprise license verification on major version upgrade#23503
pgellert merged 7 commits intoredpanda-data:devfrom
pgellert:license/strict-enforcement-on-upgrade

Conversation

@pgellert
Copy link
Copy Markdown
Contributor

This introduces strict enterprise license enforcement to redpanda, which is enforced on major version logical version upgrades starting with the major version release v24.3.x. When the cluster uses enterprise features without a valid enterprise license, and the Redpanda process detects that it is undergoing a major version upgrade, the process will abort and prompt the user to install a valid enterprise license.

This introduces an enum release_version that associates logical versions with particular releases and provides a way to describe which logical version bumps correspond to major version releases.

Ducktape tests, which use enterprise features and perform major version upgrades, also require a license. For this, we modify these tests to install a sample license available in CI through an environment variable. There are plans to automate configuring the license locally for developers using vtools, but that is beyond this PR. For now, a sample license needs to be set in the REDPANDA_SAMPLE_LICENSE environment variable when running these tests locally.

Note that join_node prevention has been discussed in the RFC. This is not included in this PR and will be future work we will likely take up.

Fixes https://redpandadata.atlassian.net/browse/CORE-7691

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

@pgellert pgellert requested review from a team, dotnwat and mmaslankaprv September 26, 2024 11:29
@pgellert pgellert self-assigned this Sep 26, 2024
@pgellert pgellert requested review from michael-redpanda and oleiman and removed request for a team September 26, 2024 11:29
Comment thread src/v/cluster/feature_manager.cc Outdated
Comment on lines +322 to +329
throw std::runtime_error{
"Looks like you’ve enabled a Redpanda Enterprise feature(s) without "
"a valid license. You cannot upgrade the cluster with Redpanda "
"Enterprise features without a valid license. Please downgrade this "
"broker to the pre-upgrade version and enter an active Redpanda "
"license key (e.g. rpk cluster license set <key>) before performing "
"the upgrade. If you don’t have one, please request a new/trial "
"license at https://redpanda.com/license-request"};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redpanda-data/documentation This is the error message I would like your review on for this PR. This is what we are going to print in the broker logs when the broker aborts upon detecting that the broker is undergoing a major version upgrade in a non-compliant enterprise license state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Github doesn't update the preview shown here, so make sure to click on the name of the file at the top of this comment and view the up-to-date message.

With the current state of the code, the message looks like this:

ERROR 2024-09-27 10:44:49,704 [shard 0:main] main - application.cc:511 - Failure during startup: std::runtime_error (Looks like you’ve enabled Redpanda Enterprise features ([rbac]) without a valid license. You cannot upgrade the cluster with Redpanda Enterprise features without a valid license. Please downgrade this broker to the pre-upgrade version and enter an active Redpanda license key (e.g. rpk cluster license set <key>) before performing the upgrade. If you don’t have one, please request a new/trial license at https://redpanda.com/license-request)

Copy link
Copy Markdown
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. Just some chores/nits/questions

Comment thread tests/rptest/services/redpanda.py
Comment thread tests/rptest/services/redpanda.py Outdated
Comment thread src/v/features/feature_table.h Outdated
Comment on lines +50 to +54
prev_version = installer.highest_from_prior_feature_version(
RedpandaInstaller.HEAD)
latest_version = installer.head_version()
self.logger.info(
f"Testing with versions: {prev_version=} {latest_version=}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chore/discussion: nothing wrong here, but triggered this thought: I think once we branch the release (say v24.3.x), we should update the feature table as the next commit to dev. I think this should be policy in core. Thoughts @dotnwat @mmaslankaprv ?

Comment thread tests/rptest/tests/license_enforcement_test.py Outdated
Comment thread tests/rptest/tests/license_enforcement_test.py Outdated
Comment thread src/v/redpanda/application.cc
Comment thread tests/rptest/tests/random_node_operations_test.py
Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor comments/questions, but looking good

Comment thread src/v/features/feature_table.h Outdated
Comment thread src/v/features/feature_table.cc
Comment thread src/v/features/feature_table.h Outdated
Comment thread src/v/cluster/feature_manager.cc Outdated
@pgellert pgellert force-pushed the license/strict-enforcement-on-upgrade branch from 24378bb to fef1277 Compare September 27, 2024 11:59
@pgellert
Copy link
Copy Markdown
Contributor Author

Force-push: address the code review feedback above

@pgellert pgellert force-pushed the license/strict-enforcement-on-upgrade branch from fef1277 to 4e1aeff Compare September 27, 2024 13:49
@pgellert
Copy link
Copy Markdown
Contributor Author

Force-push: fix linting error

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Sep 27, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42ca-4b33-bc9d-ff9c03ec78be:

"rptest.tests.redpanda_oauth_test.OIDCLicenseTest.test_license_nag.authn_config=.http_authentication.OIDC.BASIC"
"rptest.tests.schema_registry_test.SchemaRegistryLicenseTest.test_license_nag.mode=SchemaIdValidationMode.REDPANDA"

new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42cc-457c-93b9-87878587cf53:

"rptest.tests.rbac_test.RBACLicenseTest.test_license_nag"
"rptest.tests.redpanda_oauth_test.OIDCLicenseTest.test_license_nag.authn_config=.sasl_mechanisms.OAUTHBEARER.SCRAM"

new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42d0-4afa-a435-cc405671f477:

"rptest.tests.schema_registry_test.SchemaRegistryLicenseTest.test_license_nag.mode=SchemaIdValidationMode.COMPAT"

new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923403-42ce-4161-a84a-c78bb4375659:

"rptest.tests.redpanda_kerberos_test.RedpandaKerberosLicenseTest.test_license_nag"

new failures in https://buildkite.com/redpanda/redpanda/builds/55321#01923405-bd3d-4d6d-ba64-73747df990fa:

"rptest.tests.redpanda_oauth_test.OIDCLicenseTest.test_license_nag.authn_config=.http_authentication.OIDC.BASIC"
"rptest.tests.schema_registry_test.SchemaRegistryLicenseTest.test_license_nag.mode=SchemaIdValidationMode.REDPANDA"

@pgellert pgellert force-pushed the license/strict-enforcement-on-upgrade branch from 4e1aeff to 866f06c Compare September 27, 2024 16:48
@pgellert
Copy link
Copy Markdown
Contributor Author

Force-push:

  • Fix the failing ducktape tests: they were failing because I modified the license nag message

@pgellert pgellert force-pushed the license/strict-enforcement-on-upgrade branch from 866f06c to 159defa Compare September 27, 2024 17:23
@pgellert
Copy link
Copy Markdown
Contributor Author

Force-push: improved the assertion error message when the test is trying to install a license but there is none available

Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. several comments, all small stuff or questions

Comment thread tests/rptest/tests/random_node_operations_test.py
Comment thread src/v/features/feature_table.h Outdated
Comment thread src/v/features/feature_table.cc
Comment thread src/v/features/feature_table.cc
Comment thread src/v/features/enterprise_features.h
Comment thread tests/rptest/tests/rbac_test.py Outdated
Comment thread src/v/cluster/feature_manager.cc
Comment thread src/v/cluster/feature_manager.h Outdated
bool _am_controller_leader{false};

// Blocks cluster upgrades until the enterprise license has been verified
bool _verified_enterprise_license{false};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: could we just as easily use a semaphore for this? do an abortable sem::wait rather than explicitly sleep in a loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in this case we can use a ss::condition_variable it is intended to be used as a wait() notify() mechanism.

@pgellert pgellert force-pushed the license/strict-enforcement-on-upgrade branch from 159defa to b0363cd Compare September 30, 2024 09:20
@pgellert
Copy link
Copy Markdown
Contributor Author

Force-push:

  • Wait for precondition with semaphore > polling
  • Search for license nag with more specific text in tests
  • Address other minor feedback

@pgellert pgellert requested a review from oleiman September 30, 2024 09:22
Copy link
Copy Markdown
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread tests/rptest/services/redpanda.py Outdated
@pgellert
Copy link
Copy Markdown
Contributor Author

pgellert commented Oct 1, 2024

/cdt

@pgellert
Copy link
Copy Markdown
Contributor Author

pgellert commented Oct 2, 2024

/test-release-pipeline

Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holding off ✅ until CDT is fixed, but code lgtm

Copy link
Copy Markdown
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks really good. just some minor feedback, and clarifications about how the phases of start-up function.

Comment thread tests/rptest/tests/end_to_end.py
Comment on lines +256 to +258
# Installing a license is required for version upgrades with enterprise features
if license_required:
self.redpanda.install_license()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason this is repeated here in redpanda_test.py and e2e_test.py and upgrade_test.py, rather than being somewhere common like the RedpandaService constructor in redpanda.py?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the commit message:

Note that we don't install a license in all tests, but only in
those that require one. This is to ensure that during local development
we can conveniently run any test without a license that doesn't require
one.

Comment thread src/v/features/feature_table.h Outdated
};

constexpr cluster::cluster_version to_cluster_version(release_version rv) {
return cluster::cluster_version{static_cast<int64_t>(rv)};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check bounds, or better (because the space isn't necessarily dense), just have a switch statement for the translation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We only call this function with a valid release_version so it shouldn't be a problem as far as I know, but it makes sense to fail on invalid release_versions here so I'll go with the switch statement.

if (
version < to_cluster_version(release_version::MIN)
|| version > to_cluster_version(release_version::MAX)) {
// Unknown versions default to being a major version release
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does it happen? should it be an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens during tests that use some dummy features that don't correspond to real releases:

static std::array test_extra_schema{
// For testing, a feature that does not auto-activate
feature_spec{
cluster::cluster_version{2001},
"__test_alpha",
feature::test_alpha,
feature_spec::available_policy::explicit_only,
feature_spec::prepare_policy::always},

In production, we have the guarantee that the active version of the cluster is within the earliest_version and latest_version range of the binary, which should all be present in the release_version enum.

// Unknown versions default to being a major version release
return true;
}
switch (static_cast<release_version>(version())) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe there just needs to be a note on the enum that states that between MIN/MAX the range must be dense for this to not be UB?

Copy link
Copy Markdown
Member

@oleiman oleiman Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum class is scoped by definition, so I don't think the cast can result in UB as such. the result is unspecified (I think) if version() doesn't correspond to one of the enum values though, so point taken.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment now on the enum to call this out.

Comment thread src/v/features/feature_table.h Outdated
v24_1_1 = 12,
v24_2_1 = 13,
v24_3_1 = 14,
MAX = v24_3_1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave a comment around here somewhere that this affects feature_table.cc::latest_version. same for earliest version?

Copy link
Copy Markdown
Contributor Author

@pgellert pgellert Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part about latest_version makes sense because it is defined in terms of MAX. But the earliest_version is defined in terms of a specific release version and not the min/max so I'm not sure what I could add here.

Comment thread src/v/cluster/feature_manager.cc Outdated
Comment on lines +303 to +306
vlog(clusterlog.info, "Verifying enterprise license...");

if (!need_to_verify_enterprise_license()) {
vlog(clusterlog.info, "Enterprise license verification skipped...");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might consider making these two debug level

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense

Comment on lines +347 to +349
std::ignore = co_await ss::get_units(
_verified_enterprise_license, 1, _as.local());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same as _verified_enterprise_license.wait(_as.local()) and avoids needing to explain why the units are dropped?

Copy link
Copy Markdown
Contributor Author

@pgellert pgellert Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's nicer. Thanks.

This is not equivalent because wait consumes those units, and we will enter this method multiple times, so we want to reset the semaphore to 1 after we've established that the precondition is met.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

ss::future<> feature_manager::maybe_update_feature_table() {
// Before doing any feature enablement or active version update, check that
// the cluster has an enterprise license if they use enterprise features
if (need_to_verify_enterprise_license()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to check need_to_verify_enterprise_license here, if you have code elsewhere that already checks it, and sets _verified_enterprise_license?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required, but it means that we can unblock this fibre (slightly) earlier for non-upgrading nodes. I went with this to make startup more similar to the existing behaviour for non-upgrading nodes, but it is not strictly necessary.

Comment on lines +3128 to +3191
controller->get_feature_manager()
.invoke_on(
cluster::feature_manager::backend_shard,
[](cluster::feature_manager& fm) {
return fm.verify_enterprise_license();
})
.get();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably the proper we want is that start-up happens in phases.

  • Phase 1 - Startup to the point that "active version" is set and correct and pause.
  • Phase 2 - Check license, and abort start-up if it is not valid
  • Phase 3 - Allow the feature manager the opportunity agree on a newer version for the cluster

Is this what is happening? If so, great, but it's kind of hard to "see" that this happening. The phases are spread out, and maybe even some of them are asynchronously running?

I guess the first thing to do is make sure we're on the same page about what is happening. Then we can figure out if there is any changes needed. The minor concern I have is that it will be easy to break this with even simple code refactoring?

Copy link
Copy Markdown
Contributor Author

@pgellert pgellert Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what is happening?

Yes.

I think the complexity comes from:

  1. The fact that Phase 1 can take separate paths depending on whether this is a restarting node or a node that has just joined the cluster. I've tried to describe this in the comment above.
  2. In Phase 3, while the application startup itself is sequential, the fibre that performs the potential advancing of a newer version for the cluster runs asynchronously, so we need to signal to that fibre which we do with a mutex.

The minor concern I have is that it will be easy to break this with even simple code refactoring?

Do you have a specific example in mind? The main behaviour we care about is that non-compliant clusters abort and compliant clusters don't abort during an upgrading start, which I've tried to provide reasonable coverage for in the new ducktape tests.

@pgellert
Copy link
Copy Markdown
Contributor Author

pgellert commented Oct 2, 2024

/cdt

A number of ducktape tests require an enterprise license, so introduce
this helper to make installing the license straightforward in these
tests.
The commits in this PR change Redpanda to require an enterprise license
on upgrade, which means that ducktape tests that use enterprise features
and are upgrading nodes also require a license.

Therefore, this installs a license in all existing tests that require a
license. Note that we don't install a license in all tests, but only in
those that require one. This is to ensure that during local development
we can conveniently run any test without a license that doesn't require
one.
This introduces the `release_version` which is going to be useful to
define which logical version bumps correspond to a major version
release.

This allows us to check if the node is trying to perform a major version
upgrade. In the next commit we use this to perform license validation
only on feature version increases that are major version upgrades.
Pure refactor, no behaviour change.

Use the `release_version` enum more widely throughout the codebase as it
makes the code more readable.

Also moves some translation-unit-local variable from static to
anonymous namespace.
* Print the list of enterprise features used in license nag
* Improve wording and references based on Docs Team's feedback

Includes a small refactor to remove the now trivial
`feature_manager::license_required_feature_enabled` helper.
Introduces stict checking of the enterprise license on upgrade when
enterprise features are used in the cluster.

The license enforcement only applies on major version upgrades for now.
Users should be able to bootstrap a cluster with enterprise features
even without an enterprise license.

When startup is aborted, the following message gets printed:
```
ERROR 2024-10-01 12:46:21,863 [shard 0:main] main - application.cc:511 - Failure during startup: std::runtime_error (A Redpanda Enterprise Edition license is required to use enterprise features: ([rbac]). Enter an active license key (for example, rpk cluster license set <key>). To request a license, see https://redpanda.com/license-request. For more information, see https://docs.redpanda.com/current/get-started/licenses.)
```
Remind us to consider upgrading the earliest_version whenever we update
the latest_version.

Ideally these two should be at most 1 version apart, but for now, this
commit just maintains the status quo.
@pgellert pgellert force-pushed the license/strict-enforcement-on-upgrade branch from 51ddcb3 to 54bb24f Compare October 3, 2024 14:12
@pgellert
Copy link
Copy Markdown
Contributor Author

pgellert commented Oct 3, 2024

Force push 1:

  • Fix a failure case in the newly added ducktape tests (discovered by CDT) where deleting the data directory of a running node lead to a crash during the test.
  • Address @dotnwat's feedback

Force push 2:

  • Rebase to dev to ensure that no new tests on dev fail because they are missing a license

@pgellert
Copy link
Copy Markdown
Contributor Author

pgellert commented Oct 3, 2024

/cdt
rptest/tests/license_enforcement_test.py

Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, lgtm

Comment thread src/v/features/feature_table.h
@pgellert pgellert merged commit 0c0a3db into redpanda-data:dev Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants