Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from one Module version to Other Module Version causing Node pool to recreate #2127

Open
koushikgongireddy opened this issue Oct 3, 2024 · 16 comments
Labels
bug Something isn't working Stale

Comments

@koushikgongireddy
Copy link

koushikgongireddy commented Oct 3, 2024

TL;DR

Upgrading from one Module version to Other Module Version causing Node pool to recreate

Expected behavior

When we upgrade GKE module versions we are seeing breaking changes where GKE Node pools are trying to recreate.

Currently we are on old version 17.0.0 and planning to upgrade to 30.0.0 and i see there are changers in keepers which is causing the random_id to change and that's causing the node pool to recreate.

17.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v17.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L313
30.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v30.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L500

We also tried from 30.0.0 to 32.0.0 and same happening again as new changes are added in keepers
32.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v32.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L591

We need help on how to upgrade to higher versions without causing the node pool to recreate

Observed behavior

When we run TF plan after upgrading to 30.0.0 we are seeing below resources are recreated

    module.gke.module.gke.random_id.name["dev-gusc1-ee-vpc-green"] must be replaced
+/- resource "random_id" "name" {

   module.gke.module.gke.random_id.name["dev-gusc1-ee-vpc-defnp01"] must be replaced
+/- resource "random_id" "name" {

   module.gke.module.gke.google_container_node_pool.pools["dev-gusc1-ee-vpc-green"] must be replaced
+/- resource "google_container_node_pool" "pools" {

   module.gke.module.gke.google_container_node_pool.pools["dev-gusc1-ee-vpc-defnp01"] must be replaced
+/- resource "google_container_node_pool" "pools" {

17.0.0 to 30.0.0

      ~ keepers     = { # forces replacement
          + "boot_disk_kms_key"           = ""
          + "enable_gcfs"                 = ""
          + "enable_gvnic"                = ""
          + "enable_integrity_monitoring" = ""
          + "enable_secure_boot"          = "true"
          + "gpu_partition_size"          = ""
          - "labels"                      = "xxxx-vpc-green,true" -> null
          + "max_pods_per_node"           = ""
          + "min_cpu_platform"            = ""
          + "placement_policy"            = ""
          + "pod_range"                   = ""
          + "spot"                        = ""
          - "tags"                        = "xxxx-vpc-green" -> null
            # (11 unchanged elements hidden)
        }

30.0.0 to 32.0.0

      ~ keepers     = { # forces replacement
          + "enable_confidential_storage" = ""
          + "gpu_driver_version"          = ""
          + "gpu_sharing_strategy"        = ""
          + "max_shared_clients_per_gpu"  = ""
          + "queued_provisioning"         = ""
            # (22 unchanged elements hidden)
        }

Terraform Configuration

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"

  version = "30.0.0"
  project_id                  = var.project_id
  kubernetes_version          = var.kubernetes_version
  name                        = "${var.environment}-${var.location_acronym}-${var.cluster_name_suffix}"
  region                      = var.location
  zones                       = var.zones
.
.
.
.
.
.
 node_pools = [
    {
      name               = "${var.network_name}-defnp01"
      machine_type       = var.machine_type
      enable_secure_boot = true
      node_locations     = var.node_pools_locations
.
.
.

    },
    {
      name               = "${var.network_name}-green"
      machine_type       = var.machine_type
      enable_secure_boot = true
      node_locations     = var.node_pools_locations
.
.
.
.

    }
]
}

Terraform Version

We are currently on below
Present - 

terraform {
  required_version = "1.5.5"

  required_providers {
    google = "3.90.1"
  }
}

Planning to move to

terraform {
  required_version = "1.5.5"

  required_providers {
    google = "5.12.0"
  }
}


Also Module version we are using 17.0.0 and planning to move to 30.0.0 or 32.0.0

Additional information

No response

@koushikgongireddy koushikgongireddy added the bug Something isn't working label Oct 3, 2024
@koushikgongireddy
Copy link
Author

koushikgongireddy commented Oct 3, 2024

@apeabody @aaron-lane @bharathkkb Can you please check once on this issue and let me know how to overcome?

I see issue was opened and closed without resolution previously - #1773

@aaron-lane
Copy link
Contributor

Sorry, I have not maintained these modules for years.

@apeabody
Copy link
Contributor

apeabody commented Oct 3, 2024

Hi @koushikgongireddy

Any change to the keepers will normally result in nodepool replacement. To avoid this, in some cases, it may be possible to edit your remote state with the new keeper values. Here is an example of updating a keeper value in the v24.0 upgrade guide: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/docs/upgrading_to_v24.0.md#update-variant-random-id-keepers-updated

v17 to v33 is a substantial jump, so be sure to review:

@koushikgongireddy
Copy link
Author

Thanks for the update @apeabody

Yeah we reviewed the changes from 17.0.0 to 30.0.0 and its not affecting our workloads too much so we are good!!

So for keepers is there any alternative other than updating the statefile? Because if its one off we can do it but if its coming up from version to version then its tedious task if we are managing 10's of clusters. Currently we have 25+ GKE clusters!

So want to check if there is any other solution than updating the state file!!

Thanks
Koushik

@apeabody
Copy link
Contributor

apeabody commented Oct 3, 2024

Thanks for the update @apeabody

Yeah we reviewed the changes from 17.0.0 to 30.0.0 and its not affecting our workloads too much so we are good!!

So for keepers is there any alternative other than updating the statefile? Because if its one off we can do it but if its coming up from version to version then its tedious task if we are managing 10's of clusters. Currently we have 25+ GKE clusters!

So want to check if there is any other solution than updating the state file!!

Thanks Koushik

Updating the state file should only be required when keepers are modified AND you want to avoid replacing nodepools. While keepers don't change in every major release, there have been a number of changes since v17 was released 3 years ago.

@koushikgongireddy
Copy link
Author

@apeabody - But i see the changes from 30.0.0 to 32.0.0. I added the changes in above description as well!!

So we thought of changing to 32.0.0 from 30.0.0 in one dev cluster and then we again seeing keepers changes, so if its repeats its a pain point for us who is managing 25+ clusters.

@apeabody
Copy link
Contributor

apeabody commented Oct 3, 2024

enable_confidential_storage

Hi @koushikgongireddy - Curious, is there a reason the nodepool can't be recreated, especially on a dev cluster?

Part of the challenge is these new nodepool arguments can force re-creation at the provider level (For example enable_confidential_storage has ForceNew: true https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/container/node_config.go#L753).

@koushikgongireddy
Copy link
Author

koushikgongireddy commented Oct 3, 2024

@apeabody - We are fine on Dev Clusters, but for PROD clusters its a downtime because new node pool will be created and old will be deleted right away. Which will cause downtime for us.

So updating the statefile is an option but doing for 20 PROD clusters is difficult for us, so we want to see if there is any other alternative option because we are using external module provided by GCP

@KRASSUSS
Copy link

KRASSUSS commented Oct 4, 2024

Also experiencing the issue. I'm upgrading from module v32.0.0 to v33.0.4.
From the TF plan I can see the following change on the node pools. It wants to add the gcfs_config even when it's set to false:

+ gcfs_config {
+ enabled : false
}

This is the only change on a node pool level which makes me think this is what causes the nodes to recreate?

@wyardley
Copy link
Contributor

wyardley commented Oct 4, 2024

@KRASSUSS w/r/t the gcs_config diff specifically, I think updating to >= 6.4.0 or 5.44.1 provider version should resolve it as long as you're also on the latest module version. If you're seeing that still with the latest provider and latest module version, maybe file a separate issue, but I think you should be good.

If it's a provider version where it's showing that diff but is not forcing recreation on a node pool, it should also be safe to apply, but you'll probably see a permadiff until the provider is upgraded.

@wyardley
Copy link
Contributor

wyardley commented Oct 4, 2024

@koushikgongireddy you could maybe try removing the item from state (after backing up state, of course) and see if reimporting the cluster and nodepools works in your lower envs? Or make sure there aren't any config changes you have to add to the settings (e.g., formerly unsupported values that are now supported). I'm not super familiar with keepers, but it's possible that if you match the existing values in the configuration properly, you won't see a diff?

For example, the labels going from set to null makes me think there are some values that might need to be reflected in your Terraform configs?

Part of the challenge is these new nodepool arguments can force re-creation at the provider level (For example enable_confidential_storage has ForceNew: true

I'm hoping to eventually get more of the items that either don't support in-place updates, or don't work with updates at all, fixed, at least for the default nodepool case. I would imagine that enable_confidential_storage does need to recreate the nodepool if it is actually changing settings, but if the OP doesn't have it enabled, and doesn't want to enable it, maybe they can look at explicitly setting the value to false or null?

@koushikgongireddy
Copy link
Author

koushikgongireddy commented Oct 4, 2024

@wyardley The issue is for sure keepers here.

Because if you see the above description keepers is the major change in my tf plan and that's causing the random_id to get change and random_id change is causing node pool name to change and id node pool name is changing it needed Node pool to get recreated!

The issue is not with enable_confidential_storage has ForceNew: true

If you see below lines the keepers are changing and that change is ultimately causing force node pool recreation

17.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v17.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L313
30.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v30.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L500

We tried updating state file with the values and its working fine but we want to know is it the only option or any alternative options to avoid keepers change!!

@apeabody
Copy link
Contributor

apeabody commented Oct 8, 2024

Only the *-update-variant modules includes keepers, specifically for the node pool create before destroy behavior: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/tree/master/modules/beta-public-cluster-update-variant#node-pool-update-variant.

For those not making use of the node pool create before destroy behavior, the other modules such as beta-private-cluster do not include the keepers and would be suggested. Otherwise to avoid the node_pool replacement with *-update-variant modules, when new keepers are introduced, the local state could be modified.

@koushikgongireddy
Copy link
Author

@apeabody Thanks for the update, we will definitely test with beta-private-cluster and update you!!

Also can you describe more on the difference between the modules we have? I mean what benefits we get with update-variant and not with private-cluster?

beta-autopilot-private-cluster
beta-private-cluster-update-variant
beta-private-cluster

Thanks
Koushik

@apeabody
Copy link
Contributor

apeabody commented Oct 11, 2024

@apeabody Thanks for the update, we will definitely test with beta-private-cluster and update you!!

Also can you describe more on the difference between the modules we have? I mean what benefits we get with update-variant and not with private-cluster?

beta-autopilot-private-cluster beta-private-cluster-update-variant beta-private-cluster

Thanks Koushik

Hi @koushikgongireddy

All three of these are similar in that they enable beta features and can be configured as private clusters.

The big differences are that:
beta-autopilot-private-cluster: create an Autopilot cluster
beta-private-cluster-update-variant: creates a Standard cluster, with Node Pool Update Variant
beta-private-cluster: creates a Standard cluster

@apeabody apeabody reopened this Oct 11, 2024
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the Stale label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

5 participants