Upgrading from one Module version to Other Module Version causing Node pool to recreate #2127

koushikgongireddy · 2024-10-03T13:30:18Z

TL;DR

Upgrading from one Module version to Other Module Version causing Node pool to recreate

Expected behavior

When we upgrade GKE module versions we are seeing breaking changes where GKE Node pools are trying to recreate.

Currently we are on old version 17.0.0 and planning to upgrade to 30.0.0 and i see there are changers in keepers which is causing the random_id to change and that's causing the node pool to recreate.

17.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v17.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L313
30.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v30.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L500

We also tried from 30.0.0 to 32.0.0 and same happening again as new changes are added in keepers
32.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v32.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L591

We need help on how to upgrade to higher versions without causing the node pool to recreate

Observed behavior

When we run TF plan after upgrading to 30.0.0 we are seeing below resources are recreated

    module.gke.module.gke.random_id.name["dev-gusc1-ee-vpc-green"] must be replaced
+/- resource "random_id" "name" {

   module.gke.module.gke.random_id.name["dev-gusc1-ee-vpc-defnp01"] must be replaced
+/- resource "random_id" "name" {

   module.gke.module.gke.google_container_node_pool.pools["dev-gusc1-ee-vpc-green"] must be replaced
+/- resource "google_container_node_pool" "pools" {

   module.gke.module.gke.google_container_node_pool.pools["dev-gusc1-ee-vpc-defnp01"] must be replaced
+/- resource "google_container_node_pool" "pools" {

17.0.0 to 30.0.0

      ~ keepers     = { # forces replacement
          + "boot_disk_kms_key"           = ""
          + "enable_gcfs"                 = ""
          + "enable_gvnic"                = ""
          + "enable_integrity_monitoring" = ""
          + "enable_secure_boot"          = "true"
          + "gpu_partition_size"          = ""
          - "labels"                      = "xxxx-vpc-green,true" -> null
          + "max_pods_per_node"           = ""
          + "min_cpu_platform"            = ""
          + "placement_policy"            = ""
          + "pod_range"                   = ""
          + "spot"                        = ""
          - "tags"                        = "xxxx-vpc-green" -> null
            # (11 unchanged elements hidden)
        }

30.0.0 to 32.0.0

      ~ keepers     = { # forces replacement
          + "enable_confidential_storage" = ""
          + "gpu_driver_version"          = ""
          + "gpu_sharing_strategy"        = ""
          + "max_shared_clients_per_gpu"  = ""
          + "queued_provisioning"         = ""
            # (22 unchanged elements hidden)
        }

Terraform Configuration

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"

  version = "30.0.0"
  project_id                  = var.project_id
  kubernetes_version          = var.kubernetes_version
  name                        = "${var.environment}-${var.location_acronym}-${var.cluster_name_suffix}"
  region                      = var.location
  zones                       = var.zones
.
.
.
.
.
.
 node_pools = [
    {
      name               = "${var.network_name}-defnp01"
      machine_type       = var.machine_type
      enable_secure_boot = true
      node_locations     = var.node_pools_locations
.
.
.

    },
    {
      name               = "${var.network_name}-green"
      machine_type       = var.machine_type
      enable_secure_boot = true
      node_locations     = var.node_pools_locations
.
.
.
.

    }
]
}

Terraform Version

We are currently on below
Present - 

terraform {
  required_version = "1.5.5"

  required_providers {
    google = "3.90.1"
  }
}

Planning to move to

terraform {
  required_version = "1.5.5"

  required_providers {
    google = "5.12.0"
  }
}


Also Module version we are using 17.0.0 and planning to move to 30.0.0 or 32.0.0

Additional information

No response

The text was updated successfully, but these errors were encountered:

koushikgongireddy · 2024-10-03T13:33:06Z

@apeabody @aaron-lane @bharathkkb Can you please check once on this issue and let me know how to overcome?

I see issue was opened and closed without resolution previously - #1773

aaron-lane · 2024-10-03T14:31:06Z

Sorry, I have not maintained these modules for years.

apeabody · 2024-10-03T15:53:43Z

Hi @koushikgongireddy

Any change to the keepers will normally result in nodepool replacement. To avoid this, in some cases, it may be possible to edit your remote state with the new keeper values. Here is an example of updating a keeper value in the v24.0 upgrade guide: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/docs/upgrading_to_v24.0.md#update-variant-random-id-keepers-updated

v17 to v33 is a substantial jump, so be sure to review:

koushikgongireddy · 2024-10-03T15:58:33Z

Thanks for the update @apeabody

Yeah we reviewed the changes from 17.0.0 to 30.0.0 and its not affecting our workloads too much so we are good!!

So for keepers is there any alternative other than updating the statefile? Because if its one off we can do it but if its coming up from version to version then its tedious task if we are managing 10's of clusters. Currently we have 25+ GKE clusters!

So want to check if there is any other solution than updating the state file!!

Thanks
Koushik

apeabody · 2024-10-03T16:14:25Z

Thanks for the update @apeabody

Yeah we reviewed the changes from 17.0.0 to 30.0.0 and its not affecting our workloads too much so we are good!!

So for keepers is there any alternative other than updating the statefile? Because if its one off we can do it but if its coming up from version to version then its tedious task if we are managing 10's of clusters. Currently we have 25+ GKE clusters!

So want to check if there is any other solution than updating the state file!!

Thanks Koushik

Updating the state file should only be required when keepers are modified AND you want to avoid replacing nodepools. While keepers don't change in every major release, there have been a number of changes since v17 was released 3 years ago.

koushikgongireddy · 2024-10-03T16:18:23Z

@apeabody - But i see the changes from 30.0.0 to 32.0.0. I added the changes in above description as well!!

So we thought of changing to 32.0.0 from 30.0.0 in one dev cluster and then we again seeing keepers changes, so if its repeats its a pain point for us who is managing 25+ clusters.

apeabody · 2024-10-03T16:33:47Z

enable_confidential_storage

Hi @koushikgongireddy - Curious, is there a reason the nodepool can't be recreated, especially on a dev cluster?

Part of the challenge is these new nodepool arguments can force re-creation at the provider level (For example enable_confidential_storage has ForceNew: true https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/container/node_config.go#L753).

koushikgongireddy · 2024-10-03T16:38:25Z

@apeabody - We are fine on Dev Clusters, but for PROD clusters its a downtime because new node pool will be created and old will be deleted right away. Which will cause downtime for us.

So updating the statefile is an option but doing for 20 PROD clusters is difficult for us, so we want to see if there is any other alternative option because we are using external module provided by GCP

KRASSUSS · 2024-10-04T07:38:26Z

Also experiencing the issue. I'm upgrading from module v32.0.0 to v33.0.4.
From the TF plan I can see the following change on the node pools. It wants to add the gcfs_config even when it's set to false:

+ gcfs_config {
+ enabled : false
}

This is the only change on a node pool level which makes me think this is what causes the nodes to recreate?

wyardley · 2024-10-04T18:29:56Z

@KRASSUSS w/r/t the gcs_config diff specifically, I think updating to >= 6.4.0 or 5.44.1 provider version should resolve it as long as you're also on the latest module version. If you're seeing that still with the latest provider and latest module version, maybe file a separate issue, but I think you should be good.

If it's a provider version where it's showing that diff but is not forcing recreation on a node pool, it should also be safe to apply, but you'll probably see a permadiff until the provider is upgraded.

wyardley · 2024-10-04T18:34:58Z

@koushikgongireddy you could maybe try removing the item from state (after backing up state, of course) and see if reimporting the cluster and nodepools works in your lower envs? Or make sure there aren't any config changes you have to add to the settings (e.g., formerly unsupported values that are now supported). I'm not super familiar with keepers, but it's possible that if you match the existing values in the configuration properly, you won't see a diff?

For example, the labels going from set to null makes me think there are some values that might need to be reflected in your Terraform configs?

Part of the challenge is these new nodepool arguments can force re-creation at the provider level (For example enable_confidential_storage has ForceNew: true

I'm hoping to eventually get more of the items that either don't support in-place updates, or don't work with updates at all, fixed, at least for the default nodepool case. I would imagine that enable_confidential_storage does need to recreate the nodepool if it is actually changing settings, but if the OP doesn't have it enabled, and doesn't want to enable it, maybe they can look at explicitly setting the value to false or null?

koushikgongireddy · 2024-10-04T18:43:54Z

@wyardley The issue is for sure keepers here.

Because if you see the above description keepers is the major change in my tf plan and that's causing the random_id to get change and random_id change is causing node pool name to change and id node pool name is changing it needed Node pool to get recreated!

The issue is not with enable_confidential_storage has ForceNew: true

If you see below lines the keepers are changing and that change is ultimately causing force node pool recreation

17.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v17.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L313
30.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v30.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L500

We tried updating state file with the values and its working fine but we want to know is it the only option or any alternative options to avoid keepers change!!

apeabody · 2024-10-08T18:59:24Z

Only the *-update-variant modules includes keepers, specifically for the node pool create before destroy behavior: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/tree/master/modules/beta-public-cluster-update-variant#node-pool-update-variant.

For those not making use of the node pool create before destroy behavior, the other modules such as beta-private-cluster do not include the keepers and would be suggested. Otherwise to avoid the node_pool replacement with *-update-variant modules, when new keepers are introduced, the local state could be modified.

koushikgongireddy · 2024-10-09T18:16:28Z

@apeabody Thanks for the update, we will definitely test with beta-private-cluster and update you!!

Also can you describe more on the difference between the modules we have? I mean what benefits we get with update-variant and not with private-cluster?

beta-autopilot-private-cluster
beta-private-cluster-update-variant
beta-private-cluster

Thanks
Koushik

apeabody · 2024-10-11T21:27:09Z

@apeabody Thanks for the update, we will definitely test with beta-private-cluster and update you!!

Also can you describe more on the difference between the modules we have? I mean what benefits we get with update-variant and not with private-cluster?

beta-autopilot-private-cluster beta-private-cluster-update-variant beta-private-cluster

Thanks Koushik

Hi @koushikgongireddy

All three of these are similar in that they enable beta features and can be configured as private clusters.

The big differences are that:
beta-autopilot-private-cluster: create an Autopilot cluster
beta-private-cluster-update-variant: creates a Standard cluster, with Node Pool Update Variant
beta-private-cluster: creates a Standard cluster

github-actions · 2024-12-10T23:17:28Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

koushikgongireddy added the bug Something isn't working label Oct 3, 2024

apeabody closed this as completed Oct 11, 2024

apeabody reopened this Oct 11, 2024

github-actions bot added the Stale label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading from one Module version to Other Module Version causing Node pool to recreate #2127

Upgrading from one Module version to Other Module Version causing Node pool to recreate #2127

koushikgongireddy commented Oct 3, 2024 •

edited

Loading

koushikgongireddy commented Oct 3, 2024 •

edited

Loading

aaron-lane commented Oct 3, 2024

apeabody commented Oct 3, 2024

koushikgongireddy commented Oct 3, 2024

apeabody commented Oct 3, 2024

koushikgongireddy commented Oct 3, 2024

apeabody commented Oct 3, 2024

koushikgongireddy commented Oct 3, 2024 •

edited

Loading

KRASSUSS commented Oct 4, 2024 •

edited

Loading

wyardley commented Oct 4, 2024

wyardley commented Oct 4, 2024

koushikgongireddy commented Oct 4, 2024 •

edited

Loading

apeabody commented Oct 8, 2024

koushikgongireddy commented Oct 9, 2024

apeabody commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Dec 10, 2024

Upgrading from one Module version to Other Module Version causing Node pool to recreate #2127

Upgrading from one Module version to Other Module Version causing Node pool to recreate #2127

Comments

koushikgongireddy commented Oct 3, 2024 • edited Loading

TL;DR

Expected behavior

Observed behavior

Terraform Configuration

Terraform Version

Additional information

koushikgongireddy commented Oct 3, 2024 • edited Loading

aaron-lane commented Oct 3, 2024

apeabody commented Oct 3, 2024

koushikgongireddy commented Oct 3, 2024

apeabody commented Oct 3, 2024

koushikgongireddy commented Oct 3, 2024

apeabody commented Oct 3, 2024

koushikgongireddy commented Oct 3, 2024 • edited Loading

KRASSUSS commented Oct 4, 2024 • edited Loading

wyardley commented Oct 4, 2024

wyardley commented Oct 4, 2024

koushikgongireddy commented Oct 4, 2024 • edited Loading

apeabody commented Oct 8, 2024

koushikgongireddy commented Oct 9, 2024

apeabody commented Oct 11, 2024 • edited Loading

github-actions bot commented Dec 10, 2024

koushikgongireddy commented Oct 3, 2024 •

edited

Loading

koushikgongireddy commented Oct 3, 2024 •

edited

Loading

koushikgongireddy commented Oct 3, 2024 •

edited

Loading

KRASSUSS commented Oct 4, 2024 •

edited

Loading

koushikgongireddy commented Oct 4, 2024 •

edited

Loading

apeabody commented Oct 11, 2024 •

edited

Loading