-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurring error "400 route operation in progress" when applying 3-network-hub-and-spoke #1228
Comments
Reference: last full 3-networks-hub-and-spoke apply - up to 5-app-infra - TF 1.3.10 to avoid the issue running cloudbuild with 1.3 - if we use the default 1.7.5 (since downgraded in 1.5.7 in cloud shell)
I will also retest 3-nhas as soon as I finish the TEF upstream sync for 20240511 main in GoogleCloudPlatform/pbmm-on-gcp-onboarding#387 to reverify 3-nhas. There are 2 symlinks in nonproduction that need to be reverted un #1107 but they function with a double symlink ok for now. |
If there are too many simultaneous operations on peering, this will occur. It is not occurring in our integration tests. Are environments being deployed in parallel. One workaround is to set in Terraform parallel=1, but it will make the build take a long time, as you are not running in parallel. |
@sleighton2022 : there is no parallelism here and deployment is done manually. It does not happen every time, not even often. I've got one of these on 05/30 and another one today. However the problem is deeper and nastier. In both cases when one of these occurred it was associated with tfstate corruption. On 05/30 it occurred during the "apply" for "3-nhas" production and today during tf "apply" for 3-nhas development. Apparently and superficially it seemed that a retry (tf plan then apply) fixed the issue both on 05/30 and today and 3-nhas was apparently deployed without error. In reality the tfstate for the stage where the error occurred (prod on 05/30 and dev today) was corrupted and was missing variables supposed to have been generated by outputs.tf. As a result when deploying 4-projects these variables won't be found and the deployment fails for good. Example : after today's failed deployment compared the tfstate files under key "networks" and while prod and nprod were containing same output variables (different values) quite a few were missing for dev more precisely the below were missing, possibly other vars "base_subnets_secondary_ranges": { "base_subnets_self_links": { |
Interestingly, same issue reported with project-factory module but apparently not directly related to TEF. People reporting these think the error points to a race condition Cloud DNS and Peering - Terraform Providers / Google - HashiCorp Discuss |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days |
TL;DR
This happens almost every time when deploying dev, nprod or prod. Have to plan and apply again and everything is fine . But this kind of error will ruin any pipeline deploying automatically the spokes
. . .
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter.regular_service_perimeter: Creating...
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter.regular_service_perimeter: Creation complete after 3s [id=accessPolicies/6329355927/servicePerimeters/sp_n_shared_restricted_default_perimeter_e480]
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter_resource.service_perimeter_resource["115822756025"]: Creating...
module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter_resource.service_perimeter_resource["115822756025"]: Creation complete after 2s [id=accessPolicies/6329355927/servicePerimeters/sp_n_shared_restricted_default_perimeter_e480/projects/115822756025]
module.base_env.module.restricted_shared_vpc[0].google_access_context_manager_service_perimeter.bridge_to_network_hub_perimeter[0]: Creating...
module.base_env.module.restricted_shared_vpc[0].google_access_context_manager_service_perimeter.bridge_to_network_hub_perimeter[0]: Creation complete after 0s [id=accessPolicies/6329355927/servicePerimeters/spb_c_to_n_shared_restricted_bridge_e480]
Error: Error adding network peering: googleapi: Error 400: There is a route operation in progress on the local or peer network. Try again later., badRequest
with module.base_env.module.base_shared_vpc[0].module.peering[0].google_compute_network_peering.peer_network_peering,
on .terraform/modules/base_env.base_shared_vpc.peering/modules/network-peering/main.tf line 50, in resource "google_compute_network_peering" "peer_network_peering":
50: resource "google_compute_network_peering" "peer_network_peering" {
Did not investigate in detail what's going on, might be a race condition / unaccounted for dependency
Expected behavior
Should smoothly deploy - why the 2'nd time succeeds?
Observed behavior
Look at TL;DR*
Error: Error adding network peering: googleapi: Error 400: There is a route operation in progress on the local or peer network. Try again later., badRequest
Terraform Configuration
Terraform Version
Additional information
No response
The text was updated successfully, but these errors were encountered: