-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Azure Compute Gallery in CPI #710
base: master
Are you sure you want to change the base?
Conversation
- Introduces new methods in `AzureClient` to manage gallery images - Enhances `StemcellManager2` to handle gallery operations, when configured - Creates gallery image definition and version when uploading stemcells - Stores gallery metadata in blob properties during stemcell creation - Supports cross-region replication by updating target regions - Falls back to user images if gallery operations fail - Properly cleans up gallery resources during stemcell deletion - Enables efficient stemcell distribution across regions using Azure's managed gallery service while maintaining backwards compatibility with existing user image workflows.
- Adds a config option for the replica count of compute gallery image versions - Removes the dependency to the storage blob when a compute gallery image exists
f3d2521
to
147ae42
Compare
I have resolved the conflicts and added another commit addressing the feedback provided by @MSSedusch. With this change, the gallery images can now be found directly based on the tags, rather than through the tags of the VHD blob, decoupling gallery images from the VHD blobs. Theoretically, they can be deleted, but I would suggest keeping them as a fallback. |
Just to clarity, this PR covers the proposed solution in #709 FYI @s4heid, as expected, it seems that alternative approach 1 is also attractive for the community as mentioned by @jpalermo as it basically matches the other IaaS providers. Do you think it make sense to support both approaches? |
Correct. I believe both approaches have their benefits and could be valuable for the community. If you're also interested in the alternative approach of #709, I am happy to provide a complete implementation. |
This seems like a great idea. I've never heard of the Compute Gallery before and the CPI does take a bit longer to create the stemcell, but the VM creation savings seem to make it more than worth it. I think this PR stands alone just fine. It would be great for us to explore adding the additional functionality to the CPI to support "light" stemcells stored in a public gallery. We'd need to figure out some things on the publish side, but finally having light stemcells for Azure would be great. One change request for this PR though. The |
Thank you for the positive feedback, @jpalermo. It appears I missed the most crucial part of making this change usable. During my tests, I patched the CPI config directly on the director VM and rsync'ed the CPI changes. In the latest commit I have added the parameters to the job/spec. Good catch! |
cloud_error("Missing the property 'location' in the global configuration") if location.nil? | ||
|
||
metadata = stemcell_properties.dup | ||
stemcell_series = metadata['name']&.delete_prefix('bosh-azure-hyperv-ubuntu-')&.delete_suffix('-go_agent') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice a hardcoded ubuntu
here, which feels strange. What's the significance of shortening the stemcell name for use as a "sku" and "compute_gallery_image_definition"? Is the gallery feature not compatible with Windows stemcells?
Conceptually there isn't anything to prevent someone from creating heavy Windows stemcells, which would go through this code path.
We only produce light Windows stemcells. Previously, light stemcells would have gone through a different code path. However, the addition of the !@use_compute_gallery
in cloud.rb implies that all stemcells, light and heavy, will go through this code path now when configured to use compute gallery.
https://github.com/cloudfoundry/bosh-azure-cpi-release/pull/710/files#diff-a4feb37b1f55ca0dede269baecf464669cd9a30bfadf2c14568cf43e8a2771f1R85
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, nothing should prevent a user from uploading a full windows stemcell, but as you said currently there are no full windows stemcells published. When uploading windows light stemcells, they are handled by the light-stemcell-manager which does not interfere with compute gallery.
Since compute gallery images are referenced by resource ID in create_vm
, unlike light stemcells which use publisher/offer/sku, the publisher/offer/sku of compute gallery images can essentially be chosen arbitrarily.
After reconsidering the current choices and discussing with @MSSedusch, we concluded that the current naming in this PR is not ideal. We have come up with the following alternative:
{
"publisher": "bosh",
"offer": "bosh-azure-hyperv-ubuntu-bionic-go_agent",
"sku": "gen1"
}
The value of offer
refers to cloud_properties.name
without any changes. publisher
is set to bosh
, which is more appropriate than azure
. sku
typically indicates a version or generation, and we are proposing gen1
as the default. For newer stemcell-series like ubuntu-noble, the correct hypervgeneration value is needed; otherwise, features like the efi bootloader won't work correctly (see #706). This means the generation should be added to the stemcell's cloud properties. Once added, we can read it from the metadata, replace the static reference, and set the value when creating the image definition and for the regular images.
@ystros if this proposal makes sense to you, I would apply the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @s4heid for the explanation of the chosen publisher, sku, and offer fields. I'd defer to your expertise on the required values for these fields. Your updated proposal makes more sense as a generalized solution than trying to massage the stemcell name.
When uploading windows light stemcells, they are handled by the light-stemcell-manager which does not interfere with compute gallery.
This used to be the case, but as I mentioned, this PR breaks that flow when compute gallery is configured. I'll open up a new thread on the specific line so that it is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I didn't like the massage of the stemcell name either. I have changed it as part of 236ff8e.
{ | ||
'location' => location, | ||
'tags' => metadata, | ||
'storage_account_name' => @default_storage_account_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ The usage of a storage account here confused me initially, but it appears even when using managed disks the CPI will either search for a usable storage account, or create one if it can't find one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also needed a moment to understand this. CPI will then attempt to copy the stemell blob into that storage account.
@@ -80,7 +82,7 @@ def create_stemcell(image_path, cloud_properties) | |||
'stemcell' => "#{cloud_properties.fetch('name', 'unknown_name')}-#{cloud_properties.fetch('version', 'unknown_version')}" | |||
} | |||
@telemetry_manager.monitor('create_stemcell', extras: extras) do | |||
if has_light_stemcell_property?(cloud_properties) | |||
if has_light_stemcell_property?(cloud_properties) && !@use_compute_gallery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (and the corresponding update in delete_stemcell
) breaks uploading light Windows stemcells when the new compute gallery feature is active. Since !@use_compute_gallery
is false in this scenario, uploading a light Windows stemcell will not use the light stemcell manager. Instead, it will call the stemcell manager 2 code, and attempt to create a gallery image out of the light stemcell.
I created a dev release from this branch + deployed a Director to confirm this. Uploading the latest Windows 2019 light stemcell results in a CPI error as it tries to treat the light stemcell as a heavy one, and convert it into a gallery image:
$ bosh upload-stemcell https://bosh-windows-stemcells-production.s3.amazonaws.com/2019/light-bosh-stemcell-2019.83-azure-hyperv-windows2019-go_agent.tgz
Using environment '10.0.8.5' as client 'admin'
Task 6
Task 6 | 22:07:05 | Update stemcell: Downloading remote stemcell (00:00:00)
Task 6 | 22:07:05 | Update stemcell: Extracting stemcell archive (00:00:00)
Task 6 | 22:07:05 | Update stemcell: Verifying stemcell manifest (00:00:00)
Task 6 | 22:07:07 | Update stemcell: Checking if this stemcell already exists (00:00:00)
Task 6 | 22:07:07 | Update stemcell: Uploading stemcell bosh-azure-hyperv-windows2019-go_agent/2019.83 to the cloud (00:00:01)
L Error: Unknown CPI error 'Unknown' with message 'undefined method `split' for nil' in 'create_stemcell' CPI method (CPI request ID: 'cpi-133183')
Snippet from the CPI log:
I, [2025-03-04T22:07:08.851675 #19043 #2440] INFO -- [req_id cpi-133183]: StemcellManager2.create_stemcell(/var/vcap/data/director/tmp/stemcell20250304-19012-7g86nu/image, {"infrastructure"=>"azure", "os_type"=>"windows", "image"=>{"offer"=>"bosh-windows-server-2019", "publisher"=>"pivotal", "sku"=>"2019-sku2", "version"=>"2019.83.004001"}})
I, [2025-03-04T22:07:08.851941 #19043] INFO -- [req_id cpi-133183]: Finished create_stemcell in 0.78 seconds
Rescued Unknown: undefined method `split' for nil. backtrace: /var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/lib/cloud/azure/stemcell/stemcell_manager2.rb:122:in `_make_semver_compliant'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/lib/cloud/azure/stemcell/stemcell_manager2.rb:24:in `create_stemcell'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/lib/cloud/azure/cloud.rb:88:in `block (2 levels) in create_stemcell'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/lib/cloud/azure/telemetry/telemetry_manager.rb:71:in `monitor'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/lib/cloud/azure/cloud.rb:84:in `block in create_stemcell'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/gem_home/ruby/3.3.0/gems/bosh_common-2.0.0/lib/common/thread_formatter.rb:50:in `with_thread_name'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/lib/cloud/azure/cloud.rb:80:in `create_stemcell'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/gem_home/ruby/3.3.0/gems/bosh_cpi-2.6.0/lib/bosh/cpi/cli.rb:90:in `public_send'
/var/vcap/data/packages/bosh_azure_cpi/50e6cd4e8908c1507e2d71d4ad0c162e92c078fa/gem_home/ruby/3.3.0/gems/bosh_cpi-2.6.0/lib/bosh/cpi/cli.rb:90:in `run'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some additional testing today. Was able to upload multiple versions and lines of ubuntu stemcells and deploy them as well as windows2019 light stemcells. |
Summary
This change relates to #709 and introduces support for Azure Compute Gallery enabling centralized management and distribution of stemcell images across regions and accounts while maintaining backwards compatibility with existing user image workflows.
When enabled via cpi config parameter (
azure.compute_gallery_name
), BOSH will:Stemcell Workflow with Compute Gallery
Checklist
Please check each of the boxes below for which you have completed the corresponding task:
Unit Test output:
$ bundle exec rake spec:unit
Rubocop output:
$ bundle exec rake rubocop
These violations look unrelated to this change.
Changelog
compute_gallery_name
in CPI configuse_managed_disks: true
andlocation
to be set.Manual Testing
I deployed this pull request on a bosh-director and enabled the compute gallery via the cpi config. I uploaded the latest stemcell from bosh.io and the cpi created the image definition and image version as shown in the screenshots below.
image definition
image version
Duration of
bosh upload-stemcell
Performance Analysis: Azure Compute Gallery vs Regular Images
Test Methodology
Standard_B2as_v2
v50.0.1
5af7bbe
(with 3 replicas)eastus
Results
VM Creation Times (mm:ss)
x̄
σ
min
max
Statistics
Observations