Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Steal Time support in ARM #4866

Open
3 tasks done
Manciukic opened this issue Oct 22, 2024 · 10 comments
Open
3 tasks done

[Feature Request] Steal Time support in ARM #4866

Manciukic opened this issue Oct 22, 2024 · 10 comments
Assignees
Labels
Good first issue Indicates a good issue for first-time contributors Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled rust Pull requests that update Rust code Status: Parked Indicates that an issues or pull request will be revisited later Type: Enhancement Indicates new feature requests

Comments

@Manciukic
Copy link
Contributor

Feature Request

PV Steal Time reporting on ARM is available in KVM since 5.10 at least but Firecracker doesn't enable it.

Describe the desired solution

We could enable the PVTIME device in Firecracker on ARM to allow for stolen time reporting.

Describe possible alternatives

  • check if rust-vmm has some off-the-shelf implementation we could use

Additional context

I found that this is the KVM patch that introduced it: https://lwn.net/Articles/797954/
And this is how QEMU integrated with it: https://patchew.org/QEMU/20200711101033.47371-1-drjones@redhat.com/20200711101033.47371-4-drjones@redhat.com/

It needs to be enabled via ioctl to the vcpu fd, there's an example in the kernel self tests: https://github.com/amazonlinux/linux/blob/kernel-5.10.215-203.850.amzn2/tools/testing/selftests/kvm/steal_time.c#L172

static void steal_time_init(struct kvm_vm *vm)
{
        struct kvm_device_attr dev = {
                .group = KVM_ARM_VCPU_PVTIME_CTRL,
                .attr = KVM_ARM_VCPU_PVTIME_IPA,
        };
        int i, ret;

        ret = _vcpu_ioctl(vm, 0, KVM_HAS_DEVICE_ATTR, &dev);
        if (ret != 0 && errno == ENXIO) {
                print_skip("steal-time not supported");
                exit(KSFT_SKIP);
        }

        for (i = 0; i < NR_VCPUS; ++i) {
                uint64_t st_ipa;

                vcpu_ioctl(vm, i, KVM_HAS_DEVICE_ATTR, &dev);

                dev.addr = (uint64_t)&st_ipa;

                /* ST_GPA_BASE is identity mapped */
                st_gva[i] = (void *)(ST_GPA_BASE + i * STEAL_TIME_SIZE);
                sync_global_to_guest(vm, st_gva[i]);

                st_ipa = (ulong)st_gva[i] | 1;
                ret = _vcpu_ioctl(vm, i, KVM_SET_DEVICE_ATTR, &dev);
                TEST_ASSERT(ret == -1 && errno == EINVAL, "Bad IPA didn't report EINVAL");

                st_ipa = (ulong)st_gva[i];
                vcpu_ioctl(vm, i, KVM_SET_DEVICE_ATTR, &dev);

                ret = _vcpu_ioctl(vm, i, KVM_SET_DEVICE_ATTR, &dev);
                TEST_ASSERT(ret == -1 && errno == EEXIST, "Set IPA twice without EEXIST");

        }
}

Checks

  • Have you searched the Firecracker Issues database for similar requests?
  • Have you read all the existing relevant Firecracker documentation?
  • Have you read and understood Firecracker's core tenets?
@Manciukic Manciukic added Good first issue Indicates a good issue for first-time contributors Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled Type: Enhancement Indicates new feature requests rust Pull requests that update Rust code labels Oct 22, 2024
@roypat
Copy link
Contributor

roypat commented Oct 22, 2024

Describe possible alternatives

* check if `rust-vmm` has some off-the-shelf implementation we could use

we definitely have KVM_{GET,SET}_DEVICE_ATTR ioctls in kvm-ioctls :)

@kalyazin kalyazin added the Status: Parked Indicates that an issues or pull request will be revisited later label Oct 23, 2024
@aerosouund
Copy link

Hello @Manciukic

I'd like to take this up please
While i am aware that my current competencies don't fully match whats needed to solve this, however i am ramping up my knowledge of rust and linux to eventually solve it
And i can see that it is parked, I assume this means you don't need a very quick solution. If thats ok and no one else is working on it please assign it to me :)

@Manciukic
Copy link
Contributor Author

@aerosouund sure, we're not planning on taking this on anytime soon, but it'd be nice to finally land this feature!
One thing to note is that as this is an ARM-only feature, so it will require access to an ARM device, like a raspberry pi, for testing.

Giving you a few leads, this will probably need changes in builder.rs, probably in both build_microvm_for_boot and build_microvm_from_snapshot as I think we may need to keep the pvtime state across snapshot/resume, but we should probably check what happens on x86 and how it works there. We will need to allocate some system_memory from the ResourceAllocator for the struct kvm_steal_time and then call theset_device_attr ioctl on the vcpu fd (one for each vcpu).

@Manciukic
Copy link
Contributor Author

Hey @aerosouund, are you still planning to work on this issue?

@aerosouund
Copy link

Hello @Manciukic
I still do yes. But I haven't had the time lately

If i have been blocking this for longer than possible feel free to assign it to someone else.
But i do intend to go back to this as soon as i get the chance

@DakshinD
Copy link

Hi, we're a group of students at UT Austin looking to contribute to virtualization-related open-source projects for our class. We came across this issue and would like to work on it if that's okay @Manciukic @aerosouund. Before we would proceed, we have some questions about the implementation since this is our first exposure to the Firecracker codebase.

We are referencing the following resources to understand the issue: ARM PVTIME specs (DEN0057A), QEMU patch, and KVM implementation.


Feature Detection

  • To check for steal time support:
    • Should we add KVM_CAP_ARM_PVTIME to DEFAULT_CAPABILITIES in src/vmm/src/arch/aarch64/kvm.rs?
    • We noticed that combine_capabilities and check_capabilities in src/vmm/src/vstate/kvm.rs already handle capability checks. Would additional implementation be required to confirm if steal time is enabled?
    • The QEMU patch added a property (kvm-steal-time) to toggle steal time support. Would something similar be needed in Firecracker?

Registering Steal Time Structures with vCPUs

  • Where (what files) should we use the ResourceAllocator to allocate system memory for the shared region? Would this be done in Vcpu::new, or is there a more appropriate place?
    • we saw that in the QEMU impl, they have a virt_cpu_post_init method where they do this.
  • After allocating memory, our understanding is that we need to use KVM_SET_DEVICE_ATTR with the IPA from the ResourceAllocator for each vCPU.
  • Based on what we've seen, a single PVTIME structure of size 64 * vcpu_count bytes is allocated, with each vCPU getting its own 64-byte subregion. Is this how we should go about allocating and registering the memory and does this align with Firecracker's architecture?

Persisting PVTIME State

  • As suggested by @Manciukic, we understand that changes will be required in builder.rs, specifically in build_microvm_for_boot and build_microvm_from_snapshot, to persist state across snapshot/resume. Are there additional considerations we should keep in mind?

This is what we've determined from our initial research, but we'd appreciate any clarifications or guidance on implementing these parts of the codebase. Especially pointers towards what specific files/folders to look at.

Thank you!

@aerosouund
Copy link

@DakshinD
Happy to leave this to you!

@Manciukic
Copy link
Contributor Author

Replies inline.

Hi, we're a group of students at UT Austin looking to contribute to virtualization-related open-source projects for our class. We came across this issue and would like to work on it if that's okay @Manciukic @aerosouund. Before we would proceed, we have some questions about the implementation since this is our first exposure to the Firecracker codebase.

Sure, fine by me.

We are referencing the following resources to understand the issue: ARM PVTIME specs (DEN0057A), QEMU patch, and KVM implementation.

Feature Detection

* To check for steal time support:
  
  * Should we add `KVM_CAP_ARM_PVTIME` to `DEFAULT_CAPABILITIES` in `src/vmm/src/arch/aarch64/kvm.rs`?

I didn't find the capability you're mentioning. Could you point me to it?
From what I can see, it's possible to use one of the kvm_{has,get,set}_device_attr syscalls to check and configure the steal time device.

  * We noticed that `combine_capabilities` and `check_capabilities` in `src/vmm/src/vstate/kvm.rs` already handle capability checks. Would additional implementation be required to confirm if steal time is enabled?
  * The QEMU patch added a property (`kvm-steal-time`) to toggle steal time support. Would something similar be needed in Firecracker?

I'd just enable it by default for now. The only reason to make it optional is if we see any drawback in enabling it.

Registering Steal Time Structures with vCPUs

* Where (what files) should we use the `ResourceAllocator` to allocate system memory for the shared region? Would this be done in `Vcpu::new`, or is there a more appropriate place?
  
  * we saw that in the [QEMU impl](https://patchew.org/QEMU/20200711101033.47371-1-drjones@redhat.com/20200711101033.47371-4-drjones@redhat.com/), they have a `virt_cpu_post_init` method where they do this.

I didn't check too deeply about this, but my hope was to treat this as a device and initialize it separately from the vcpus. This would require defining a new virtual device for steal time, allocating a region of system memory for it, and then initializing it using KVM syscalls. An example of a device that allocates system memory is the vmgenid device. While an example that uses the kvm_set_device_attr is the GIC interrupt manager but that is configured inside arch_post_create_vcpus.

* After allocating memory, our understanding is that we need to use `KVM_SET_DEVICE_ATTR` with the IPA from the `ResourceAllocator` for each vCPU.

* Based on what we've seen, a single PVTIME structure of size `64 * vcpu_count` bytes is allocated, with each vCPU getting its own 64-byte subregion. Is this how we should go about allocating and registering the memory and does this align with Firecracker's architecture?

I'd try to fit it into the system memory we already allocate (2MiB for ARM), using the ResourceAllocator::allocate_system_memory method to get the address.

Persisting PVTIME State

* As suggested by [@Manciukic](https://github.com/Manciukic), we understand that changes will be required in `builder.rs`, specifically in `build_microvm_for_boot` and `build_microvm_from_snapshot`, to persist state across snapshot/resume. Are there additional considerations we should keep in mind?

I'm guessing the only state to be persisted is the gva of the IPAs.

This is what we've determined from our initial research, but we'd appreciate any clarifications or guidance on implementing these parts of the codebase. Especially pointers towards what specific files/folders to look at.

Thank you!

@DakshinD
Copy link

DakshinD commented Apr 1, 2025

Thanks you for the help @Manciukic! We've refined our implementation plan and would like to confirm a few details before coding if that's alright.

I didn't find the capability you're mentioning. Could you point me to it?

I apologize, I meant to say we can add KVM_CAP_STEAL_TIME to DEFAULT_CAPABILITIES.

From what I can see, it's possible to use one of the kvm_{has,get,set}_device_attr syscalls to check and configure the steal time device.

I referenced the GIC device to see how configuring would work. I see that it first uses kvm_create_device, then kvm_set_device_attr. I want to confirm that it won't be necessary to use kvm_create_device specifically, but on each vcpu, we would use kvm_set_device_attr to register the IPA of its steal time memory block?


I also just wanted to write a quick summary of our planned implementation beforehand to validate with you so that we are on the right direction.

1. Feature Detection

  • KVM_CAP_STEAL_TIME:
    We’ll add this to DEFAULT_CAPABILITIES in src/vmm/src/arch/aarch64/kvm.rs.

2. PVTIME Device Initialization

  • No KVM Device Creation:
    Thank you for pointing out the GIC implementation, it was helpful. We see that GICv2/3, uses kvm_create_device. But we believe this is not necessary for the specific PVTime device we would make, so instead we would:
    • Allocate memory via ResourceAllocator::allocate_system_memory.
    • Create a PvTime struct to manage per-vCPU IPAs:
      pub struct PvTime {
        /// Maps vCPU IDs to guest physical addresses (IPAs) of their 64-bit steal time regions.
        steal_time_regions: HashMap<usize, u64>,
      }
      
      impl PvTime {
          /// Creates a new PVTIME device by allocating system memory for all vCPUs.
          ///
          /// Steps:
          /// 1. Use `ResourceAllocator` to reserve a memory block of size `64 * vcpu_count` bits.
          /// 2. Divide the allocated region into 64-bit subregions, assigning each to a vCPU.
          /// 3. Store the per-vCPU IPAs in a map for later registration.
          pub fn create(
              resource_allocator: &mut ResourceAllocator,
              vcpu_count: usize,
          ) -> Result<Self, Error> { /* ... */ }
      
          /// Registers a vCPU with its pre-allocated steal time region.
          ///
          /// Steps:
          /// 1. Fetch the vCPU's IPA from the `steal_time_regions` map.
          /// 2. Use `KVM_SET_DEVICE_ATTR` to configure the vCPU's KVM state with this IPA.
          pub fn register_vcpu(
              &self,
              vcpu_id: usize,
              vcpu_fd: &VcpuFd,
          ) -> Result<(), Error> { /* ... */ }
      
          /// Restores the PVTIME device state during snapshot resume.
          ///
          /// Some sort of restore function here? will flesh out in later steps
          pub fn restore( ) -> Result<Self, Error> { /* ... */ }
      }

3. vCPU Configuration

  • Q: I believe doing the creation of PvTime and the per-vcpu registration can be done in arch_post_create_vcpus along with setting the GIC up?

4. Snapshot Persistence

  • Implementing the Persist interface for PvTime by saving some state (base IPA of system memory region, and # of vcpus), and restoring later.

We are looking to break this into manageable steps and are looking for your input as to what part to start with as well as how to test this. Are there any other parts we need to take into consideration? Any feedback would be really appreciated. Thanks again for your patience!

@Manciukic
Copy link
Contributor Author

I apologize, I meant to say we can add KVM_CAP_STEAL_TIME to DEFAULT_CAPABILITIES.

It looks like it's architecture-agnostic capability, it will tell you the hypervisor supports steal time, not that it supports the particular pv_time (aarch64). I think just checking kvm_device_has_attr as in the self-test should suffice.

I want to confirm that it won't be necessary to use kvm_create_device specifically, but on each vcpu, we would use kvm_set_device_attr to register the IPA of its steal time memory block?

That's correct, it needs to be called on the vcpu fds.

Q: I believe doing the creation of PvTime and the per-vcpu registration can be done in arch_post_create_vcpus along with setting the GIC up?

Maybe, but you may not have all you need from within that context (no vcpu fds, no resource allocator). You could also look into putting it directly in build_microvm_for_boot in a new setup_pv_time(resource_allocator, vcpu_fds) function.

We are looking to break this into manageable steps and are looking for your input as to what part to start with as well as how to test this. Are there any other parts we need to take into consideration?

Since it doesn't look like a big change, I'd start from getting a prototype working to identify the rough edges and then refine those to build the final implementation. You can probably start from adding a function in build_microvm_for_boot for setup_pv_time to create and initialize the pv_time structure. In the prototype it's not important to implement the correct feature detection or have snapshot support at first.
Once we have that as a draft PR, we can see if we like the proposed implementation, and improve or change that if needed, and then add a better detection/fallback and snapshot support. We will then need to add integration tests to verify everything works correctly (but you can manually test on your own ARM device for the prototype).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good first issue Indicates a good issue for first-time contributors Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled rust Pull requests that update Rust code Status: Parked Indicates that an issues or pull request will be revisited later Type: Enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

5 participants