Skip to content

Commit fa12722

Browse files
committed
acpi: add documentation about VMGenID
Extend our current documentation for snapshotting and entropy recommendations with context about VMGenID. Mention the available VMGenID features depending on Linux version and also provide recommendations for entropy on VM clones based on VMGenID availability. Also, add CHANGELOG entry for VMGenID support. Signed-off-by: Babis Chalios <[email protected]>
1 parent 650ea37 commit fa12722

File tree

3 files changed

+109
-22
lines changed

3 files changed

+109
-22
lines changed

CHANGELOG.md

+9
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,15 @@ and this project adheres to
1717
without MPTable support. Please see our
1818
[kernel policy documentation](docs/kernel-policy.md) for more information
1919
regarding relevant kernel configurations.
20+
- [#4487](https://github.com/firecracker-microvm/firecracker/pull/4487): Added
21+
support for the Virtual Machine Generation Identifier (VMGenID) device on
22+
x86_64 platforms. VMGenID is a virtual device that allows VMMs to notify
23+
guests when they are resumed from a snapshot. Linux includes VMGenID support
24+
since version 5.18. It uses notifications from the device to reseed its
25+
internal CSPRNG. Please refer to
26+
[snapshot support](docs/snapshotting/snapshot-support.md) and
27+
[random for clones](docs/snapshotting/random-for-clones.md) documention for
28+
more info on VMGenID.
2029

2130
### Changed
2231

docs/snapshotting/random-for-clones.md

+60-13
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,19 @@ which wraps the [`AWS-LC` cryptographic library][9].
2222

2323
Traditionally, `/dev/random` has been considered a source of “true” randomness,
2424
with the downside that reads block when the pool of entropy gets depleted. On
25-
the other hand, `/dev/urandom` doesn’t block, but provides lower quality
26-
results. It turns out the distinction in output quality is actually very hard to
27-
make. According to [this article][2], for kernel versions prior to 4.8, both
28-
devices draw their output from the same pool, with the exception that
29-
`/dev/random` will block when the system estimates the entropy count has
30-
decreased below a certain threshold. The `/dev/urandom` output is considered
31-
secure for virtually all purposes, with the caveat that using it before the
32-
system gathers sufficient entropy for initialization may indeed produce low
33-
quality random numbers. The `getrandom` syscall helps with this situation; it
34-
uses the `/dev/urandom` source by default, but will block until it gets properly
35-
initialized (the behavior can be altered via configuration flags).
25+
the other hand, `/dev/urandom` doesn’t block, which lead people believe that it
26+
provides lower quality results.
27+
28+
It turns out the distinction in output quality is actually very hard to make.
29+
According to [this article][2], for kernel versions prior to 4.8, both devices
30+
draw their output from the same pool, with the exception that `/dev/random` will
31+
block when the system estimates the entropy count has decreased below a certain
32+
threshold. The `/dev/urandom` output is considered secure for virtually all
33+
purposes, with the caveat that using it before the system gathers sufficient
34+
entropy for initialization may indeed produce low quality random numbers. The
35+
`getrandom` syscall helps with this situation; it uses the `/dev/urandom` source
36+
by default, but will block until it gets properly initialized (the behavior can
37+
be altered via configuration flags).
3638

3739
Newer kernels (4.8+) have switched to an implementation where `/dev/random`
3840
output comes from a pool called the blocking pool, the output of `/dev/urandom`
@@ -41,6 +43,8 @@ and there’s also an input pool which gathers entropy from various sources
4143
available on the system, and is used to feed into or seed the other two
4244
components. A very detailed description is available [here][3].
4345

46+
### Linux kernels from 4.8 until 5.17 (included)
47+
4448
The details of this newer implementation are used to make the recommendations
4549
present in the document. There are in-kernel interfaces used to obtain random
4650
numbers as well, but they are similar to using `/dev/urandom` (or `getrandom`
@@ -99,6 +103,42 @@ not increase the current entropy estimation. There is also an `ioctl` interface
99103
which, given the appropriate privileges, can be used to add data to the input
100104
entropy pool while also increasing the count, or completely empty all pools.
101105

106+
### Linux kernels from 5.18 onwards
107+
108+
Since version 5.18, Linux has support for the
109+
[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier).
110+
The purpose of VMGenID is to notify the guest about time shift events, such as
111+
resuming from a snapshot. The device exposes a 16-byte cryptographically random
112+
identifier in guest memory. Firecracker implements VMGenID. When resuming a
113+
microVM from a snapshot Firecracker writes a new identifier and injects a
114+
notification to the guest. Linux,
115+
[uses this value](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/virt/vmgenid.c#L77)
116+
[as new randomness for its CSPRNG](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/char/random.c#L908).
117+
Quoting the random.c implementation of the kernel:
118+
119+
```
120+
/*
121+
* Handle a new unique VM ID, which is unique, not secret, so we
122+
* don't credit it, but we do immediately force a reseed after so
123+
* that it's used by the crng posthaste.
124+
*/
125+
```
126+
127+
As a result, values returned by `getrandom()` and `/dev/(u)random` are distinct
128+
in all VMs started from the same snapshot, **after** the kernel handles the
129+
VMGenID notification. This leaves a race window between resuming vCPUs and Linux
130+
CSPRNG getting successfully re-seeded. In Linux 6.8, we
131+
[extended VMGenID](https://lore.kernel.org/lkml/[email protected]/)
132+
to emit a uevent to user space when it handles the notification. User space can
133+
poll this uevent to know when it is safe to use `getrandom()`, et al. avoiding
134+
the race condition.
135+
136+
Please note that, Firecracker will always enable VMGenID. In kernels earlier
137+
than 5.18, where there is no VMGenID driver, the device will not have any effect
138+
in the guest.
139+
140+
### User space considerations
141+
102142
Init systems (such as `systemd` used by AL2 and other distros) might save a
103143
random seed file after boot. For `systemd`, the path is
104144
`/var/lib/systemd/random-seed`. Just to be on the safe side, any such file
@@ -121,8 +161,8 @@ alter the read result via bind mounting another file on top of
121161
and should be sufficient for most cases.
122162
- Use `virtio-rng`. When present, the guest kernel uses the device as an
123163
additional source of entropy.
124-
- To be as safe as possible, the direct approach is to do the following (before
125-
customer code is resumed in the clone):
164+
- On kernels before 5.18, to be as safe as possible, the direct approach is to
165+
do the following (before customer code is resumed in the clone):
126166
1. Open one of the special devices files (either `/dev/random` or
127167
`/dev/urandom`). Take note that `RNDCLEARPOOL` no longer
128168
[has any effect][7] on the entropy pool.
@@ -133,6 +173,13 @@ alter the read result via bind mounting another file on top of
133173
1. Issue a `RNDRESEEDCRNG` ioctl call ([4.14][5], [5.10][6], (requires
134174
`CAP_SYS_ADMIN`)) that specifically causes the `CSPRNG` to be reseeded from
135175
the input pool.
176+
- On kernels starting from 5.18 onwards, the CSPRNG will be automatically
177+
reseeded when the guest kernel handles the VMGenID notification. To completely
178+
avoid the race condition, users should follow the same steps as with kernels
179+
\< 5.18.
180+
- On kernels starting from 6.8, users can poll for the VMGenID uevent that the
181+
driver sends when the CSPRNG is reseeded after handling the VMGenID
182+
notification.
136183

137184
**Annex 1 contains the source code of a C program which implements the previous
138185
three steps.** As soon as the guest kernel version switches to 4.19 (or higher),

docs/snapshotting/snapshot-support.md

+40-9
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,10 @@ The snapshot functionality is still in developer preview due to the following:
146146
- If a [CPU template](../cpu_templates/cpu-templates.md) is not used on x86_64,
147147
overwrites of `MSR_IA32_TSX_CTRL` MSR value will not be preserved after
148148
restoring from a snapshot.
149+
- Resuming from a snapshot that was taken during early stages of the guest
150+
kernel boot might lead to crashes upon snapshot resume. We suggest that users
151+
take snapshot after the guest microVM kernel has booted. Please see
152+
[VMGenID device limitation](#vmgenid-device-limitation).
149153

150154
## Firecracker Snapshotting characteristics
151155

@@ -571,15 +575,32 @@ we also consider microVM A insecure if it resumes execution.
571575

572576
### Reusing snapshotted states securely
573577

574-
We are currently working to add a functionality that will notify guest operating
575-
systems of the snapshot event in order to enable secure reuse of snapshotted
576-
microVM states, guest operating systems, language runtimes, and cryptographic
577-
libraries. In some cases, user applications will need to handle the snapshot
578-
create/restore events in such a way that the uniqueness and randomness
579-
properties are preserved and guaranteed before resuming the workload.
580-
581-
We've started a discussion on how the Linux operating system might securely
582-
handle being snapshotted [here](https://lkml.org/lkml/2020/10/16/629).
578+
[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier)
579+
(VMGenID) is a virtual device that allows VM guests to detect when they have
580+
resumed from a snapshot. It works by exposing a cryptographically random
581+
16-bytes identifier to the guest. The VMM ensures that the value of the
582+
indentifier changes every time the VM a time shift happens in the lifecycle of
583+
the VM, e.g. when it resumes from a snapshot.
584+
585+
Linux supports VMGenID since version 5.18. When Linux detects a change in the
586+
identifier, it uses its value to reseed its internal PRNG. Moreover,
587+
[since version 6.8](https://lkml.org/lkml/2023/5/31/414) Linux VMGenID driver
588+
also emits to userspace a uevent. User space processes can monitor this uevent
589+
for detecting snapshot resume events.
590+
591+
Firecracker supports VMGenID device on x86 platforms. Firecracker will always
592+
enable the device. During snapshot resume, Firecracker will update the 16-byte
593+
generation ID and inject a notification in the guest before resuming its vCPUs.
594+
595+
As a result, guests that run Linux versions >= will re-seed their in-kernel PRNG
596+
upon snapshot resume. User space applications can rely on the guest kernel for
597+
randomness. State other than the guest kernel entropy pool, such as unique
598+
identifiers, cached random numbers, cryptographic tokens, etc **will** still be
599+
replicated across multiple microVMs resumed from the same snapshot. Users need
600+
to implement mechanisms for ensuring de-duplication of such state, where needed.
601+
On guests that run Linux versions >= 6.8, users can make use of the uevent that
602+
VMGenID driver emits upon resuming from a snapshot, to be notified about
603+
snapshot resume events.
583604

584605
## Vsock device limitation
585606

@@ -605,6 +626,16 @@ section 5.10.6.6 Device Events.
605626
Firecracker handles sending the `reset` event to the vsock driver, thus the
606627
customers are no longer responsible for closing active connections.
607628

629+
## VMGenID device limitation
630+
631+
During snashot resume, Firecracker updates the 16-byte generation ID of the
632+
VMGenID device and injects an interrupt in the guest before resuming vCPUs. If
633+
the snapshot was taken at the very early stages of the guest kernel boot process
634+
proper interrupt handling might not be in place yet. As a result, the kernel
635+
might not be able to handle the injected notification and crash. We suggest to
636+
users that they take snapshots only after the guest kernel has completed
637+
booting, to avoid this issue.
638+
608639
## Snapshot compatibility across kernel versions
609640

610641
We have a mechanism in place to experiment with snapshot compatibility across

0 commit comments

Comments
 (0)