-
-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pseudorandom numbers are sometimes the same across processes #1816
Comments
I am extremely impressed by this analysis! It seems more exhaustive than I could ever have imagined, and it feels I learned a lot and can only hope to recall it when applicable. As for this particular problem, focussing on reliably-enough solving 2) seems easiest, but when attempting it I'd hope that To me it boils down to calling Maybe this could be combined with a way to obtain handles to uniquely opened files (or symlinks) in
|
It seems to me that seeding more robustly would solve both. Since adding
Is arranging to call |
Since all global RNG calls are automatically done per thread, I'd think that arranging one call per thread to So probably it will be easier overall to maintain a thread-local |
I've just seen this bug on macOS on a feature branch that doesn't change anything related to it. I think can be taken as confirmation that it is not specific to Windows (which was already fairly clear, but not previously confirmed). |
Current behavior 😯
Background
Fast pseudorandom numbers are used in a few places where they probably need not be cryptographically secure, but where the generated bytes are intended to be different across any kind of concurrent tasks, including across separate processes. Existing uses in gitoxide--but not necessarily all of their callers--are as follows:
gix_utils::backoff::Quadratic::default_with_random
supplies a series of increasing timings that are adjusted with pseudorandom jitter, so concurrent tasks don't all retry at the same time. Itsrandomize
helper computes:gitoxide/gix-utils/src/backoff.rs
Line 4 in ffb73b5
default_with_random
is public, so there may be non-gitoxide crates that use it directly, probably in the same kind of scenarios it is used here. Here, it is used to avoid some cases of the thundering herd problem: when accessing a network, at least ingix-testtools
; as well as ingix-lock
, which runs in production.(Per gitpython-developers/gitdb#115 (comment) and #1815, this was recently renamed from
Exponential
toQuadratic
. But that was just a name change, with no effect on this issue.)gix_fs::capabilities::Capabilities::probe
includes a check for the ability to create symlinks in thegit_dir
, done by itsprobe_symlink
helper with a filename produced by:gitoxide/gix-fs/src/capabilities.rs
Lines 106 to 107 in ffb73b5
As discussed in #1789, a collision in the filename, even across processes, causes the probe to wrongly report that symlinks are not supported.
In addition to its use of backoff with pseudorandom jitter mentioned above,
gix_testtools::spawn_git_daemon
also shuffles the order of ports to try listening on:gitoxide/tests/tools/src/lib.rs
Lines 206 to 208 in ffb73b5
Unlike (1) and (2), I think it is not really a problem here if the same orders are sometimes used. But performance is slightly better if that is avoided.
The bug
As currently implemented, we occasionally generate the same sequence of pseudorandom numbers across concurrently running processes. This was observed in #1789 to affect (2) when running parallel tests, but in addition to affecting the symlink capability probe, it should be expected occasionally to affect cases (1) and (3). Case (1) is overall the most important.
See "The cause" subsection below on why this happens, and the "Steps to reproduce" section or #1789 for details of the recently observed failure due to (2).
Although I was only able to cite one occurrence of (2), I vaguely recalled having seen similar errors. Since the test case that failed only started asserting that the probe returned
true
a couple months ago, occasional failures may have been different or absent in the past. For these reasons, I estimate that (2) has occurred on CI about once a month for an extended time, but this estimate is extremely low confidence.In contrast to (2), I would not expect our existing automated tests ever to surface (1) or (3):
hyperfine
was able to reproduce it). I would expect this issue to be encountered less often than that.gix-testtools
(or in non-test code in the rare case that another project usesgix-testtools
for something other than tests). But it is hard to observe because it doesn't really cause a problem: if and when the collision occurs, the effect is at most a small slowdown.The cause
This happens because we use
fastrand
with the per-thread generators it provides, which use seeds produced on most platforms by hashing together monotonic clock time and thread ID, but nothing that inherently varies across concurrently running processes:Thread IDs, as returned by
std::thread::current().id()
, are often small numbers like 1. They are frequently the same between threads of separate, concurrently running processes.Expected behavior 🤔
To avoid even rare cases of the thundering herd problem when used to generate jitter for backoff--as well as to avoid other glitches such as collisions when probing for the ability to create symlinks--even pseudorandom numbers that do not need to be cryptographically secure should be generated in a way that does not depend on state that is likely to be shared across concurrently running pseudorandom generations.
Possible solutions
That leaves open a wide range of possible solutions. Assuming everything that needs cryptographically secure random numbers is already implemented in a crate that uses CSPRNGs rather than
fastrand
--which seems to be the case--I think that:fastrand
crate could still be used, unless something else turns out to be more convenient.I don't know if we could continue using the per-thread PRNGs
fastrand
arranges, though. Afastrand::Rng
can be reseeded, including anRng
thatfastrand
provides for a thread. But this would have to be done on each thread whose generator is used. Furthermore, since there could be many threads and higher-entropy sources are probably always at least a little slower than the above approach, each thread's PRNG should not be seeded until it is first used.At first glance, it feels like it would be enough to do even slightly better with the seed. However, I think this actually requires as unpredictable a seed--or, rather, 8 bytes of seed material per thread, to continue using shared thread-local PRNGs--as can practically be obtained without heavy delays.
Why not just mix in the PID?
Suppose we do something like
fastrand
does for its per-thread PRNGs, but also mix in the current process ID. That seems like the simple and obvious thing to do, given that the observed problem is that processes are not distinguished. PIDs are reused, but not concurrently.I think this is insufficient for two reasons:
The second reason is more important because it suggests that seeding with monotonic clock time, thread ID, and PID would not be desirable even if a backup strategy were used for cases where the PID is unavailable.
Another way to describe the second reason is that just mixing in the PID would turn a collision that happens between separate processes into a collision that (still) happens across separate containers. This might, in practice, fix (2), if the containers don't write to the same volume. But it would not be a complete fix for (1).
However, this would be a significant mitigation even in (1). So if all other approaches turn out to involve dependencies that need to be avoided in some cases, then a new default feature could be added for the primary solution, and mixing in the PID might then be considered as a partial fallback solution when that feature is disabled.
We can use
getrandom
for the seedThe
getrandom
library obtains random data from the operating system in a way that is intended to be as robust and unpredictable as possible, typically using the best source of randomness that is available in each environment.There are a number of tradeoffs between robustness and speed. That library navigates them. For example, on some Unix-like systems, it does a blocking read on
/dev/random
to avoid returning data that are insufficiently random because the system entropy pool is not yet adequately seeded, but then uses/dev/urandom
to avoid further blocking that could be excessive.For some platforms, it requires explicit build configuration to work. However, that is rare. I think this only applies to WASM targets with
unknown-unknown
. I think the only conditions under which this happens are those wherefastrand
is already using a fixed seed.To be clear, I am not advocating generating our random numbers with
getrandom
, but only the seed. Unless a reason is found for the uses listed above to use cryptographically secure random numbers, the algorithm being seeded could continue to be a cryptographically insecure but extremely fast algorithm.fastrand
could continue to be used. If its interface is less convenient when the seeds are computed separately, then some other libraries could be considered (and there are some other Rust libraries that use Wyrand, the algorithmfastrand
uses).Can we directly read
/dev/urandom
instead on Unix-like systems?Since we don't need cryptographically secure random numbers, it seems like the
getrandom
crate dependency could be avoided on Unix-like systems, by reading seed bytes directly from/dev/unrandom
instead, and that this would on rare occasion even be slightly faster. It seems like that would be okay even if the system entropy pool is not, and was never, adequately seeded.My guess is that this would be okay, but I am not sure. Besides (or sometimes together with) containers, one way someone might run the same or related programs many times concurrently is when deploying a task to many virtual machines. Normally that would be fine, but it may be that it is done early on after booting the virtual machines, which might be in use only for that task. This is relevant because, as the
getrandom
readme says:So I worry that this approach would fix the problem across processes and containers but not fix it across virtual machines, and that this could actually in rare cases cause problems in practice. I think that this not a problem for our use case, because:
But I am not sure. This is assuming we have non-adversarial unpredictability even when we don't have adversarial unpredictability. That assumption may be wrong. Since implementing more strategies is more complicated than implementing fewer strategies, and it will be either hard or infeasible to test in the test suite, it should probably only be done if we are confident it will work properly. I am not confident.
Why we shouldn't just mix in some weird thing that happens to vary
We can't rely on things that seem to change, but that are not required to change, being different, and oddities will vary by target. Even attempting this would be complicated and probably involve more code that it would take to reimplement substantial fragments of libraries.
(Secondarily, I would worry about using anything that is not routinely exposed already and that might be in use to seed ASLR, or to reveal its seed data. This includes the numerical addresses of objects, as the running process sees them, but it may include other information that varies across separate processes in a system-dependent way.)
Is it a bug in
fastrand
that its seeds clash across processes?Edit: Clarified the title of this subsection.
I don't know if this would be considered a bug in how
fastrand
seeds its per-thread PRNGs by default. But my guess is that this may not be considered a bug, because while that functionality does depend on thestd
feature, it is intentionally pretty minimalistic, whereas even adding in the process IDs seems like it would be more extensive and not always available. I'm not sure, though.Git behavior
The
csprng_bytes
andgit_rand
functionsOutside of operations performed with CSPRNGs in cryptographic libraries, and outside of the test suite, Git uses pseudorandom numbers in a few ways. It obtains them, primarily from cryptographic libraries, though its
csprng_bytes
function, and its associatedgit_rand
convenience function:git_rand
makes a 4-byte integer from data fromcsprng_bytes
. Both functions are declared inwrapper.h
and defined inwrapper.c
. In both functions, the parameter is not a seed, but an option controlling how robust the output is required to be.When Git probes for the ability to create symlinks, it does not use those facilities.
Quality of pseudorandom numbers
As its name implies,
csprng_bytes
is primarily a cryptographically secure pseudorandom number generator. It uses several strategies. For at least one of those strategies, it is possible to fall back to less secure output when secure output is not available.Whether this will be done depends on whether the
CSPRNG_BYTES_INSECURE
flag is passed. Currently it is the only recognized flag, so calls tocsprng_bytes
--orgit_rand
, which passes the flag along--pass either0
orCSPRNG_BYTES_INSECURE
.Even when
CSPRNG_BYTES_INSECURE
is passed, such as when choosing delays inreftable/stack.c
, I believe the pseudorandom data are well distributed and unlikely to cause a thundering herd condition or other clash-related problems.Separately from that flag, if
git
is compiled without any of the cryptographic libraries it prefers to use for generating pseudorandom numbers, it will attempt to read from/dev/urandom
. If this were somehow to happen during early boot, then the data could be predictable on some systems, but I think would not be the same across runs, as discussed above in "Expected behavior."The
RtlGenRandom
case (HAVE_RTLGENRANDOM
incsprng_bytes
) applies to Windows. To the best of my knowledge, it is always available togit
on Windows, but I have not investigated this.The symlink probe
This is for comparison to case (2) presented above in the "Current behavior" section.
Git probes to see if it is able to create a symbolic link like this, creating a regular file in a collision-resistant way and then deleting it and attempting to make a symlink of the same name:
That uses the
xmkstemp
function, defined inwrapper.c
.xmkstemp
delegates tomkstemp
(which is usually part of libc).xmkstemp
checks for and bails out with a useful error message, but it does not customize how the temporary file is created, nor how its name is chosen.An
mkstemp
implementation usually uses a PRNG, though sometimes not a very high quality one. However, it also performs an operation that is only expected to fail due to a collision, and it retries a number of times with different names when there is a collision.Some other
mkstemp
-related functions that do not delegate to the systemmkstemp
function are defined inwrapper.c
, in both the upstream Git repository and in Git for Windows. In particular,git_mkstemp_mode
andgit_mkstemps_mode
accept a mode with which to create the temporary file. These functions are implemented fully inwrapper.c
. They usecsprng_bytes
to fill in theX
placeholders in theirpattern
arguments.I believe it is only on Windows that the
mkstemp
implementation is part of Git itself, where it is defined inmingw.c
. It delegates togit_mkstemp_mode
, passing a mode of 0600.Steps to reproduce 🕹
This is hard to reproduce intentionally, and to my knowledge only (2) has been observed so far. This is as reported in #1789. In that question, I had thought this was a Windows-specific problem, but I no longer expect that to be so. For completeness, the CI output shown there is:
That assertion was introduced in d715e4a (#1652):
gitoxide/gix-worktree-state/tests/state/checkout.rs
Line 182 in ffb73b5
So if the problem occurred before that when running the test suite, on CI or otherwise, it would either have caused a different (possibly less obvious) error, or not caused an error in the tests.
The most important case is (1) and it might be good to try to reproduce it intentionally. Maybe an approach with
hyperfine
could attempt this. I am not sure.The text was updated successfully, but these errors were encountered: