Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying lots of paths fails with "too many root sets" #7359

Closed
lheckemann opened this issue Nov 28, 2022 · 19 comments · Fixed by #12296
Closed

Copying lots of paths fails with "too many root sets" #7359

lheckemann opened this issue Nov 28, 2022 · 19 comments · Fixed by #12296
Assignees
Labels

Comments

@lheckemann
Copy link
Member

lheckemann commented Nov 28, 2022

Describe the bug

$ nix copy ./*.drv --to ssh-ng://[email protected] --derivation
warning: error: SQLite database '/nix/var/nix/db/db.sqlite' is busy
[1147 copied (17.0 MiB), 5.2 MiB DL] copying 44430 pathsToo many root sets
Aborted (core dumped)

[linus@geruest:~/nixpkgs/master/to-build]$ terminate called after throwing an instance of 'nix::EndOfFile'
  what():  error: unexpected end-of-file

Steps To Reproduce

  1. Get lots of drvs (546, all the NixOS test drvs for a nixpkgs checkout in my case)
  2. Try copying them to a remote machine using nix copy --to ssh-ng://root@$host ./*.drv or similar

Expected behavior

Successful copy, or useful error message

nix-env --version output

nix-env (Nix) 2.12.0pre20221116_561440b

@lheckemann lheckemann added the bug label Nov 28, 2022
@rickynils
Copy link
Member

I ran into this issue also. Works in Nix 2.9.1 and 2.10.3 but not in 2.11.1 and 2.12.0.

@abbec
Copy link

abbec commented Feb 1, 2023

This is the callstack of the error:

#0  0x00007f341028abc7 in __pthread_kill_implementation () from /nix/store/9xfad3b5z4y00mzmk2wnn4900q0qmxns-glibc-2.35-224/lib/libc.so.6
#1  0x00007f341023db46 in raise () from /nix/store/9xfad3b5z4y00mzmk2wnn4900q0qmxns-glibc-2.35-224/lib/libc.so.6
#2  0x00007f34102284b5 in abort () from /nix/store/9xfad3b5z4y00mzmk2wnn4900q0qmxns-glibc-2.35-224/lib/libc.so.6
#3  0x00007f34114e0da4 in GC_add_roots_inner () from /nix/store/n68j305pcfac37770hcz09iwz36xbbqf-boehm-gc-8.2.2/lib/libgc.so.1
#4  0x00007f34114f355e in GC_add_roots () from /nix/store/n68j305pcfac37770hcz09iwz36xbbqf-boehm-gc-8.2.2/lib/libgc.so.1
#5  0x00007f34112f3687 in nix::BoehmGCStackAllocator::allocate() () from /nix/store/ksb0p7wj3l5i6m8g7yhzn0593z9x3910-nix-2.14.0pre20230131_dirty/lib/libnixexpr.so
#6  0x00007f3410b95c65 in nix::sinkToSource(std::function<void (nix::Sink&)>, std::function<void ()>)::SinkToSource::read(char*, unsigned long) () from /nix/store/ksb0p7wj3l5i6m8g7yhzn0593z9x3910-nix-2.14.0pre20230131_dirty/lib/libnixutil.so
#7  0x00007f3410b95179 in nix::Source::drainInto(nix::Sink&) () from /nix/store/ksb0p7wj3l5i6m8g7yhzn0593z9x3910-nix-2.14.0pre20230131_dirty/lib/libnixutil.so
#8  0x00007f3410f0751c in std::_Function_handler<void (nix::Sink&), nix::RemoteStore::addMultipleToStore(std::vector<std::pair<nix::ValidPathInfo, std::unique_ptr<nix::Source, std::default_delete<nix::Source> > >, std::allocator<std::pair<nix::ValidPathInfo, std::unique_ptr<nix::Source, std::default_delete<nix::Source> > > > >&, nix::Activity&, nix::RepairFlag, nix::CheckSigsFlag)::{lambda(nix::Sink&)#1}>::_M_invoke(std::_Any_data const&, nix::Sink&) () from /nix/store/ksb0p7wj3l5i6m8g7yhzn0593z9x3910-nix-2.14.0pre20230131_dirty/lib/libnixstore.so
#9  0x00007f3410b96224 in void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, nix::VirtualStackAllocator, boost::coroutines2::detail::pull_coroutine<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::control_block::control_block<nix::VirtualStackAllocator, nix::sinkToSource(std::function<void (nix::Sink&)>, std::function<void ()>)::SinkToSource::read(char*, unsigned long)::{lambda(boost::coroutines2::detail::push_coroutine<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&)#1}>(boost::context::preallocated, nix::VirtualStackAllocator&&, nix::sinkToSource(std::function<void (nix::Sink&)>, std::function<void ()>)::SinkToSource::read(char*, unsigned long)::{lambda(boost::coroutines2::detail::push_coroutine<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&)#1}&&)::{lambda(boost::context::fiber&&)#1}> >(boost::context::detail::transfer_t) () from /nix/store/ksb0p7wj3l5i6m8g7yhzn0593z9x3910-nix-2.14.0pre20230131_dirty/lib/libnixutil.so
#10 0x00007f341108618f in make_fcontext () from /nix/store/ksb0p7wj3l5i6m8g7yhzn0593z9x3910-nix-2.14.0pre20230131_dirty/lib/libboost_context.so.1.79.0
#11 0x0000000000000000 in ?? ()

Seems to have been introduced here: #6612

@abbec
Copy link

abbec commented Feb 3, 2023

@thufschmitt It seems that doing the work of draining all paths here inside the same lambda is a bit too much for the garbage collector. Not sure what the correct fix is though, batching?

@thufschmitt
Copy link
Member

I'm not sure what causes it to crash since that should be properly streaming 🤔

@edolstra since you're the original author of that “low-latency ssh copying”, any idea what might go wrong? I must confess I'm not entirely clear on how it works

@MrFoxPro
Copy link

MrFoxPro commented Apr 30, 2023

This happens when I'm using deploy-rs (it copies store via ssh-ng://)

@colemickens
Copy link
Member

This is something I'm hitting on a nearly daily basis as nixos-unstable moves. I have a number of systems that don't really get cache hits, causing large rebuilds and thus numerous derivations to copy. Unfortunately it's causing enough noise that I'm getting close to writing another nix wrapper that looks for the crash string and just retries the copy, but that's not ideal.

(Thanks to those investigating / looking at fixes!)

@MrFoxPro
Copy link

MrFoxPro commented Dec 1, 2023

Especially annoying when dealing with slow internet, as it doesn't allow to build on remote.

@siriobalmelli
Copy link

siriobalmelli commented Feb 26, 2024

Running into this with nixos-anywhere using --build-on-remote:

$ nix run github:nix-community/nixos-anywhere -- --flake .#MACHINE-NAME --build-on-remote root@MACHINE-IP
...
[0 copied (514946.1 MiB)] copying 17178 pathsToo many root sets
/nix/store/1dymvajkvj3kwj2xpjz5ccab49ry6paj-nixos-anywhere-1.0.0/bin/.nixos-anywhere-wrapped: line 196: 93359 Abort trap: 6           NIX_SSHOPTS="-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i $ssh_key_dir/nixos-anywhere ${ssh_args[*]}" nix copy "${nix_options[@]}" "${nix_copy_options[@]}" "$@"
$  nix --version
nix (Nix) 2.18.1

@sysedwinistrator
Copy link
Contributor

Since the OP and other commenters only reported getting the error when using the ssh-ng:// protocol, I'd like to mention that I'm getting the error when copying a lot of derivations from a HTTP cache (Minio S3) to the local store in a GitLab CI job (which is running in a Docker container but uses the host's Nix store via the daemon).

Background:
I've been getting this error since I refactored my pipeline a few months ago to push all derivations to the cache after the initial eval and then pull those derivations from the cache in the build job in order to avoid having to re-evaluate the derivations in the build jobs.
This only occurs on my remote ARM build machine whose store is auto GC'ed due to limited disk space, meaning it sometimes has to refetch all derivations for a NixOS config. The other x86_64 build machine is also the eval machine, so it does not even have to fetch the derivations.

@geoffreygarrett
Copy link

geoffreygarrett commented Sep 29, 2024

sshOpts = [ "-o" "ProxyCommand=none" ] partially solved it for me

@a-h
Copy link
Contributor

a-h commented Dec 13, 2024

I ran into this on Nix 2.23.1 today. Copying from a file export into the local store, so no network involved at all:

nix copy --all --offline --no-check-sigs --from file://$PWD/nix-store

The output was:

[0 copied (78520668.1/78501814.0 MiB)] copying 10651 pathsToo many root sets
Aborted (core dumped)

Ignoring the fact it's reporting copying much more data than it expected (that's tracked in #9088), the output is concatenated with the error message.

@a-h
Copy link
Contributor

a-h commented Dec 13, 2024

There's no explanation on why in the Gerrit, but I noticed that the Lix project have refactored this area to remove the sinkToSource references, and to remove some patches from the GC. https://gerrit.lix.systems/c/lix/+/1558

@Mic92
Copy link
Member

Mic92 commented Dec 20, 2024

@a-h where you able to do the same operation with lix?

@a-h
Copy link
Contributor

a-h commented Dec 20, 2024

I haven't been able to reproduce the issue on my laptops, maybe because they have a lot more RAM (32GB and 64GB respectively), than the server I was using (8GB).

I tested out copying my whole store nix copy --all --offline --no-check-sigs --to file://$HOME/export-test, and the CPUs stuck at 100%, while the RAM usage of nix hit a fairly steady 2GB.

There were around 60k paths to copy, and that copied everything quite well. It could be that the slowness of the copy operation gives the GC time to act.

I made a repo to test that creates 100k derivations: https://github.com/a-h/nix-7359 but that worked fine on my 32GB RAM laptop.

It could be that it's deeply nested derivations that cause the issue to surface, so a simple map to create derivations might not be enough.

@Mic92
Copy link
Member

Mic92 commented Dec 20, 2024

Thanks for taking time for setting up a reproducer. If it only happens with RAM pressure, a nixos vm should work.

@Mic92 Mic92 added this to Nix team Dec 20, 2024
@github-project-automation github-project-automation bot moved this to To triage in Nix team Dec 20, 2024
@Mic92 Mic92 removed this from Nix team Jan 8, 2025
@Mic92 Mic92 self-assigned this Jan 8, 2025
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2025-01-08-nix-team-meeting-minutes-207/58523/1

@NaN-git
Copy link

NaN-git commented Jan 12, 2025

I had a look at the issue, although I'm missing a good test case. I wasn't able to reproduce the issue with https://github.com/a-h/nix-7359, although it provides some useful insights. Newer nix versions use more memory than old nix versions because copyPaths() allocates many SinkToSource objects that are not released after a path is copied, but only when all paths are copied.

The current implementations of SourceToSink and SinkToSource are problematic. First, SinkToSource allocates a potentially unbounded amount of memory, which is no released before the destruction of the SinkToSource instance.
Furthermore passing unbounded chunks of data through a pipeline is probably not the best choice because it can cause buffer bloat and increased pipeline latency.

I wrote a patch, which passes data in in chunks through SourceToSink and SinkToSource. This bounds the max. memory usage. Furthermore the allocated memory is released immediately after it is read completely. copyPaths is not changed because its memory overhead seems to be acceptable.

It would be nice if someone can test whether the patch fixes the issue.

EDIT: I observe the increased memory usage of current nix versions using the test case, when copying --to ssh-ng://..., but not --to file://.... With my patch the memory usage stays low and the bandwidth usage is much smoother, although it's far away from saturation, which could be caused by transferring a lot of very small files. Also I see some drops, i.e. the behavior could probably be improved further or the target host is the bottleneck.

@Mic92
Copy link
Member

Mic92 commented Jan 20, 2025

@a-h Thanks for putting together the reproducer, would it possible for you to test the patch and see if you can reproduce the issue? #12255

edolstra added a commit to DeterminateSystems/nix-src that referenced this issue Jan 20, 2025
This allows RemoteStore::addMultipleToStore() to free the Source
objects early (and in particular the associated sinkToSource()
buffers). This should fix NixOS#7359. For example, memory consumption of

  nix copy --derivation --to ssh-ng://localhost?remote-store=/tmp/nix --derivation --no-check-sigs \
    /nix/store/4p9xmfgnvclqpii8pxqcwcvl9bxqy2xf-nixos-system-...drv

went from 353 MB to 74 MB.
mergify bot pushed a commit that referenced this issue Jan 20, 2025
This allows RemoteStore::addMultipleToStore() to free the Source
objects early (and in particular the associated sinkToSource()
buffers). This should fix #7359. For example, memory consumption of

  nix copy --derivation --to ssh-ng://localhost?remote-store=/tmp/nix --derivation --no-check-sigs \
    /nix/store/4p9xmfgnvclqpii8pxqcwcvl9bxqy2xf-nixos-system-...drv

went from 353 MB to 74 MB.

(cherry picked from commit cc838e8)
mergify bot pushed a commit that referenced this issue Jan 20, 2025
This allows RemoteStore::addMultipleToStore() to free the Source
objects early (and in particular the associated sinkToSource()
buffers). This should fix #7359. For example, memory consumption of

  nix copy --derivation --to ssh-ng://localhost?remote-store=/tmp/nix --derivation --no-check-sigs \
    /nix/store/4p9xmfgnvclqpii8pxqcwcvl9bxqy2xf-nixos-system-...drv

went from 353 MB to 74 MB.

(cherry picked from commit cc838e8)
@NaN-git
Copy link

NaN-git commented Jan 20, 2025

@edolstra Users reported that this issue was worse with slow internet, i.e. the size of allocated buffers increased in this case. Your patch fixes the root cause of the increased memory usage; nevertheless the implementation of SinkToSource is not ideal and also the coroutine yields until the current buffer is drained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.