Skip to content

Conversation

@nathanweeks
Copy link

The dev container uses the root user in the container:

"remoteUser": "root",

When run as root, tar -x will preserve ownership (uid-gid) of files in the tarball upon extraction. This can result in an error when using rootless podman if the dev container localWorkspaceFolder, i.e.:

"remoteEnv": {
// Workspace path on the host for mounting with docker-outside-of-docker
"LOCAL_WORKSPACE_FOLDER": "${localWorkspaceFolder}"
},

resides on an NFS file system (see Rootless Podman and NFS for more details); e.g.:

/workspaces/modules -> nf-test test modules/nf-core/untar/tests/main.nf.test
...

  Command error:
    kraken2/opts.k2d
    tar: opts.k2d: Cannot change ownership to uid 501, gid 50: Operation not permitted
    kraken2/taxo.k2d
    tar: taxo.k2d: Cannot change ownership to uid 501, gid 50: Operation not permitted
    kraken2/hash.k2d
    tar: hash.k2d: Cannot change ownership to uid 501, gid 50: Operation not permitted
    tar: Exiting with failure status due to previous errors

The solution proposed by this PR is to add the GNU tar --no-same-owner option to make the extracted files owned by the user that runs the tar command (in the preceding scenario, root in the dev container's user namespace, which is mapped to the user running podman on the host).

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@nathanweeks
Copy link
Author

Might need to run the CI checks again---seems like there were some network and/or disk-full related errors for the self-hosted runners? e.g.:

https://github.com/nf-core/modules/actions/runs/19082094294/job/55387492031#step:6:1057

ERROR ~ Error executing process > 'TXIMETA_TXIMPORT'
  
  Caused by:
    Failed to pull singularity image
      command: singularity pull  --name depot.galaxyproject.org-singularity-bioconductor-tximeta%3A1.20.1--r43hdfd78af_0.img.pulling.1763111870007 https://depot.galaxyproject.org/singularity/bioconductor-tximeta%3A1.20.1--r43hdfd78af_0 > /dev/null
      status : 143
      hint   : Try and increase singularity.pullTimeout in the config (current is "20m")
      message:
        INFO:    Downloading network image

https://github.com/nf-core/modules/actions/runs/19082094294/job/55387491920#step:6:831

    > Command error:
    >   Unable to find image 'quay.io/biocontainers/bioconductor-tximeta:1.20.1--r43hdfd78af_0' locally
    >   1.20.1--r43hdfd78af_0: Pulling from biocontainers/bioconductor-tximeta
    >   fa7e54f17dc0: Pulling fs layer
    >   4ca545ee6d5d: Pulling fs layer
    >   f76401802415: Pulling fs layer
    >   4ca545ee6d5d: Verifying Checksum
    >   4ca545ee6d5d: Download complete
    >   fa7e54f17dc0: Verifying Checksum
    >   fa7e54f17dc0: Download complete
    >   fa7e54f17dc0: Pull complete
    >   4ca545ee6d5d: Pull complete
    >   docker: write /var/lib/docker/tmp/GetImageBlob4044426464: no space left on device

https://github.com/nf-core/modules/actions/runs/19082094294/job/55387491895#step:6:803

    > Command error:
    >   Unable to find image 'quay.io/biocontainers/rtg-tools:3.12.1--hdfd78af_0' locally
    >   3.12.1--hdfd78af_0: Pulling from biocontainers/rtg-tools
    >   c1a16a04cedd: Already exists
    >   4ca545ee6d5d: Already exists
    >   5c8d8c55d21b: Pulling fs layer
    >   5c8d8c55d21b: Verifying Checksum
    >   5c8d8c55d21b: Download complete
    >   docker: failed to register layer: write /usr/local/lib/libxcb-render.so.0.0.0: no space left on device

@nathanweeks
Copy link
Author

I suspect that this change is triggering far more tests than normally run, resulting in disk space exhaustion on the runners.

To test this hypothesis, in a fork of nf-core/modules I switched to the GitHub-hosted runners, using the secondary /mnt partition for conda environments, docker containers, and nextflow/nf-test work directories (using the technique described in #7016 (comment)). This resulted in substantially more checks passing:

https://github.com/fasrc/modules/actions/runs/19448287416

Though there were still failures---some of which seem to have plausible explanations, maybe indicating they haven't been run in a while?

e.g., CELLRANGERARC_MKFASTQ failed in some of the docker and singularity shards due to output differeing from the snapshot (https://github.com/fasrc/modules/actions/runs/19448287416/job/55715338873#step:5:449). However, its results are non-deterministic due to multithreading:

// WARNING !! Cell Ranger ARC mkfastq results are not deterministic, so the number of threads used in the process might affect the results.

Also, METAPHLAN3_MERGEMETAPHLANTABLES appears to be subject to bit rot:
https://github.com/fasrc/modules/actions/runs/19448287416/job/55715337100#step:5:929

>     File "/mnt/runner/nf-test/tests/dd18346802b2188d78a7d615dbbe57ba/work/conda/env-4976d331465f416f-eb1330aa155c9242f2a03346deb44f4d/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 26, in <module>
>       from distutils.version import LooseVersion
>   ModuleNotFoundError: No module named 'distutils'

distutils was removed in Python 3.12 (https://peps.python.org/pep-0632/), and isn't going to be present in the Python 3.13 installed in the conda environment.

I'm not sure how to proceed with this one; any guidance would be appreciated!

Addresses CI runners running out of space with most shards
@nathanweeks nathanweeks requested review from a team as code owners November 19, 2025 17:15
@nathanweeks
Copy link
Author

Per nf-core slack, I temporarily increased max_shards from 15 to 30. Will revert after tests have run and before PR is merged.

@nathanweeks
Copy link
Author

Still running out of space with 30 shards. Bumping max_shards to 60 to see if that's sufficient...?

@famosab
Copy link
Contributor

famosab commented Nov 20, 2025

Try merging the master branch into your branch, that should lower the number of tests again I think!

@nathanweeks
Copy link
Author

nathanweeks commented Nov 21, 2025

Try merging the master branch into your branch, that should lower the number of tests again I think!

@famosab Thanks for the tip! That substantially reduced the number of test failures.

There are still some test failures; none of which seem to be related to the change in the tar command proposed in this PR.

  1. x64 | docker | 21
  2. x64 | docker | 22

These are the remaining no-space-left-on-device errors. I could try temporarily bumping up max_shards further?

  1. x64 | singularity | 13

This job seemed to have a network issue ("FATAL: While making image from oci registry: error fetching image to cache: while building SIF from layers: conveyor failed to get: error writing layer: unexpected EOF") while generating a SIF---perhaps a rerun could fix?

  1. x64 | conda | 8
  2. x64 | conda | 10
  3. x64 | conda | 12
  4. x64 | conda | 15

These tests fail due to metaphlan 3.0.12 bioconda package referencing a Python library that was removed in Python 3.12 (and is not present in the Python 3.13 currently installed by the environment):

Test Process METAPHLAN3_MERGEMETAPHLANTABLES
...
      File "/home/runner/_work/modules/modules/.nf-test/tests/1b6ae314ec0ac11e9fbfa4b41ccbfacc/work/conda/env-a704cd16283e2a0e-eb1330aa155c9242f2a03346deb44f4d/lib/python3.13/site-packages/metaphlan/metaphlan.py", line 26, in <module>
        from distutils.version import LooseVersion
    ModuleNotFoundError: No module named 'distutils'

This issue was apparently fixed in MetaPhlAn 4.2.0:

biobakery/MetaPhlAn#232 (comment)

  1. x64 | conda | 9
  2. x64 | conda | 11
  3. x64 | conda | 14
  4. x64 | docker | 11

Another Metaphlan 3 issue -- possibly related to Python version???

Test Process METAPHLAN3_MERGEMETAPHLANTABLES
...
com.fasterxml.jackson.dataformat.yaml.snakeyaml.error.MarkedYAMLException: while scanning a simple key
 in 'reader', line 3, column 1:
    line
    k^
could not find expected ':'
 in 'reader', line 4, column 1:
    import
    ^

 at [Source: (InputStreamReader); line: 2, column: 23]
  1. x64 | singularity | 9
  2. x64 | singularity | 10
  3. x64 | docker | 12

A "different snapshot" error in kofamscan output that I'm able to reproduce on the master branch in a codespace with, e.g. nf-test test --profile singularity modules/nf-core/kofamscan/tests/main.nf.test

  1. x64 | singularity | 12

Another different-snapshot error, also reproducible in a codespace on the master branch, using nf-test test --profile singularity modules/nf-core/foldmason/msa2lddtreport/tests/main.nf.test

  1. x64 | conda | 13

A different snapshot error with harmonization/rgi, reproducible in a codespace on the master branch:

nf-test test --profile conda modules/nf-core/hamronization/rgi/tests/main.nf.test

@nathanweeks
Copy link
Author

PR to attempt to fix the metaphlan3_metaphlan3 and metaphlan3_mergemetaphlantables errors:
#9448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants