fasterq-dump overloads memory #903

mfansler · 2024-01-29T22:53:14Z

I have installed sra-tools v3.0.10 distributed from Bioconda for linux-64 platform. Running fasterq-dump occupies far more RAM than the flags would imply (default 100 MB/core) or I have ever encountered before using identical commands. In previous versions, I always used 8 cores + 1GB/core, with -t pointing to local scratch disk and VDB configured with plenty of room for the ncbi/sra cache. E.g.,

fasterq-dump -e 8 -S --include-technical -o /fscratch/fanslerm/rc11_d8_1_2.fastq -t /fscratch/fanslerm SRR9117967

Using the above for any SRRs from PRJNA544617 ends with LSF killing my jobs for exceeding memory. I have retried with:

8 cores + 2 GB/core (16 GB total)
6 cores + 4 GB/core (24 GB total)

all eventually killed for overallocating memory. I am currently running again with 4 cores + 8 GB/core (32 GB total).

This makes me suspect there is something off in this version with possibly:

using /tmp/ instead of the designated -t path, i.e.,
not respecting the --mem argument (or not reading the default).
a memory leak

Please let me know if I can provide any additional information.

The text was updated successfully, but these errors were encountered:

mfansler · 2024-01-29T22:58:22Z

I also tried running on a local Docker (mambaorg/micromamba:1.5.6) rather than HPC, with -e2 and 16 GB total on the container. This was also killed.

OOAAHH · 2024-03-04T03:55:48Z

我遇到了类似的问题，my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p

我的HPC节点宕机了，该问题反复出现，我也怀疑是新版本的问题，但我不太明白我该如何寻找恰当的证据证明这一点。
My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

mfansler · 2024-03-04T04:21:34Z

For completeness, I did eventually get it to complete with the 4 core and 8GB/core configuration. I expect this will be dependent on the size of the data.

mfansler · 2024-03-04T06:44:57Z

@OOAAHH I was able to run your example without any issue. The SRA file is 14GB, and unpacked it leads to a 26GB FASTQ file. Are you sure you are not running out of disk quota?

Some things I see: Your example does not provide a scratch space to store the temporary files, so they will be written to a temporary folder in the current directory. Also, unless ~/TOS/output is symlinked elsewhere, that is under user home (~/) which on typical HPC clusters is 100GB. Lastly, have you configured VDB so that the NCBI cache is not under user home (the default)? That is, from worst case assumptions, this single operation could occupy up to 75GB of disk at maximum occupancy.

It should further be noted that this particular data was uploaded as an aligned BAM. Dumping out a FASTQ file from a BAM-derived SRA file is mostly useless for scRNA-seq because any cell barcodes and UMIs will only be in the tags and not get properly dumped out. I don't know what you plan with the data, but for processing as scRNA-seq you are likely better off downloading the BAM (and .bai) directly from the ENA (see ERR4027871).

OOAAHH · 2024-03-04T07:59:56Z

First of all, thank you for your prompt and detailed response. Your insights have been incredibly helpful and have shed light on several oversight areas in my approach.

Disk Quota and Cache Settings: You're absolutely right; I hadn't fully considered the disk quota and the cache parameter settings. I've been so focused on monitoring my memory usage that I overlooked the capacity of the disk. Based on your advice, I will start specifying a scratch space for temporary files in my commands and configuration to manage disk space more efficiently.
Data for scRNA-seq Projects: Also, you've made an excellent point regarding the use of data with UMIs and barcodes for my large-scale single-cell atlas project. It appears I may have encountered issues with some of the .bai files, which complicates the process. Following your suggestion, I will explore downloading the necessary indexed data directly from BioStudies:E-MTAB-8221.

mfansler · 2024-03-04T14:46:16Z

Glad to help. Fortunately, the .bai files shouldn't be essential - one can reindex with samtools index to generate new ones.

OOAAHH · 2024-03-06T08:43:42Z

I hope this message finds you well. I wanted to take a moment to update you on the significant progress I've made, thanks in large part to your invaluable advice and guidance.

Following your suggestions, I revisited my BAM files and utilized samtools to reindex them and examine the metadata more closely. This process was incredibly enlightening; not only was I able to generate new .bai files successfully, but I also uncovered crucial information embedded within the BAM files. The metadata and initial read segments revealed essential details such as cell barcodes, UMIs, and sample identifiers - precisely the data I needed for my single-cell RNA sequencing analysis.

Discovering this information was particularly critical for me, given the challenging network environment I am operating in, which makes downloading genomic data quite difficult. Being able to extract and utilize data already within my possession has saved me a tremendous amount of time!
My codes:
samtools view -H

samtools view my.bam | head

permia · 2024-09-03T15:25:25Z

an OOM issue.

我遇到了类似的问题，my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p 我的HPC节点宕机了，该问题反复出现，我也怀疑是新版本的问题，但我不太明白我该如何寻找恰当的证据证明这一点。 My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

I'm having the same issue that causes the node to go down. This doesn't seem to be because of the size of the SRA file. After running many files with the same script, only a few files are like this.

permia · 2024-09-11T01:36:44Z

an OOM issue.

我遇到了类似的问题，my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p 我的HPC节点宕机了，该问题反复出现，我也怀疑是新版本的问题，但我不太明白我该如何寻找恰当的证据证明这一点。 My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

I'm having the same issue that causes the node to go down. This doesn't seem to be because of the size of the SRA file. After running many files with the same script, only a few files are like this.

In this situation, use fastq-dump than fasterq-dump.

OOAAHH · 2024-09-11T01:54:29Z

an OOM issue.

我遇到了类似的问题，my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p 我的HPC节点宕机了，该问题反复出现，我也怀疑是新版本的问题，但我不太明白我该如何寻找恰当的证据证明这一点。 My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

I'm having the same issue that causes the node to go down. This doesn't seem to be because of the size of the SRA file. After running many files with the same script, only a few files are like this.

In this situation, use fastq-dump than fasterq-dump.

谢谢你的关注！事实上，我在遇到这个问题之后使用了新的计算方法。我在虚拟容器中（K8s虚拟化）完成了我的所有计算任务。我在后来的排查过程中发现了一个关键的问题，fasterq-dump需要使用本地存储作为缓存，这导致了它在特定情况下（受限的计算环境中）的表现可能不太好，这取决于具体的计算环境的配置。这也是为什么我现在非常推荐各位生物信息学研究者（或者其他领域研究人员）使用容器化技术，尤其是你像我一样有十多万SRA文件需要处理的时候。基于容器化的技术使得这样需要大量重复的计算变得非常容易，你可以在任意地点的任意设备上取得一致的计算体验和结果。

Thank you for your attention! In fact, after encountering this issue, I adopted a new computational approach. I completed all my computational tasks within virtual containers (K8s virtualization). During the troubleshooting process, I discovered a key issue: fasterq-dump requires local storage as a cache, which can lead to suboptimal performance in certain situations (in a constrained computational environment), depending on the specific configuration of the computing environment. This is also why I now highly recommend researchers in bioinformatics (or other fields) to use containerization technology, especially if, like me, you have hundreds of thousands of SRA files to process. Container-based technology makes such repetitive computational tasks much easier, allowing you to achieve consistent computational experiences and results on any device, anywhere.

mfansler · 2024-09-11T04:13:28Z

@permia please show the command you use an indicate at least one accession (SRR).

permia · 2024-09-12T07:18:14Z

The command I used is correct. fasterq-dump does encounter issues when processing certain random SRA files. Providing examples is not meaningful.
fastq-dump does not have this problem, possibly because fastq-dump does not generate temporary files. This issue might be as @OOAAHH mentioned.

mfansler · 2024-09-12T12:41:02Z

@permia having multiple examples of failures can be valuable to developers. This thread is about possible memory issues in recent versions of fasterq-dump and it would be nice to have additional clearly-documented cases similar to the one originally reported. "Clearly-documented" means not only showing the command used, but also reporting on the version and additional system information.

Note that @OOAAHH did not in the end have the same issue, but rather appeared to be about disk space and managing temporary scratch spaces. Ultimately it was resolved it in an orthogonal way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fasterq-dump overloads memory #903

fasterq-dump overloads memory #903

mfansler commented Jan 29, 2024

mfansler commented Jan 29, 2024

OOAAHH commented Mar 4, 2024 •

edited

Loading

mfansler commented Mar 4, 2024

mfansler commented Mar 4, 2024 •

edited

Loading

OOAAHH commented Mar 4, 2024

mfansler commented Mar 4, 2024

OOAAHH commented Mar 6, 2024 •

edited

Loading

permia commented Sep 3, 2024

permia commented Sep 11, 2024

OOAAHH commented Sep 11, 2024

mfansler commented Sep 11, 2024

permia commented Sep 12, 2024

mfansler commented Sep 12, 2024 •

edited

Loading

fasterq-dump overloads memory #903

fasterq-dump overloads memory #903

Comments

mfansler commented Jan 29, 2024

mfansler commented Jan 29, 2024

OOAAHH commented Mar 4, 2024 • edited Loading

mfansler commented Mar 4, 2024

mfansler commented Mar 4, 2024 • edited Loading

OOAAHH commented Mar 4, 2024

mfansler commented Mar 4, 2024

OOAAHH commented Mar 6, 2024 • edited Loading

permia commented Sep 3, 2024

permia commented Sep 11, 2024

OOAAHH commented Sep 11, 2024

mfansler commented Sep 11, 2024

permia commented Sep 12, 2024

mfansler commented Sep 12, 2024 • edited Loading

OOAAHH commented Mar 4, 2024 •

edited

Loading

mfansler commented Mar 4, 2024 •

edited

Loading

OOAAHH commented Mar 6, 2024 •

edited

Loading

mfansler commented Sep 12, 2024 •

edited

Loading