Check matched FASTQ files for same number of reads/lines #14

mruffalo · 2022-05-04T14:36:06Z

Single-cell/nucleus datasets typically contain matched FASTQ files, in groups of 2 for RNA-seq and some ATAC-seq assays, and 3 for other ATAC-seq data types. (RNA-seq contains 2 more files per group, with prefix I, which are not currently used in the analysis).

Processing of these datatypes requires the groups of FASTQ files to match, in that (e.g.,) "the first read in R1 is the barcode + UMI, the first read in R2 is the matched transcript sequence", with "zipped" iteration over the reads in each file.

This crucially requires the number of reads (and therefore lines) in each of the grouped FASTQ files to match. Check this during dataset ingest -- we already check for valid gzip compression, and we should implement the check proposed here so that it doesn't waste CPU and I/O time decompressing the same file twice.

The text was updated successfully, but these errors were encountered:

mruffalo · 2022-05-04T14:37:00Z

Finding grouped FASTQ files is handled in pipelines by https://github.com/hubmapconsortium/fastq-utils, and that should probably be used here too.

jswelling · 2022-05-04T16:12:28Z

Grouped files are identified by the regex at https://github.com/hubmapconsortium/fastq-utils/blob/main/fastq_utils/__init__.py#L15

jswelling · 2022-05-04T16:13:56Z

gz_validator.py in this repo is the test which currently tests uncompressability of fastq.gz files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check matched FASTQ files for same number of reads/lines #14

Check matched FASTQ files for same number of reads/lines #14

mruffalo commented May 4, 2022 •

edited by jswelling

Loading

mruffalo commented May 4, 2022

jswelling commented May 4, 2022

jswelling commented May 4, 2022

Check matched FASTQ files for same number of reads/lines #14

Check matched FASTQ files for same number of reads/lines #14

Comments

mruffalo commented May 4, 2022 • edited by jswelling Loading

mruffalo commented May 4, 2022

jswelling commented May 4, 2022

jswelling commented May 4, 2022

mruffalo commented May 4, 2022 •

edited by jswelling

Loading