Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check matched FASTQ files for same number of reads/lines #14

Open
mruffalo opened this issue May 4, 2022 · 3 comments
Open

Check matched FASTQ files for same number of reads/lines #14

mruffalo opened this issue May 4, 2022 · 3 comments

Comments

@mruffalo
Copy link
Contributor

mruffalo commented May 4, 2022

Single-cell/nucleus datasets typically contain matched FASTQ files, in groups of 2 for RNA-seq and some ATAC-seq assays, and 3 for other ATAC-seq data types. (RNA-seq contains 2 more files per group, with prefix I, which are not currently used in the analysis).

Processing of these datatypes requires the groups of FASTQ files to match, in that (e.g.,) "the first read in R1 is the barcode + UMI, the first read in R2 is the matched transcript sequence", with "zipped" iteration over the reads in each file.

This crucially requires the number of reads (and therefore lines) in each of the grouped FASTQ files to match. Check this during dataset ingest -- we already check for valid gzip compression, and we should implement the check proposed here so that it doesn't waste CPU and I/O time decompressing the same file twice.

@mruffalo
Copy link
Contributor Author

mruffalo commented May 4, 2022

Finding grouped FASTQ files is handled in pipelines by https://github.com/hubmapconsortium/fastq-utils, and that should probably be used here too.

@jswelling
Copy link
Collaborator

Grouped files are identified by the regex at https://github.com/hubmapconsortium/fastq-utils/blob/main/fastq_utils/__init__.py#L15

@jswelling
Copy link
Collaborator

gz_validator.py in this repo is the test which currently tests uncompressability of fastq.gz files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants