Skip to content

Conversation

@antgonza
Copy link
Member

No description provided.

@antgonza antgonza changed the title add subsample_reads [WIP] add subsample_reads Jul 10, 2025
Copy link
Contributor

@AmandaBirmingham AmandaBirmingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request for change to an error message, a couple of questions

f'| gzip > {f}')
_, se, rv = system_call(cmd)
if rv != 0 or se:
raise ValueError(f'Error during mv: {cmd}. {se}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message seems misleading since it (still) says "Error during mv"; I think it should be changed to say it is reporting on errors that occur during seqtk.

for f in files:
dn = dirname(f)
bn = basename(f)
nbn = join(dn, bn.replace('fastq.gz', 'subsampled.gz'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it going to cause any problems later that the subsampled reads file doesn't end with 'fastq.gz'? I'm not familiar with the workflow but I know I've seen regexes around that expect this suffix ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep in the working directory a backup of the original file, just in case we need to debug it. This basically moves the original problematic file to a new name with subsample,, then in the next command the subsample will create a new (smaller) file with the name of the original. In fact, I'm relying on those regex to ignore the subsampled.gz. I'll add a comment about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense :) However, I wonder if we could name them something other than "subsampled.gz"? That name makes me think that the file with that name IS the one that was subsampled, rather than being the original one. Could we name it "not_subsampled.gz" or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, while writing the comment I realized the same thing so I called it "full".

self.convert_raw_to_fastq()
self.integrate_results()
self.generate_sequence_counts()
self.subsample_reads()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we will now ALWAYS subsample?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but really we will always check if subsample is needed and only run it when necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we will always subsample every fastq that has more than the max number of reads, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@antgonza antgonza changed the title [WIP] add subsample_reads add subsample_reads Jul 11, 2025
@antgonza antgonza merged commit 2b7832e into qiita-spots:main Jul 13, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants