You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the useful tool. I would like to request a new feature/option to come into play when using fastp for deduplication. It would be useful if the IDs of duplicate reads could be saved as well as the ID of the 'representative' read that each is a duplicate of. This would mimic a useful feature of another tool:
In addition to the de-duplicated FASTA or FASTQ outputs, czid-dedup also outputs a cluster file which makes it possible to identify clusters of duplicate reads. The file lists the representative cluster read ID for each initial read ID, where the representative cluster read ID is the read ID that makes it into the output file. If a read is found to be a duplicate of a previous read, it will be filtered out of the FASTA/FASTQ output and paired with the read ID of the previous duplicate read in the cluster output file. Representative cluster read IDs are paired with themselves. The order of the input files is preserved. The representative read will always be the first read of its type.
Thanks
The text was updated successfully, but these errors were encountered:
To follow up on the point from @charlesfoster and the issue #528. fastp is very useful, and I do like the option to do deduplication.
I tried to output the reads that got filtered to see if they contained duplicated reads with the option --failed_out. I found that it will only output reads that are filtered because they are too short, too many ambigious, etc.. The output file does not contain the duplicated reads.
for my use case with shotgun metagenomic data, I am not so interested in the number of clusters of sequences, but I would be happy if I could see which reads are duplicated. Than I can decide if that would affect downstream analyses.
Of course I can retrieve the duplicate reads from the raw data, by identifying the missing reads in the clean data and filter them out with seqtk.
Or I can run the entire dataset through VSEARCH and generate the clusters of sequences in that way.
Hi,
Thanks for the useful tool. I would like to request a new feature/option to come into play when using
fastp
for deduplication. It would be useful if the IDs of duplicate reads could be saved as well as the ID of the 'representative' read that each is a duplicate of. This would mimic a useful feature of another tool:Thanks
The text was updated successfully, but these errors were encountered: