A collection of useful sequencing utilities.
FASTQ
BAM
VCF
Install using nim c sc.nim
As I have done in the past, I intend to develop seq-kit
over a period of months or years. In general, this means I'll add a new tool or utility when it becomes apparent that I need it.
However, I am open to PRs, requests, and feedback. Please let me know what you think.
I intend to port some commands over from VCF-kit, but will broaden the applicability to FASTQs, BAMs, and perhaps even other items.
The fq-dedup
command de-duplicates a FASTQ by read ID (e.g. @@D00446:1:140101_HWI-D00446_0001_C8HN4ANXX:8:2210:1238:2018
). Ideally, this should never happen, and I honestly do not know how it happens.
The command uses a Bloom filter to identify duplicates, and has to read through the file twice. If no duplicates are found during the first pass, the command will return 0 and print "No Duplicates Found" to stderr
. If you are
checking a set of FASTQs to remove duplicates, you can use the following bash to handle cases where duplicates are
not found.
(sc fq-dedup myfastq.fq.gz 2> dup.err) | gzip > out.fq.gz
if [[ `head -n 1 dup.err` == "No Duplicates Found" ]]; then
# Handle case where duplicates are not found (probably by moving file)
mv myout.fq.gz out.dedup.fq.gz
fi
fq-dedup can read both .gz
and raw text. It sends the deduplicated FASTQ to stdout.
Be sure to use -d:release
compiled binaries with this command otherwise its really slow.
Once complete, the following is sent to stderr:
total_reads: 2500000
duplicates 1086043
false-positive: 0
false-positive-rate: 0.0
The false-positive values are for diagnostics only based on reads initially labeled as duplicates by the bloom filter that were later found not to be.
Benchmark
2.5M Reads; 1M+ duplicates; 2015 MacBook Pro
0m58.738s
Count the number of reads in a FASTQ
Scenario: You are given an old dusty hard drive packed with sequence data. Your collaborator says "We have some great sequencing data here, if only someone could analyze it." You peek at the filesystem and discover that FASTQs have been renamed, removing crucial information about how they were generated. Your collaborator, however, recalls certain details about which data was sequenced on which sequencer and he has a list of sequencing barcodes and associated samples that you can match on." If only there was a way to determine the barcodes, sequencer, or other essential metadata for each FASTQ...
If this scenerio sounds familiar, fq-meta
™ is for you. It samples the first few sequences of a FASTQ and outputs a summary for every FASTQ including:
- A best guess as to the type of sequencer based on instrument and flowcell information
- A best guess for the quality score format
- The barcode/index used to multiplex the sample which can be useful for identifying samples
- The machine, run, and other metadata about the FASTQ
- Information derived from the filename
- The file location
Data is output in a tab-delimited format so that this tool can be used to assemble a database of FASTQs and their associated information. You can do so like this:
sc fq-meta --header > my_fq_database.txt # Use this to output just the variable names
sc fq-meta *.fq.gz >> my_fq_database.txt
The resulting dataset can be combined with other metadata and filtered to select samples for processing in a pipeline.
You can also parallelize the operation with GNU-parallel.
sc fq-meta --header > my_fq_database.txt # Use this to output just the variable names
parallel -j 8 sc fq-meta ::: find . -name '*.fq.gz' >> my_fq_database.txt
Example output``
machine | sequencer | prob_sequencer | flowcell | flowcell_description | run | lane | sequence_id | index1 | index2 | qual_format | qual_phred | qual_multiple | min_qual | max_qual | n_lines | basename | absolute_path |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
D00446 | HiSeq2000/2500 | high:machine+flowcell | C8HN4ANXX | High Output (8-lane) v4 flow cell | 1 | 8 | GCTCGGTA | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 14 | 14 | 1 | illumina_2000_2500.fq | … | ||
K00100 | HiSeq3000/4000 | high:machine+flowcell | H300JBBXX | (8-lane) v1 flow cell | 33 | 6 | GCCAAT | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 14 | 14 | 1 | illumina_3000_4000.fq | … | ||
D00209 | HiSeq2000/2500 | high:machine+flowcell | CACDKANXX | High Output (8-lane) v4 flow cell | 258 | 6 | CGCAGTT | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 0 | 37 | 1 | illumina_6.fq | … | ||
D00209 | HiSeq2000/2500 | high:machine+flowcell | CACDKANXX | High Output (8-lane) v4 flow cell | 258 | 6 | GAGCAAG | Sanger;Illumina 1.8+ | Phred+33 | TRUE | 0 | 37 | 1 | illumina_7.fq | … |
fq-meta
accepts both gzipped FASTQs (.fq.gz
, .fastq.gz
~ inferred from .gz
extension) and raw text FASTQs.
sc fq-meta --header > fastq_db.tsv # Prints just the header
sc fq-meta sample_1_R1.fq.gz sample_1_R2.fq.gz sample_2_R1.fq.gz sample_2_R2.fq.gz >> fastq_db.tsv
Calculate the insert-size of a bam or a set of bams. Bams are estimated by evaluating up to the 99.5th percentile of read insert-sizes. This gives numbers that are very close to Picard a lot faster.
sc insert-size --header input.bam # One bam
sc insert-size --header *.bam # Multiple bams
Options
--verbose
Output information about progress.--dist
Output the frequency distribution of insert sizes.
Output
median | mean | std_dev | min | percentile_99.5 | max_all | n_reads | n_accept | n_use | sample |
---|---|---|---|---|---|---|---|---|---|
179 | 176.5 | 63.954 | 38 | 358 | 359 | 237 | 101 | 100 | AB1 |
Calculate insert-size metrics on a set of bams.
Convert a VCF to JSON. This is most useful in the context of a web-service, where you can serve variant data for the purposes of visualization, browsing, etc. For an example of what this might look like, see the elegans variation variant browser.
The json
command can be used to parse ANN
columns (effect annotations) by specifying --annotation
. Additionally, FORMAT fields (GT, DP, etc) can be zipped with sample names.
Usage:
sc json [options] vcf region
Arguments:
vcf VCF to convert to JSON
region Region
Options:
-i, --info=INFO comma-delimited INFO fields; Use 'ALL' for everything
-f, --format=FORMAT comma-delimited FORMAT fields; Use 'ALL' for everything
-s, --samples=SAMPLES Set Samples (default: ALL)
-p, --pretty Prettify result
-a, --array Output as a JSON array instead of ind. JSON lines
-z, --zip Zip sample names with FORMAT fields (e.g. {'sample1': 25, 'sample2': 34})
-n, --annotation Parse ANN Fields
--pass Only output variants where FILTER=PASS
--debug Debug
-h, --help Show this help
sc json --format=GT --zip --pretty tests/data/test.vcf.gz
Outputs:
{
"CHROM": "I",
"POS": 41947,
"ID": ".",
"REF": "A",
"ALT": [
"T"
],
"QUAL": 999.0,
"FILTER": [
"PASS"
],
"FORMAT": {
"GT": {
"AB1": [
2,
2
],
"...": [
0,
0
]
}
}
}
You can also specify custom SGT
and TGT
output formats which transform GT
fields from their
typical output.
GT
- Outputs genotypes as [[0, 0], [0, 1], [1, 1], ...SGT
- Outputs genotypes as0/0
,0/1
,1/1
, ...TGT
- Outputs genotypes asT/T
,A/T
,A/A
, ...
nim c --cpu:i386 --os:linux --threads:on --compileOnly sc.nim