-
Notifications
You must be signed in to change notification settings - Fork 196
Long‐read RNA‐seq with pre‐existing pangenome
This tutorial supplements "Transcriptomic analyses" by providing complete steps for long-read RNA-seq (e.g. Kinnex) quantification with a pan-transcriptome made from a .gbz
pangenome.
All example files here are assumed to be in the working directory. Input files provided by the user are called graph.gbz
, annotations.ref.gff3
, and reads.fastq
.
Install vg
following its README. Version 1.63.0 or later is required for long-read mapping. Also install rpvg
following its README.
If you don't want to follow the instructions, you can get `rpvg` for Linux x86\_64 with a shell script.
if ! which rpvg ; then
if [[ "$(uname -sm)" == "Linux x86_64" ]] ; then
wget https://github.com/jonassibbesen/rpvg/releases/download/v1.0/rpvg-v1.0_linux.tar.gz
tar -xvf rpvg-v1.0_linux.tar.gz rpvg-v1.0_linux/bin/rpvg
mv rpvg-v1.0_linux/bin/rpvg "$(dirname "$(which vg)")"
rmdir rpvg-v1.0_linux/bin
rmdir rpvg-v1.0_linux
rm rpvg-v1.0_linux.tar.gz
fi
# Otherwise follow the actual instructions
fi
A spliced pangenome simply adds splice junction edges between exons and thus is reference-based. By contrast, for a pan-transcriptome, splice junctions are projected onto the non-reference haplotypes. Thus, a pan-transcriptome has paths for all haplotype transcripts.
Some considerations for the annotation file:
- Chromosome names in the annotations must match the pangenome’s reference haplotype’s contigs. Check the graph’s chromosome names with
vg paths --list --reference-paths -x graph.gbz
. Some paths in the graph might have modified PanSN (sample#haplotype#contig#offset
) names; it’s OK if the graph has names with offsets and the annotations don’t.. - Choose
--transcript-tag
to be the tag that connects exons to their parent transcript. Make sure it distinguishes between chrX/Y annotations for pseudoautosomal genes. - Annotations must correspond to the correct reference genome. For example, CHM13’s first version lacked a Y chromosome, and graphs built from it often use GRCh38’s Y, so the chrY genes in a CHM13 v2 annotation file may need to be swapped out for GRCh38’s chrY annotations.
The “Construction” section of the "Transcriptomic analyses" page has an alternative more efficient three-step process for pan-transcriptome creation. This process needs a VCF file (which is used for GBWT creation). It cannot be used when starting from a .gbz
pangenome, which already includes a GBWT.
Graph creation requires large amounts of time and memory. With a HPRC v1.1 pangenome (90 haplotypes, 3.4G GBZ), 10k valid transcripts on CHM13, and 16 cores, it takes two hours and 140GB even without projection. With projection it was still running after a week. With pre-made annotations for other haplotypes (without projection), and a total of 13 million transcripts, it took two hours and 230GB.
If you don't want to use the full-size graph, you can use a small test graph included with vg to follow along.
vg gbwt --gbz-format -g graph.gbz -G longread/graph.gfa
cp longread/annotations.gff3 ./annotations.ref.gff3
cp longread/annotations.gff3 ./annotations.sample1.gff3
cp longread/annotations.gff3 ./annotations.sample2.gff3
cp longread/kinnex.fq ./reads.fastq
If annotations are only available for the reference haplotype, they can be projected onto the pangenome’s other haplotypes via the --proj-embed-paths
option:
vg rna --progress --proj-embed-paths --gbz-format graph.gbz \
--transcripts annotations.ref.gff3 --transcript-tag Parent \
--write-info info.tsv --gbwt-bidirectional --write-gbwt transcripts.gbwt > pantran.pg
Make sure the contig names in the first column of your GFF3 match the contig names in your graph. Check your available reference and generic paths with:
vg paths --list --reference-paths --generic-paths -x graph.gbz
In particular, if your target paths are in the graph in PanSN format, as something like CHM13#0#chr1
, but your GFF3 just calls for chr1
, it won't work.
If annotations are available for all haplotypes, such as for HPRC v1.1, simply use multiple --transcripts/-n
options. All annotation files must have the same --transcript-tag
. Disable projection via --use-hap-ref
. This option is significantly more efficient.
vg rna --progress --use-hap-ref --gbz-format graph.gbz \
--transcripts annotations.ref.gff3 --transcript-tag Parent \
-n annotations.sample1.gff3 -n annotations.sample2.gff3 \
`# ...more sample annotations...` \
--write-info info.tsv --gbwt-bidirectional --write-gbwt transcripts.gbwt > pantran.pg
Again, the paths in the GFF3 files need to match the names of the paths in the graph. If your annotations for, say, sample NA12878
are not working, check what the graph calls that sample's paths with:
vg paths --list --sample NA12878 -x graph.gbz
Then make sure the corresponding GFF3 file matches those path names.
vg giraffe
requires a lot of indexes, which it can make for itself. Creating indexes in a step separate from read alignment has the major advantage of breaking up the job into self-contained pieces. To do this, use the vg autoindex
command, which pre-creates indexes. This example uses a read file shipped with vg
to create the indexes.
# Build .gbz from .gbwt and .pg
vg gbwt -x pantran.pg --gbz-format -g pantran.gbz transcripts.gbwt
# Build indexes
vg autoindex --workflow lr-giraffe --prefix pantran --gbz pantran.gbz
# Use preset "-b hifi" for Kinnex
vg giraffe --progress -b hifi --gbz-name pantran.gbz \
--fastq-in reads.fastq --mapq-score-scale 0.01 > giraffe.gam
The resulting alignments may be QC’ed by vg stats -a giraffe.gam
. Note the --mapq-score-scale 0.01
option. Long reads’ scores are penalized by vg giraffe
’s mapping quality logic using this scaling factor, which must be set to bring the alignment scores down to an appropriate range for the mapping quality heuristics. The value used here empirically moves the median MAPQ of Kinnex reads from ~10 to 60.
The GAMs produced by vg giraffe
have metadata incompatible with rpvg. Either use the linked PR, or remove metadata with vg convert --gam-to-gaf
and --gaf-to-gam
.
# Sanitize the GAM
vg convert --gam-to-gaf giraffe.gam pantran.pg > giraffe.gaf
vg convert --gaf-to-gam giraffe.gaf pantran.pg > giraffe.gaf.gam
The rpvg
run requires a .xg
version of the graph, so make one form the .pg
file:
# rpvg requires .xg, not the .pg or .gbz from previous steps
vg convert --xg-out pantran.pg > pantran.xg
This step requires a GBWT with transcripts; use the one from vg rna
. Additionally, as the vg giraffe -b hifi
command did not use quality-adjusted scores, use --score-not-qual
.
# Don't try this part unless you actually installed rpvg
if ! which rpvg ; then exit 0 ; fi
# Actually run rpvg to quantify transcripts
rpvg --long-reads --score-not-qual --single-path \
-a giraffe.gaf.gam -g pantran.xg -p transcripts.gbwt \
--path-info info.tsv -i haplotype-transcripts -o quantification
After following along, you can clean up the extraneous files.
rm info.tsv transcripts.gbwt pantran.{xg,pg,gbz,dist,longread.withzip.min,longread.zipcodes} giraffe.{gaf,gam,gaf.gam} graph.gbz annotations.{ref,sample1,sample2}.gff3 reads.fastq
rm -Rf quantification