Skip to content

Commit

Permalink
Update workflows.md
Browse files Browse the repository at this point in the history
  • Loading branch information
BonsonW committed Mar 6, 2024
1 parent d69423d commit 83ebe06
Showing 1 changed file with 34 additions and 13 deletions.
47 changes: 34 additions & 13 deletions docs/workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,9 @@
## Extract and re-basecall reads mapping to a particular genomic region

```sh
# extract region chrX:147911919-147951125
samtools view https://url/to/reads.bam chrX:147911919-147951125 | cut -f1 | sort -u > rid_list.txt

# OR

# extract regions from a BED file
samtools view https://url/to/reads.bam -L /path/to/regions.bed -M | cut -f1 | sort -u > rid_list.txt

# fetch reads corresponding to the extracted region
slow5curl get https://url/to/reads.blow5 --list rid_list.txt -o extracted.blow5

# basecall extracted BLOW5, see https://github.com/Psy-Fer/buttery-eel/ for butter-eel options
buttery-eel -i extracted.blow5 -g /path/to/ont-guppy/bin/ --config dna_r9.4.1_450bps_sup.cfg --device 'cuda:all' -o extracted_sup.fastq
buttery-eel -i extracted.blow5 -g /path/to/ont-guppy/bin/ --config dna_r9.4.1_450bps_sup.cfg --device 'cuda:all' -o extracted_sup.fastq # see https://github.com/Psy-Fer/buttery-eel/ for butter-eel options
```

Note: If the read IDs in the BAM file are not the parent IDs (happens when read splitting is enabled during the initial basecalling step), you can grab the parent read IDs from the FASTQ file as below and use that as the input the to slow5curl get.
Expand All @@ -26,10 +16,41 @@ The above assumes that `parent_read_id` tag is present in all reads including th
```sh
# for split reads, get the parent_read_id tag
grep -F -f rid_list.txt reads.fastq | sed -n -e 's/.*parent\_read\_id=//p' | awk '{print $1}' > tmp.txt

# for non split reads, get the normla read ID
grep -F -f rid_list.txt reads.fastq | grep -v "parent\_read\_id" | awk '{print $1}' | tr -d '@' >> tmp.txt

# remove duplicates
sort -u tmp.txt > parent_rid_list.txt
```

## Generate compressed TSV from BAM file to map genomic regions to reads

Below is a walkthrough on generating a TSV file from a BAM file. This is useful if you only intend on using the TSV to map genomic regions to read IDs, and want to save as much disk space as possible.

```sh
samtools view -h reads.bam -o reads.sam # convert bam to sam
paftools.js sam2paf -p reads.sam > reads.paf # convert sam to paf
cut -f1,6,8,9 reads.paf > reads.tsv # generate tsv from paf
bgzip reads.tsv # compress
tabix tsv.gz -0 -s2 -b3 -e4 # generate index
```

## Generate list of read IDs from region with compressed TSV

Note: The index of the compressed TSV must be present in its directory.

```sh
# get list of read IDs corresponding to a specific region
tabix reads.tsv.gz chrX:43,044,295-43,170,245 | cut -f 1 | sort -u > rid_list.txt

# get list of read IDs corresponding to a list of regions specified in a sorted bed file
tabix reads.tsv.gz -R regions_list.hg38.bed | cut -f 1 | sort -u > rid_list.txt

# get list of read IDs corresponding to a chromosome
tabix reads.tsv.gz chrX | cut -f 1 | sort -u > rid_list.txt
```

Note: Any list of regions used must be sorted, this can be done with:

```sh
bedtools sort -i unsorted_regions_list.hg38.bed > regions_list.hg38.bed
```

0 comments on commit 83ebe06

Please sign in to comment.