Skip to content

Commit b6862de

Browse files
committed
README update
1 parent 33d71f6 commit b6862de

3 files changed

Lines changed: 78 additions & 93 deletions

File tree

README.md

Lines changed: 77 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,80 @@ Utilities to extra mutate-and-map 2D maps from big library sequencing runs
44
## TODO
55

66
- [ ] output to HDF5 file with [Nseq x Nres x Nres] tensor, instead of .txt file
7-
- [ ] test splitting into chunks.
7+
- [ ] provide M2 example
8+
- [ ] provide MOHCA example
9+
- [ ] convert MATLAB visualization steps into python and provide example notebook.
10+
11+
## Workflow
12+
13+
### Get the counts with `bam_to_m2.py`
14+
First run `bam_to_m2.py`:
15+
16+
```
17+
usage: bam_to_m2.py [-h] [-b [BAM ...]] -s SEQUENCES_FASTA [-mq MAP_QUALITY] [--mutdel_cutoff MUTDEL_CUTOFF] [--trim_cutoff TRIM_CUTOFF] [-o OUT_TAG]
18+
[-n CHUNK_SIZE] [--start_idx START_IDX] [--end_idx END_IDX]
19+
20+
Processes BAM file to create 1D and 2D mut/del profiles
21+
22+
options:
23+
-h, --help show this help message and exit
24+
-b, --bam [BAM ...] BAM-formatted text file, Bowtie2
25+
-s, --sequences_fasta, --fasta SEQUENCES_FASTA
26+
Fasta file with DNA/RNA sequences used for Bowtie2
27+
-mq, --map_quality MAP_QUALITY
28+
minimum Bowtie2 MAPQ to consider read
29+
--mutdel_cutoff MUTDEL_CUTOFF
30+
Filter for maximum number of mut/del in read (default 0 means no filter)
31+
--trim_cutoff TRIM_CUTOFF
32+
Filter for maximum number of missing terminal residues (trim) in read (default 50; 0 means no filter)
33+
-o, --out_tag, --out_file_tag OUT_TAG
34+
Tag for outfile [default: derive from first bamfile]
35+
-n, --chunk_size CHUNK_SIZE
36+
split with this number of sequences per chunk
37+
--start_idx START_IDX
38+
only do the reference sequences from start_idx onwards [default all]
39+
--end_idx END_IDX only do the reference sequences up to end_idx [default all]
40+
41+
Extremely basic processing of CIGAR string in bowtie2 BAM file to produce alignments and then coding.
42+
```
43+
44+
The output for each input bam will be 9 files, with suffixes: `counts_del_del.txt.gz counts_del_mut.txt.gz counts_del.txt.gz counts_mutdel_mutdel.txt.gz counts_mut_del.txt.gz counts_mutdel.txt.gz counts_mut_mut.txt.gz counts_mut.txt.gz coverage.txt.gz`
45+
46+
The files are:
47+
- `counts_del_del.txt.gz`, ... `counts_mutdel_mutdel.txt.gz` are 2D files. These hold matrices in plain text of integer counts. The count how many reads have events -- `del` (deletion), `mut` (mutation), or `mutdel` (either deletion or mutation) -- co-occuring at residue i and j.
48+
- These 2D files take the Nres x Nres matrix for each probed reference sequence and concatenate them. The files have Nres values per row and the number of rows is Nres x Nseq, i.e., Nres rows for each reference sequence.
49+
- `counts_del.txt.gz`,...`counts_mut.txt.gz` are 1D files. They hold number of reads in which each event occurs at position i. The files have Nres values per row, and the number of rows is Nseq, one for each sequence.
50+
- `coverage.txt.gz` hold the number of reads that contributed counts to each position of each sequence. This file has Nres values per row, and the number of rows is Nseq, one for each sequence.
51+
52+
53+
**Tips**:
54+
- Bam file should be sorted!
55+
- If you have a huge file, e.g an Ultima-generated .cram or .bam with 1B reads, you can parallelize by using `--start_idx` and `--end_idx` to select out chunks to run on different processes.
56+
- If you have a bunch of `.bam` files, e.g., in different directories from a pre-split UBR run, you can supply them all with `-b` and the outputs will be in the same directories as the bam file. You can then combine the files with `merge_m2.py` (type `merge_m2.py -h` for help).
57+
- Merging a bunch of M2 files can get memory intensive! If not using `--start_idx`/`--end_idx`, you can provide `--chunk_size 100` to get more manageable files, and then after merging the files chunk-wise across multiple directories, concatenate them with `cat [chunk1] [chunk2]`.
58+
59+
60+
### Visualize the maps
61+
Once you've got the output of `bam_to_m2.py`, you can fire up MATLAB and run:
62+
63+
```
64+
% Read in data
65+
[all_counts_3d, all_counts, all_coverage, tags] = read_m2_output( filedir, data_type );
66+
67+
% derive maps from raw counts.
68+
all_cov2d = get_all_cov2d( all_counts_3d, all_counts, all_coverage [, BLANK_OUT5, BLANK_OUT3] );
69+
70+
% show maps with highest coverage first
71+
coverage = get_coverage( all_coverage );
72+
colorscale = [-5,5];
73+
[~,sortidx] = sort(coverage,'descend');
74+
show_m2_maps( all_cov2d, sortidx, colorscale [, BLANK_OUT5, BLANK_OUT3, coverage, headers, tags] );
75+
```
76+
77+
**Tips**:
78+
- For `data_type`, use `mutdel` to capture both mutations and deletions in mutate-and-map experiments correlating SHAPE or DMS signals to intrinsic mutations and deletions or to other SHAPE/DMS events. Use `del` for MOHCA-MaP data, where it seems that only deletions carry signal.
79+
- `cov2d` maps are `P(i,j)/P(i)P(j) - 1`, i.e. the 2D signal above the background from 1D events.
80+
- An alternative is to use `get_all_m2(...)`. In that case, `colorscale = [0,0.05]` works better in `show_m2_maps`.
81+
82+
83+

bam_to_m2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
import shutil
99

1010
parser = argparse.ArgumentParser(
11-
prog = 'bam_to_2d.py',
11+
prog = 'bam_to_m2.py',
1212
description = 'Processes BAM file to create 1D and 2D mut/del profiles',
1313
epilog = 'Extremely basic processing of CIGAR string in bowtie2 BAM file to produce alignments and then coding.')
1414

matlab/show_m2_maps.asv

Lines changed: 0 additions & 91 deletions
This file was deleted.

0 commit comments

Comments
 (0)