GitHub - medvedevgroup/cnfind

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
bamtools		bamtools
include		include
README		README
add_first_line		add_first_line
annot1_win.R		annot1_win.R
annot2_win.R		annot2_win.R
calc_purity.R		calc_purity.R
cnfind.pl		cnfind.pl
doc_walker.cpp		doc_walker.cpp
drop_first_col		drop_first_col
figures.R		figures.R
gcContent.cpp		gcContent.cpp
gem2masks.cpp		gem2masks.cpp
get_mean_sd.R		get_mean_sd.R
gpl.txt		gpl.txt
item		item
lower_resolution.cpp		lower_resolution.cpp
make_exp.cpp		make_exp.cpp
make_fasta_random.cpp		make_fasta_random.cpp
make_fasta_random_companion.py		make_fasta_random_companion.py
make_gcbins.cpp		make_gcbins.cpp
make_masked_windows.cpp		make_masked_windows.cpp
make_simple_coverage.cpp		make_simple_coverage.cpp
makefile		makefile
maps2bin.cpp		maps2bin.cpp
merge_intervals.py		merge_intervals.py
plot_doc.R		plot_doc.R
project.nt.pl		project.nt.pl
project_kidney.pl		project_kidney.pl
run_project.pl		run_project.pl
segment.R		segment.R
split_gem.cpp		split_gem.cpp

Repository files navigation

Description:
cnfind is a tool that detects Copy Number Abberations (CNAs) using read depth information.
It incorporates genome mappability information to filter the mappings, performs GC-normalization, divides the genome into windows with calculated read depths.
The genome is then segmented using the circular binary segmentation algorithm implemented in the R package DNACopy.
If a normal sample is available, it is used to assign p-values to each segment and give a set of calls.

Running:
A typical example of a run:
cnfind.pl --mode all --work_dir "~/my_output_dir" --ref_folder "~/hg19comp" --bam_files "~/mappings.bam"

If you have a project with multiple samples but a shared normal, use run_project.pl. It is an umbrella script that will generate the commands needed to run all the samples. See the top of the script for further details.

Prerequisites:
ImageMagick needs to be installed for making figures.
R needs to be installed.
You must have the latest hg19 (or hg18) companion package, which are available on cyberstar in /gpfs/cyberstar/pzm11/nobackup/pzm11/hg1?com/ or at OHSU on /home/groups/atlas/pmedvedev/hg1?comp/

Input Files:
The only data needed to run cnfind is one or multiple BAM files.
The BAM file must contain all the mappings for a given read consecutively.

Output:
In the work dir, there will be several files with the results:

all.windows:
This contains a line for every non-overlapping window of the genome. Each line contains several annotations. The most relevant are:
Masked: the number of positions in this windows that were masked, i.e. any mappings to that location are ignored and the position is not included in the calculations of expected coveage. (todo: describe which locations are masked)
Observed: the normalized number of mappings to the unmasked positions of this window. Each mapping is normalized by the number of total mappings for that read. For example, a read with a unique mapping will contribute 1 to the observed count of that location; a read with three mappings will contribute a count of 1/3 to each mapping position.
Expected: The number of reads expected to map to the unmasked locations of this window. This is calculated based on the total number of unmasked positions in the window and the expected coverage at each position (based on the GC content)
rOsEs (lrOsEs) : ratio (log ratio) of Observed to Expected
ObservedFull: the normalized number of mappings to this window, including the masked positions.
ExpectedRaw: the number of reads expected to map to this window, including masked positions.
GC: the GC content of this window (not counting the masked positions (I think?))
ObsN: number of Observed in the normal
rOnEn (lrOnEn) : ratio (log ratio) of Observed in the normal to the Expected in the normal
rOsOn (lrOsOn) : ratio (log ratio) of Observed in sample to Observed in Normal
rRsRn (lrRsRn) : ratio of Observed/Expected in sample to Observed/Expected in normal

all.segments:
This contains the results of the segmentation algorithm (DNACopy).
Every segment corresponds to a line in this file, with the following annotations:
num.mark: number of windows merged into this segment
seg.mean: the average LogRatio of the constituent windows.
call: -1 (loss), 0 (netural), or 1 (gain).
seused: the estimated standard error of the mean log ratio on a segment of this long in the normal
Wald: for internal use
pval: the probability of there being no variation in this segment
Call: -1/0/1, depending on whether pval is lower then the threshold given by the pval parameter to the program, or, if min/maxlogratio is specified, if the seg.mean is outside of the range.
Note that seused/Wald/pval are always NA if a normal was not provided using the normal_dir parameter.

all.calls
This contains the subset of the segments file where the call is non-zero and where adjacent segments with the same call are merged together

all.covplot.png
A plot of the LogRatios and the results of the segmentation

ALGORITHM

Mappability:
Locations that lie in repetitive regions complicate read depth analysis.
Therefore, cnfind masks some locations from its analysis.
A location is masked if the 76-mer starting at tha location is present somewhere else in the genome with up to two mismatches.
Read mappings that start at a masked location are not counted towards the read depth.
We use a set of masks generated by the GEM program (http://sourceforge.net/apps/mediawiki/gemlibrary/index.php?title=The_GEM_library) and
downloaded from http://xfer.curie.fr/get/fdwcA3QfU5a/GEM_mapp_hg19.tar.gz
Masking information bsed on 50-mers or 35-mers can be also used by setting the "masks" parameter to "50mer" or "35mer".
Alternate masks can be specified as described below.

Multiple mapping reads:
Read mappings that come from reads with multiple mapping locations are weighed according to the number of mappings.
For exapmle, if a read maps to three locations of which one is masked, then it contributed 1/3 to the read depth count at the unmasked locations.
It does not contribute anything to the masked location.

GC-correction:

Normalization:
There are several options for normalization.
The default option is "genome wide," under which the average read depth ratios over the genome are 1.
Under the "own_chr" option, every chromosome has an average ratio of 1.
You may also specify a concrete chromomosome to use for normalization.
That way, the average ratios over that chromosome are 1, but the normalization factor used for that chromosome is then applied to all other chromosomes.
Under the "auto" option, a chromosome is picked automatically.
The chromosome where the ratio deviate the least from one.

For every chromosome, cnfind calculates the number
Normalization is the calculation of an expected coverage as a function of GC-content.
For each chromosome, cnfind calculates the root mean squared difference between the read depth at each location and the average read depth.
The chromosome with the smallest difference is chosen

Windowing:
The genome is segmented into non-overlapping windows whose length is controlled via the win_size parameter.
The genome is scanned left to right, and any time the number of unmasked positions in the current window reaches win_size or the number of total positions reaches 1.5*win_size, a new windows is opened.

Segmentation:
The read-depth log-ratios for each window are passed into the widely used segmentation program DNACopy.
DNACopy groups the windows together so that each group represents a segment of the genome with
similar log-ratios within the contained windows and different log-ratios from the neighboring windows.

Calculation of p-values:
This step is only performed if a normal was provided using the normal_dir parameter.
The all.windows file in the normal_dir is analyzed by looking at the distribution of the LogRatio values
The standard error of this distribution is then used as a basis for calculating the probability that
a given segment in the sample has no variation.

It is assumed that coverage of the normal is the same as of the sample.
If it is lower (respectivelly, higher), then the pval given in the all.segments file will be too small (respectivelly, too large).

Calling:
If a normal was provided, then all segments with p-values less than the parameter pval are called as CNAs.
If no normal was provided, then all segments with ratio means outside the range given by minlogratio/maxlogratio are called as CNAs.

CUSTOMIZATION

If you wish to use an alternate reference genome or an alternate set of chromosomes or different mappability information,
you must modify the existing or create your own companion package.
The package must contain the following directories:

1. fasta_files_folder/: Fasta files for reference sequences (e.g. chr1.fa, chr2.fa, ...)
2. allchr.txt: A short text file containing the names of all the reference sequences (chromosomes) of the organism.
3. autosomes.txt: A short text file containing the names of all the DIPLOID reference sequences (chromosomes) of the organism.
4. gem_mappability/out76/: Masking files for each chromosome (i.e. chr1.mask, chr2.mask,...). Each masking file contains a single line of length equal to the length of the chromosome. The line contains a binary string, where "1" indicates that the position is to be masked, and "0" indicates it should be used.

QUESTIONS?
For any questions/issues/bugs, please contact cnfindATcse.psu.edu.
In order for us to help you, please include
1. A long recursive listing (including time stamps and file-sizes) of the cnfind work_dir
2. A long recursive listing of the ref_folder (where the companion package is)
3. A long recursive listing of directory with the read mappings
4. If using multiple BAMs, then contents of bam_files.
5. The first 1000 lines of at least one of the BAM files, including the header.
6. The log.txt file from the directory in which cnfind was invoked.
7. The stdout and stderr output of the cnfind run.