Workflow: Convert CONTROL-freec output to GISTIC2
The presented workflow converts the output generated by CONTROL-freec to GISTIC2 to be able to find high confident somatic copy-number alterations.
I found this resource which was useful to get the correct input format for GISTIC2
https://www.biostars.org/p/426635/
From CONTROL-freec the following two files are needed: PREFIX _ratio.txt and PREFEX sample.cpn
Additionally, CONTROL-freec provides a script to convert output files to BED format (freec2bed.pl; https://github.com/BoevaLab/FREEC/tree/master/scripts/freec2bed.pl)
The ratio file contains information about ratios and predicted copy number alterations for each window. The cpn file contains information about the number of reads per called window.
Furthermore, a file with chromosome name and length is needed (e.g. ref_hg19.genome).
Run the freec2bed.pl script on the ratio file, e.g. freec2bed -f Lx002031.bam_ratio.txt > Lx002031.freec_segments.bed
If cohorts are analyzed, the following lines of code loop through a directory (assumption: all files are stored in the same directory)
INPUTRATIO=/some/path/to/freec/output/files
for file in ls $INPUTRATIO*bam_ratio.txt; do
echo $file
OUTFILE=$(echo $file | sed -e "s/bam_ratio.txt/freec_segments.bed/g")
freec2bed.pl -f $file > $OUTFILE
done
Files are joined and the ratio is transformed applying log2(ratio)-1 (according to https://www.biostars.org/p/426635/)
library(tidyverse)
library(GenomicRanges)
# list of files with pattern matching
files <- list.files(path = "../data/lymphome_WXS/cnvs", pattern="freec_segments.bed", full.names=TRUE)
segData <- NULL
for( i in files ) {
print(i)
tab <- read.table(i)
id <- gsub(pattern = "../data/lymphome_WXS/cnvs.", replacement = "", i)
id <- gsub(pattern = ".freec_segments.bed", replacement = "", id)
tab$Sample <- id
tab$Dummy <- 'NA'
tab <- tab[, c("Sample", "V1", "V2", "V3", "Dummy", "V4")]
# convert ratio to log2(ratio)-1
tab$V4 <- log2(tab$V4) - 1
segData <- rbind(segData, tab)
}
Next, chromosomes X and Y are removed and columns are renamed.
segData <- segData %>%
dplyr::filter( V1 != 'chrX' & V1 != 'chrY') %>%
dplyr::rename( Chromosome = V1, Start = V2, End = V3, log2ratio = V4 )
segData$Chromosome <- gsub(pattern = "chr", replacement = "", segData$Chromosome)
Retrieve coverage information:
files <- list.files(path = "../data/lymphome_WXS/cnvs", pattern="bam_sample.cpn", full.names=TRUE)
cpnData <- NULL
for( i in files ) {
print(i)
tab <- read.table(i)
id <- gsub(pattern = "../data/lymphome_WXS/cnvs.", replacement = "", i)
id <- gsub(pattern = ".bam_sample.cpn", replacement = "", id)
tab$Sample <- id
tab <- tab[, c("Sample", "V1", "V2", "V3", "V4", "V5")]
cpnData <- rbind(cpnData, tab)
}
colnames(cpnData) <- c("Sample", "Chromosome", "Start", "End", "Number", "Range")
Combine coverage and ratios
segData_num <- NULL
for ( i in unique( cpnData$Sample ) ) {
print(i)
cpn.s <- cpnData %>% dplyr::filter( Sample == i ) %>% dplyr::select( -Sample ) %>% GRanges
seg.s <- segData %>% dplyr::filter( Sample == i ) %>% dplyr::select( -Sample ) %>% dplyr::rename(Num_Probes=Dummy) %>% GRanges
# sum features per interval
for ( j in 1:nrow( data.frame(seg.s) ) ) {
Num_Probes <- sum( data.frame( cpn.s[ c(data.frame( findOverlaps(seg.s[j,], cpn.s) )$subjectHits), ] )$Number )
mcols( seg.s[j, ] )$Num_Probes <- Num_Probes
}
seg.s <- data.frame(seg.s)
seg.s$Sample <- i
segData_num <- rbind(segData_num, seg.s)
}
segData_num <- segData_num %>%
dplyr::select( Sample, seqnames, start, end, Num_Probes, log2ratio)
Finally, save all files
write.table(x = segData, file = "gistic.segments.txt", sep = "\t", row.names = F, col.names = F, quote = F)
write.table(x = segData_num, file = "gistic.segments.numProbes.txt", sep = "\t", row.names = F, col.names = F, quote = F)
If necessary, common CNAs can be removed as described below. Here, a panel of common mutations for hg19 was retrieved from a Broad Institute panel of normals (ftp://ftp.broadinstitute.org/pub/GISTIC2.0/hg19_support/)
common.cnvs <- GRanges( read.delim( file = "../data/CNV.hg19.bypos.111213.txt" ) )
# get index of all CNVs without a hit in common.cnv
#subsetByOverlaps( GRanges(cnvdata[4, 1:4]), common.cnvs )
idx <- which( countOverlaps( GRanges(segData[, -1]), common.cnvs ) == 0 )
segData <- segData[idx, ]
rownames(segData) <- NULL
The second line in the loop before can be skipped by adjusting the R code above. Anyhow, it does not hurt.
INPUT_FREEC=gistic.segments.numProbes.txt
REF_IDX=ref_hg19.genome
WORK_DIR=gistic_input
mkdir -p $WORK_DIR
for sample in $(cut -f1 $INPUT_FREEC | uniq); do
echo $sample
grep "$sample\t" $INPUT_FREEC |cut -f2-6|sed 's/^/chr/' > $WORK_DIR/$sample.bed
bedtools sort -i $WORK_DIR/$sample.bed -faidx $REF_IDX > $WORK_DIR/$sample.sorted.bed
bedtools complement -i $WORK_DIR/$sample.sorted.bed -g $REF_IDX > $WORK_DIR/$sample.complement.bed
awk '{print $0, "\tNA\t0"}' $WORK_DIR/$sample.complement.bed > $WORK_DIR/$sample.CN2.bed
cat $WORK_DIR/$sample.sorted.bed $WORK_DIR/$sample.CN2.bed > $WORK_DIR/$sample.whole.bed
bedtools sort -i $WORK_DIR/$sample.whole.bed -faidx $REF_IDX > $WORK_DIR/$sample.whole.sorted.bed
awk -v sample=$sample '{print sample"\t" $0}' $WORK_DIR/$sample.whole.sorted.bed > $WORK_DIR/$sample.whole.sorted.seg
done
cd $WORK_DIR
find . -name "*.whole.sorted.seg" -exec cat {} \; > ../samples.seg
cd ..
The file samples.seg is the final input for GISTIC2. Basically, we are done here
GISTIC2 uses a couple of MATLAB scripts. A docker image and install instructions are provided here
https://hub.docker.com/r/shixiangwang/gistic
Basically, just run docker pull shixiangwang/gistic
to install the image.
To run the image use
sudo docker run -it shixiangwang/gistic /bin/bash
change the working directory and create the input and ouput directory
cd /opt/GISTIC/
mkdir indat
mkdir gistic_out
We have to transfer our samples.seg file to the docker image. Therefor, we have to retrieve the container id for the running image and copy the file (here, the ID is 16d57225d045 but it depends on your local environment).
docker container ls
docker cp samples.seg 16d57225d045:/opt/GISTIC/indat/
Now, we can run GISTIC2
./gistic2 -b gistic_out -seg indat/samples.seg -refgene refgenefiles/hg19.mat
and copy the result files (from outside the docker image)
docker cp 16d57225d045:/opt/GISTIC/gistic_out ./
DONE