Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNAseq analysis #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added analysis.MS/Thumbs.db
Binary file not shown.
Binary file added analysis.MS/bowtie.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17,486 changes: 17,486 additions & 0 deletions analysis.MS/countFiles/ERR990557s.count.txt

Large diffs are not rendered by default.

17,486 changes: 17,486 additions & 0 deletions analysis.MS/countFiles/ERR990558s.count.txt

Large diffs are not rendered by default.

17,486 changes: 17,486 additions & 0 deletions analysis.MS/countFiles/ERR990559s.count.txt

Large diffs are not rendered by default.

17,486 changes: 17,486 additions & 0 deletions analysis.MS/countFiles/ERR990560s.count.txt

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
55 changes: 55 additions & 0 deletions analysis.MS/report
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
1.Download manually the 4 datasets
ERR990557
ERR990558
ERR990559
ERR990560

2.Download fastq files with a bash script downloadFastq.sh
To run it $ bash downloadFastq.sh repertoryOfDatasets

3.Unzip manually fastq files
ERR990557.fastq
ERR990558.fastq
ERR990559.fastq
ERR990560.fastq

4.Select the first 8 000 000 reads in each fastq files with a python script selectReads.py
Creation of sample files
ERR990557s.fastq
ERR990558s.fastq
ERR990559s.fastq
ERR990560s.fastq
The principal fonction fonction selectReadForAllFiles takes the list of fastq files and the number of reads wanted in each sample files

5.Align reads datasets to reference genomes with bowtie
Prerequisite : download Drosophila melanogaster index and unzip it
Use default parameters
To run bowtie
$ ./bowtie -S d_melanogaster_fb5_22 fastqFile >outputFile
Creation of sam files
ERR990557s.bowtie.sam
ERR990558s.bowtie.sam
ERR990559s.bowtie.sam
ERR990560s.bowtie.sam

6.Count reads aligning to genome's genes with HTSeq
Prerequisite : download and unzip gtf file of Drosophila melanogaster
To run HTSeq
$ python -m HTSeq.scripts.count samFile gtfFile >outputFile
Creation of 4 files, folder countFiles with theses files given
ERR990557s.count.txt
ERR990558s.count.txt
ERR990559s.count.txt
ERR990560s.count.txt

7.Differential analysis with DESeq2 package on R
Script differentialAnalysis.R
Graphs reporting analysis in folder differentialAnalysisGraphs
file with differentially expressed gene resultOfDifferantialAnalysis.txt







10 changes: 10 additions & 0 deletions analysis.MS/resultOfDifferantialAnalysis.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
baseMean log2FoldChange lfcSE stat pvalue padj
FBgn0003892 273.650858940795 4.22489167801591 0.748709288961954 5.64290004184869 1.67209434203519e-08 3.04878535031083e-05
FBgn0003961 718.730743284043 3.68326480073617 0.722672951426683 5.0967243114119 3.45581026526622e-07 0.000472582053775155
FBgn0031523 148.98239886426 3.2679796419709 0.742188225063946 4.40316826865495 1.06681357069062e-05 0.00680675265212847
FBgn0032049 241.050037802806 5.13852162101445 0.668717526285134 7.68414378124649 1.54023606280711e-14 8.42509126355492e-11
FBgn0032235 73.7391147520904 3.73481514906243 0.754057228824972 4.95295981033468 7.30930754341963e-07 0.000799638245250107
FBgn0032505 107.722750765249 4.08443972447685 0.710725067842802 5.74686318139019 9.09142342257462e-09 2.48650430607416e-05
FBgn0032913 702.593594885596 3.12529718041409 0.71148858967721 4.39261743021345 1.11994102137397e-05 0.00680675265212847
FBgn0033354 143.433506890372 3.41694304832748 0.737368530292016 4.63396918630944 3.5872041293664e-06 0.00327033443127237
FBgn0035343 225.528304796664 3.32562295524925 0.737523134797493 4.50917781197792 6.50793470713763e-06 0.00508548612114897
48 changes: 48 additions & 0 deletions analysis.MS/scripts/differentialAnalysis.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#Rscript for differential analysis

library(DESeq2)

# DESeqDataSet object creation
dir="/home/mariam/Documents/countFiles"
HTSeqCountFiles = c("ERR990557s.count.txt", "ERR990558s.count.txt", "ERR990559s.count.txt", "ERR990560s.count.txt")
stages = c("1", "2","2","2")
samples = c("ERR990557", "ERR990558", "ERR990559", "ERR990560")
samplesTable = data.frame(sampleName=samples, fileName=HTSeqCountFiles, condition=stages)
dds = DESeqDataSetFromHTSeqCount(sampleTable=samplesTable, directory=dir, design= ~ condition)

# normalization of counts

# calculation of sizeFactors
dds = estimateSizeFactors(dds)
sizeFactors(dds)

# bar plots
barplot(colSums(counts(dds)),main='number of reads mapped on genes for each experiment')

# boxplots
boxplot(log2(counts(dds)+1),main='Barplot which represent logfc dispersion for each experiment\nbefore normalization')
boxplot(log2(counts(dds,normalized=TRUE)+1),main='Barplot which represent logfc dispersion for each experiment\nafter normalization')

# variance estimations
dds = estimateDispersions(dds)
plotDispEsts(dds,main='Representation of variance')

# differential analysis
dds = nbinomWaldTest(dds)
res = results(dds)

# diagnostic plots

# MA plot
plotMA(res,main='MA plot for differential analysis')

# Histogram of the p values
hist(res$pvalue,main='distribution of p-values',breaks=20,col="grey")
hist(res$padj,main='distribution of adjusted p-values',breaks=20,col="blue")

# selection of the genes with an adjusted p-value<0.01
ind=which(res$padj<0.01)
res=as.matrix(res)
print(res)
res = res[ind,]
write.table(res, "resultOfDifferentialAnalysis.txt", row.name=T, quote=F, sep='\t')
24 changes: 24 additions & 0 deletions analysis.MS/scripts/downloadFastq.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
export repertory=$1

files=$(ls $repertory)
echo $files
cd $repertory
chaine="gz"
see='nothing'

for file in $files;
do
for line in $(<$file)
do
IFS='/' read a b c d e f<<< "$line"
IFS='.' read g h i<<< "$f"
if [ "$i" == "$chaine" ] && [ "$see" != "$line" ]
then
echo $i
wget $line
see=$line

fi

done
done
44 changes: 44 additions & 0 deletions analysis.MS/scripts/selectReads.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# -*- coding: utf-8 -*-
"""
Created on Wed Sep 27 17:40:23 2017

@author: mariam
"""
import re

def seletReads(fileRead,number,fileWrite,name):
count=0
fi=open(fileRead,'r')
fiWrite=open(fileWrite,'w')
line=fi.readline()
while(1):

if line[0]=='+' and len(line)==2:
count+=1
if count==number:
fiWrite.write(line)
line=fi.readline()
fiWrite.write(line)
break;
fiWrite.write(line)
line=fi.readline()
fi.close()
fiWrite.close()
return

def selectReadForAllFiles(filesNames,number):
for i in filesNames:
print i
fileFin=i.split('/')
fileFin=fileFin[len(fileFin)-1]
fileFin=fileFin.split('.')
name=fileFin[0]
fileFin[0]=fileFin[0]+'s'
fileFin='.'.join(fileFin)
seletReads(i,number,fileFin,name)
return

files=['/home/mariam/Documents/datasets/ERR990557.fastq','/home/mariam/Documents/datasets/ERR990558.fastq','/home/mariam/Documents/datasets/ERR990559.fastq','/home/mariam/Documents/datasets/ERR990560.fastq']
selectReadForAllFiles(files,8000000)