Samples were processed using some bash functions to change the 382 raw files in tab-separated values to VCF.
- Using the script: csv_to_vcf.sh
code goes like this:
mkdir AmphoraChallenge
cd AmphoraChallenge
ln -s ~/DataEngineering-GenomicsChallenge-Jun2022/Challenge/Challenge\ Samples/Challenge\ Samples/* .
This can be done in R or Python, but with simple BASH and AWK without importing to Python or R before Clustering or VCF analysis.
for file in *.csv ;
do
#sed has a confusing syntax; more lines of sed instead of a long one with all the changes gets clearer
sed -i 's/,REF/#CHROM;POS,REF/' $file ;
sed -i 's/,ALT,ALT/,ALT/' $file ;
sed -i 's/;/\t/' $file ;
sed -i 's/,/\t/' $file ;
sed -i 's/,/|/4g' $file ;
sed -i 's/"//' $file ;
sed -i 's/"//' $file ;
sed -i 's/,/\t/' $file ;
sed -i 's/,/\t/2g' $file ;
sed -i 's/ALT,/ALT\t/' $file ;
#We are missing ID, QUAL, FILTER, INFO and FORMAT, columns we can add them with AWK
awk 'BEGIN{ FS=OFS="\t" } {$2 = $2 FS (NR==1? "ID" : ".") }1' $file > tmp && mv tmp $file ;
awk 'BEGIN{ FS=OFS="\t" } {$5 = $5 FS (NR==1? "QUAL": ".") }1' $file > tmp && mv tmp $file ;
awk 'BEGIN{ FS=OFS="\t" } {$6 = $6 FS (NR==1? "FILTER" : "PASS") }1' $file > tmp && mv tmp $file ;
awk 'BEGIN{ FS=OFS="\t" } {$7 = $7 FS (NR==1? "INFO" : "AC=1;AN=2") }1' $file > tmp && mv tmp $file ;
awk 'BEGIN{ FS=OFS="\t" } {$8 = $8 FS (NR==1? "FORMAT" : "GT") }1' $file > tmp && mv tmp $file ;
#Head just to check structure in logout
head $file
#The following is to add metadata to maintain format consistency
sed -i 1i"##" $file
sed -i 1i"##File was a .csv raw data, transformed to a .vcf file" $file
sed -i 1i"##fileformat=VCFv4.1" $file ;
done 1> CSVtoVCF_logout.txt 2> CSVtoVCF_error.txt
for file in *.csv;
do
mv -- "$f" "${f%.csv}.vcf"
done
- The code is in: VCF_Process.R
To process the data i had to read A LOT of online guides 🥴 and watch many many videos.
most of the help came from:
- https://grunwaldlab.github.io/Population_Genetics_in_R/index.html
- https://github.com/cmcouto-silva?tab=repositories
- https://www.bioconductor.org/packages/devel/bioc/vignettes/SNPRelate/inst/doc/SNPRelate.html#overview
- data.table
- vcfr
- SNPRelate
- dplyr
- poppr
- ape
- RColorBrewer
- tidyr
- parllell
- SeqArray
- ggploy
-Also the PNG graphs and merged_genotype.vcf file with all samples.
The code to work requires to run in the same folder as the file intented to parse to the large merge_genotype.vcf also with the PCA-vcfa-data.rda file.
- Other ordination/clustering techniques UMAP or t-SNE techniches are kinda new techniques that can cluster samples in an efficient way.
Althought that does not mean that PCA, PCoA or NMDS classical ordinations cant work well for some data.
For example when you are testing stuff like dummy data
-
This can be scalable in a server with Nextflow, Docker or API (Flask)
-
Betwern VCFr and SeArray/SNPrelate libraries a testing to compare wich is faster or efficient for big ammounts of data and samples should be done.