This is a small (broken, incomplete) code project to test your overall python ecosystem and knowledge about algorithms, complexity and general data handling.
You will be questioned and guided on several aspects during the code interview. Valuable experience is:
- Being able to work in a collaborative way using Github.
- Use test driven development methodology to reason about inputs/outputs of code pipelines, unit tests.
- Identify bad coding practices.
- Improve the performance, usefulness, cleanness and generality of the code.
- Last but not least (at all): Respect the importance of deployment of (scientific) code.
Say you pick 100 random bioinformatics software tools -- how many will you actually be able to access, install, and run?
— Ran Blekhman (@blekhman) 26 oktober 2018
Our new paper: https://t.co/rllGmGdL1s https://t.co/JzyWc0W44X
Those three commands assume that you have Miniconda installed on your computer:
git clone https://github.com/brainstorm/telotest && cd telotest
conda create -n telotest python=3
pip install -r requirements.txt
But that should be part of a nice .travis.yml
, right?
In order to run the testsuite under tests
, the tox tool will be used to run them all:
tox
When tests pass, the input dataset for this code is the 2013 human reference genome (hg38) (~950MB compressed), which can be downloaded to the data
folder for further testing.
Telomere is a repetitive region found at each end of a chromosome. Its evolutionary role is to protect chromosome ends from degradation: after each division of our cells, the DNA shrinks a bit from each end, and telomeres take all the damage.
Unfortunately, the telomers are not endless, and after certain number of cell divisions they disappear, and the cell dies. For clinical porposes, it's very important to know how much of the telomere is left on each chromosome.
This project is an attempt to solve this problem computationally from DNA sequencing data. As the first step, it finds boundaries of telomeric regions in a chromosome sequence. All telomeric regions consist of a motif CCCTAA repeated many times, and rely on that fact to find the regions.