GitHub - elcharitas/bioinformatics_algorithms: Python code covering some of the algorithms underlying the following fundamental topics in bioinformatics: assembling genomes, comparing DNA and protein sequences, finding regulatory motifs, analyzing genome rearrangements, identifying proteins, and many other topics.

Bioinformatics Algorithms (Part 1)

This course was the first in a two-part series covering some of the algorithms underlying bioinformatics. It covers some of the algorithms underlying the following fundamental topics in bioinformatics:

assembling genomes,
comparing DNA and protein sequences,
finding regulatory motifs,
analyzing genome rearrangements,
identifying proteins,
and many other topics.

About the Course

The sequencing of the human genome fueled a computational revolution in biology. As a result, modern biology produces as many new algorithms as any other fundamental realm of science. Accordingly, the newly formed links between computer science and biology affect the way we teach applied algorithms to computer scientists.

Genome sequencing is just one of hundreds of biological problems that have become inextricable from the computational methods required to solve them. In this course, we will uncover some of the algorithmic ideas that are fundamental to an understanding of modern biology. Computational concepts like dynamic programming and graph theory will help us explore algorithms applied to a wide range of biological topics, from finding regulatory motifs to determining whether the human genome has "fragile" regions. Throughout the process, we will apply bioinformatics algorithms to real genetic data.

Course Syllabus

The course was based on six central questions, with the algorithmic ideas that was used to solve them in parentheses:

Where Does DNA Replication Begin? (Algorithmic Warm-up)
How Do We Sequence Antibiotics? (Brute Force Algorithms)
Which DNA Patterns Act As Cellular Clocks? (Greedy and Randomized Algorithms)
How Do We Assemble Genomes? (Graph Algorithms)
How Do We Compare Biological Sequences? (Dynamic Programming Algorithms)
Are There Fragile Regions in the Human Genome? (Combinatorial Algorithms)

=========================================================

frequent_words_problem.py

This is the first problem in a collection of "code challenges" to accompany Bioinformatics Algorithms: An Active-Learning Approach by Phillip Compeau & Pavel Pevzner.

A k-mer is a string of length k. We define Count(Text, Pattern) as the number of times that a k-mer Pattern appears as a substring of Text. For example,

Count(ACAACTATGCATACTATCGGGAACTATCCT,ACTAT)=3

We note that Count(CGATATATCCATAG, ATA) is equal to 3 (not 2) since we should account for overlapping occurrences of Pattern in Text.

We say that Pattern is a most frequent k-mer in Text if it maximizes Count(Text, Pattern) among all k-mers. For example, "ACTAT" is a most frequent 5-mer in "ACAACTATGCATCACTATCGGGAACTATCCT", and "ATA" is a most frequent 3-mer of "CGATATATCCATAG".

Frequent Words Problem

Find the most frequent k-mers in a string.

Given: A DNA string Text and an integer k.

Return: All most frequent k-mers in Text (in any order).

Sample Dataset

ACGTTGCATGTCGCATGATGCATGAGAGCT
4

Sample Output

CATG GCAT

======================================================

clump_finding_problem.py

Given integers L and t, a string Pattern forms an (L, t)-clump inside a (larger) string Genome if there is an interval of Genome of length L in which Pattern appears at least t times. For example, TGCA forms a (25,3)-clump in the following Genome: gatcagcataagggtcccTGCAaTGCAtgacaagccTGCAgttgttttac

Clump Finding Problem

Find patterns forming clumps in a string.

Given: A string Genome, and integers k, L, and t.

Return: All distinct k-mers forming (L, t)-clumps in Genome.

Sample Dataset

CGGACTCGACAGATGTGAAGAAATGTGAAGACTGAGTGAAGAGAAGAGGAAACACGACACGACATTGCGACATAATGTACGAATGTAATGTGCCTATGGC
5 75 4

Sample Output

CGACA GAAGA AATGT

=========================================================

minimum_skew_problem.py

Define the skew of a DNA string Genome, denoted Skew(Genome), as the difference between the total number of occurrences of G and C in Genome. Let Prefixi (Genome) denote the prefix (i.e., initial substring) of Genome of length i. For example, the values of Skew(Prefixi ("CATGGGCATCGGCCATACGCC")) are:

0 -1 -1 -1 0 1 2 1 1 1 0 1 2 1 0 0 0 0 -1 0 -1 -2

Minimum Skew Problem

Find a position in a genome minimizing the skew.

Given: A DNA string Genome.

Return: All integer(s) i minimizing Skew(Prefixi (Text)) over all values of i (from 0 to |Genome|).

Sample Dataset

CCTATCGGTGGATTAGCATGTCCCTGTACGTTTCGCCGCGAACTAGTTCACACGGCTTGATGGCAAATGGTTTTTCCGGCGACCGTAATCGTCCACCGAG

Sample Output

53 97

=========================================================

pattern_matching_problem.py

Recall from that different occurrences of a substring can overlap with each other. For example, ATA occurs three times in CGATATATCCATAG.

Pattern Matching Problem

Find all occurrences of a pattern in a string.

Given: Strings Pattern and Genome.

Return: All starting positions in Genome where Pattern appears as a substring.

Sample Dataset

ATAT
GATATATGCATATACTT

Sample Output

1 3 9

=========================================================

reverse_complement_problem.py

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. Given a nucleotide p, we denote its complementary nucleotide as p. The reverse complement of a string Pattern = p1…pn is the string Pattern = pn … p1 formed by taking the complement of each nucleotide in Pattern, then reversing the resulting string.

For example, the reverse complement of Pattern = "GTCA" is Pattern = "TGAC".

Reverse Complement Problem

Find the reverse complement of a DNA string.

Given: A DNA string Pattern.

Return: Pattern, the reverse complement of Pattern.

Sample Dataset

AAAACCCGGT

Sample Output

ACCGGGTTTT

=========================================================

approximate_pattern_matching_problem

We say that position i in k-mers p1 … pk and q1 … qk is a mismatch if pi ≠ qi. For example, CGAAT and CGGAC have two mismatches.

Approximate Pattern Matching Problem

Find all approximate occurrences of a pattern in a string.

Given: Strings Pattern and Text along with an integer d.

Return: All starting positions where Pattern appears as a substring of Text with at most d mismatches.

Sample Dataset

ATTCTGGA
CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAATGCCTAGCGGCTTGTGGTTTCTCCTACGCTCC
3

Sample Output

6 7 26 27 78

=========================================================

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
approximate_match_data.txt		approximate_match_data.txt
approximate_pattern_matching_problem.py		approximate_pattern_matching_problem.py
clump_finding_problem.py		clump_finding_problem.py
clump_finding_problem_text.txt		clump_finding_problem_text.txt
frequent_words_problem.py		frequent_words_problem.py
frequent_words_problem_text.txt		frequent_words_problem_text.txt
minimum_skew_data.txt		minimum_skew_data.txt
minimum_skew_problem.py		minimum_skew_problem.py
pattern_matching_problem.py		pattern_matching_problem.py
pattern_matching_problem_text.txt		pattern_matching_problem_text.txt
reverse_complement_problem.py		reverse_complement_problem.py
reverse_complement_problem_text.txt		reverse_complement_problem_text.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioinformatics Algorithms (Part 1)

About the Course

Course Syllabus

frequent_words_problem.py

Frequent Words Problem

Sample Dataset

Sample Output

clump_finding_problem.py

Clump Finding Problem

Sample Dataset

Sample Output

minimum_skew_problem.py

Minimum Skew Problem

Sample Dataset

Sample Output

pattern_matching_problem.py

Pattern Matching Problem

Sample Dataset

Sample Output

reverse_complement_problem.py

Reverse Complement Problem

Sample Dataset

Sample Output

approximate_pattern_matching_problem

Approximate Pattern Matching Problem

Sample Dataset

Sample Output

About

Releases

Packages

elcharitas/bioinformatics_algorithms

Folders and files

Latest commit

History

Repository files navigation

About the Course

Course Syllabus

Frequent Words Problem

Sample Dataset

Sample Output

Clump Finding Problem

Sample Dataset

Sample Output

Minimum Skew Problem

Sample Dataset

Sample Output

Pattern Matching Problem

Sample Dataset

Sample Output

Reverse Complement Problem

Sample Dataset

Sample Output

Approximate Pattern Matching Problem

Sample Dataset

Sample Output

About

Resources

Stars

Watchers

Forks