Skip to content

Commit

Permalink
Test notebooks!
Browse files Browse the repository at this point in the history
  • Loading branch information
balajialg authored May 10, 2024
1 parent 7db90ea commit afdc565
Showing 1 changed file with 298 additions and 0 deletions.
298 changes: 298 additions & 0 deletions test_notebooks/Notebook - Part 2 Phylogenetics.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "46e9dde0",
"metadata": {},
"source": [
"# Notebook: Phylogenetics"
]
},
{
"cell_type": "markdown",
"id": "1047b7d7",
"metadata": {},
"source": [
"Welcome to Bio 1B Notebook on Phylogenetics! In this notebook we will be looking at concentrations of primary air pollutants in Bay Area neighborhoods.\n",
"\n",
"Throughout this notebook you will learn to:\n",
"\n",
"Aggregate data (mean, standard deviation, variance) based on a common attribute\n",
"Create histograms for multi-dimensional datasets\n",
"Plot time-resolved and neighborhood-resolved boxplots of concentrations\n",
"Historical Context: Acute Exposure Guideline Levels were created in 1984 to measure the danger of exposure to air pollutants. The US Environmental Protection Agency has information about these studies."
]
},
{
"cell_type": "markdown",
"id": "8c19deac",
"metadata": {},
"source": [
"### Table of Contents\n",
"\n",
"1. [Introduction to the Data](#sectiondata)<br>\n",
"\n",
"2. [Calculating Statistics](#sectionstats)<br>\n",
"\n",
"3. [Histograms](#sectionhistograms)<br>\n",
"\n",
"4. [Visualizing with *seaborn*](#sectionseaborn)<br>\n",
" 4a. [Plotting concentrations and neighborhoods](#subsectionplot)<br>"
]
},
{
"cell_type": "markdown",
"id": "3df71d71",
"metadata": {},
"source": [
"### Introduction to Phylogenetics\n",
"\n",
"Phylogenetic is a better way to determine how organisms are related to one another, because it is based on the evolutionary processes that gave rise to different taxa. (A note here, **taxon** simply refers to a group of related organisms, and the plural form is **taxa**. Taxon/taxa is a generic term that can be used to refer to species, genera, families, or any\n",
"other level of biological classification). All living things are related by common ancestry. Therefore, any group of taxa that we could assemble shares an evolutionary history. A **phylogeny** is a hypothesis about the evolutionary history of a group of taxa."
]
},
{
"cell_type": "markdown",
"id": "f4c9b6ab",
"metadata": {},
"source": [
"We can use many different kinds of evidence to help us answer questions about how different organisms evolved. In today’s lab about tarsiers, we will use two types of data:\n",
"\n",
"1) `morphological data`: anatomical characters, gathered by comparing homologous structures such as bones, teeth, and other tissues from different taxa. Homologous structures are structures inherited from a common ancestor.\n",
"\n",
"2) `molecular data`: gathered by sequencing the homologous gene in DNA from\n",
"different taxa. We will be using a gene that codes for one of the subunits of the hemoglobin molecule."
]
},
{
"cell_type": "markdown",
"id": "8d156d2e",
"metadata": {},
"source": [
"Although we know that any group of taxa shares a true evolutionary history, we can never reconstruct the past with total certainty (this is especially true if we are dealing with events that took place millions of years ago). Therefore, although we have very effective methods for determining phylogenies, we can never know the phylogeny of a group of taxa with total certainty. Our determination of the phylogeny of a group of taxa serves as a hypothesis, subject to revision if additional evidence comes to light."
]
},
{
"cell_type": "markdown",
"id": "881486c5",
"metadata": {},
"source": [
"Hypotheses of phylogenetic relationships are represented by **tree** diagrams—the branching patterns reflect a hypothesis of how different taxa are related to one another. The tree diagrams are also called **cladograms**. Speciation produces new branches on the tree of life, while extinction removes branches from it. The cladograms group taxa by common ancestry, from the smallest group of two **sister taxa**, to larger groups, sometimes composed of thousands of species. Biologists attempt to group organisms into **clades**, lineages that contain a common ancestor and all of its descendants. Clades are also called **monophyletic groups**."
]
},
{
"cell_type": "markdown",
"id": "0d0b636f",
"metadata": {},
"source": [
"Smaller monophyletic groups can be subsets of larger monophyletic groups. We refer to smaller groups contained within larger groups as being **nested** within the larger group. For example, as mentioned above, birds are a monophyletic group nested within the larger monophyletic group called reptiles. Reptiles are a monophyletic group nested within a larger monophyletic group called vertebrates, and so on."
]
},
{
"cell_type": "markdown",
"id": "fc60732d",
"metadata": {},
"source": [
"<img src=\"Figure_2.1.png\" alt=\"Data scientist Venn diagram\" width=\"250\" height=\"250\">\n",
"\n",
"<div style=\"text-align:center\">\n",
" <b>Figure 2.1.</b> Example of a phylogenetic tree and challenges in reconstructing evolutionary histories.\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"id": "d3a56361",
"metadata": {},
"source": [
"<div style=\"display:flex; align-items:center;\">\n",
" <div style=\"flex:1;\">\n",
" <img src=\"Figure2.2.png\" alt=\"Data scientist Venn diagram\" style=\"width:1000px; height:auto;\">\n",
" </div>\n",
" <div style=\"flex:2; padding-left: 70px; padding-top:40px\">\n",
" <p>The two taxa that branch from a common\n",
"ancestor are sister taxa. Ancestors are\n",
" represented on evolutionary trees as <b>nodes</b>.\n",
"Therefore, each node <b>identifies</b> a pair of sister\n",
"taxa, which are each other’s closest relatives.\n",
"Whether other taxa are nested within a sister\n",
"taxon (like smaller boxes fitted inside larger\n",
"boxes) does not change the sister relationship.\n",
"For example, in the cladogram shown in Figure\n",
"2.2, three nodes are numbered and seven extant\n",
"(living) taxa are labeled with letters. The sister\n",
"taxa identified by node 1 are B and C. The sister\n",
"taxa identified by node 2 are A and B+C.</p>\n",
" </div>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"id": "05518151",
"metadata": {},
"source": [
"**Figure 2.2** Cladogram showing the evolutionary relationships among\n",
"seven hypothetical taxa (represented by letters A–G). Three nodes are numbered."
]
},
{
"cell_type": "markdown",
"id": "07b21620",
"metadata": {},
"source": [
"When relationships are well-understood, the branching pattern of a tree is assumed to be **bifurcating** (*two* branches diverge from one another each time the tree branches). However, when we are uncertain about the evolutionary relationships between groups, we collapse the branches into a **polytomy**, where more than two branches originate from a single point (or, \"**node**”)."
]
},
{
"cell_type": "markdown",
"id": "5120f5d9",
"metadata": {},
"source": [
"Returning to the topic of phylogenetic groups, if you were to construct a phylogenetic\n",
"group consisting of a node and all of the descendants that branch from that node, you\n",
"would have created a **monophyletic group** (i.e., clade)."
]
},
{
"cell_type": "markdown",
"id": "917f486d",
"metadata": {},
"source": [
"Which types of characteristics are useful for reconstructing evolutionary histories? An\n",
"important step is to identify **homologous** characteristics. These are the **characters** (DNA\n",
"sequences, morphological structures, patterns of development, or behaviors) that are\n",
"shared among taxa because they were inherited from a common ancestor. A **character\n",
"state** is a version of a character that we observe in organisms. For example, if eye color is a\n",
"***character***, then blue, black, brown and hazel might be the ***character states***. With respect\n",
"to DNA, each nucleotide position in a DNA sequence is considered a character; the five\n",
"possible character states are the four types of nucleic acids that make up DNA (A, C, T, G) or\n",
"a gap in the sequence (a missing nucleotide). So, the number of characters in a DNA\n",
"sequence is its particular length (the number of nucleotides in the sequence), but the\n",
"number of character states is always five."
]
},
{
"cell_type": "markdown",
"id": "60b69505",
"metadata": {},
"source": [
"Character states that are shared by taxa but that ***are not*** inherited from a common ancestor\n",
"are called **analogous traits**. Analogous traits are not useful for reconstructing evolutionary\n",
"history, because they do not give any information about shared ancestry. For example,\n",
"forelimbs that are modified into wings in birds and forelimbs that are modified into wings\n",
"in bats are analogous because their most recent common ancestor did not have forelimbs\n",
"modified into wings; the state of having wings does not tell us anything about their shared\n",
"ancestor or the evolutionary relationship between birds and bats, because wings evolved\n",
"independently in the two lineages. If the presence of wings was the only shared trait we\n",
"used to infer patterns of common ancestry, we would make the mistake of suggesting that\n",
"bats are more closely related to birds than bats are to other mammals, which evidence\n",
"from many other characters clearly indicates is not the case. Thus, building cladograms\n",
"with datasets that include many characters helps prevent incorrect inferences caused by\n",
"analogous traits. A major benefit of molecular datasets is that they include many characters\n",
"because each nucleotide is potentially another informative character."
]
},
{
"cell_type": "markdown",
"id": "a9a1d168",
"metadata": {},
"source": [
"For each character you examine, you need to determine the **ancestral** and **derived** character states. How can you do this without already knowing the evolutionary history? You must select an appropriate **outgroup** (a taxon that is related to the taxa of interest) to\n",
"compare with the **ingroup** (the taxa of interest). All of the members of the ingroup should be more closely related to one another than any are to the outgroup. The key assumption here, which is reasonable but not infallible, is that the character states of the outgroup are **ancestral**, and that the character states present in the ingroup (not the outgroup) are considered to be **derived**. Although the character states of the outgroup are treated as ancestral, an outgroup itself is not an ancestor of the ingroup; it could, however, be a sister group, the group most closely related to the entire ingroup."
]
},
{
"cell_type": "markdown",
"id": "be27a5cb",
"metadata": {},
"source": [
"<img src=\"Figure2.3.png\" alt=\"Data scientist Venn diagram\" width=\"250\" height=\"250\">\n",
"\n",
"**Figure 2.3.** The Hispaniolan solenodon (Solenodon paradoxus) is found only on Hispaniola. Morphologically, this\n",
"species closely resembles mammals living at the time dinosaurs inhabited the Earth. During the lab, you will use\n",
"the solenodon as an outgroup to the primates."
]
},
{
"cell_type": "markdown",
"id": "b4634bcb",
"metadata": {},
"source": [
"A derived character state that is shared by at least two taxa because of common ancestry is called a **synapomorphy**. By definition, synapomorphies are inherited from a common ancestor. Therefore, they are homologous characters."
]
},
{
"cell_type": "markdown",
"id": "f455f7fa",
"metadata": {},
"source": [
"To perform a phylogenetic analysis, character states are coded into a **character matrix**, *a table that aligns homologous characters for an outgroup and members of the ingroup*."
]
},
{
"cell_type": "markdown",
"id": "20fad676",
"metadata": {},
"source": [
"<img src=\"Table1.png\" alt=\"Data scientist Venn diagram\" width=\"400\" height=\"400\">\n",
"\n",
"<img src=\"Figure2.4.png\" alt=\"Data scientist Venn diagram\" width=\"400\" height=\"400\">\n",
"\n",
"**Figure 2.4:** (A) Venn Diagram grouping taxa by shared characters. (B) Cladogram depicting the same relationships and evolutionary history. The evolution of each character has been marked onto the tree with hash marks."
]
},
{
"cell_type": "markdown",
"id": "554eab17",
"metadata": {},
"source": [
"Figure 2.4 is an example of how to construct a phylogeny of tetrapods, the four-limbed vertebrates. The outgroup in this example is sharks; they are more distantly-related vertebrates. The absence of hind limbs is coded as “0” because this state is ancestral, and\n",
"the presence of hind limbs is coded as “1” because possessing hind limbs is the derived state.\n",
"\n",
"Notice that all the taxa have the ancestral condition of 0 for the “vertebrae” character. Having vertebrae is the ancestral condition for these organisms, and because **all** of the organisms in the matrix have the same character state, the character of “vertebrae” is *not informative*. It does not tell us anything about how these animals are related to one another.\n",
"\n",
"“External pouch” and “feathers” each appear in one taxon only. These characters are derived, but not shared between multiple taxa (they are each in one taxon only). Therefore, these characters are not informative for grouping taxa.\n",
"\n",
"**In summary, neither ancestral characters nor derived characters that occur in only one lineage are informative for grouping taxa.**\n",
"\n",
"We can put other types of data into a character matrix as well. In Figure 2.5, we have lined up DNA sequences for the same gene for three different species. When we choose genes for\n",
"phylogenetic analyses, we tend to look for genes that are similar enough across taxa to have some parts of the gene that are the same and other parts that vary among taxa. In other words, we choose genes in which unique mutations have accumulated in each taxon\n",
"over the time the taxa have been diverging. Genes for proteins like insulin, cytochromes, glycolysis enzymes, and muscle proteins tend to be very similar across very different taxa, telling us that the states of these genes and the proteins they produce are important to the physiology of organisms.\n",
"\n",
"In Figure 2.5, you can see that characters (loci) 1, 5 and 7 are not informative, because they are identical in all three taxa. The gaps, identified by hyphens, indicate places where there have been mutations. For example, either Species C has had a nucleotide deleted at Locus 2, OR species A and B have both had a T inserted at Locus 2. (A deletion is a type of mutation that eliminates parts of a DNA sequence; an insertion is a type of mutation that\n",
"inserts one or more nucleotides into a DNA sequence).\n",
"\n",
"<img src=\"Figure2.5.png\" alt=\"Data scientist Venn diagram\" width=\"400\" height=\"400\">\n",
"\n",
"**Figure 2.5.** A character matrix for DNA sequences of the same gene collected and sequenced from three different\n",
"species. In this example, there are 7 characters, and five possible character states (A, T, G, C, and gap).\n",
"\n",
"It is important to realize that when you construct an evolutionary tree you are proposing a hypothesis. When biologists build trees, they do not initially know the order in which characters evolved or how many times the character states have changed. One guiding\n",
"principle for finding a plausible evolutionary history is the principle of **parsimony**, which states, in general, that the simplest answer is likely to be the correct one (also known as Occam’s Razor). In the context of phylogenetics, *trees with fewer character state transitions are considered to be more reasonable than trees with a greater number of transitions*. The principle of parsimony asserts that the most parsimonious tree is our preferred hypothesis. Thus, it is reasonable to assume that the tree with the lowest parsimony score (the fewest character state changes) is the best estimate of the actual history of a group of species. But\n",
"remember, the tree is an estimate – the hypothesis with the most supporting evidence. Additional data or a different analysis might change or further support one’s conclusion."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit afdc565

Please sign in to comment.