diff --git a/test_notebooks/LectureEIGHTData102Fall2023.ipynb b/test_notebooks/LectureEIGHTData102Fall2023.ipynb
new file mode 100644
index 0000000..a1497f4
--- /dev/null
+++ b/test_notebooks/LectureEIGHTData102Fall2023.ipynb
@@ -0,0 +1,2652 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "3e6fb570",
+ "metadata": {},
+ "source": [
+ "# Probabilistic Programming using PyMC, and Graphical Models\n",
+ "\n",
+ "Today, we shall discuss the following two topics:\n",
+ "\n",
+ "1. Probabilistic Programming using PyMC\n",
+ "2. Graphical Models "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7efab4fa",
+ "metadata": {},
+ "source": [
+ "Practical Bayesian Inference is often carried out through specialized probabilistic programming libraries such as PyMC. In these libraries, one inputs the probability model (prior and likelihood) and obtains an output of the posterior distributions in the form of Monte Carlo samples. The outputted Monte Carlo samples provide an approximation of the posterior distributions."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28cdb539",
+ "metadata": {},
+ "source": [
+ "To illustrate probabilistic programming with PyMC, consider the following simple probability model. We have four binary random variables: overweight, smoking, heart disease and cough. Suppose we specify a probability model for these random variables as:\n",
+ "\\begin{align*}\n",
+ " & \\text{overweight} \\sim \\text{Bernoulli}(0.1) \\\\\n",
+ " & \\text{smoking} \\sim \\text{Bernoulli}(0.1) \\\\\n",
+ " & \\text{heart disease} \\mid \\text{overweight} = 1, \\text{smoking} = 1 \\sim \\text{Bernoulli}(0.75) \\\\\n",
+ " & \\text{heart disease} \\mid \\text{overweight} = 1, \\text{smoking} = 0 \\sim \\text{Bernoulli}(0.5) \\\\\n",
+ " & \\text{heart disease} \\mid \\text{overweight} = 0, \\text{smoking} = 1 \\sim \\text{Bernoulli}(0.4) \\\\\n",
+ " & \\text{heart disease} \\mid \\text{overweight} = 0, \\text{smoking} = 0 \\sim \\text{Bernoulli}(0.1) \\\\\n",
+ " & \\text{cough} \\mid \\text{smoking} = 1 \\sim \\text{Bernoulli}(0.6) \\\\\n",
+ " & \\text{cough} \\mid \\text{smoking} = 0 \\sim \\text{Bernoulli}(0.05)\n",
+ "\\end{align*}\n",
+ "The above model starts by specifying the marginal distribution of overweight and smoking. Then it specifies the conditional distribution of heart disease conditional on overweight and smoking. Finally, it specifies the conditional distribution of cough conditional on smoking. Based on this model, we might be interested in several probability questions such as:\n",
+ "1. What is the marginal distribution of heart disease i.e., $\\mathbb{P}(\\text{heart disease} = 1)$?\n",
+ "2. What is the conditional distribution of overweight conditional on heart disease i.e., $\\mathbb{P}(\\text{overweight} = 1 \\mid \\text{heart disease} = 1)$?\n",
+ "3. What is conditional distribution of smoking conditional on cough i.e., $\\mathbb{P}(\\text{smoking} = 1 \\mid \\text{cough} = 1)$?\n",
+ "\n",
+ "All these probabilities can be calculated exactly using the model specification. One way of doing this is to note that for every binary $b_o, b_s, b_h, b_c \\in \\{0, 1\\}$, we can write\n",
+ "\\begin{align*}\n",
+ " &\\mathbb{P}\\left(\\text{overweight} = b_o, \\text{smoking} = b_s, \\text{heart} = b_h, \\text{cough} = b_c \\right) \\\\\n",
+ " &= \\mathbb{P}\\left(\\text{overweight} = b_o \\right) \\mathbb{P}\\left(\\text{smoking} = b_s \\right) \\mathbb{P} \\left(\\text{heart} = b_h \\mid \\text{overweight} = b_o, \\text{smoking} = b_s \\right) \\mathbb{P} \\left(\\text{cough} = b_c \\mid \\text{smoking} = b_s \\right)\n",
+ "\\end{align*}\n",
+ "and these probabilities can then be read off from the model specification. For example, with $b_o = b_s = b_h = b_c = 1$, we get\n",
+ "\\begin{align*}\n",
+ " &\\mathbb{P}\\left(\\text{overweight} = 1, \\text{smoking} = 1, \\text{heart} = 1, \\text{cough} = 1 \\right) \\\\\n",
+ " &= \\mathbb{P}\\left(\\text{overweight} = 1 \\right) \\mathbb{P}\\left(\\text{smoking} = 1 \\right) \\mathbb{P} \\left(\\text{heart} = 1 \\mid \\text{overweight} = 1, \\text{smoking} = 1 \\right) \\mathbb{P} \\left(\\text{cough} = 1 \\mid \\text{smoking} = 1 \\right) \\\\\n",
+ " &= 0.1 \\times 0.1 \\times 0.75 \\times 0.6\n",
+ "\\end{align*}\n",
+ "From the full joint distribution of the four variables, all the probabilities asked in the questions can be calculated. \n",
+ "\n",
+ "It is clear that this approach will be quite tedious especially if we are dealing with more than four random variables. Probabilistic Programming Libraries (such as PyMC) automate the calculation of these probabilities. However, instead of calculating probabilities exactly, they output \"samples\" from which probabilities can be approximated. Here is how this works for this simple health probability model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "0ffe418b",
+ "metadata": {},
+ "outputs": [
+ {
+ "ename": "ModuleNotFoundError",
+ "evalue": "No module named 'pymc3'",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[1], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[0;32m----> 6\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpymc3\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mpm\u001b[39;00m \u001b[38;5;66;03m#pymc3 is the previous version of pymc. We shall switch to pymc as soon as it gets installed in datahub\u001b[39;00m\n",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pymc3'"
+ ]
+ }
+ ],
+ "source": [
+ "#Import the necessary libraries:\n",
+ "import arviz as az\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "import pymc3 as pm #pymc3 is the previous version of pymc. We shall switch to pymc as soon as it gets installed in datahub"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "f02500bd-0e06-45ef-a8b1-32243b2819a2",
+ "metadata": {},
+ "outputs": [
+ {
+ "ename": "ModuleNotFoundError",
+ "evalue": "No module named 'graphviz'",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[4], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mgraphviz\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Digraph\n",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'graphviz'"
+ ]
+ }
+ ],
+ "source": [
+ "from graphviz import Digraph"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "d9ab026e",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Multiprocess sampling (4 chains in 4 jobs)\n",
+ "BinaryGibbsMetropolis: [overweight, smoking, heart, cough]\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 4 seconds.\n"
+ ]
+ }
+ ],
+ "source": [
+ "#PyMC model when cough is pre-specified as 1. \n",
+ "health_model_2 = pm.Model()\n",
+ "with health_model_2:\n",
+ " overweight = pm.Bernoulli('overweight', 0.1)\n",
+ " smoking = pm.Bernoulli('smoking', 0.1)\n",
+ " # Deterministic probabilities for 'heart' based on conditions\n",
+ " p_heart = pm.Deterministic('p_heart', pm.math.switch(overweight, \n",
+ " pm.math.switch(smoking, 0.75, 0.5), \n",
+ " pm.math.switch(smoking, 0.4, 0.1)))\n",
+ "\n",
+ " # 'heart' random variable\n",
+ " heart = pm.Bernoulli('heart', p_heart) \n",
+ " #observed = 1 means we want heart to be fixed at the observed value of 1.\n",
+ " #If we want heart to be fixed at 0, we would say observed = 0.\n",
+ " # Deterministic probability for 'cough' based on 'smoking'\n",
+ " p_cough = pm.Deterministic('p_cough', pm.math.switch(smoking, 0.6, 0.05))\n",
+ " # 'cough' random variable\n",
+ " cough = pm.Bernoulli('cough', p_cough, observed = 1)\n",
+ " #This ends the specification of the model. \n",
+ " #To obtain samples from PyMC, run the following:\n",
+ " idata = pm.sample(5000, chains = 4, return_inferencedata = True)\n",
+ " #This will generate 5000*4 = 20000 posterior samples from (overweight, smoking, heart, cough)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "38d245ee",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Dimensions: (chain: 4, draw: 5000)\n",
+ "Coordinates:\n",
+ " * chain (chain) int64 0 1 2 3\n",
+ " * draw (draw) int64 0 1 2 3 4 5 6 ... 4994 4995 4996 4997 4998 4999\n",
+ "Data variables:\n",
+ " overweight (chain, draw) int64 0 0 0 0 0 0 0 0 0 0 ... 0 1 1 1 0 0 0 0 0 0\n",
+ " smoking (chain, draw) int64 1 0 0 1 0 1 1 1 1 0 ... 1 1 1 0 1 0 1 0 1 0\n",
+ " heart (chain, draw) int64 0 0 1 0 0 0 1 0 1 0 ... 0 1 0 1 1 0 0 0 0 0\n",
+ " p_heart (chain, draw) float64 0.4 0.1 0.1 0.4 0.1 ... 0.4 0.1 0.4 0.1\n",
+ " p_cough (chain, draw) float64 0.6 0.05 0.05 0.6 ... 0.6 0.05 0.6 0.05\n",
+ "Attributes:\n",
+ " created_at: 2023-09-19T23:13:30.935572\n",
+ " arviz_version: 0.12.1\n",
+ " inference_library: pymc3\n",
+ " inference_library_version: 3.11.2\n",
+ " sampling_time: 4.06363320350647\n",
+ " tuning_steps: 1000\n",
+ "0.56815\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(idata.posterior)\n",
+ "overweight_samples = idata.posterior['overweight'].values.flatten()\n",
+ "smoking_samples = idata.posterior['smoking'].values.flatten()\n",
+ "heart_samples = idata.posterior['heart'].values.flatten()\n",
+ "all_samples = np.column_stack((overweight_samples, smoking_samples, heart_samples))\n",
+ "#Required conditional probability:\n",
+ "print(np.sum(smoking_samples)/len(smoking_samples))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3512f57c",
+ "metadata": {},
+ "source": [
+ "## Graphical Models\n",
+ "\n",
+ "It is often convenient to represent Bayesian probability models in the form of **Graphical Models**. Graphical Models are graphical representations of probability models. Each random variable in the model is represented by a circle (sometimes random variables whose values are observed are shaded). A directed edge is put from edge $w$ to edge $v$ if the distribution of $v$ in the model is described in terms of $w$. \n",
+ "\n",
+ "Graphical models should be constructed sequentially in the same order as the variables appear in the PyMC specification. For example, in this health model, the following steps should be followed in sequence to form the graphical model:\n",
+ "1. The first random variable in the PyMC model specification is \"overweight\". So we first draw a node (circle) for this random variable. At this stage, the graphical model only consists of a single node (overweight) \n",
+ "2. The second variable is \"smoking\" so we draw a node (circle) for smoking. The probability specification for smoking does not involve the first variable overweight so we do not place any edge from overweight to smoking. At this stage, the graphical model only consists of two nodes (overweight and smoking) without any edge between them. \n",
+ "3. The third variable is \"heart disease\" so we draw a node (circle) for heart disease. The probability specification for heart disease clearly uses both overweight and smoking so we place two directed edges: one from overweight to heart, and the other from smoking to heart. \n",
+ "4. The fourth variable is \"cough\" so we draw a node for cough. The probability specification for cough clearly uses smoking (and not overweight or heart disease) so we play one directed edge from smoking to heart. \n",
+ "\n",
+ "This graphical model can be drawn in Python as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "37505bec",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/svg+xml": [
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from graphviz import Digraph\n",
+ "def create_graphical_model():\n",
+ " # Create a new directed graph\n",
+ " dot = Digraph()\n",
+ "\n",
+ " # Add nodes\n",
+ " dot.node('overweight', 'Overweight')\n",
+ " dot.node('smoking', 'Smoking')\n",
+ " dot.node('heart_disease', 'Heart Disease')\n",
+ " dot.node('cough', 'Cough')\n",
+ " \n",
+ " # Add edges\n",
+ " dot.edge('overweight', 'heart_disease')\n",
+ " dot.edge('smoking', 'heart_disease')\n",
+ " dot.edge('smoking', 'cough')\n",
+ "\n",
+ " return dot\n",
+ "\n",
+ "# Display the graph\n",
+ "display(create_graphical_model())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8757a962",
+ "metadata": {},
+ "source": [
+ "Graphical models are useful because interesting (conditional) independence facts can be simply read off from them. The language of familial relationships is extremely useful when discussing graphical models. For example, in the above graphical model, we can use the following terminology: \n",
+ "1. Overweight and Smoking variables can be seen as \"founders\" of this family of variables. \n",
+ "2. Heart disease is child born to parents overweight and smoking.\n",
+ "3. Cough is a child born to Smoking. \n",
+ "\n",
+ "The conditional independence relationships induced by graphical models exactly match those found in simple Mendelian genetics. For example, in the above graphical model: \n",
+ "1. Overweight and Smoking are marginally independent\n",
+ "2. Conditional on the parents (overweight and smoking), the children (heart disease and cough) are independent. \n",
+ "3. Without conditioning on parents, heart disease and cough are **not** independent. Knowing something about heart disease will change what we think about cough. \n",
+ "4. Conditional on smoking, the two children heart disease and cough are independent. \n",
+ "5. Cough and Overweight are marginally independent. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "56249a5c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/svg+xml": [
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#Another Graphical Model:\n",
+ "from graphviz import Digraph\n",
+ "\n",
+ "# Create a directed graph\n",
+ "dot = Digraph()\n",
+ "\n",
+ "# Add nodes to the graph\n",
+ "for node in ['A', 'B', 'C', 'D', 'E', 'F']:\n",
+ " dot.node(node)\n",
+ "\n",
+ "# Add directed edges\n",
+ "edges = [('A', 'C'), ('B', 'C'), ('C', 'E'), ('D', 'E'), ('C', 'F'), ('D', 'F')]\n",
+ "for edge in edges:\n",
+ " dot.edge(edge[0], edge[1])\n",
+ "\n",
+ "# Display the graph in the Jupyter notebook\n",
+ "display(dot)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "9e0f3b66-34d9-4151-ae38-7e9421ffea32",
+ "metadata": {},
+ "outputs": [
+ {
+ "ename": "ModuleNotFoundError",
+ "evalue": "No module named 'graphviz'",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[2], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mgraphviz\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Digraph\n",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'graphviz'"
+ ]
+ }
+ ],
+ "source": [
+ "from graphviz import Digraph"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3e0975b1",
+ "metadata": {},
+ "source": [
+ "Here are some conditional independence relationships induced by this graphical model:\n",
+ "1. A, B, D are marginally independent. \n",
+ "2. E and F are conditionally independent given C and D. \n",
+ "3. C and D are marginally independent. \n",
+ "4. C and D are **not** conditionally independent given E.\n",
+ "5. A and B are **not** conditionally independent given E."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "242340a5",
+ "metadata": {},
+ "source": [
+ "# Bayesian Statistics using PyMC\n",
+ "\n",
+ "Next we revisit the simple examples that we studied in the previous two lectures. We shall use PyMC to answer questions based on these models. We shall also draw the graphical models associated with these models. \n",
+ "\n",
+ "## Microwave Example using PyMC\n",
+ "\n",
+ "In the Microwave Example, there are two parameters: $\\theta_A$ and $\\theta_B$ representing the qualities of the two microwaves. The data for Microwave A is given by 3 out of 3 positive reviews: $\\text{pos}_A = 3, n_A = 3$ ($n_A$ is the total number of reviews for A). The data for Microwave B is given by 19 out of 20 positive reviews: $\\text{pos}_B = 19$, $n_B = 20$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "df91a1ab",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Data\n",
+ "pos_A_obs = 3\n",
+ "neg_A_obs = 0\n",
+ "n_A_obs = 3\n",
+ "pos_B_obs = 19\n",
+ "neg_B_obs = 1\n",
+ "n_B_obs = 20"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a2cfc7d7",
+ "metadata": {},
+ "source": [
+ "The prior is given by $\\theta_A, \\theta_B \\overset{\\text{i.i.d}}{\\sim} \\text{uniform}[0, 1]$ and the likelihood is $\\text{pos}_A \\mid \\theta_A \\sim \\text{Bin}(n_A, \\theta_A)$ and $\\text{pos}_B \\mid \\theta_B \\sim \\text{Bin}(n_B, \\theta_B)$. We can input this Bayesian model in PyMC as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "9b4c235d",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Auto-assigning NUTS sampler...\n",
+ "Initializing NUTS using jitter+adapt_diag...\n",
+ "Multiprocess sampling (4 chains in 4 jobs)\n",
+ "NUTS: [theta_B, theta_A]\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "