From aae63ac65d955c3651aa42d51d36c2b603ce4e67 Mon Sep 17 00:00:00 2001 From: Dipa C Date: Tue, 23 Dec 2025 02:12:38 -0500 Subject: [PATCH] Sampling A1 --- .../a1_sampling_and_reproducibility.ipynb | 72 +++++++++++++++---- 1 file changed, 60 insertions(+), 12 deletions(-) diff --git a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb index 264a448a..e2f23768 100644 --- a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb +++ b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb @@ -8,43 +8,78 @@ "# Assignment 1: Sampling and Reproducibility\n", "\n", "The code at the end of this file explores contact tracing data about an outbreak of the flu, and demonstrates the dangers of incomplete and non-random samples. This assignment is modified from [Contact tracing can give a biased sample of COVID-19 cases](https://andrewwhitby.com/2020/11/24/contact-tracing-biased/) by Andrew Whitby.\n", - "\n", - "Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved. \n" + " \n", + "> Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved." ] }, { "cell_type": "markdown", "id": "4ea73db3", "metadata": {}, - "source": [] + "source": [ + "**Code Block 1 - Infect random subset of people:** \n", + "Function --> \n", + "   infected_indices = np.random.choice(ppl.index, size=int(len(ppl) * ATTACK_RATE), replace=False) \n", + "   ppl.loc[infected_indices, 'infected'] = True \n", + "Sample size = 100 (1000 * 10%) \n", + "Sampling frame = 1000 \n", + "Sampling type = simple random sampling (without replacement) \n", + "Distribution = binomial \n", + " \n", + "**Code Block 2 - Primary contact tracing:** \n", + "Function --> \n", + "   ppl.loc[ppl['infected'], 'traced'] = np.random.rand(sum(ppl['infected'])) < TRACE_SUCCESS \n", + "Sample size = 20 (100 * 20%) \n", + "Sampling frame = 100 \n", + "Sampling type = simple random sampling (with replacement, although doesn't really matter for this) \n", + "Distribution = binomial \n", + "\n", + " \n", + "**Code Block 3 - Secondary contact tracing:** \n", + "Function --> \n", + "   event_trace_counts = ppl[ppl['traced'] == True]['event'].value_counts() \n", + "   events_traced = event_trace_counts[event_trace_counts >= SECONDARY_TRACE_THRESHOLD].index \n", + "   ppl.loc[ppl['event'].isin(events_traced) & ppl['infected'], 'traced'] = True \n", + "Sample size = anywhere from nil (0) to all primary contact traced (20) \n", + "Sampling frame = 20 \n", + "Sampling type = convenience \n", + "Distribution = conditional\n", + " \n" + ] }, { "cell_type": "markdown", "id": "3d9b2ccc", "metadata": {}, "source": [ - "Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results." + "> Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results." ] }, { "cell_type": "markdown", "id": "4cf5d993", "metadata": {}, - "source": [] + "source": [ + "The 1000 repetition is quite overlapping between the infections vs traced, and over every re-run of the simulation. This is expected as the 1000 repetitions if basically averaging the results over the population many times (actually, as many times as the sample population of this simulation). \n", + "The 10 repetitions look very noisy, both between the re-runs and in overlap between the infections vs traced. Again, to be expected because it's a very small sample of the random simulation, thus any random individual variations can greatly skew the plot at each run. \n", + "Interestingly (for me), the 100 repetitions is not as noisy as I expected. Yes it is noiser than the 1000 repetition, but definitely better than the 10 repetitions...almost like the accuracy is logarithmic (probably not the best word) - in the sense that as you increase your repetitions/n, your noise decreases exponentially. I am not too familiar with stats, but I believe the law of large numbers shows something similar, where as your number of repetitions/n increases, the standard error/noise decreases exponentially (sorry I'm not explaining this well)." + ] }, { "cell_type": "markdown", "id": "32603ce7", "metadata": {}, "source": [ - "Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The output needs to produce the same output when run multiple times." + "> Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The output needs to produce the same output when run multiple times." ] }, { "cell_type": "markdown", "id": "77613cc3", "metadata": {}, - "source": [] + "source": [ + "I set a random seed (np.random.seed(50)) before the simulation so that all random generators (e.g. np.random.choice & np.random.rand) would have the same start point at each repetition, and carry out their defined \"randomness\" from there (which is usually systematic leaps). This allows the simulation to run the same sequence and produce the same outcomes every time, removing variability that would come with true randomness (and in turn, improving reproducibility)." + ] }, { "cell_type": "markdown", @@ -56,10 +91,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 30, "id": "ab8587a0", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "# Import necessary libraries\n", "import pandas as pd\n", @@ -80,6 +126,8 @@ "TRACE_SUCCESS = 0.20\n", "SECONDARY_TRACE_THRESHOLD = 2\n", "\n", + "np.random.seed(50) # set random seed to make code reproducible\n", + "\n", "def simulate_event(m):\n", " \"\"\"\n", " Simulates the infection and tracing process for a series of events.\n", @@ -130,7 +178,7 @@ "\n", " return p_wedding_infections, p_wedding_traces\n", "\n", - "# Run the simulation 1000 times\n", + "# Run the simulation 10 / 100 / 1000 times\n", "results = [simulate_event(m) for m in range(1000)]\n", "props_df = pd.DataFrame(results, columns=[\"Infections\", \"Traces\"])\n", "\n", @@ -193,7 +241,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "sampling-env", "language": "python", "name": "python3" }, @@ -207,7 +255,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.0" + "version": "3.11.13" } }, "nbformat": 4,