Skip to content

Latest commit

 

History

History
140 lines (91 loc) · 14.7 KB

README.md

File metadata and controls

140 lines (91 loc) · 14.7 KB

Portfolio

Welcome!

👩‍💻 About Me:

I'm a graduate student pursuing a M.S. in Data Science and Public Policy at Georgetown University in Washington, DC. I'm also a public policy professional with experience in program management and communications with the federal government. I'm currently open to opportunities where data science and public interest intersect. See below for examples of my work!

Jump to a section:


Professional Research (Python)

📁 Massive Data Institute

Summary: I am working with Dr. Rebecca Johnson to develop a first-of-its-kind database of school board video transcripts for public policy analysis of education inequalities at the local level. Responsibilities included cleaning and compiling 10 years of directory and demographic data from the National Center for Education Statistics and testing regex and Large Language Model (LLM) methods to clean and extract public comments from transcripts. Also with MDI, I collaboratively developed an interactive tool in Python to automate the data collection process and replicate the the excel-based funding allocation formula for the Deparment of Health and Human Services – Low Income Home Energy Assistance Program (LIHEAP). I helped develop and refine a detailed user guide for the tool and trained HHS staff (not versed in Python) on tool usage. See the poster presentations for these projects below:

Techniques: Webscraping (Selenium), Data Wrangling, Pipeline Production, LLM Testing

📁 Beeck Center for Social Impact + Innovation

Summary: Each year, the “State of the State" speech is a governor’s prime opportunity to outline their top priorities to the public. To understand governor priorities in 2023, I analyzed all 50 State of the State addresses (or the equivalent annual budget address or inaugural speech) and dove deeper into governor’s priorities by analyzing a year’s worth of press releases for the 13 states that have participated in the Data Labs program with the Beeck Center. I found that governors across the country are focused most on issues related to housing and homelessness, energy policy, and taxes. See the public GitHub repository for this project here. My blog post with accompanying infographics will be published in May on the Beeck Center website.

Techniques: Webscraping (BeautifulSoup and Selenium), TF-IDF, Principal Component Analysis (PCA), Data Visualization

Academic Research (Python and R):

📘 Data Science III: Just Jargon or Policy Priorities? Text Analysis of Secretary of State Remarks for the Biden and Trump Administrations (Python)

Summary: Public affairs materials point to both policy priorities and how we talk about those policy priorities. This is particularly relevant when comparing the priorities of presidential administrations from different political parties. I scraped 1,973 public remarks from the Office of the Secretary of State websites for the current U.S. Secretary of State Antony Blinken (Biden Administration) and his predecessor, Secretary Michael Pompeo (Trump Administration). I then conducted both exploratory and predictive analysis on these documents to understand how government public affairs materials do or do not reveal key facts about U.S. foreign policy, particularly for changes between administrations of different political parties. I found that these public remarks revealed both diplomatic "business as usual" reveal both and key policy priorities under each administration. Read the final report here.

Techniques: Webscraping (BeautifulSoup and Requests), TF-IDF, Principal Component Analysis (PCA), Naive Bayes, K-Nearest Neighbors (KNN), Decision Trees/Random Forest

📘 Data Science II: Real Life Leslie Knopes: Factors Contributing to the Proportion of Women Candidates for Local Office in the United States (Python)

Summary: Much data and research exists about factors influencing the number of women in national politics globally but there is little understanding of these same elements at the local level. In my project, I examine which social, economic, and political factors are most relevant to predicting higher proportions of women candidates for local office in the United States. I employ a gender guessing package on candidate-level precinct returns from the 2018 elections and combine the resulting data with county-level factors related to demography, economics, election history, and reproductive healthcare. My resulting dataset provides one of the most conclusively available datasets on women running for local office in the U.S. Read the final report here.

Techniques: Least Absolute Shrinkage and Selection Operator (LASSO) regression, random forests, handling unbalanced datasets

📘 Stats I: School-Aged Violence and Potential Impact on Proportion of Women in Public Office (R)

Summary: Initially we planned to examine the relationship between online violence against women and women's political participation. However, this is an emerging data area and many data gaps remain. Instead, we pivoted to examine violence in relation to another proposed element contributing to the low numbers of women in elected office: the influence of childhood experiences. For example, based on survey data from 1,600 children ages 6 to 12, researchers Bos, et al concluded that girls report less interest in running for political office than their boy peers (Bos, Angela L., et al). Another study by Fox and Lawless found that even factors such as participation in school sports influences whether or not a girl says they want to run for office someday (Fox and Lawless, "Girls Just Wanna Not Run"). Our hypothesis is that the levels of violence against girls in high school will have a negative correlation on the levels of women in politics at both the state and federal level. In our study, we analyzed 2015 state-level data on the percent of high school students experiencing harassment or bullying and dating violence, and state statutes on violence and employment, domestic violence, sexual violence, stalking, and gun ownership as well as 2015 data on women in congress and state legislatures. See the full presentation of results here.

Techniques: Linear Regression

📘 Data Science I: The Air We Breathe: Air Quality and Health Outcomes in Kentucky (Python)

Summary: Much research has been conducted on the relationship between air pollution and health outcomes. For our final project, we dived deeper into this relationship in a single state in the United States: Kentucky. With 2019 data from the U.S. Centers for Disease Control and Prevention (CDC) and the U.S. Environmental Protection Agency (EPA) , we conducted an exploratory analysis with data visualization and linear regression on overall, respiratory, and mental health outcomes. Due to a very limited sample size (potential political reasons discussed), our results did not show a statistically significant relationship, but we provided recommendations for further research. Read the full report (co-authored with three of my classmates) here.

Techniques: Linear Regression; Overleaf (LaTeX)


Data Visualization Projects:

📊 Parking in DC: A Story of Millions of Red Tickets and Revenue with Unequal Enforcement on Communities of Color (Tableau and R)

Summary: Every driver dreads receiving a parking ticket on their dashboard. However, for those with ample income, this ticket is a minor inconvenience that can easily be settled. In contrast, a ticket of any amount for an individual with low-income is burdensome at minimum and may mean the choice between groceries or paying the fine. In DC, a city known for its history and present as a predominantly Black city, parking tickets are plentiful. My dashboard (accessible here at Tablueau Public) can be used to explore how the enforcement of parking tickets intersects with issues of race and income in DC in 2019, resulting in a disproportionate burden on low-income communities of color. Read the full report here.

Techniques: Mapping, exploratory data visualization using Tableau

📊 How to Stop Recreating the Wheel: Creating and saving custom ggplot themes for your organization’s brand (R)

Summary In this blog tutorial, I show how with ggplot in R, you can create a new, custom theme that integrates your organization’s brand, including color and font. You can also add your organizational logo to a plot using the packages cowplot and magick. By placing this code in a utils.R document in your project folder, you can easily load these custom visualization settings into new scripts. Read the blog post here.

Techniques: custom visualization themes using ggplot


Class Assignments (Python):

(posted with permission from my Data Science professor)

📗 00_python_basics.ipynb

  • python lists
  • numpy arrays
  • basic list comprehension

📗 01_criminal_justice_data.ipynb

  • recoding variables using np.select and np.where
  • aggregation using groupby and agg
  • user-defined function to find matches within a broader pool of data
  • using list comprehension to apply a function iteratively over list elements

📗 02_guestworker_violations.ipynb

  • pivot from long to wide
  • filter out duplicate data
  • merging
  • targeted regex

📗 03_doj_press_releases.ipynb

  • tagging and sentiment scoring
  • part of speech tagging
  • named entity recognition
  • sentiment analysis
  • topic modeling
  • estimate a topic model using preprocessed words
  • extend the analysis from unigrams to bigrams

How to reach me:

📬 [email protected]

Linkedin Badge