- Introduction
- Data
- Models and Metrics
- Methodology
- Results
- Conclusions and Recommendations
- Tables
In 2009, a novel strain of influenza virus, commonly referred to as “swine flu”, spread first across the United States and quickly around the world causing a pandemic. From April 2009 to April 2010, the CDC estimates that there were 60.8 million cases, 274,304 hospitalizations, and 12,469 deaths in the United States alone. A vaccine for the H1N1 flu virus became publicly available in October 2009.
Figure 1. A view of a newspaper headline near Times Square in New York, New York, USA, on 27 April 2009. Photo by EPA/BGNES (Left). An H1N1 flu vaccination clinic held in San Francisco in December 2009, Justin Sullivan/Getty Images (Right)
From August 2009 to May 2010, one or more doses of the seasonal flu vaccine were administered to 31.6 million children and 91.6 million adults. During the same time period, one or more doses of the H1N1 vaccine were administered to 29.1 million children and 80.8 million adults.
In 2009, the US National Center for Health Statistics conducted National H1N1 Flu Survey over phone. The phone survey asked respondents whether they had received the H1N1 and seasonal flu vaccines, in conjunction with questions about themselves. The questions covered the respondents’ social, economic, and demographic background, opinions on vaccine effectiveness and risks, and behaviors towards mitigating transmission. The survey is sixty pages long.
Below is a description of the survey from the CDC page for National Immunization Surveys:
" The National 2009 H1N1 Flu Survey (NHFS) was sponsored by the National Center for Immunization and Respiratory Diseases (NCIRD) and conducted jointly by NCIRD and the National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention (CDC). The NHFS was a list-assisted random-digit-dialing telephone survey of households, designed to monitor influenza immunization coverage in the 2009-10 season.
The target population for the NHFS was all persons 6 months or older living in the United States at the time of the interview. Data from the NHFS were used to produce timely estimates of vaccination coverage rates for both the monovalent pH1N1 and trivalent seasonal influenza vaccines. "
Data was obtained from DrivenData, a website that hosts data science competitions and crowdsources social challenges from across the world. This particular dataset was from the Flu Shot Learning competition.
The data consists of 36 features consisting of binary, ordinal and categorical responses. The data dictionary from DrivenData is provided in table 1.
Three Models were tested for their ability to predict whether respondents were vaccinated or unvaccinated:
- Logistic Regression
- Neural Network
- XGBoost
The models were compared on four metrics:
- Recall
- Precision
- Accuracy
- ROC_AUC
Recall score was chosen to be the most important metric to compare models. This was because the goal was to minimize the number of respondents falsely classified as vaccinated, in order to capture as many unvaccinated respondents as possible.
We first performed exploratory data analysis and null imputation on our dataset to get a clean dataset that can be fed into our models. This dataset was then used to train a number of models with the following goals in mind:
- to find the best model to capture all observations that are unvaccinated based on predictor features.
- to use the best model to find the most predictive features, and confirm the predictive nature of these features by training our best model on this subset and comparing metrics with the initial model trained on the full set of feature.
- to provide a recommendation for areas of focus for the CDC to shorten, or rearrange the survey so that a cleaner dataset can be obtained with minimal missing values.
Exploring Nulls: The survey was done over phone and respondents were allowed to refuse to respond to any question asked. These represented a large portion of the nulls in the data. To better understand the distribution of nulls, we first examine which features had the most missing data (Figure 2 (Left)) and discovered that some features were missing upto half of their data. Next, we compared the number of nulls to the page number on the survey questions script that the question appeared on. We found that as the survey progressed, the number of null values increased (Figure 2 (Right)).
Figure 2. Left: A barchart of the top 10 of 36 features that had the most number of nulls. Right: A scatterplot of the number of null values in a given feature column versus the page number that the corresponding question appears on in the survey. There is a clear increase in the number of nulls as the survey progresses.
Imputing Nulls: Nulls were replaced according to the following procedure:
We looked closely at the distribution of values and replaced the missing observations with either the median, or proportionally with respect to the distribution of the feature across different age groups of respondants. For instance, as seen in Figure 3., the missing values in the feature household_children
were replaced with values that maintained the same proportion of responses across the different age groups.
Figure 3. Tables showing the distribution of values across different age groups before (Left) and after (Right) null replacement for the feature household_children
A complete summary of null replacements is provided in Table 2.
Exploratory Data Analysis
Our target variables were H1N1_vaccine
and seasonal_vaccine
, both of which had binary values of 1 for vaccinated and 0 for unvaccinated. Figure 4 illustrates the distribution of values for the two target variables. We found that which the two classes for seasonal_vaccine
were balanced, there was a large inbalance in the number of observations for the vaccinated and unvaccinated groups for H1N1_vaccine
.
Additionally, looking at the interaction between the two target variables, we found that receiving the H1N1 vaccine and receiving the Seasonal flu vaccine were statistically dependent events. In fact, while the probability that a respondent received the seasonal vaccine was 47%, the probability that a respondent received the seasonal vaccine given they received the H1N1 vaccine was 82%.
Figure 4.Left. Number of positive and negative classes for each target variable. H1N1 Vaccine: 21% vaccinated, and 79% unvaccinated; Seasonal Vaccine: 47% vaccinated, and 53% unvaccinated. Right Bar graph showing that the two target variables were not independent of eachother.
One of the more interesting observations from the data analysis we performed, was that there was a statistically significant dependency between respondent's perceived risk of the H1N1 vaccine and whether they chose to receive the H1N1 vaccine (Figure 5).
Figure 5.Bargraph showing the relationship between perceived risk of vaccine and vaccination status. The two tallest bars represent the people who did not receive the vaccine and perceived it as very high or somewhat high risk. We can see that those people are in the majority.
Since the dataset has two labels, it requires a multi-label classification approach. This means that each observation can have multiple labels assigned to it. In our case it would be yes/no to H1N1_vaccine
and yes/no to seasonal_vaccine
.There are many methods to approaching multilabel problems, and we chose to approach it as two separate binary classification problems, each with its own model. Therefore, we converted the data set into two single class binary datasets and fit a binary classification model to each data set.
Preliminary Models
As a first pass, we used a gridsearch, and all features to see how a basic model would perform. The gridsearch was set to optimize for accuracy. Tests were run for both seasonal flu and H1N1, using accuracy and AUC as metrics on logistic regression, k-NN, multinomial Naive Bayes, and random forest classifiers for a total of 16 tests. While k-NN and random forest classifiers were severely overfit, logistic regression and multinomoal Naive Bayes prefromed comparably. Scores for logistic regression are shown below:
Score | seasonal_vaccine | H1N1_vaccine |
---|---|---|
Accuracy | 0.779 | 0.848 |
AUC | 0.854 | 0.856 |
Next, knowing that H1N1 and seasonal flu vaccination were dependent on each other, we tried using seasonal flu as a predictor for H1N1 vaccination. While using the given target column for seasonal flu vaccine yielded marginally better modele for H1N1 vaccination, using a predicted value of seasonal flu vaccination as a predictor did not improve performance. Values for AUROC score comparing the preliminary model and models using the known seasonal_vaccine
classes are shown below:
Model | Preliminary Model | Using seasonal_vaccine as predictor |
---|---|---|
Logistic Regression | 0.855 | 0.884 |
k-NN | 0.798 | 0.820 |
mNB | 0.789 | 0.803 |
Random Forest | 0.854 | 0.886 |
Final Models : revisiting metrics
For our final models, we chose Logistic Regression, XGBoost and Neural Networks.
To improve our models, we made the following changes:
- Since we were more interested in the people who did not get the vaccine, we made them our positive class. That is, we changed the classes for both target variables so that
1
represented the unvaccinated group, and0
represented the vaccinated group. - Since we wanted to correctly classify as many unvaccinated respondents as possible, we changed our optimizing metric from AUROC (area under the ROC) and accuracy to recall and precision. For all our models, we recorded the following metrics:
Metric | Interpretation |
---|---|
Accuracy | The percentage of respondents classified correctly |
AUROC | Measure of overall classification performance |
Recall | The percentage of respondents who did not receive the vaccine who were classified correctly. |
Precision | The percentage of respondents who were classified unvaccinated who who were, in fact, unvaccinated. |
Final Models : Feature Engineering and balancing classes
With the XGBoost models, we also tried feature engineering and created some interaction features to feed into our models. The feature engineering performed is described in Table 3. Additionally, we compared two methods of balancing classes: using a random oversampler and SMOTE.
We then ran different combinations of XGBoost models: with our without feature engineering and class balancing. A summary of results is presented below:
Optimized for | Features Used | Accuracy | Precision | Recall | Roc_Auc |
---|---|---|---|---|---|
roc_auc | All features, target classes switched (unvacc = 1) | 0.855 | 0.878 | 0.948 | 0.730 |
roc_auc | All features + seasonal flu, target classes switched | 0.871 | 0.894 | 0.949 | 0.766 |
Recall | All features, target classes switched | 0.847 | 0.854 | 0.971 | 0.679 |
Recall | All features + engineered features, target classes switched | 0.000 | 0.858 | 0.966 | 0.688 |
Recall | Same as above + RandomOverSampler | 0.810 | 0.930 | 0.821 | 0.796 |
Recall | Same as above but SMOTE | 0.000 | 0.864 | 0.955 | 0.701 |
Recall | Same as above, but gridsearched specifically with SMOTE balanced data | 0.951 | 0.899 | 0.939 | 0.732 |
Feature engineering and balancing classes did not improve the XGBoost model significantly. Therefore, we compared our three final model types: Logistic Regression, XGBoost and Neural Networks only on the cleaned features and no class balancing.
Final Models : Results
Below is a comparison of our models based on the recorded metrics
H1N1 Vaccine
Model | Accuracy | AUROC | Precision | Recall |
---|---|---|---|---|
Logistic Regression | 0.803 | 0.789 | 0.928 | 0.813 |
XGBoost | 0.852 | 0.730 | 0.879 | 0.942 |
Neural Network | 0.823 | 0.897 | 0.790 | 0.873 |
Seasonal flu Vaccine
Model | Accuracy | AUROC | Precision | Recall |
---|---|---|---|---|
Logistic Regression | 0.789 | 0.788 | 0.804 | 0.799 |
XGBoost | 0.797 | 0.794 | 0.800 | 0.825 |
Neural Network | 0.800 | 0.864 | 0.800 | 0.834 |
Our second goal was to determine which questions were most useful in predicting if a respondent was unvaccinated. In order to do this, we determined the top 15 most important features using coefficients from Logistic Regression, and feature importances from XGBoost for both H1N1 and seasonal vaccine. Table 4 shows the complete list of top 15 features for seasonal flu and H1N1 vaccine.
For H1N1_vaccine, we compared model performance when using either of these feature sets. Since this dataset has imbalanced classes we also compared using the top 15 features as is, or in combination with feature engineering and balancing classes. Figure 8 shows a comparison of performance metrics when using the full feature set versus using only the top 15 features. Below is the summary of results:
Features | Accuracy | Precision | Recall | AUC |
---|---|---|---|---|
Logistic Regression feat.engg + SMOTE | 0.817 | 0.899 | 0.865 | 0.753 |
Logistic Regression feat as is + SMOTE | 0.824 | 0.901 | 0.873 | 0.759 |
Logistic Regression feat as is | 0.851 | 0.876 | 0.944 | 0.725 |
XGBoost feat.engg + SMOTE | 0.839 | 0.886 | 0.913 | 0.740 |
XGBoost feat as is + SMOTE | 0.839 | 0.886 | 0.913 | 0.739 |
XGBoost feat. as is | 0.847 | 0.871 | 0.945 | 0.714 |
Figure 8.Radar charts comparing performance metrics for predicting H1N1 vaccination using either only the top 15 features or the full feature set.
Our best model was an XGBoost trained on the 15 most important features from an XGBoost model used without any feature engineering or balancing of classes.
For the seasonal flu vaccine, we used the top 15 features from an XGBoost model and trained an XGBoost model on this subset of features. Figure 9 shows a comparison of performance metrics when using the full feature set versus using only the top 15 features. Below are the results:
Metric | With all features | With top 15 features |
---|---|---|
Accuracy | 0.797 | 0.849 |
Precision | 0.800 | 0.873 |
Recall | 0.825 | 0.956 |
Roc_Auc | 0.794 | 0.717 |
Figure 9.Radar charts comparing performance metrics for predicting seasonal flu vaccination using either only the top 15 features or the full feature set.
Given the performance of our models on the smaller subset of features we were able to conclude that these features are indeed predictive of whether a person is unvaccinated.
A better understanding of how behevioral, socioeconomic and demographic characteristics are associated with personal vaccination patterns can provide guidance for future public health efforts.
We first aimed to predict which respondents did not receive the vaccine. To do this, we chose to optimize our models for recall score, and compared multiple optimized models and were able to make predictictions with a recall of 0.94 for H1N1, and 0.83 for seasonal flu.
Our second aim was to identify the features in the dataset that were most predictive of whether a person did not get the vaccine. Using 15 of 35 features, our models predicted whether or a person was unvaccinated with a recall of 0.94 for both H1N1 and seasonal flu vaccines. This strongly indicates that the features we selected were indeed predictive of vaccination outcome.
Fnally, we aimed to recommend areas of focus to shorten the survey. Our models identified the most predictive features as:
- availability of health insurance,
- perception of the vaccine (effectiveness, risks),
- knowledge of H1N1,
- kind of dwelling,
- doctor’s recommendation,
- employment,
- marital status,
- Behaviors,
- level of education,
- overall health
- age
Variable | Description | Values |
---|---|---|
h1n1_concern | Level of concern about the H1N1 flu. | 0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned. |
h1n1_knowledge | Level of knowledge about H1N1 flu. | 0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge. |
behavioral_antiviral_meds | Has taken antiviral medications. | (binary) |
behavioral_avoidance | Has avoided close contact with others with flu-like symptoms. | 0 = No; 1 = Yes. |
behavioral_face_mask | Has bought a face mask. | 0 = No; 1 = Yes. |
behavioral_wash_hands | Has frequently washed hands or used hand sanitizer. | (0 = No; 1 = Yes. |
behavioral_large_gatherings | Has reduced time at large gatherings. | 0 = No; 1 = Yes. |
behavioral_outside_home | Has reduced contact with people outside of own household. | 0 = No; 1 = Yes. |
behavioral_touch_face | Has avoided touching eyes, nose, or mouth. | 0 = No; 1 = Yes. |
doctor_recc_h1n1 | H1N1 flu vaccine was recommended by doctor. | 0 = No; 1 = Yes. |
doctor_recc_seasonal | Seasonal flu vaccine was recommended by doctor. | 0 = No; 1 = Yes. |
chronic_med_condition | Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. | 0 = No; 1 = Yes. |
child_under_6_months | Has regular close contact with a child under the age of six months. | 0 = No; 1 = Yes. |
health_worker | Is a healthcare worker. | 0 = No; 1 = Yes. |
health_insurance | Has health insurance. | 0 = No; 1 = Yes. |
opinion_h1n1_vacc_effective | Respondent's opinion about H1N1 vaccine effectiveness. | 1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective. |
opinion_h1n1_risk | Respondent's opinion about risk of getting sick with H1N1 flu without vaccine. | 1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high. |
opinion_h1n1_sick_from_vacc | Respondent's worry of getting sick from taking H1N1 vaccine. | 1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried. |
opinion_seas_vacc_effective | Respondent's opinion about seasonal flu vaccine effectiveness. | 1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective. |
opinion_seas_risk | Respondent's opinion about risk of getting sick with seasonal flu without vaccine. | 1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high. |
opinion_seas_sick_from_vacc | Respondent's worry of getting sick from taking seasonal flu vaccine. | 1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried. |
age_group | Age group of respondent. | '65+ Years', '55 - 64 Years', '45 - 54 Years', '18 - 34 Years', '35 - 44 Years' |
education | Self-reported education level. | College Graduate, Some College, 12 Years, < 12 Years |
race | Race of respondent. | 'White', 'Black', 'Hispanic', 'Other or Multiple' |
sex | Sex of respondent. | Male, Female |
income_poverty | Household annual income of respondent with respect to 2008 Census poverty thresholds. | <= USD 75,000, Above Poverty, > USD 75,000, Below Poverty |
marital_status | Marital status of respondent. | Married, Not Married |
rent_or_own | Housing situation of respondent. | Rent, Own |
employment_status | Employment status of respondent. | Employed, Not in Labor Force, Unemployed |
hhs_geo_region | Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. | Values are represented as short random character strings. |
census_msa | Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census. | MSA, Not Principle City, MSA, Principle City, Non-MSA |
household_adults | Number of other adults in household | top-coded to 3. |
household_children | Number of children in household | top-coded to 3. |
employment_industry | Type of industry respondent is employed in. | Values are represented as short random character strings. |
employment_occupation | Type of occupation of respondent. | Values are represented as short random character strings. |
Feature | Null Count | Replacement Method |
---|---|---|
employment_occupation | 13470 | 'no_response' |
employment_industry | 13330 | 'no_response' |
health_insurance | 12274 | 'no_response' |
income_poverty | 4423 | 'no_response' |
doctor_recc_h1n1 | 2160 | 'no_response' |
doctor_recc_seasonal | 2160 | 'no_response' |
rent_or_own | 2042 | 'no_response' |
employment_status | 1463 | 'no_response' |
marital_status | 1408 | 'no_response' |
education | 1407 | 'no_response' |
chronic_med_condition | 971 | 'no_response' |
child_under_6_months | 820 | by age_group |
health_worker | 804 | 'no_response' |
opinion_seas_sick_from_vacc | 537 | median |
opinion_seas_risk | 514 | median |
opinion_seas_vacc_effective | 462 | median |
opinion_h1n1_sick_from_vacc | 395 | median |
opinion_h1n1_vacc_effective | 391 | median |
opinion_h1n1_risk | 388 | median |
household_adults | 249 | by age_group |
household_children | 249 | by age_group |
behavioral_avoidance | 208 | median |
behavioral_touch_face | 128 | median |
h1n1_knowledge | 116 | median |
h1n1_concern | 92 | median |
behavioral_large_gatherings | 87 | median |
behavioral_outside_home | 82 | median |
behavioral_antiviral_meds | 71 | median |
behavioral_wash_hands | 42 | median |
behavioral_face_mask | 19 | median |
Feature | Description |
---|---|
doctor_recc_h1n1 |
Values changed: 'no_response':0 , 0: 1, 1: 2 |
doctor_recc_seasonal |
Values changed: 'no_response':0 , 0: 1, 1: 2 |
chronic_med_condition |
Values changed: 'no_response':0 , 0: 1, 1: 2 |
age_group |
Values changed '18 - 34 Years' : 0, '35 - 44 Years' : 1, '45 - 54 Years' : 2, '55 - 64 Years' : 3, '55 - 64 Years' : 4, '65+ Years' : 5 |
health_worker |
Values changed: 'no_response':0 , 0: 1, 1: 2 |
h1n1_all |
Interaction term : h1n1_concern * h1n1_knowledge |
behavioral_all |
Interaction term : behavioral_antiviral_meds * behavioral_avoidance * behavioral_face_mask * behavioral_wash_hands * behavioral_large_gatherings * behavioral_outside_home * behavioral_touch_face |
opinion_all |
Interaction term : opinion_h1n1_vacc_effective * opinion_h1n1_risk * opinion_h1n1_sick_from_vacc * opinion_seas_vacc_effective * opinion_seas_risk * opinion_seas_sick_from_vacc |
health_worker_by_age |
Interaction term : health_worker * age_group |
doctor_recc_all |
Interaction term : doctor_recc_h1n1 * doctor_recc_seasonal |
H1N1 Vaccine | Seasonal Vaccine |
---|---|
opinion_h1n1_vacc_effective |
doctor_recc_seasonal |
health_insurance |
opinion_seas_vacc_effective |
opinion_h1n1_risk1 |
opinion_seas_risk |
h1n1_knowledge |
age_group |
behavioral_large_gatherings |
health_insurance |
marital_status |
health_worker |
rent_or_own |
rent_or_own |
behavioral_touch_face |
employment_industry |
opinion_seas_risk |
h1n1_knowledge |
opinion_seas_vacc_effective |
doctor_recc_h1n1 |
education |
employment_occupation |
behavioral_avoidance |
opinion_seas_sick_from_vacc |
doctor_recc_h1n1 |
chronic_med_condition |
behavioral_outside_home' |
opinion_h1n1_risk |
household_adults |
education |
h1n1_vaccine |
h1n1_vaccine |