The aim of this project was to to predict cancer mortality rates in "unseen" US counties, given some training data. The training data comprises various features/predictors related to socio-economic characteristics, amongst other types of information for specific counties in the country. OLS, Ridge, Lasso and random forest were trained and compared for model accuracy. The most accurate model was evaluated with a final unseen test dataset.
Data Dictionary for training dataset:
avgAnnCount: Mean number of reported cases of cancer diagnosed annually
avgDeathsPerYear: Mean number of reported mortalities due to cancer
incidenceRate: Mean per capita (100,000) cancer diagoses
medianIncome: Median income per county
popEst2015: Population of county
povertyPercent: Percent of populace in poverty
MedianAge: Median age of county residents
MedianAgeMale: Median age of male county residents
MedianAgeFemale: Median age of female county residents
AvgHouseholdSize: Mean household size of county
PercentMarried: Percent of county residents who are married
PctNoHS18_24: Percent of county residents ages 18-24 highest education attained: less than high school
PctHS18_24: Percent of county residents ages 18-24 highest education attained: high school diploma
PctSomeCol18_24: Percent of county residents ages 18-24 highest education attained: some college
PctBachDeg18_24: Percent of county residents ages 18-24 highest education attained: bachelor's degree
PctHS25_Over: Percent of county residents ages 25 and over highest education attained: high school diploma
PctBachDeg25_Over: Percent of county residents ages 25 and over highest education attained: bachelor's degree
PctEmployed16_Over: Percent of county residents ages 16 and over employed
PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed
PctPrivateCoverage: Percent of county residents with private health coverage
PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance)
PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage
PctPublicCoverage: Percent of county residents with government-provided health coverage
PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone
PctWhite: Percent of county residents who identify as White
PctBlack: Percent of county residents who identify as Black
PctAsian: Percent of county residents who identify as Asian
PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian
PctMarriedHouseholds: Percent of married households
BirthRate: Number of live births relative to number of women in county