Skip to content

Files

Latest commit

3efecd6 · Jan 2, 2019

History

History

supervised-learning-finding-donors

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Nov 11, 2018
Jan 2, 2019
Nov 6, 2018
Nov 6, 2018
Nov 6, 2018

Supervised Learning

Project: Finding Donors for CharityML

Install

This project requires Python 3.x and the following Python libraries installed:

You will also need to have software installed to run and execute an iPython Notebook

Code

Code is provided in the finding_donors.ipynb notebook file. The notebook uses visuals.py Python file and the census.csv dataset file. Three models were initially selected for analysis: Linear SVC, Decision Tree Clasifier and Gradient Boosting Classifier. Out of 3 latter model was selected as the most promising.

Run

In a terminal or command window, navigate to the top-level project directory finding_donors/ (that contains this README) and run one of the following commands:

ipython notebook finding_donors.ipynb

or

jupyter notebook finding_donors.ipynb

This will open the iPython Notebook software and project file in your browser.

Overview

The notebook covers supervised learning techniques applied on data collected for the U.S. census to help CharityML (a fictitious charity organization) identify people most likely to donate. Data was analyzed, series of transformations and pre-processing steps applied to manipulate the data into a workable format. Linear SVC, Decision Tree Classifier and Gradient Boosting Classifier (GBC) sklearn models were evaluated so to find best solution. Based on the evaluation results sklearn GBC was selected as the most promising. The model was optimized using sklearn grid search. Additionally features importance was analyzed, the importance of each feature when making predictions based on the chosen algorithm.

Data

The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", by Ron Kohavi. You may find this paper online, with the original dataset hosted on UCI.

Features

  • age: Age
  • workclass: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
  • education_level: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)
  • education-num: Number of educational years completed
  • marital-status: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
  • occupation: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
  • relationship: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
  • race: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
  • sex: Sex (Female, Male)
  • capital-gain: Monetary Capital Gains
  • capital-loss: Monetary Capital Losses
  • hours-per-week: Average Hours Per Week Worked
  • native-country: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)

Target Variable

  • income: Income Class (<=50K, >50K)

Results

Optimized model of Gradient Boosting Classifier is giving accurracy as 0.8689 and F-score (beta 0.5) as 0.7483 on test set.