A Machine Learning-based project developed as part of the IBM SkillsBuild Internship
This project aims to build a binary classification machine learning model that predicts whether an individual's annual salary exceeds $50,000 based on various demographic and professional features.
Key steps in this project include:
- Data loading and cleaning
- Exploratory Data Analysis (EDA)
- Model training and evaluation
- Deployment using Streamlit for interactive usage
The dataset used is a cleaned and modified version of the UCI Adult Income dataset.
The dataset used is Salary_List.csv, containing the following features:
| Feature | Description |
|---|---|
age |
Age of the individual |
workclass |
Type of employer (e.g., Private, Federal-gov) |
fnlwgt |
Final weight (sampling weight) |
education |
Education level (e.g., Bachelors, HS-grad) |
education-num |
Numeric representation of education |
marital-status |
Marital status |
occupation |
Occupation type |
relationship |
Relationship status |
race |
Race |
sex |
Gender |
capital-gain |
Capital gains |
capital-loss |
Capital losses |
hours-per-week |
Hours worked per week |
native-country |
Country of origin |
salary |
Target label (<=50K or >50K) |
- Handled missing values (
?) - Removed irrelevant entries (
Without-pay,Never-worked) - Dropped redundant column (
education, sinceeducation-numprovides numeric representation) - Treated outliers in
ageandeducation-num - Encoded categorical features using Label Encoding
Employee_Salary_Prediction/ │ ├── app.py # Streamlit app ├── model.joblib # Saved ML model ├── Salary_List.csv # Dataset ├── salary_prediction.ipynb # Jupyter notebook for development ├── requirements.txt # Required libraries └── README.md # Project documentation
Clone the repository and install dependencies:
git clone https://github.com/samuelcodes18/Employee_Salary_Prediction.git
cd Employee_Salary_Predictionpip install -r requirements.txtngrok authtoken YOUR_NGROK_AUTHTOKENstreamlit run app.py