This project predicts whether an employee earns more than $50K per year using demographic and work-related features. It demonstrates a complete machine learning workflow, from data exploration to deployment via a Streamlit web app.
The goal is to classify employees into two salary classes (>50K
or <=50K
) based on features such as age, education, occupation, and more. The workflow includes data cleaning, preprocessing, feature engineering, model training, evaluation, and deployment.
- Data cleaning and preprocessing
- Feature selection using Random Forest importance
- Model training (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting)
- Model evaluation and comparison
- Saving the best model pipeline
- Interactive Streamlit app for single and batch predictions
.
├── Employee_Salary_Prediction.ipynb # Main notebook with ML workflow
├── app.py # Streamlit app for predictions
├── salary_pipeline_streamlit.pkl # Saved ML pipeline for deployment
├── adult_3.csv # Dataset (not included here)
└── README.md # Project documentation
-
Clone the repository
git clone <repo-url> cd "Employee Salary Prediction"
-
Install dependencies
It is recommended to use a virtual environment.pip install -r requirements.txt
Or manually install:
pip install pandas numpy scikit-learn streamlit joblib matplotlib
To run the Streamlit app, execute the following command in your terminal:
streamlit run app.py
This will launch the app in your default web browser. You can enter employee details in the sidebar to get a salary class prediction, or upload a CSV for batch predictions.
To predict salary classes for multiple employees at once:
- Prepare a CSV file with the following columns (matching the training features):
age
,workclass
,education
,occupation
,hours-per-week
,educational-num
- Use the "Batch Prediction" section in the app to upload your CSV.
- Download the results with predicted classes.
- Features Used:
- Only the most important features (selected via Random Forest feature importance):
age
,workclass
,education
,occupation
,hours-per-week
,educational-num
- Only the most important features (selected via Random Forest feature importance):
- Preprocessing:
- Handles missing values, encodes categorical variables, and scales numerical features
- Model Selection:
- Compared Logistic Regression, Decision Tree, Random Forest, Gradient Boosting
- Selected the best model (Gradient Boosting) based on accuracy and F1-score
- Deployment:
- The final pipeline (preprocessing + scaler + model) is saved as
salary_pipeline_streamlit.pkl
- The final pipeline (preprocessing + scaler + model) is saved as
- Model Performance:
- The model achieved an accuracy of 85.5% and an F1-score of 0.67.
The Streamlit app provides:
- Single Prediction:
- Batch Prediction:
- User-Friendly UI:
- Clean, interactive interface for both individual and batch use cases.
- Preview of uploaded data and predictions.
Note: The app requires
salary_pipeline_streamlit.pkl
to be present in the project directory.
This project is licensed under the MIT License.
Project completed under IBM internship program taught by Edunet Foundation through AICTE.