Identifying at-risk students before they fail — turning data into intervention.
Every year, universities lose students to preventable academic failure. By the time a student fails their final exam, it's too late to help them. The warning signs were there — declining attendance, missed assignments, poor midterm grades — but no one connected the dots in time.
The cost is real:
- Students lose tuition money and time
- Universities lose retention revenue
- Society loses potential graduates
This project builds an Early Warning System that:
- Identifies at-risk students using a weighted risk score based on attendance, assignments, and exam performance
- Prioritizes intervention with urgency scoring that surfaces the most critical cases first
- Recommends specific actions tailored to each student's unique risk profile
- Tracks accountability with clear ownership for every intervention
The system transforms raw data into actionable insights that counselors, teachers, and administrators can use immediately.
| Metric | Finding |
|---|---|
| Risk Identification | 35.8% of students flagged as high-risk before final exams |
| Attendance Impact | Students below 70% attendance have 80%+ failure rate |
| Parental Factor | Low parental involvement = 75.7% high-risk rate (vs 0.8% for high involvement) |
| Interventions Generated | 661 specific, actionable recommendations with assigned owners |
Student_Performance_Project/
├── data/
│ ├── raw_student_data.csv # Original data with realistic flaws
│ ├── cleaned_student_data.csv # Cleaned data with risk scores
│ ├── student_performance.db # SQLite database
│ └── reports/
│ ├── executive_summary.txt # Principal's weekly overview
│ ├── individual_student_reports.txt
│ └── intervention_tracker.csv # Actionable task list
├── scripts/
│ ├── generate_student_data.py # Synthetic data generator
│ ├── sql_integration.py # Database creation & queries
│ └── recommendation_engine.py # Intervention logic
├── analysis.ipynb # Full EDA notebook
└── README.md
Since real student data is protected by FERPA, I created a synthetic dataset that mirrors actual academic patterns:
- Correlated variables: Attendance affects assignment scores affects exam grades
- Realistic flaws: Missing values, data entry errors (attendance > 100%), outliers
- Categorical distributions: Parental involvement weighted toward "Medium" (most common)
Every cleaning decision is documented with business justification:
# Cap attendance at 100% — don't delete records, just fix the impossible value
# Why? The student exists, their other data is valid, we preserve sample size
df['Attendance_Pct'] = df['Attendance_Pct'].clip(upper=100)A transparent, explainable formula that stakeholders can understand:
Risk Score = (100 - Attendance) × 0.35
+ (100 - Assignment_Avg) × 0.35
+ (100 - Midterm_Grade) × 0.30
Why these weights?
- Attendance (35%): Observable, actionable, schools can track it
- Assignments (35%): Shows consistent effort, early warning signal
- Midterm (30%): Direct outcome measure, but happens late in semester
- Study Hours excluded: Self-reported, unreliable
Not toy queries — real questions a principal would ask:
-- Which attendance level predicts failure?
SELECT
CASE
WHEN Attendance_Pct >= 90 THEN '90-100% (Excellent)'
WHEN Attendance_Pct >= 80 THEN '80-89% (Good)'
WHEN Attendance_Pct >= 70 THEN '70-79% (Concerning)'
ELSE 'Below 70% (Critical)'
END AS Attendance_Bucket,
COUNT(*) AS Student_Count,
ROUND(100.0 * SUM(CASE WHEN Midterm_Grade < 60 THEN 1 ELSE 0 END) / COUNT(*), 1) AS Failure_Rate
FROM students
GROUP BY Attendance_BucketEach student gets interventions matched to their specific risk factors:
| Risk Factor | Intervention | Owner |
|---|---|---|
| Attendance < 60% | Home visit + daily check-ins | Attendance Officer |
| Assignments < 40% | Intensive 1-on-1 tutoring | Academic Support |
| Low parental involvement | Parent conference | Guidance Counselor |
pip install pandas numpy matplotlib seaborn faker# 1. Generate synthetic data
python scripts/generate_student_data.py
# 2. Run EDA notebook (or execute cells in analysis.ipynb)
jupyter notebook analysis.ipynb
# 3. Create database and run SQL queries
python scripts/sql_integration.py
# 4. Generate intervention recommendations
python scripts/recommendation_engine.pycd Student_Performance_Project
python scripts/generate_student_data.py && python scripts/sql_integration.py && python scripts/recommendation_engine.pyTOTAL HIGH-RISK STUDENTS: 179
URGENCY BREAKDOWN:
URGENT (70+): 34 students
HIGH (50-69): 85 students
MODERATE (<50): 60 students
MOST COMMON RISK FACTORS:
- Poor Attendance: 127 students
- Failing Assignments: 124 students
- Low Parental Engagement: 106 students
STUDENT: Julian Conner
Risk Score: 68.6/100
Urgency Score: 89.3/100
IDENTIFIED RISK FACTORS:
1. [CRITICAL] Critical Absenteeism - 56.0%
2. [CRITICAL] Failing Assignments - 15.0
3. [CRITICAL] Failing Exams - 30.8
RECOMMENDED INTERVENTIONS:
1. Attendance Recovery Program (Owner: Attendance Officer) - URGENT
2. Intensive Tutoring (Owner: Academic Support) - URGENT
3. Parent Conference (Owner: Guidance Counselor) - HIGH
| Category | Skills |
|---|---|
| Data Engineering | Synthetic data generation, data quality assessment, ETL pipelines |
| Data Analysis | Exploratory data analysis, correlation analysis, statistical summary |
| Data Cleaning | Missing value imputation, outlier detection, constraint validation |
| Feature Engineering | Risk score design, categorical encoding, urgency scoring |
| SQL | Schema design, complex queries, aggregations, indexing |
| Visualization | Distribution plots, heatmaps, scatter plots with trend lines |
| Business Logic | Rules-based recommendation engine, intervention prioritization |
| Documentation | Business-aware code comments, stakeholder reporting |
- Machine Learning: Train a classifier to predict risk (Random Forest, XGBoost)
- Dashboard: Build interactive Streamlit/Dash dashboard
- Real-time Updates: Connect to live student information system
- Outcome Tracking: Measure intervention effectiveness over time
- API: RESTful API for integration with other school systems
This project was built as a portfolio piece to demonstrate end-to-end data analytics capabilities. It showcases not just technical skills, but business thinking - understanding that data only matters when it drives action.
The code is intentionally over-documented to show the reasoning behind decisions, which is what separates junior analysts from senior ones.
MIT License - feel free to use this as a template for your own projects.
[Khush Paliwal] [LinkedIn Profile] | [Connectwithkhush@gmail.com]
Built with Python, Pandas, and a genuine desire to help students succeed.



