Skip to content

khush196/Student_Performance_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Student Performance Early Warning System

Identifying at-risk students before they fail — turning data into intervention.

Python Pandas SQLite License


The Problem

Every year, universities lose students to preventable academic failure. By the time a student fails their final exam, it's too late to help them. The warning signs were there — declining attendance, missed assignments, poor midterm grades — but no one connected the dots in time.

The cost is real:

  • Students lose tuition money and time
  • Universities lose retention revenue
  • Society loses potential graduates

The Solution

This project builds an Early Warning System that:

  1. Identifies at-risk students using a weighted risk score based on attendance, assignments, and exam performance
  2. Prioritizes intervention with urgency scoring that surfaces the most critical cases first
  3. Recommends specific actions tailored to each student's unique risk profile
  4. Tracks accountability with clear ownership for every intervention

The system transforms raw data into actionable insights that counselors, teachers, and administrators can use immediately.


Key Results

Metric Finding
Risk Identification 35.8% of students flagged as high-risk before final exams
Attendance Impact Students below 70% attendance have 80%+ failure rate
Parental Factor Low parental involvement = 75.7% high-risk rate (vs 0.8% for high involvement)
Interventions Generated 661 specific, actionable recommendations with assigned owners

Project Structure

Student_Performance_Project/
├── data/
│   ├── raw_student_data.csv          # Original data with realistic flaws
│   ├── cleaned_student_data.csv      # Cleaned data with risk scores
│   ├── student_performance.db        # SQLite database
│   └── reports/
│       ├── executive_summary.txt     # Principal's weekly overview
│       ├── individual_student_reports.txt
│       └── intervention_tracker.csv  # Actionable task list
├── scripts/
│   ├── generate_student_data.py      # Synthetic data generator
│   ├── sql_integration.py            # Database creation & queries
│   └── recommendation_engine.py      # Intervention logic
├── analysis.ipynb                    # Full EDA notebook
└── README.md

Technical Highlights

1. Realistic Data Generation

Since real student data is protected by FERPA, I created a synthetic dataset that mirrors actual academic patterns:

  • Correlated variables: Attendance affects assignment scores affects exam grades
  • Realistic flaws: Missing values, data entry errors (attendance > 100%), outliers
  • Categorical distributions: Parental involvement weighted toward "Medium" (most common)

2. Business-Aware Data Cleaning

Every cleaning decision is documented with business justification:

# Cap attendance at 100% — don't delete records, just fix the impossible value
# Why? The student exists, their other data is valid, we preserve sample size
df['Attendance_Pct'] = df['Attendance_Pct'].clip(upper=100)

3. Risk Score Engineering

A transparent, explainable formula that stakeholders can understand:

Risk Score = (100 - Attendance) × 0.35
           + (100 - Assignment_Avg) × 0.35
           + (100 - Midterm_Grade) × 0.30

Why these weights?

  • Attendance (35%): Observable, actionable, schools can track it
  • Assignments (35%): Shows consistent effort, early warning signal
  • Midterm (30%): Direct outcome measure, but happens late in semester
  • Study Hours excluded: Self-reported, unreliable

4. SQL Queries That Matter

Not toy queries — real questions a principal would ask:

-- Which attendance level predicts failure?
SELECT
    CASE
        WHEN Attendance_Pct >= 90 THEN '90-100% (Excellent)'
        WHEN Attendance_Pct >= 80 THEN '80-89% (Good)'
        WHEN Attendance_Pct >= 70 THEN '70-79% (Concerning)'
        ELSE 'Below 70% (Critical)'
    END AS Attendance_Bucket,
    COUNT(*) AS Student_Count,
    ROUND(100.0 * SUM(CASE WHEN Midterm_Grade < 60 THEN 1 ELSE 0 END) / COUNT(*), 1) AS Failure_Rate
FROM students
GROUP BY Attendance_Bucket

5. Personalized Recommendations

Each student gets interventions matched to their specific risk factors:

Risk Factor Intervention Owner
Attendance < 60% Home visit + daily check-ins Attendance Officer
Assignments < 40% Intensive 1-on-1 tutoring Academic Support
Low parental involvement Parent conference Guidance Counselor

Visualizations

Risk Distribution

Risk Distribution

Correlation Analysis

Correlation Matrix

Attendance vs. Performance

Attendance Impact

Parental Involvement Effect

Parental Involvement


How to Run

Prerequisites

pip install pandas numpy matplotlib seaborn faker

Step-by-Step Execution

# 1. Generate synthetic data
python scripts/generate_student_data.py

# 2. Run EDA notebook (or execute cells in analysis.ipynb)
jupyter notebook analysis.ipynb

# 3. Create database and run SQL queries
python scripts/sql_integration.py

# 4. Generate intervention recommendations
python scripts/recommendation_engine.py

Quick Start (All Phases)

cd Student_Performance_Project
python scripts/generate_student_data.py && python scripts/sql_integration.py && python scripts/recommendation_engine.py

Sample Output

Executive Summary (for Principal)

TOTAL HIGH-RISK STUDENTS: 179

URGENCY BREAKDOWN:
  URGENT (70+):   34 students
  HIGH (50-69):   85 students
  MODERATE (<50): 60 students

MOST COMMON RISK FACTORS:
  - Poor Attendance: 127 students
  - Failing Assignments: 124 students
  - Low Parental Engagement: 106 students

Individual Student Report (for Counselor)

STUDENT: Julian Conner
Risk Score: 68.6/100
Urgency Score: 89.3/100

IDENTIFIED RISK FACTORS:
1. [CRITICAL] Critical Absenteeism - 56.0%
2. [CRITICAL] Failing Assignments - 15.0
3. [CRITICAL] Failing Exams - 30.8

RECOMMENDED INTERVENTIONS:
1. Attendance Recovery Program (Owner: Attendance Officer) - URGENT
2. Intensive Tutoring (Owner: Academic Support) - URGENT
3. Parent Conference (Owner: Guidance Counselor) - HIGH

Skills Demonstrated

Category Skills
Data Engineering Synthetic data generation, data quality assessment, ETL pipelines
Data Analysis Exploratory data analysis, correlation analysis, statistical summary
Data Cleaning Missing value imputation, outlier detection, constraint validation
Feature Engineering Risk score design, categorical encoding, urgency scoring
SQL Schema design, complex queries, aggregations, indexing
Visualization Distribution plots, heatmaps, scatter plots with trend lines
Business Logic Rules-based recommendation engine, intervention prioritization
Documentation Business-aware code comments, stakeholder reporting

Future Enhancements

  • Machine Learning: Train a classifier to predict risk (Random Forest, XGBoost)
  • Dashboard: Build interactive Streamlit/Dash dashboard
  • Real-time Updates: Connect to live student information system
  • Outcome Tracking: Measure intervention effectiveness over time
  • API: RESTful API for integration with other school systems

About This Project

This project was built as a portfolio piece to demonstrate end-to-end data analytics capabilities. It showcases not just technical skills, but business thinking - understanding that data only matters when it drives action.

The code is intentionally over-documented to show the reasoning behind decisions, which is what separates junior analysts from senior ones.


License

MIT License - feel free to use this as a template for your own projects.


Contact

[Khush Paliwal] [LinkedIn Profile] | [Connectwithkhush@gmail.com]

Built with Python, Pandas, and a genuine desire to help students succeed.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors