Context:
Educational institutions face significant financial and reputational losses when students drop out. For a university with 50,000 students, a 5% increase in retention can save millions in tuition revenue annually.
Problem:
The current intervention process is reactive, relying on faculty referrals after students have already failed courses. The university lacks a proactive, data-driven method to identify at-risk students early in the semester.
Objective:
Build an automated Early Warning System (EWS) that predicts the probability of student dropout using historical academic, behavioral, and financial data. The system will categorize students into Low, Medium, and High-Risk groups to enable targeted intervention by counselors.
The solution is built using Python for data processing and modeling, with outputs designed for integration into Power BI dashboards.
Tech Stack:
- Language: Python 3.10+
- Libraries: Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib
- Modeling: Logistic Regression, Random Forest Classifier
- Dashboard: Power BI (Integration Ready)
Folder Structure:
Student-Dropout-Risk-System/
│
├── data/
│ ├── student_dropout_raw.csv # Dataset (500k+ records)
│ ├── student_dropout_processed.csv # Cleaned & Scaled data
│
├── notebooks/
│ └── dropout_modeling.ipynb # EDA and Modeling Walkthrough
│
├── output/
│ ├── student_dropout_scored.csv # Final Scored Data for Power BI
│ ├── confusion_matrix.png # Model Performance Visual
│ └── feature_importance.png # Key Drivers Visual
│
├── src/
│ ├── data_generation.py # Generates realistic synthetic data
│ ├── preprocessing.py # Cleaning pipeline
│ ├── modeling.py # Training & Scoring pipeline
│
└── README.md # Project Documentation
The dataset contains 200,000 student records with the following key features:
| Feature | Description |
|---|---|
| Student_ID | Unique identifier |
| Attendance_Percentage | % of classes attended (Key Driver) |
| Average_Test_Score | Average score across all subjects |
| Fee_Payment_Delay_Days | Days tuition payment was delayed |
| Scholarship_Status | Merit, Need-Based, or None |
| Dropout | Target Variable (0 = Retained, 1 = Dropped Out) |
We compared Logistic Regression and Random Forest. Random Forest was selected as the champion model due to its ability to handle non-linear relationships (e.g., the complex interaction between attendance and financial stress).
Performance Metrics:
- Accuracy: ~88%
- ROC-AUC Score: ~0.94
- Recall (Dropout): ~85% (Critical for catching at-risk students)
Top Predictive Factors:
- Attendance Percentage
- Average Test Score
- Fee Payment Delay
- Previous Failures
- Assignment Completion Rate
The model assigns a Dropout_Probability (0-1) to each student, categorized as:
- Low Risk (0 - 0.3): Standard academic support.
- Medium Risk (0.3 - 0.6): Automated email reminders, tutor suggestions.
- High Risk (> 0.6): Immediate counselor intervention required.
Output: output/student_dropout_scored.csv
To visualize the results in Power BI:
-
Import Data:
- Open Power BI Desktop -> Get Data -> Text/CSV.
- Select
output/student_dropout_scored.csv.
-
Recommended Visuals:
- Top Card: Total High-Risk Students (Count where Risk='High').
- Pie Chart: Risk Category Distribution.
- Bar Chart: Average Attendance by Risk Category.
- Table: List of High-Risk Students (filtered) for export to counselors.
- Synthetic Data: The data is simulated based on real-world patterns but may not capture specific institutional nuances.
- Static Snapshot: The model currently predicts based on a snapshot in time. A future enhancement would be time-series forecasting (RNN/LSTM) to track risk trajectory week-over-week.
- Bias: Historical data may contain bias against certain demographics; fairness audits are recommended before full deployment.
Riya Rastogi GitHub: https://github.com/riyalytics