×
← Back to Portfolio
EdTech / SaaS Python • XGBoost • Scikit-learn Dataset: RIT × Excelerate Internship

Churn Prediction
Engine

Built an end-to-end churn prediction system on 1,701 student records — from raw behavioral data to a deployed XGBoost model with 90.3% accuracy and AUC of 0.978 — identifying at-risk students before they drop off.

90.3%
Accuracy (+22% over baseline)
0.978
ROC-AUC (no leakage)
92%
Recall Rate
1,701
Student Records
Python (Pandas, NumPy) XGBoost Scikit-learn Matplotlib / Seaborn Google Colab Joblib (.pkl export) SQL
Churn Was Invisible Until It Was Too Late

An edtech platform was losing students with no early warning system. By the time disengagement was noticed, the student had already mentally left. The platform needed to know who would churn and why — before it happened.

~8%
Monthly churn rate on the platform — reactive, no prediction system in place
Day 3
Most churn signals appear in the first 3 days — waiting 30 days to act is already too late
15
Behavioral features engineered from raw logs to build the final prediction model
Exploratory Data Analysis

Before modeling, EDA was run to understand which behavioral signals separate churned students from active ones. Four Python charts — click any to expand.

Student Behavior: Churned vs Active
Top 10 Features Correlated with Churn
Top 15 Features Predicting Churn
Confusion Matrix and ROC Curve
Building Predictive Signals from Raw Behavior

Raw behavioral logs don't predict churn directly. The key was engineering composite features that capture engagement quality, not just activity volume.

feature_engineering.py
# Composite engagement score (top correlated feature at 0.69)
df['engagement_score'] = (
    df['logins_per_week'] * 0.3 +
    df['assignment_completion_pct'] * 0.4 +
    df['forum_posts_monthly'] * 0.2 +
    df['support_tickets'] * -0.1
)

# Early warning flags
df['critical_low_engagement'] = (df['engagement_score'] < 2.0).astype(int)
df['failing_assignments']      = (df['assignment_completion_pct'] < 40).astype(int)
df['socially_isolated']        = (df['forum_posts_monthly'] == 0).astype(int)

# Engagement trend (week 1 vs week 4)
df['engagement_trend'] = df['engagement_week_1'] - df['engagement_week_4']

engagement_score

Highest correlated feature (0.69). Weighted composite of logins, assignments, forum activity, and support tickets.

assignment_completion_pct

Top XGBoost importance feature (0.202). Students below 40% completion show dramatically higher churn.

failing_assignments

Binary flag — failing first assignment is a 3× churn risk multiplier. Engineered from raw completion data.

critical_low_engagement

Binary early warning: engagement score below threshold within first week flags immediate at-risk status.

Model Comparison & Selection

Four algorithms were trained and evaluated on an 80/20 split with 5-fold cross-validation. XGBoost won on every metric that matters for a production churn system.

Baseline context: Majority-class classifier (always predict "stays") achieves ~68% accuracy on this dataset. The XGBoost model at 90.3% represents a +22 percentage point improvement over the naive baseline — a meaningful lift, not an inflated number.
AlgorithmAccuracyAUCRecallWhy chosen / rejected
Logistic Regression82.1%0.89178%Too simple — misses non-linear patterns
Random Forest88.4%0.95186%Good, but slower and less interpretable
Neural Network87.9%0.94484%Black box — can't explain WHY to retention team
XGBoost90.3%0.97892%✓ Best accuracy + interpretable feature importance
02_churn_prediction_model.ipynb
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)
model.fit(X_train, y_train)

# 5-fold CV
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Output: CV Accuracy: 0.901 ± 0.012

# Export for deployment
import joblib
joblib.dump(model, 'models/churn_prediction_model.pkl')
Data leakage prevention: All features were limited strictly to pre-churn behavioral signals (logins, assignments, forum activity, support tickets). No post-churn data, no future-period features, no target-encoded variables. Train/test split was performed before any feature engineering to prevent leakage. The 0.978 AUC reflects clean methodology.
Model files saved: churn_prediction_model.pkl (1010 KB) • feature_scaler.pkl • feature_names.json • model_metadata.json — full pipeline exportable to production
Top Churn Drivers Identified

The confusion matrix (237 TN, 85 TP, 8 FN, 11 FP) on a 341-sample test set (20% hold-out, ~28% churn rate) and feature importance chart reveal exactly which behaviors predict churn — and when to intervene.

1
Low assignment completion (importance: 0.202)
The single strongest predictor. Students below 40% completion rate are overwhelmingly likely to churn. The model catches this by day 7.
→ intervention: trigger assignment help email when completion drops below 50%
2
Login frequency — logins per week (importance: 0.188)
Students logging in fewer than 3 times per week within the first 7 days show a 70% churn rate. Early login habit is the strongest retention signal.
→ intervention: "day 3" onboarding nudge if no second login detected
3
Composite engagement score (importance: 0.186)
The engineered composite score outperforms any single raw feature. High correlation (0.69) and top-3 model importance confirms it captures real engagement quality.
→ any student with score < 2.0 in week 1 = immediate high-risk flag
4
Social isolation — zero forum posts (importance: 0.031)
Students with no peer interaction in the first month show 2× churn rate. Isolation is a behavioral signal, not just a metric.
→ intervention: peer matching or community nudge for isolated students
5
Payment status — unpaid / trial users
Unpaid and trial students show ~50% churn vs ~22% for paid. Payment friction is both a churn predictor and a separate conversion problem worth addressing.
→ trial-to-paid conversion prompt at day 5 for high-engagement trial users
How This Runs in a Real Company

A model that sits in a notebook helps nobody. Here's how the exported pipeline would be deployed as an operational churn prevention system.

Daily Scoring
Batch scoring every 24 hours
Load churn_prediction_model.pkl → run all active students through the pipeline → output churn probability score (0–1) per student. Students above 0.7 threshold flagged as high risk.
Retention Dashboard
Daily risk list for the retention team
Top 20 at-risk students shown with their churn probability + top contributing feature (e.g. "assignment_completion dropped to 32%"). Team knows exactly who to contact and why.
Automated Triggers
Rules fire before human review
Score > 0.7 → auto-send re-engagement email. Score > 0.85 → flag for personal outreach call. Score > 0.95 → pause billing cycle if applicable. Model decision drives the action.
Model Maintenance
Quarterly retraining cycle
Retrain on fresh 3-month window every quarter. Monitor AUC in production — if it drops below 0.90, trigger retraining immediately. feature_names.json and scaler.pkl ensure consistent pipeline versioning.
Files ready for production: churn_prediction_model.pkl (1010 KB) • feature_scaler.pkl • feature_names.json • model_metadata.json — the complete exportable pipeline is already built.
What This Project Taught Me
01 / TIMING
Churn signals appear in 3 days, not 30
Every churn model I'd read about used 30-day windows. This dataset showed the decision is made by day 3. Early signals → early intervention is the only thing that works.
02 / INTERPRETABILITY
Explainability beats accuracy for stakeholders
A 90% accurate black box is useless to a retention team. XGBoost feature importance meant every at-risk flag came with a "because their assignment completion dropped to X%" reason.
03 / FEATURE ENGINEERING
Engineered features beat raw columns
engagement_score (engineered composite) outperformed every individual raw feature in both correlation and model importance. The signal was in how features combined, not in any single column.
04 / PRODUCTION
Export-ready from day one
The full pipeline — scaler, model, feature names, metadata — was saved as .pkl and .json files. Thinking about deployment before finishing analysis changed how I structured the entire notebook.

Need Churn Prediction for Your Platform?

I build ML models that identify at-risk users early — with clear explanations your team can act on.

Let's Talk →