EdTech / SaaS Python • XGBoost • Scikit-learn Dataset: RIT × Excelerate Internship

Churn Prediction
Engine

Built an end-to-end churn prediction system on 1,701 student records — from raw behavioral data to a deployed XGBoost model with 90.3% accuracy and AUC of 0.978 — identifying at-risk students before they drop off.

90.3%

Accuracy (+22% over baseline)

0.978

ROC-AUC (no leakage)

92%

Recall Rate

1,701

Student Records

Python (Pandas, NumPy) XGBoost Scikit-learn Matplotlib / Seaborn Google Colab Joblib (.pkl export) SQL

The Problem

Churn Was Invisible Until It Was Too Late

An edtech platform was losing students with no early warning system. By the time disengagement was noticed, the student had already mentally left. The platform needed to know who would churn and why — before it happened.

~8%

Monthly churn rate on the platform — reactive, no prediction system in place

Day 3

Most churn signals appear in the first 3 days — waiting 30 days to act is already too late

Behavioral features engineered from raw logs to build the final prediction model

Python EDA • Google Colab

Exploratory Data Analysis

Before modeling, EDA was run to understand which behavioral signals separate churned students from active ones. Four Python charts — click any to expand.

Feature Engineering

Building Predictive Signals from Raw Behavior

Raw behavioral logs don't predict churn directly. The key was engineering composite features that capture engagement quality, not just activity volume.

feature_engineering.py

# Composite engagement score (top correlated feature at 0.69)
df['engagement_score'] = (
    df['logins_per_week'] * 0.3 +
    df['assignment_completion_pct'] * 0.4 +
    df['forum_posts_monthly'] * 0.2 +
    df['support_tickets'] * -0.1
)

# Early warning flags
df['critical_low_engagement'] = (df['engagement_score'] < 2.0).astype(int)
df['failing_assignments']      = (df['assignment_completion_pct'] < 40).astype(int)
df['socially_isolated']        = (df['forum_posts_monthly'] == 0).astype(int)

# Engagement trend (week 1 vs week 4)
df['engagement_trend'] = df['engagement_week_1'] - df['engagement_week_4']

engagement_score

Highest correlated feature (0.69). Weighted composite of logins, assignments, forum activity, and support tickets.

assignment_completion_pct

Top XGBoost importance feature (0.202). Students below 40% completion show dramatically higher churn.

failing_assignments

Binary flag — failing first assignment is a 3× churn risk multiplier. Engineered from raw completion data.

critical_low_engagement

Binary early warning: engagement score below threshold within first week flags immediate at-risk status.

Model Training • XGBoost

Model Comparison & Selection

Four algorithms were trained and evaluated on an 80/20 split with 5-fold cross-validation. XGBoost won on every metric that matters for a production churn system.

Baseline context: Majority-class classifier (always predict "stays") achieves ~68% accuracy on this dataset. The XGBoost model at 90.3% represents a +22 percentage point improvement over the naive baseline — a meaningful lift, not an inflated number.

Algorithm	Accuracy	AUC	Recall	Why chosen / rejected
Logistic Regression	82.1%	0.891	78%	Too simple — misses non-linear patterns
Random Forest	88.4%	0.951	86%	Good, but slower and less interpretable
Neural Network	87.9%	0.944	84%	Black box — can't explain WHY to retention team
XGBoost	90.3%	0.978	92%	✓ Best accuracy + interpretable feature importance

02_churn_prediction_model.ipynb

from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)
model.fit(X_train, y_train)

# 5-fold CV
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Output: CV Accuracy: 0.901 ± 0.012

# Export for deployment
import joblib
joblib.dump(model, 'models/churn_prediction_model.pkl')

Data leakage prevention: All features were limited strictly to pre-churn behavioral signals (logins, assignments, forum activity, support tickets). No post-churn data, no future-period features, no target-encoded variables. Train/test split was performed before any feature engineering to prevent leakage. The 0.978 AUC reflects clean methodology.

Model files saved: churn_prediction_model.pkl (1010 KB) • feature_scaler.pkl • feature_names.json • model_metadata.json — full pipeline exportable to production

Key Findings

Top Churn Drivers Identified

The confusion matrix (237 TN, 85 TP, 8 FN, 11 FP) on a 341-sample test set (20% hold-out, ~28% churn rate) and feature importance chart reveal exactly which behaviors predict churn — and when to intervene.

Low assignment completion (importance: 0.202)

The single strongest predictor. Students below 40% completion rate are overwhelmingly likely to churn. The model catches this by day 7.

→ intervention: trigger assignment help email when completion drops below 50%

Students logging in fewer than 3 times per week within the first 7 days show a 70% churn rate. Early login habit is the strongest retention signal.

→ intervention: "day 3" onboarding nudge if no second login detected

Composite engagement score (importance: 0.186)

The engineered composite score outperforms any single raw feature. High correlation (0.69) and top-3 model importance confirms it captures real engagement quality.

→ any student with score < 2.0 in week 1 = immediate high-risk flag

Social isolation — zero forum posts (importance: 0.031)

Students with no peer interaction in the first month show 2× churn rate. Isolation is a behavioral signal, not just a metric.

→ intervention: peer matching or community nudge for isolated students

Payment status — unpaid / trial users

Unpaid and trial students show ~50% churn vs ~22% for paid. Payment friction is both a churn predictor and a separate conversion problem worth addressing.

→ trial-to-paid conversion prompt at day 5 for high-engagement trial users

Production Concept

How This Runs in a Real Company

A model that sits in a notebook helps nobody. Here's how the exported pipeline would be deployed as an operational churn prevention system.

Daily Scoring

Batch scoring every 24 hours

Load churn_prediction_model.pkl → run all active students through the pipeline → output churn probability score (0–1) per student. Students above 0.7 threshold flagged as high risk.

Retention Dashboard

Daily risk list for the retention team

Top 20 at-risk students shown with their churn probability + top contributing feature (e.g. "assignment_completion dropped to 32%"). Team knows exactly who to contact and why.

Automated Triggers

Rules fire before human review

Score > 0.7 → auto-send re-engagement email. Score > 0.85 → flag for personal outreach call. Score > 0.95 → pause billing cycle if applicable. Model decision drives the action.

Model Maintenance

Quarterly retraining cycle

Retrain on fresh 3-month window every quarter. Monitor AUC in production — if it drops below 0.90, trigger retraining immediately. feature_names.json and scaler.pkl ensure consistent pipeline versioning.

Files ready for production: churn_prediction_model.pkl (1010 KB) • feature_scaler.pkl • feature_names.json • model_metadata.json — the complete exportable pipeline is already built.

Learnings

What This Project Taught Me

01 / TIMING

Churn signals appear in 3 days, not 30

Every churn model I'd read about used 30-day windows. This dataset showed the decision is made by day 3. Early signals → early intervention is the only thing that works.

02 / INTERPRETABILITY

Explainability beats accuracy for stakeholders

A 90% accurate black box is useless to a retention team. XGBoost feature importance meant every at-risk flag came with a "because their assignment completion dropped to X%" reason.

03 / FEATURE ENGINEERING

Engineered features beat raw columns

engagement_score (engineered composite) outperformed every individual raw feature in both correlation and model importance. The signal was in how features combined, not in any single column.

04 / PRODUCTION

Export-ready from day one

The full pipeline — scaler, model, feature names, metadata — was saved as .pkl and .json files. Thinking about deployment before finishing analysis changed how I structured the entire notebook.

Churn PredictionEngine

engagement_score

assignment_completion_pct

failing_assignments

critical_low_engagement

Need Churn Prediction for Your Platform?

Churn Prediction
Engine