A few years ago, I built a churn model for a B2B SaaS product. Logistic regression, binary label, 30-day prediction window. It performed fine. The business used it. I moved on.
What bothered me was a question the model couldn’t answer: how long does a customer actually stay?
A customer who churns at month 2 and one who churns at month 18 are identical in a binary classifier — both labeled “churned.” But they’re completely different business problems. One is an onboarding failure. The other is a pricing issue, a competitive loss, or accumulated dissatisfaction that took a year and a half to complete. The interventions are different. The teams responsible are different. The economics are different.
I eventually built a survival analysis instead. This post is what I learned.
What survival analysis does differently #
Survival analysis treats time-to-event as the outcome. Instead of “will this customer churn?” it asks “when will this customer churn, and what accelerates or delays that?”
The technical framing: you’re modelling a duration variable that, for some customers, you haven’t fully observed yet. A customer who joined six months ago and hasn’t churned has survived at least six months. You don’t know when they’ll leave. Standard classification models either drop these customers or treat them as permanent non-churners — both wrong. Survival analysis handles them correctly as censored observations: they contribute the information they have (at least six months of survival) without overclaiming.
This matters more than it sounds. In most SaaS datasets, the majority of your customers are still active. A model that ignores them or misrepresents them is working with a fundamentally biased sample.
Two tools, two questions #
Kaplan-Meier: what does retention actually look like? #
The Kaplan-Meier estimator produces a survival curve: at each point in time, what fraction of customers are still subscribed?
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(durations=df['duration_months'], event_observed=df['churned'])
kmf.plot_survival_function()The curve steps down at each churn event. The shaded band is the 95% confidence interval — it widens at longer durations where fewer customers remain. Don’t over-interpret the tail.
The useful thing is you can fit separate curves for segments and compare them directly:
for segment in df['segment'].unique():
mask = df['segment'] == segment
kmf.fit(
durations=df.loc[mask, 'duration_months'],
event_observed=df.loc[mask, 'churned'],
label=segment
)
kmf.plot_survival_function(ax=ax, ci_show=False)Higher curve = better retention. Wider separation between curves = segment is a strong predictor of churn. The log-rank test tells you whether the difference is statistically significant:
from lifelines.statistics import multivariate_logrank_test
result = multivariate_logrank_test(
df['duration_months'], df['segment'], df['churned']
)
print(f"p = {result.p_value:.2e}")Cox Proportional Hazards: what’s driving it? #
Kaplan-Meier shows that segments differ. Cox tells you why — which factors increase or decrease churn risk, controlling for everything else simultaneously.
The model:
h(t) = h₀(t) · exp(β₁x₁ + β₂x₂ + ... + βₚxₚ)The output is a hazard ratio per covariate. HR = 1.4 means 40% higher churn risk. HR = 0.7 means 30% lower churn risk. HR = 1.0 means no effect.
from lifelines import CoxPHFitter
cph = CoxPHFitter(penalizer=0.01)
cph.fit(cox_df, duration_col='duration_months', event_col='churned')
cph.plot(hazard_ratios=True) # forest plot — the most useful outputThe concordance index tells you how well the model ranks customers by risk:
print(f"C-index: {cph.concordance_index_:.3f}")
# 0.5 = random, 0.7+ = useful, 0.8+ = strongWhat I found #
The dataset was enterprise contract customers for a B2B SaaS product — around 23,000 subscription records across ten product lines, roughly 44% churned.
Retention curve:
| Milestone | Retention |
|---|---|
| 6 months | 96.8% |
| 12 months | 76.3% |
| 24 months | 55.7% |
| 36 months | 36.2% |
Median lifetime: 30 months.
One important caveat for this population: these are contract customers with at least annual terms. Near-zero churn before month 6 is structurally expected — they can’t easily leave mid-contract. The meaningful signal starts at the first renewal window. The 20-point drop between month 6 and month 12 is where annual contracts come up for renewal. That’s not a surprise; it’s the shape of the problem.
The Cox model (C-index 0.78) surfaces the factors that differentiate which accounts make it through renewal:
- Account segment dominated the results. Some segments showed 10–16× higher churn risk than the reference segment, controlling for product, region, and revenue. These differences reflect contractual dynamics — competitive renewal processes, procurement cycles, shorter initial terms — not product quality.
- Region was consistently protective. EMEA and North American accounts churned 30–40% less than the APAC baseline, all else equal.
- Product line mattered significantly. Some products showed 6× higher churn than others. Worth separating in the model rather than lumping all product lines together.
- MRR had a near-zero independent effect once segment and product were controlled for. The apparent revenue effect in simpler models was confounding.
The assumption check you can’t skip #
Cox PH has a key assumption: hazard ratios are constant over time. A segment that has 3× higher churn risk at month 12 must also have 3× higher churn risk at month 24. In practice this often fails — and it failed here.
The test:
from lifelines.statistics import proportional_hazard_test
results = proportional_hazard_test(cph, cox_df, time_transform='rank')
failing = results.summary[results.summary['p'] < 0.05].index.tolist()15 covariates failed. Alarming until you notice the pattern: most were one-hot dummies from the same two or three source variables (account segment, product line, downgrade count). The violation isn’t spread randomly — it’s concentrated in a few original variables.
The fix: stratify on the source variable, not each dummy.
Stratification lets each group have its own baseline hazard curve while still estimating shared hazard ratios for the remaining covariates. You lose the hazard ratio for the stratified variable; you get correct estimates for everything else.
cph_strat = CoxPHFitter(penalizer=0.01)
cph_strat.fit(
cox_strat_df,
duration_col='duration_months',
event_col='churned',
strata=['account_segment', 'product_line', 'downgrade_count', 'region']
)What happened when I did this: MRR’s hazard ratio collapsed to 1.00 (p = 0.93). Once segment, product, region, and downgrade behaviour were properly controlled, MRR had no independent effect on churn. The apparent MRR signal in the non-stratified model was pure confounding.
The C-index dropped to 0.505 — barely above chance. This is correct: I stratified away all the predictive variance. Within a fixed stratum, the only remaining covariate (MRR) doesn’t predict churn. The non-stratified model (C = 0.78) is still the right tool for ranking accounts by risk. The stratified model is the right tool for getting assumption-correct estimates of specific effects.
This trade-off — model correctness vs. predictive power — is the main practical tension in survival analysis with real business data. The resolution is usually better features. Behavioral data from the first 60–90 days of a subscription (usage volume, feature adoption, login frequency) tends to satisfy the PH assumption more cleanly than firmographic segment labels, and it’s far more predictive.
Scoring individual accounts #
Beyond population-level curves, Cox PH generates a predicted survival function per customer. This is how the analysis becomes operational.
# Rank all active accounts by churn risk
features = cox_df.drop(columns=['duration_months', 'churned'])
risk = cph.predict_partial_hazard(features.loc[cox_df['churned'] == 0])
risk_percentile = risk.rank(pct=True)Sort descending. The top accounts are your highest renewal risk. Hand the list to whoever runs renewal conversations — with enough lead time to act.
What the method gets you that classification doesn’t #
- Time-aware LTV. Survival curves give you the expected retention trajectory per segment. LTV is the area under the curve, weighted by revenue — not an average tenure calculated from churned accounts only.
- Danger periods. The hazard function shows when in the lifecycle churn risk peaks. For contract customers that’s structurally at renewal, but the height of the spike varies by segment. That variance is where the model earns its keep.
- Evaluation. If you run a retention program, you can compare survival curves before and after with a log-rank test. The same framework that generates the analysis can measure whether an intervention worked.
Where it falls short #
It only sees what billing sees. A customer who has stopped using the product is “active” until the contract formally ends. Usage signals — logins, API calls, feature adoption — are the leading indicators. This model doesn’t have them.
No competing risks. Voluntary cancellation, payment failure, and non-renewal are all treated as the same event. They’re not. Each has a different cause and a different response. Separating them is the right next step.
Point-in-time features are hard. The look-ahead bias problem is real and underappreciated. A feature like “total downgrades” uses lifetime data — including future events for active customers. For a descriptive analysis that’s fine. For a live risk scorer it’s not. Building it correctly requires historical snapshots of customer state, not just current state. Most data warehouses don’t have this by default.
The short version #
Survival analysis is the right tool when the question is about duration, not just outcome. For SaaS churn specifically:
- Use Kaplan-Meier to describe retention by segment and identify when churn concentrates.
- Use Cox PH to isolate which factors drive churn controlling for each other. Check the proportional hazards assumption — it will fail for some variables, and that failure is informative.
- Use the partial hazard score to rank active accounts by risk and operationalize the model.
- The biggest improvement available is better features: behavioral data from early in the subscription, built from point-in-time historical snapshots.
lifelines is the right Python library to start with. The documentation is genuinely good.