My Data Portfolio | Task 2 Report

Predicting Customer Lifetime Value (CLV) using PySpark & Survival Analysis

Published on: Task 2 Assignment Date

In this project, I transitioned a local Pandas-based Cox Proportional Hazards model into a distributed PySpark architecture to calculate Customer Lifetime Value (CLV) at scale.

1. The Business Problem

Retaining telecommunications customers is critical. We needed a way to translate statistical survival probabilities into actual financial metrics (Net Present Value).

2. Technical Implementation (PySpark)

To handle large-scale datasets, the Pandas DataFrame logic was rewritten using PySpark's Window functions and vectorized operations. Here is a snippet of the core logic:

# PySpark Window Function for Cumulative NPV
w = Window.orderBy("contract_month").rowsBetween(Window.unboundedPreceding, Window.currentRow)

final_cohort_df = clv_cohort_df.withColumn("cumulative_npv", F.round(F.sum("npv").over(w), 2)) \
    .withColumn("contract_month", F.col("contract_month") + 1)

3. Visualizing the Payback Period

(Optional: You can take screenshots of the plots we generated earlier and upload them to your GitHub repo, then insert them here using <img src="your-image.png" width="100%">)

The analysis revealed the cumulative NPV over 12, 24, and 36 months, allowing the marketing team to accurately cap their Customer Acquisition Cost (CAC).

Conclusion: Migrating to PySpark allowed us to overcome the single-node memory limitations of the lifelines library, setting the stage for industrial-scale batch predictions.