Bank Customer Churn Demo with Machine Learning and AI as Needed
Customer churn prediction is one of the classic problems in data science. Banks, telecoms, and subscription services all face the same question: which customers are most likely to leave, and why? Solving this helps organizations act early with retention strategies.
When I set out to build this demo, my original idea was to show a two-step approach:
Use native Greenplum machine learning capabilities with Apache MADlib.
Then layer on AI with more advanced models to show improvements.
But here’s the surprise: the Greenplum-native machine learning results were so strong that there was no need to continue to AI models. For this churn prediction dataset, a simple decision tree trained with Apache MADlib inside Greenplum gave excellent performance right out of the box.
How I Built It
The starting point was gpmlbot, a tool that tries multiple model families automatically, evaluates them with the dataset at hand, and recommends which ones are most likely to succeed. By running live experiments and generating SQL code, gpmlbot allowed me to very quickly converge on the right approach for churn prediction in Greenplum. gpmlbot tested several candidates including logistic regression, random forest, and decision trees, and it identified the decision tree classifier as the best match for this dataset.
I used a Kaggle dataset of churn prediction for a bank with sample data covering 10,000 example customers. This data was loaded into Greenplum in less than one second with a single SQL command referenced in my repository.
The way machine learning works in Greenplum is straightforward. A SQL command is executed to train a model, and the result is a model output table that can be applied to predict outcomes on new datasets. I validated the model by testing it against data that was not part of the training set. In this case, the accuracy of the decision tree model was over 99 percent, with 9,986 out of 10,000 predictions correct.
The Final Workflow
Workflow Step 1: Create the schema using bankchurn_schema.sql. Step 2: Load the dataset using load.sql. Step 3: Train the model using train_decisiontree.sql. Step 4: Validate the model using final_validation.sql. The repo is here if you want to give it a try: https://github.com/ivannovick/bankchurn
Closing Thoughts
What began as a plan to contrast Greenplum-native ML with more advanced AI techniques ended up proving something different: Greenplum’s in-database machine learning is powerful enough on its own for real-world business problems like churn prediction.
The combination of Apache MADlib’s algorithms, Greenplum’s parallelism, and gpmlbot’s ability to try different models, run rapid experiments, and generate code made solving this problem not only possible but efficient. gpmlbot’s recommendation of the decision tree classifier and the process of training, validating, and exporting that model showed how quickly churn prediction can be operationalized inside Greenplum. In the end, the native Greenplum decision tree performed so well that the AI layer wasn’t needed at all, though it remains available if future use cases that demand it.
Client Technical Specialist – QBE | Co-Founder | ex-Oracle, IBM, Google | Database & AI/ML Architect | Healthcare & GRC
1wIvan D. Novick “native machine learning results were so strong that there was no need to continue to AI models”, No doubt. We saw the same when enhancing banking customer profiles off Park Avenue with third-party machine data bucketed with GP 4.x @ 12-20TB/hr using gpfdist for your load.sql