SlideShare a Scribd company logo
N. Scott Cardell, Mikhail Golovnya, Dan Steinberg
Salford Systems
http://www.salford-systems.com
June 2003
 Churn, the loss of a customer to a competitor, is a
problem for any provider of a subscription service or
recurring purchasable
◦ Costs of customer acquisition and win-back can be high
◦ Best if churn can be prevented by preemptive action or selection
of customers less likely to churn
 Churn is especially important to mobile phone service
providers given the ease with which a subscriber can
switch services
 The NCR Teradata center for CRM at Duke identified churn
prediction as a modeling topic deserving serious study
 A major mobile provider offered data for an international
modeling accuracy and targeted marketing competition
 Data was provided for 100,000 customers with at least 6 months
of service history, stratified into a roughly equal number of
churners and non-churners
 Objective was to predict probability of loss of a customer 30-60
days into the future
 Historical information provided in the form of
◦ Type and price of handset and recency of change/upgrade
◦ Total revenue and recurring charges
◦ Call behavior: statistics describing completed calls, failed calls, voice
and data calls, call forwarding, customer care calls, directory info
◦ Statistics included mean and range for at least 3 months, last 6
months, and lifetime
◦ Demographic and geographical information, including familiar Acxiom
style variables and census-derived neighborhood summaries.
 Competition defined a sharply-defined task: churn within
a specific window for existing customers of a minimum
duration
 Challenge was defined in a way to avoid complications of
censoring that could require survival analysis models
 Each customer history was already summarized
 Data quality was good
 Vast majority of analytical effort could be devoted to
development of an accurate predictive model of a binary
outcome
Data Set Measure TreeNet
Ensemble
Single
TreeNet
2nd Best Avg. Std
Current Top Decile
Lift
2.90 2.88 2.80 2.14
(.536)
Current Gini .409 .403 .370 .269
(.096)
Future Top Decile
Lift
3.01 2.99 2.74 2.09
(.585)
Future Gini .400 .403 .361 .261
(.098)
 Single TreeNet model always better than 2nd best
entry in field
 Ensemble of TreeNets slightly better 3 out of 4
times
 Best entries substantially better than the average
 In broad telecommunications markets the added
accuracy and lift of TreeNet models over
alternatives could easily translate into millions of
dollars of revenue per year
 A modest amount of data preprocessing was undertaken
to repair and extend original data
 Some missing values could be recorded to “0”
 Select non-missing values were recorded to missing
 Experiments with missing value handling were conducted,
including the addition of missing value indicators to the
data
◦ CART imputation
◦ “All missings together” strategies in decision trees
 Missings in a separate node
 Missings go with non-missing high values
 Missings go with non-missing low values
 TreeNet was key to winning the tournament
◦ Provided considerably greater accuracy and top decile lift than any
other modeling method we tried
 A new technology, different than standard boosting,
developed by Stanford University Professor Jerome Friedman
 Based on the CART® decision tree and thus inherits these
characteristics:
◦ Automatic feature selection
◦ Invariant with respect to order-preserving transforms of
predictors
◦ Immune to outliers
◦ Built-in methods for handling missing values
 Based on optimizing an objective function
◦ e.g: Likelihood function or sum of squared errors
 Objective function expressed in terms of a target
function of the data
◦ The target function is fit as a nonparametric function of
the data
◦ The fit optimizes the objective function
 Large number of small decision trees used to
form the nonparametric estimate
 Current implementation allows:
◦ Binary classification
◦ Multinomial classification
◦ Least-squares regression
◦ Least-absolute-deviation regression
◦ M-regression (Huber loss function)
 Other objective functions are possible
 (Insert equation)
 The dependent variable, y, is coded (-1,+1)
 The target function, F(x), is ½ the log-odds ratio
 F is initialized to the log odds on the full training
data set
 (Insert equation)
◦ Equivalent to fitting data to a constant.
 Do not use all training data in any one iteration
◦ Randomly sample from training data (we used a 50%
sample)
 Compute log-likelihood gradient for each
observation
◦ (insert equation)
 Build a K-node tree to predict G(y,x)
◦ K=9 gave the best cross-validated results
◦ Important that trees be much smaller than the size of an
optimal single CART tree
 Let (insert equations)
 Update formula (insert equation)
 Repeat until T trees grown
 Select the value of m≤T that produces the
best fit to the test data
 Compute Ymn, a single Newton-Raphson step for
Bmn
 (insert equation)
 Use only a small fraction, p of, Ymn(Bmn=Pymn)
 Apply the update formula
 (insert equation)
 P is called the learning rate, T is the number of trees
grown
 The product pT is the total learning
◦ Holding pT constant, smaller p usually improves model fit to test
data, but can require many trees
 Reducing the learning rate tends to slowly increase the
optimal amount of total learning
 Very low learning rates can require many trees
 Our CHURN models used values of p from 0.01 to 0.001
 We used total learning of between 6 and 30
 All the models used to score the data for the entries used
9-node trees
 Our final models used the following three combinations:
◦ (p=.001; T=6000; pT=6)
◦ (p=.005; T=2500; pT=12.5)
◦ (p=.01; T=3000; pT=30)
 One entry was a single TreeNet model (p=.01; T=3000;
pT=30)
◦ In this range all models had almost identical results on test data
◦ The scores were highly correlated (r≥.97)
◦ Within this range, a higher pT was the most important factor
◦ For models with pT=6, the smaller the learning rate the better
 (insert table)
 (insert graphs)
 (insert graph)
 (insert graph)
 Friedman, J.H. (1999). Stochastic gradient
boosting. Stanford: Statistics Department,
Stanford University.
 Friedman, J.H. (1999). Greedy function
approximation: a gradient boosting machine.
Stanford: Statistics Department, Stanford
University.
 Salford Systems (2002) TreeNet™ 1.0 Stochastic
Gradient Boosting. San Diego, CA.
 Steinberg, D., Cardell, N.S., and Golovnya, M.
(2003) Stochastic Gradient Boosting and
Restrained Learning. Salford Systems discussion
paper.

More Related Content

PPTX
Churn Modeling For Mobile Telecommunications
PDF
Customer Churn Analytics using Microsoft R Open
PPTX
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
PDF
Introduction to Random Forest
PDF
What makes a good decision tree?
PDF
Modelling the expected loss of bodily injury claims using gradient boosting
PPS
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
PDF
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
Churn Modeling For Mobile Telecommunications
Customer Churn Analytics using Microsoft R Open
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Introduction to Random Forest
What makes a good decision tree?
Modelling the expected loss of bodily injury claims using gradient boosting
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...

What's hot (20)

PDF
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
PDF
Survey on Feature Selection and Dimensionality Reduction Techniques
PPTX
Introduction to Machine Learning
PDF
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
PDF
A Threshold fuzzy entropy based feature selection method applied in various b...
PDF
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
PPT
Churn model for telecom
PDF
Understanding Bagging and Boosting
PPTX
Multiclass classification of imbalanced data
PPTX
Borderline Smote
PPTX
Methods for solving ‘or’ models
PDF
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
DOCX
A fast non dominated sorting guided genetic algorithm for multi objective pow...
PDF
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
PPTX
Expedia Data Analysis
PPTX
Classification
PDF
Rachit Mishra_stock prediction_report
PDF
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
PDF
IRJET- Analyzing Voting Results using Influence Matrix
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Survey on Feature Selection and Dimensionality Reduction Techniques
Introduction to Machine Learning
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
A Threshold fuzzy entropy based feature selection method applied in various b...
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
Churn model for telecom
Understanding Bagging and Boosting
Multiclass classification of imbalanced data
Borderline Smote
Methods for solving ‘or’ models
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
A fast non dominated sorting guided genetic algorithm for multi objective pow...
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
Expedia Data Analysis
Classification
Rachit Mishra_stock prediction_report
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
IRJET- Analyzing Voting Results using Influence Matrix
Ad

Viewers also liked (15)

PDF
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
PPT
Analysis Of A Binary Outcome Variable
PPTX
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...
PDF
Case Study: American Family Insurance Best Practices for Automating Guidewire...
PPTX
Predicting Hospital Readmission Using TreeNet
PPTX
Data mining for diabetes readmission
PPTX
Improve Your Regression with CART and RandomForests
PDF
Predictive Modeling in Insurance in the context of (possibly) big data
PPT
Decision tree and random forest
PPT
Logistic regression
PPT
Logistic management
PPTX
Logistic regression
PPTX
Using CART For Beginners with A Teclo Example Dataset
PDF
LinkedIn SlideShare: Knowledge, Well-Presented
PPTX
Slideshare ppt
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
Analysis Of A Binary Outcome Variable
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...
Case Study: American Family Insurance Best Practices for Automating Guidewire...
Predicting Hospital Readmission Using TreeNet
Data mining for diabetes readmission
Improve Your Regression with CART and RandomForests
Predictive Modeling in Insurance in the context of (possibly) big data
Decision tree and random forest
Logistic regression
Logistic management
Logistic regression
Using CART For Beginners with A Teclo Example Dataset
LinkedIn SlideShare: Knowledge, Well-Presented
Slideshare ppt
Ad

Similar to Churn Modeling-For-Mobile-Telecommunications (20)

PDF
Leveragin research, behavioural and demeographic data
 
PPT
MIS637_Final_Project_Rahul_Bhatia
PDF
Churn in the Telecommunications Industry
PPTX
TELECOMMUNICATION (2).pptx
PDF
featurers_Machinelearning___________.pdf
PPTX
Telcom churn .pptx
PPT
5_Model for Predictions_Machine_Learning.ppt
PPTX
Machine learning session6(decision trees random forrest)
PPTX
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
PPTX
Predict Backorder on a supply chain data for an Organization
PPTX
Credit Card Fraudulent Transaction Detection Research Paper
PPTX
Diabetes prediction using Machine Leanring and Data Preprocessing techniques
PDF
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
PPTX
Intro to ml_2021
PDF
KNOLX_Data_preprocessing
PPTX
Prediction of customer propensity to churn - Telecom Industry
PPTX
churn_detection.pptx
PDF
AIRLINE FARE PRICE PREDICTION
PPT
Lesson 6 measures of central tendency
PPTX
Optimal Model Complexity (1).pptx
Leveragin research, behavioural and demeographic data
 
MIS637_Final_Project_Rahul_Bhatia
Churn in the Telecommunications Industry
TELECOMMUNICATION (2).pptx
featurers_Machinelearning___________.pdf
Telcom churn .pptx
5_Model for Predictions_Machine_Learning.ppt
Machine learning session6(decision trees random forrest)
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
Predict Backorder on a supply chain data for an Organization
Credit Card Fraudulent Transaction Detection Research Paper
Diabetes prediction using Machine Leanring and Data Preprocessing techniques
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
Intro to ml_2021
KNOLX_Data_preprocessing
Prediction of customer propensity to churn - Telecom Industry
churn_detection.pptx
AIRLINE FARE PRICE PREDICTION
Lesson 6 measures of central tendency
Optimal Model Complexity (1).pptx

More from Salford Systems (20)

PDF
Datascience101presentation4
PPTX
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
PPT
The Do's and Don'ts of Data Mining
PPTX
Introduction to Random Forests by Dr. Adele Cutler
PPTX
9 Data Mining Challenges From Data Scientists Like You
PPTX
Statistically Significant Quotes To Remember
PPT
CART Classification and Regression Trees Experienced User Guide
PPTX
Evolution of regression ols to gps to mars
PPTX
Data Mining for Higher Education
PDF
Comparison of statistical methods commonly used in predictive modeling
PDF
Molecular data mining tool advances in hiv
PPTX
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
PDF
SPM v7.0 Feature Matrix
PDF
SPM User's Guide: Introducing MARS
PPT
Hybrid cart logit model 1998
PPTX
Session Logs Tutorial for SPM
PPTX
Some of the new features in SPM 7
PPTX
TreeNet Overview - Updated October 2012
PPTX
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
PPTX
Text mining tutorial
Datascience101presentation4
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
The Do's and Don'ts of Data Mining
Introduction to Random Forests by Dr. Adele Cutler
9 Data Mining Challenges From Data Scientists Like You
Statistically Significant Quotes To Remember
CART Classification and Regression Trees Experienced User Guide
Evolution of regression ols to gps to mars
Data Mining for Higher Education
Comparison of statistical methods commonly used in predictive modeling
Molecular data mining tool advances in hiv
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
SPM v7.0 Feature Matrix
SPM User's Guide: Introducing MARS
Hybrid cart logit model 1998
Session Logs Tutorial for SPM
Some of the new features in SPM 7
TreeNet Overview - Updated October 2012
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
Text mining tutorial

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
A Presentation on Artificial Intelligence
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Churn Modeling-For-Mobile-Telecommunications

  • 1. N. Scott Cardell, Mikhail Golovnya, Dan Steinberg Salford Systems http://www.salford-systems.com June 2003
  • 2.  Churn, the loss of a customer to a competitor, is a problem for any provider of a subscription service or recurring purchasable ◦ Costs of customer acquisition and win-back can be high ◦ Best if churn can be prevented by preemptive action or selection of customers less likely to churn  Churn is especially important to mobile phone service providers given the ease with which a subscriber can switch services  The NCR Teradata center for CRM at Duke identified churn prediction as a modeling topic deserving serious study  A major mobile provider offered data for an international modeling accuracy and targeted marketing competition
  • 3.  Data was provided for 100,000 customers with at least 6 months of service history, stratified into a roughly equal number of churners and non-churners  Objective was to predict probability of loss of a customer 30-60 days into the future  Historical information provided in the form of ◦ Type and price of handset and recency of change/upgrade ◦ Total revenue and recurring charges ◦ Call behavior: statistics describing completed calls, failed calls, voice and data calls, call forwarding, customer care calls, directory info ◦ Statistics included mean and range for at least 3 months, last 6 months, and lifetime ◦ Demographic and geographical information, including familiar Acxiom style variables and census-derived neighborhood summaries.
  • 4.  Competition defined a sharply-defined task: churn within a specific window for existing customers of a minimum duration  Challenge was defined in a way to avoid complications of censoring that could require survival analysis models  Each customer history was already summarized  Data quality was good  Vast majority of analytical effort could be devoted to development of an accurate predictive model of a binary outcome
  • 5. Data Set Measure TreeNet Ensemble Single TreeNet 2nd Best Avg. Std Current Top Decile Lift 2.90 2.88 2.80 2.14 (.536) Current Gini .409 .403 .370 .269 (.096) Future Top Decile Lift 3.01 2.99 2.74 2.09 (.585) Future Gini .400 .403 .361 .261 (.098)
  • 6.  Single TreeNet model always better than 2nd best entry in field  Ensemble of TreeNets slightly better 3 out of 4 times  Best entries substantially better than the average  In broad telecommunications markets the added accuracy and lift of TreeNet models over alternatives could easily translate into millions of dollars of revenue per year
  • 7.  A modest amount of data preprocessing was undertaken to repair and extend original data  Some missing values could be recorded to “0”  Select non-missing values were recorded to missing  Experiments with missing value handling were conducted, including the addition of missing value indicators to the data ◦ CART imputation ◦ “All missings together” strategies in decision trees  Missings in a separate node  Missings go with non-missing high values  Missings go with non-missing low values
  • 8.  TreeNet was key to winning the tournament ◦ Provided considerably greater accuracy and top decile lift than any other modeling method we tried  A new technology, different than standard boosting, developed by Stanford University Professor Jerome Friedman  Based on the CART® decision tree and thus inherits these characteristics: ◦ Automatic feature selection ◦ Invariant with respect to order-preserving transforms of predictors ◦ Immune to outliers ◦ Built-in methods for handling missing values
  • 9.  Based on optimizing an objective function ◦ e.g: Likelihood function or sum of squared errors  Objective function expressed in terms of a target function of the data ◦ The target function is fit as a nonparametric function of the data ◦ The fit optimizes the objective function  Large number of small decision trees used to form the nonparametric estimate
  • 10.  Current implementation allows: ◦ Binary classification ◦ Multinomial classification ◦ Least-squares regression ◦ Least-absolute-deviation regression ◦ M-regression (Huber loss function)  Other objective functions are possible
  • 11.  (Insert equation)  The dependent variable, y, is coded (-1,+1)  The target function, F(x), is ½ the log-odds ratio  F is initialized to the log odds on the full training data set  (Insert equation) ◦ Equivalent to fitting data to a constant.
  • 12.  Do not use all training data in any one iteration ◦ Randomly sample from training data (we used a 50% sample)  Compute log-likelihood gradient for each observation ◦ (insert equation)  Build a K-node tree to predict G(y,x) ◦ K=9 gave the best cross-validated results ◦ Important that trees be much smaller than the size of an optimal single CART tree
  • 13.  Let (insert equations)  Update formula (insert equation)  Repeat until T trees grown  Select the value of m≤T that produces the best fit to the test data
  • 14.  Compute Ymn, a single Newton-Raphson step for Bmn  (insert equation)  Use only a small fraction, p of, Ymn(Bmn=Pymn)  Apply the update formula  (insert equation)
  • 15.  P is called the learning rate, T is the number of trees grown  The product pT is the total learning ◦ Holding pT constant, smaller p usually improves model fit to test data, but can require many trees  Reducing the learning rate tends to slowly increase the optimal amount of total learning  Very low learning rates can require many trees  Our CHURN models used values of p from 0.01 to 0.001  We used total learning of between 6 and 30
  • 16.  All the models used to score the data for the entries used 9-node trees  Our final models used the following three combinations: ◦ (p=.001; T=6000; pT=6) ◦ (p=.005; T=2500; pT=12.5) ◦ (p=.01; T=3000; pT=30)  One entry was a single TreeNet model (p=.01; T=3000; pT=30) ◦ In this range all models had almost identical results on test data ◦ The scores were highly correlated (r≥.97) ◦ Within this range, a higher pT was the most important factor ◦ For models with pT=6, the smaller the learning rate the better
  • 21.  Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University.  Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.  Salford Systems (2002) TreeNet™ 1.0 Stochastic Gradient Boosting. San Diego, CA.  Steinberg, D., Cardell, N.S., and Golovnya, M. (2003) Stochastic Gradient Boosting and Restrained Learning. Salford Systems discussion paper.