DMA Analytics Challenge 2015 (Winner - First Position)

2015 Analytic Challenge
KA RA N SA RA O

ANALYTIC SOFTWARE USED
 Data Preparation – SAS
 Model Building – R
 Hardware
– Acer Aspire 5750
– 6 GB RAM

SOLUTION OVERVIEW
Data Preparation
Missing Value Treatment
•Nominal – New Category
•Numeric/Ordinal – Replace with 0 (Value)
New Variable Creation
•Multiple derived Variables
Model Tuning and
Stacking
Training / Blending /Testing Split
Caret Function to tune Multiple
Model parameters
Stacking and Testing to optimize
sequence
Final Modeling
2 Stage Modeling process adopted
Initial set of optimized models
created in Stage 1
Scores incorporated into final blended
Model in Stage 2
Scoring
2 Stage scoring process followed

Model Tuning Process
Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation
Phase
Modeling Data Set –
Random Assignment
50% ofObservations
30% ofObservations
20 % of
Observations
Stage 1 Models
 Model 1
 Model 2
 Model 3
 Model 4
 Model 5
Scoreall 5 Models
on Stage 2 Data,
append scores as
new variables
Stage 2 Models
 Model 1
 Model 2
 Model 3
 Model 4
 Model 5
Run Stage 1 Models
Run Stage 2 Models
Compare
performance of all
Stage 2 Models
SOLUTION OVERVIEW – Continued (Model Tuning)

DATA TRANSFORMATIONS
 Mix of Linear and Non Linear (Tree Based) Models
‒ Cover each others weakness
‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)
 More focus on feature engineering, new variables created as below 
‒ SHIP_RATIO  (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial
order have any influence)
‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each
payment)
‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure)
‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order)
‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY
‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY
‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A
‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ
‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ
‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY
‒ All divide by zero exceptions set to 0

Multiple Models trained on 50% of the data
 Random Forests (randomForest)
 AdaBoost (ada)
 Gradient Boosting Machines (gbm)
 eXtreme Gradient Boost (xgboost)
 Logistic Regression (variables selected by studying glmnet output)
 Regularized Logistic Regression (glmnet)
Several of the above models have tunable parameters
 Caret package in R used to cycle through various combinations of input parameters
using multiple folds
 Problem statement specifies rank order primacy, hence ROC metric maximized
Stage 1 Models

 All 5 Models built in stage 1 used to score both Stage 2 and evaluation data
 5 score columns added back to the data set (stage 2 and evaluation)
 4 Models created again on Stage 2 dataset
 Stage 1 and Stage 2 models are scored on evaluation dataset
 ROC (AUC) calculated for the models on evaluation dataset
 Best Model identified – xgboost (Stage 2)
Model Stage 1 (AUC)
On EvaluationSet
Stage 2 (AUC)
On EvaluationSet
xgboost 0.646 0.647
logit 0.641 0.646
gbm 0.636 0.644
glmnet 0.641 0.642
ada 0.637 0.642
random forest 0.617 NA
Stage 2 Models

 Data split as 50-50 between Stage 1 modeling and Stage 2 blending
 Xgboost used to blend in Stage 2
 Initial 5 models score the submission dataset and scores merged
back to create dataset for sixth model
 Blend Model used to generate the final submission score
Final Model Building

Important Variables
TXN_CHANNEL_CD
PAYMENT_QTY
RUSH_ORD_FLAG
SHIP_RATIO
FIRST_ORDER_RATIO
DEMOGRAPHIC_SEGMENT
ORDER_GROSS_AMT
RETAIL/CATALOG_SPENDING_QUINTILE
REV_PER_ORDER
HH_INCOME
PAYMT_RATIO
ETHNICITY
LANGUAGE
 Mix of ready and derived variables
 Ranking of top variables can be difficult
to quantify across multiple modeling
techniques/blends
 Plain logistic regression with these
variables can create a Model with
comparable performance (~.64 AUC)
TOP VARIABLES

 Derived Variables
‒ Create as many behavioral/pattern variables as possible
‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.
 Cross Validation for controlling overfit
‒ K fold (maximum possible) validation runs
‒ Tune parameters (control depth and boosting rounds to maximize test ROC)
‒ Use grid search for optimum parameter search or employ Caret package
KEYS TO SUCCESS

DMA Analytics Challenge 2015 (Winner - First Position)

More Related Content

Similar to DMA Analytics Challenge 2015 (Winner - First Position) (20)

Recently uploaded (20)

DMA Analytics Challenge 2015 (Winner - First Position)