2015 Analytic Challenge
KA RA N SA RA O
TEAM
 Karan Sarao
ANALYTIC SOFTWARE USED
 Data Preparation – SAS
 Model Building – R
 Hardware
– Acer Aspire 5750
– 6 GB RAM
SOLUTION OVERVIEW
Data Preparation
Missing Value Treatment
•Nominal – New Category
•Numeric/Ordinal – Replace with 0 (Value)
New Variable Creation
•Multiple derived Variables
Model Tuning and
Stacking
Training / Blending /Testing Split
Caret Function to tune Multiple
Model parameters
Stacking and Testing to optimize
sequence
Final Modeling
2 Stage Modeling process adopted
Initial set of optimized models
created in Stage 1
Scores incorporated into final blended
Model in Stage 2
Scoring
2 Stage scoring process followed
Model Tuning Process
Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation
Phase
Modeling Data Set –
Random Assignment
50% ofObservations
30% ofObservations
20 % of
Observations
Stage 1 Models
 Model 1
 Model 2
 Model 3
 Model 4
 Model 5
Scoreall 5 Models
on Stage 2 Data,
append scores as
new variables
Stage 2 Models
 Model 1
 Model 2
 Model 3
 Model 4
 Model 5
Run Stage 1 Models
Run Stage 2 Models
Compare
performance of all
Stage 2 Models
SOLUTION OVERVIEW – Continued (Model Tuning)
DATA TRANSFORMATIONS
 Mix of Linear and Non Linear (Tree Based) Models
‒ Cover each others weakness
‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)
 More focus on feature engineering, new variables created as below 
‒ SHIP_RATIO  (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial
order have any influence)
‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each
payment)
‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure)
‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order)
‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY
‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY
‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A
‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ
‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ
‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY
‒ All divide by zero exceptions set to 0
Multiple Models trained on 50% of the data
 Random Forests (randomForest)
 AdaBoost (ada)
 Gradient Boosting Machines (gbm)
 eXtreme Gradient Boost (xgboost)
 Logistic Regression (variables selected by studying glmnet output)
 Regularized Logistic Regression (glmnet)
Several of the above models have tunable parameters
 Caret package in R used to cycle through various combinations of input parameters
using multiple folds
 Problem statement specifies rank order primacy, hence ROC metric maximized
Stage 1 Models
 All 5 Models built in stage 1 used to score both Stage 2 and evaluation data
 5 score columns added back to the data set (stage 2 and evaluation)
 4 Models created again on Stage 2 dataset
 Stage 1 and Stage 2 models are scored on evaluation dataset
 ROC (AUC) calculated for the models on evaluation dataset
 Best Model identified – xgboost (Stage 2)
Model Stage 1 (AUC)
On EvaluationSet
Stage 2 (AUC)
On EvaluationSet
xgboost 0.646 0.647
logit 0.641 0.646
gbm 0.636 0.644
glmnet 0.641 0.642
ada 0.637 0.642
random forest 0.617 NA
Stage 2 Models
 Data split as 50-50 between Stage 1 modeling and Stage 2 blending
 Xgboost used to blend in Stage 2
 Initial 5 models score the submission dataset and scores merged
back to create dataset for sixth model
 Blend Model used to generate the final submission score
Final Model Building
Important Variables
TXN_CHANNEL_CD
PAYMENT_QTY
RUSH_ORD_FLAG
SHIP_RATIO
FIRST_ORDER_RATIO
DEMOGRAPHIC_SEGMENT
ORDER_GROSS_AMT
RETAIL/CATALOG_SPENDING_QUINTILE
REV_PER_ORDER
HH_INCOME
PAYMT_RATIO
ETHNICITY
LANGUAGE
 Mix of ready and derived variables
 Ranking of top variables can be difficult
to quantify across multiple modeling
techniques/blends
 Plain logistic regression with these
variables can create a Model with
comparable performance (~.64 AUC)
TOP VARIABLES
 Derived Variables
‒ Create as many behavioral/pattern variables as possible
‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.
 Cross Validation for controlling overfit
‒ K fold (maximum possible) validation runs
‒ Tune parameters (control depth and boosting rounds to maximize test ROC)
‒ Use grid search for optimum parameter search or employ Caret package
KEYS TO SUCCESS

More Related Content

PDF
DMA Analytic Challenge 2015 final
PPTX
Session 4 c discussion of xianjia ye paper in session 4a 26 august
PPT
JF608: Quality Control - Unit 3
PPT
JF608: Quality Control - Unit 1
PPTX
Rme 085 tqm unit-3 part 2
PPTX
RME-085 (Total Quality Management) Unit-3 _part 3
PDF
The 7 basic quality tools through minitab 18
PDF
RME-085 TQM Unit-5 part 6
DMA Analytic Challenge 2015 final
Session 4 c discussion of xianjia ye paper in session 4a 26 august
JF608: Quality Control - Unit 3
JF608: Quality Control - Unit 1
Rme 085 tqm unit-3 part 2
RME-085 (Total Quality Management) Unit-3 _part 3
The 7 basic quality tools through minitab 18
RME-085 TQM Unit-5 part 6

Similar to DMA Analytics Challenge 2015 (Winner - First Position) (20)

PDF
Customer analytics for e commerce
PPTX
Data mining to improve e-mail marketing
PPTX
Enhancing E-Commerce Efficiency: Predicting Delivery Times with Machine Learning
PDF
Predictive modeling
PPTX
Introduction to MARS (1999)
PPTX
Mini datathon - Bengaluru
PPTX
Churn Analysis in Telecom Industry
PPTX
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...
PPTX
JamieStainer ATA SCIEnCE path finder.pptx
PPTX
AI AND DATA SCIENCE generative data scinece.pptx
PPTX
Predict Backorder on a supply chain data for an Organization
PDF
Learning from data
PPTX
E-Commerce Order PredictionShraddha Kamble.pptx
PPSX
PPTX
BMDSE v1 - Data Scientist Deck
PDF
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DOCX
Building & Evaluating Predictive model: Supermarket Business Case
PDF
Data Mining using SAS
PPTX
Churn Modeling For Mobile Telecommunications
PDF
1555 track 2 ning_using our laptop
Customer analytics for e commerce
Data mining to improve e-mail marketing
Enhancing E-Commerce Efficiency: Predicting Delivery Times with Machine Learning
Predictive modeling
Introduction to MARS (1999)
Mini datathon - Bengaluru
Churn Analysis in Telecom Industry
Linear, Machine Learning or Probabilistic Predictive Models: What's Best for ...
JamieStainer ATA SCIEnCE path finder.pptx
AI AND DATA SCIENCE generative data scinece.pptx
Predict Backorder on a supply chain data for an Organization
Learning from data
E-Commerce Order PredictionShraddha Kamble.pptx
BMDSE v1 - Data Scientist Deck
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
Building & Evaluating Predictive model: Supermarket Business Case
Data Mining using SAS
Churn Modeling For Mobile Telecommunications
1555 track 2 ning_using our laptop
Ad

Recently uploaded (20)

PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
DOCX
Factor Analysis Word Document Presentation
PPTX
Leprosy and NLEP programme community medicine
PDF
Microsoft 365 products and services descrption
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
modul_python (1).pptx for professional and student
PDF
Transcultural that can help you someday.
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
New ISO 27001_2022 standard and the changes
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Business_Capability_Map_Collection__pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
Introduction to Inferential Statistics.pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
SET 1 Compulsory MNH machine learning intro
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Factor Analysis Word Document Presentation
Leprosy and NLEP programme community medicine
Microsoft 365 products and services descrption
SAP 2 completion done . PRESENTATION.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
modul_python (1).pptx for professional and student
Transcultural that can help you someday.
[EN] Industrial Machine Downtime Prediction
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
New ISO 27001_2022 standard and the changes
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Business_Capability_Map_Collection__pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Introduction to Inferential Statistics.pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Ad

DMA Analytics Challenge 2015 (Winner - First Position)

  • 3. ANALYTIC SOFTWARE USED  Data Preparation – SAS  Model Building – R  Hardware – Acer Aspire 5750 – 6 GB RAM
  • 4. SOLUTION OVERVIEW Data Preparation Missing Value Treatment •Nominal – New Category •Numeric/Ordinal – Replace with 0 (Value) New Variable Creation •Multiple derived Variables Model Tuning and Stacking Training / Blending /Testing Split Caret Function to tune Multiple Model parameters Stacking and Testing to optimize sequence Final Modeling 2 Stage Modeling process adopted Initial set of optimized models created in Stage 1 Scores incorporated into final blended Model in Stage 2 Scoring 2 Stage scoring process followed
  • 5. Model Tuning Process Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation Phase Modeling Data Set – Random Assignment 50% ofObservations 30% ofObservations 20 % of Observations Stage 1 Models  Model 1  Model 2  Model 3  Model 4  Model 5 Scoreall 5 Models on Stage 2 Data, append scores as new variables Stage 2 Models  Model 1  Model 2  Model 3  Model 4  Model 5 Run Stage 1 Models Run Stage 2 Models Compare performance of all Stage 2 Models SOLUTION OVERVIEW – Continued (Model Tuning)
  • 6. DATA TRANSFORMATIONS  Mix of Linear and Non Linear (Tree Based) Models ‒ Cover each others weakness ‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)  More focus on feature engineering, new variables created as below  ‒ SHIP_RATIO  (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial order have any influence) ‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each payment) ‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure) ‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order) ‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY ‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY ‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A ‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ ‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ ‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY ‒ All divide by zero exceptions set to 0
  • 7. Multiple Models trained on 50% of the data  Random Forests (randomForest)  AdaBoost (ada)  Gradient Boosting Machines (gbm)  eXtreme Gradient Boost (xgboost)  Logistic Regression (variables selected by studying glmnet output)  Regularized Logistic Regression (glmnet) Several of the above models have tunable parameters  Caret package in R used to cycle through various combinations of input parameters using multiple folds  Problem statement specifies rank order primacy, hence ROC metric maximized Stage 1 Models
  • 8.  All 5 Models built in stage 1 used to score both Stage 2 and evaluation data  5 score columns added back to the data set (stage 2 and evaluation)  4 Models created again on Stage 2 dataset  Stage 1 and Stage 2 models are scored on evaluation dataset  ROC (AUC) calculated for the models on evaluation dataset  Best Model identified – xgboost (Stage 2) Model Stage 1 (AUC) On EvaluationSet Stage 2 (AUC) On EvaluationSet xgboost 0.646 0.647 logit 0.641 0.646 gbm 0.636 0.644 glmnet 0.641 0.642 ada 0.637 0.642 random forest 0.617 NA Stage 2 Models
  • 9.  Data split as 50-50 between Stage 1 modeling and Stage 2 blending  Xgboost used to blend in Stage 2  Initial 5 models score the submission dataset and scores merged back to create dataset for sixth model  Blend Model used to generate the final submission score Final Model Building
  • 10. Important Variables TXN_CHANNEL_CD PAYMENT_QTY RUSH_ORD_FLAG SHIP_RATIO FIRST_ORDER_RATIO DEMOGRAPHIC_SEGMENT ORDER_GROSS_AMT RETAIL/CATALOG_SPENDING_QUINTILE REV_PER_ORDER HH_INCOME PAYMT_RATIO ETHNICITY LANGUAGE  Mix of ready and derived variables  Ranking of top variables can be difficult to quantify across multiple modeling techniques/blends  Plain logistic regression with these variables can create a Model with comparable performance (~.64 AUC) TOP VARIABLES
  • 11.  Derived Variables ‒ Create as many behavioral/pattern variables as possible ‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.  Cross Validation for controlling overfit ‒ K fold (maximum possible) validation runs ‒ Tune parameters (control depth and boosting rounds to maximize test ROC) ‒ Use grid search for optimum parameter search or employ Caret package KEYS TO SUCCESS