Data mining to improve e-mail marketing

Tayko
Smart Marketing using analytics

Business Problem
 Tayko is a software catalog firm that sells games and educational software
 Want to market a new collection using e-mail marketing.
 As member of an industry consortium, they can pull 2,00,000 emails address from
the central repository of the consortium.
 To maximize the benefit, Tayko wants to pull records with high probability of
response and higher value of sale.

Analytics Problem
1. Create a classification model to groups the customer as responder or
purchasers(1) and non-responders or non-purchasers(0).
2. Create a prediction model to predict the value of sale of the responder(1).

Data Collection
 Supervised learning techniques is to be applied as a desired output is required is
already defined.
 A sample of 2000 customer is drawn form the central repository and test e-mail
marketing is done.
 The 2 target variables : Purchased and Spending is recorded for the sample.
 The result showed 1000 purchasers and 1000 non-purchasers

Data partitioning
 The data set is partitioned into
 Training set – 60% - 1200 records
 Testing – 20% - 400 records
 Validation – 20% - 400 records

Initial Study
What kind of variables are present.

Finding the variables with strong differentiation
power – Nominal Variables
 Use of Catalog A, T, U, P show high percentage of people making a purchase
 Use of Catalog O, H show high percentage of people not making a purchase
 But only Catalog A & U has been used for more than 100 customers.
 Catalog H for more than 50 customers & rest below 50 customers.
 Distribution of catalogs were not even.

Other Nominal Variables
 Out of other categorical variables : “Order Online” is the only one which show some
power to differentiate between customer who purchased and the non-purchasers.

Ordinal Variables
 Number of purchase last year shows a good trend
 People who have not made any purchase last year have
not made any purchase with the new catalogs also.
 People who had made more than 3 purchase has surly
made a purchase this time also

Scale Variables
 Out of the 2 scale variables “Last update to customer record” shows a significant
difference in their mean.

Target Variables
 Purchaser and non-purchasers are equally distributed
 However the sales value or the amount spend by customer follows a non-normal
distribution

Classification
Who will make a purchase?

Logistic Regression – Training
Final set of variables
1. Frequency : Number of transactions in last year at source
catalog
2. Web Order : Customer placed at least 1 order via web
3. Address is Residence : Address is a residence
4. Source_a, h or u :Source Catalog is A, U or H

Data mining to improve e-mail marketing

Logistic Regression – Testing & Validation
 Test
 Over-all accuracy : 80%
 Validation

Decision Tree – Training
 CHAID Growing method gave best results

Decision Tree – Test & Validate
 Test
 Validation

Result
 Logistic regression gives a better result than decision tree

Prediction
How much a purchaser will spend?

New Calculated Variables
• High correlation between “last_update_days_ago ” and
“1st_update_days_ago ”
• New calculated variable DayDiff which is difference of the 2
variables

Multiple Linear Regression
 Pre-processiong
 Univariate analysis and transformation of Target Variable “Spend”
Outlier removal,
Filtering and
Transformation

Model & Performance
 4 models are generated
 Case 1 : None Residence Address & Not a Web-Order (R-sqr : 0.569 & Adj R-sqr : 0.566)
Spending = -15.733 + 79.11 * No of transaction last year – 47.825 * Catalog D + 30.632 * Catalog U
 Case 2 : None Residence Address & Web-Order (R-sqr : 0.62 & Adj R-sqr : 0.616)
Spending = -42.285 + 115.976 * No of transaction last year + 45.506 * Catalog U -247.655 * Catalog H +
55.605 Catalog R
 Case 3 : Residence Address & Not a Web-Order (R-sqr : 0.516 & Adj R-sqr : 0.507)
Spending = -26.965 + 69.218 * No of transaction last year + 66.219 * Catalog U – 113.587*Catalog H
 Case 4 : Residence Address & Web-Order (R-sqr : 0.612 & Adj R-sqr : 0.592)
Spending = -4.616 + 65.114 * No of transaction last year - 111.934*Catalog H – 81.28 * Catalog R – 129.754
* Catalog C + 66.242 * Catalog A

MAD & MAPE
 Training
 MAD : 68.89
 MAPE : 103%
 Test
 MAD : 104.53
 MAPE : 109%
 Validation
 MAD : 104.03
 MAPE : 101%

Regression Tree
 Exhaustive CHAID

MAD & MAPE
 Training
 MAD : 105.37
 MAPE : 95%
 Test
 MAD : 121.54
 MAPE : 103%
 Validation
 MAD : 121.31
 MAPE : 113%

Decision
 Both the models are very weak in predicting the amount spent
 There is high error for evaluation indicators.
 One major reason for this can be the lack of scale variables and high correlation
between whatever scale variables are given.
 Since most variables are of nominal type, converting the prediction problem to
classification might produce better result. But it was out of scope for the given
problem.

Conclusion
 The classification of customer into purchasers and non-purchasers shows good
result and the elected logistic regression model is expected to show high
performance in live situation also.
 However the prediction models show weak performance and a high degree of error
is expected if used in the current state.

Data mining to improve e-mail marketing

More Related Content

What's hot (20)

Similar to Data mining to improve e-mail marketing (20)

More from Ritu Sarkar (9)

Recently uploaded (20)

Data mining to improve e-mail marketing