Predictive Analyticsand MarketBasket
Analysis.
BUS5PA – Assignment III
SIDDHANTH CHAURASIYA 19139507
1 | P a g e 1 9 1 3 9 5 0 7
INTRODUCTION
-------------------------------------------------------------------------------------------
The purpose of this report isto documentthe findingsfromthe datamininganalysis conductedona
national veteran’sorganization’s Donor’sdatasetand market-basketanalysison transactional data
of Health& BeautyAidsDepartmentandthe StationaryDepartment.The objective of the data
miningactivityistoseeka betterresponse andhitrate bytargetingonly those segmentsof
customers whohave beenflaggedasa potential donorbythe predictive model. Since the request
for donationisalso complemented withasmall gift,mailingonlypotentiallyinterestedcustomers
for the upcomingcampaign wouldsubstantiallyreduce the costforthe organization, andthat
proportionof savedcostscouldbe utilizedforothercharitable activities.
ThisanalysiswasconductedonSASminerand R on database of customerswhohad donated inthe
past 12 to 24 months,withdetaileddescriptionof the steps, interpretationof the modelsandmodel
comparisons(amongstmodelsandacrossSAS& R) discussedinthe report.
The secondpart of the reportexploresandsuggestsproductsthatcouldbe bundledormarketed
togethertoenable the organizationtomaximize itsrevenues. ThisanalysiswasperformedonSAS
miner,andrelevantproposalshave beenadvisedonthe productbundles andplacements basedon
the findings.
PART A
-------------------------------------------------------------------------------------------
1. CreatingSAS miner Projectand pre-processingvariables
Afterselectinganappropriate directoryforthe projectandsettingupthe diagramand library,we
processsome of the defaultsettingsof the datasource. Since there are variousnumericvariables
withlevelslessthan20 in the dataset,we setthe Classlevelsthresholdat2.This wouldensure only
binaryvariablesare treatedasnominal variables andnumericvariableswithless than20distinct
value continue tobe treatedasinterval variables.
Similarly,we alsohave one class variable(DemCluster) withover20 distinctlevelshence,thuswe
setthe levelscountthresholdas100. The rolesof the variableshave beensetasfollows:
2 | P a g e 1 9 1 3 9 5 0 7
2. Explorationof variable
Exploringthe distribution of the variablescanunearthunusual patternsandbehavioursinthe
variables.These anomaliescanhave a substantial effectonthe modellingif notrectified.
We use the Explore windowtoexamine the distributionof MedianIncome Region.The fetchsize is
keptat max (20,000 recordsor recordsinvariable,whicheverisless) toensure all the observations
are consideredinthe exploration.
Figure 2: Changing the default settings of Explore.
We prepare a Histogramforvariable MedianIncome Regiontonotice anyabnormalityinits
distribution. The distributionatdefaultsettingswasasfollows:
Figure 1: Pre-processing variables.
3 | P a g e 1 9 1 3 9 5 0 7
Figure 3: Distribution of Median Income Region at 10 bins.
The distributionatdefaultsettingsdidn’tlook suspicious.However,the bin’srange wassubstantial,
whichmighthave concealedanyabnormalityinthe distribution. Hence,we change the numberof
binsto 200, whichcreatesrangesof $1000.
Figure 4: Distribution of Median Income Region with 200 bins.
Changingthe binlimitshedslightonacrucial anomalyinthe variable. We observe a
disproportionateskew forcustomerswith medianincomeof $0.
The reasonbehindthisabnormal spike coulddue tothe factthat income is a confidential
information.Not everyoneisopen orcomfortable indisclosingtheirincome tocompaniesor
organization. Thus,peoplewhodonotreporttheirincome might have beenassignedashavingan
income of $0 by the organization. UsingMedianIncome withsuchaskeweddistributioncouldlead
to a creationof a flawedmodel.
4 | P a g e 1 9 1 3 9 5 0 7
Thus rectifythisanomaly,we shouldreplace the $0income valueswith missingorNA usingthe
Replacementnode. Otheralternativecouldtoreplace the $0 income valueswiththe meanof
variable.However,replacing$0withthe meanwouldrequire apriorexamination,aswithouta
properbusinesscontext(strataof peopleof didn’trecordtheirincome,etc.) thisreplacementcould
be inappropriate.
3. BuildingPredictive models
Since we discoveredanabnormal peakat$0 for
MedianIncome Regionvariable, we replace the $0
value withmissing(SASminerusesfull-stopto
denote amissingvalue) usingReplacementnode.
To carry thisreplacement,we change the default
limitmethod of Interval Variables asnone,aswe
do notwant to replace the interval variables.
Furthermore,ReplacementvaluesissetasMissing
as we want to replace the abnormal value with a
missingvalue.
Since we needtochange the value of an interval variable (DemMedIncome),we make the following
changes inReplacementIntervalEditor:
By changingthe limitmethodtoUserSpecifiedandReplacementLowerLimitto1 of
DemMedIncome,we setall valuesof thatfall below 1to missing(since only$0isthe value below
$1).
Figure 5: Snapshot of result of Replacement node - $0 has been replaced with '.'
Aftercarryingout thisstep,we divide the datasetinto50:50 ratio betweenTrainingsetand
Validationset. The trainingsetisutilizedtobuildasetof modelswhile Validationsetisusedto
Figure 5: Properties of Replacement node.
Figure 6: Replacement Interval Editor.
5 | P a g e 1 9 1 3 9 5 0 7
selectthe bestmodel. Byallocatingagoodproportionof datasetto training,we reduce the riskof
overfitting.
To predictmostlikelydonors,we create three predictive models:
 AutonomousDecisionTree
The firstmodel we create is a DecisionTree whichhasbeensplitautonomouslybythe algorithm.
We setthe leaf size as25 and leaf size propertyas50 to ensure preventionof verysmall leaves.The
assessmentmethod,whichspecifiesthe methodtoselectthe besttree,hasbeensetasAverage
Square Error. Thisessentiallymeansthe tree thatgeneratesthe smallestaverage squarederror
(difference betweenpredictedandactual outcome) will be selected.
Figure 7: Autonomous DecisionTree.
The optimal decisiontree resultsincreationof 5 leaves;witheachendnode indicatingaclassifying
rule that coulddistinguishlikely andnon-likelylapsingdonorsbasedonthe targetvariable.
The biggestadvantage of DecisionTree isthattheyveryeasyto understand andinterpretevenfor
people of non—technical background. Furthermore,the model explainshow itworks,witheachleaf
node denotingaclassificationrule.
Fewkeyrules/leaves of the AutonomousDecisionTree are:
Figure 6: Workspace Diagram for Donors analysis.
6 | P a g e 1 9 1 3 9 5 0 7
 CustomerswithGiftcountof more than2.5 or missinginthe last36 monthsand withlast
giftamountof lessthan $7.5 is verylikely(64%) todonate [Node 6].
 Customerswhose medianhomevalue isless than$67350 and has giftcountof lessthan2.5
inthe last36 monthsisunlikely(37%) todonate [Node 4].
Remaining3leavescouldbe interpretedinsimilarmanner.
 Interactive DecisionTree.
The secondDecisionTree hasbeencreatedinteractively,with splitsconductedonthe basisof the
Logworthof the variables anddomainknowledge(i.e. whichsplits/variablewouldbe more relevant).
Logworthis a metricthat onthe basisof Informationgaintheorycalculatesthe importance of the
variable.Essentially,Logworthindicatesthe variable’sabilitytocreate homogenousorpure
subgroups. The maximumnumberof brancheshave beenkeptat3 to allow athree-wayinterval
split.Thiswill facilitate veryspecificinsightsandrules. The intervalsplits of some of the variables are
alsochangedto enable creationof more precise rules.
Figure 8: Interval splits of GiftCnt36 changed to <2.5 & missing, >=2.5 & < 4 and >=4.5.
The assessmentmethodforthisDecisionTree is Misclassificationrate. The Interactive DecisionTree
ismore complex comparedtothe AutonomousDecisionTree,withthe tree achievingoptimalityat
21 leaves.
Figure 9: Interactive Decision Tree.
Some of the keyrulesobtained fromthe interactive DecisionTree are:
7 | P a g e 1 9 1 3 9 5 0 7
 Customerswhose promotioncountinthe past12 monthsislessthan17.5 or missing,having
a medianhouse value of lessthan$67350, giftcountof lessthan2.5 or missinginpast36
monthsand average giftcard amountof more than equal to $11.5 in the last36 months are
unlikely(33%) todonate [Node 54].
 Customerswithmedianincomeof lessthan$54473 or missing,withpromotioncountof
card in the last12 monthsof lessthan5.5, giftcountin last36 monthsof more or equal to
4.5 and lastgiftamountof more than $7.5 or missingisverylikely (91%) todonate [Node
67].
Remaining19 leavescanbe interpreted insimilarfashion.
 Regression.
The final model we create isa LogisticRegression(asthe Targetvariable isabinarynumber).The
advantage of a Regressionmodel isthe itcanexpressorquantifythe associationbetweenthe input
variable andthe targetvariables.Furthermore,Regressionisanexcellenttool forestimation. Since
we are usingLogisticRegression,the outcome of the model wouldestimate the odds(probability)of
a donor as weightedsumof the attributes(inputvariables).
As Regressionmodels doesnotaccommodate missingvalue (unlike DecisionTrees) butratherskips
it,we add a Impute node to impute missingintervalvalue withthe variable’smeanandmissing
nominal valueswiththe mostcommonnominal value of the variable. Similarly,Regressionmodels
worksbestwithlimitedbutworthysetof variables,andthuswe addVariable Clustering node to
grouptogethervariablesthatbehave similarly toreduce redundancy.Theseclustersare represented
by the variablesthathasthe leastnormalizedRsquare value (since VariablesselectioninVariable
ClusteringhasbeensetasBestVariables).MaximumEigenvalue is keptas1; thisspecifies the largest
permissible value of the secondeigenvalue ineachcluster,facilitatingcreationof betterclusters.
Figure 10: Changing the properties of Variable Clustering node.
The selectionmodel optedforthe Regressionmodel isStepwise selectionwhilethe selection
criterionissetas Validationerror.Due tothese settings,the model will initiate will novariables (i.e.
at intercept).Thenateverystepwe add a variable andat the same time variablesalreadyinthe
model are verifiedif theypassthe minimumselectioncriteriathreshold(ValidationErrorinthis
case).Variablesbelowthe thresholdlimitare removedandthisprocesscontinuesuntilthe stay
significance level isachieved.
8 | P a g e 1 9 1 3 9 5 0 7
The outputof the Regressionmodel suggested StatusCategoryStarAll month,Giftcountof card in
the last 36 monthsandGift amountaverage of 36 monthsas the most crucial factors respectively.
These variableshave aninfluence orcorrelationwiththe targetvariable.
The outputcan be interpretedasaunitchange in the GiftCntCard36,withall the othervariables
remainingconstant, willleadtochange in the logodds (as it isLogisticregression) of donation by
0.1156 units.All the othervariablescanbe expressedinsimilarmanner,basedupontheir
coefficients.
4. Evaluating Predictive models
The three createdpredictive modelsare comparedonthe basisof few prominentmachine-learning
metricsto evaluate whichmodeloutperformsthe other.We use the Model Comparisonnode and
Excel to compute the metrics.
 ReceiverOperatingCharacteristic (ROC) curves
The ROC curve demonstratesthe relationbetweenTrue Positive (Sensitivity)andTrue
Negative (Specificity) of amodel atvariousdiagnostics testlevels. The model’swhose curve
isclosestto the top left end(i.e.nearthe topof Sensitivity)are more accurate predictors
while model’swhosecurvesare close tothe baseline canbe saidto be poor predictors.In
otherwords,the more area underthe model’scurve,the betterthe model is.
Figure 11: Output of Regression model.
9 | P a g e 1 9 1 3 9 5 0 7
The modelsare evaluatedonthe basisof their performance onthe validationset.The Regression
node curvesthe closesttothe top-leftsectionof the chart(i.e.towardsTrue Positive).The
autonomousdecisiontree staysmarginallybehindwhile the interactivedecisiontree isthe closest
to the baseline,indicatingittobe a poorpredictor.
 Cumulative Lift
Liftis a measure of a model’s effectiveness.Itessentiallycomparesthe resultobtainedby
the model tothe resultobtainedwithoutamodel (i.e.randomly).Higherliftvalue would
indicate the model isthe manytimesmore effectivethan randomselections.
Figure 12: ROC curves of the Predictive models.
Figure 13: Cumulative Lift of the Predictive models.
10 | P a g e 1 9 1 3 9 5 0 7
Basedon the cumulative liftaccumulatedbythe models,we could observethatthe
Autonomous DecisionTree (1.30) producedthe highestliftat15th
depth.Thiswas followed
by 1.27 cumulative liftachievedby Regressionmodel and1.16 cumulative liftbyInteractive
DecisionTree respectively at15th
depth.
Thiscan be interpretedas the top15% customersselectedbyAutonomousDecisionTree is
likelytocapture 1.30 time more donorsthan15% of customerpickeduprandomly.
 Average Square Error (ASE)
ASE isthe error arisingdue the variationinthe predicted
outcome bythe model andactual outcome.Lowerthe
ASE,betterthe model is,as itproducesfewererrors.
From the resultsof the model comparisonnode,we
couldobserve RegressionandAutonomous Decision
Tree performto an identical levelintermsof the ASE
generated;withthe latterperformingonlymarginally
better.The interactive DecisionTree producedthe most
difference betweenpredictedoutcome andactual
outcome.
 Misclassificationrate
Misclassificationrate isthe errorproducedwhena
model incorrectlyclassifiesaresponderasnon-
responderorvice-a-versa.Inourbusinesscontext,a
model wouldbe misclassifyingif itpredictsadonorto
be a non-donorandvice-a-versa.Naturally,we would
prefera model whichmakesthe leastamountof such
we error.
Interactive DecisionTree comesoutontop onthis
metric,producingonly0.398 worthof misclassification
rate incomparison to0.436 by Regressionand0.428 by
AutonomousDecisionTree.
 Accuracy
Accuracy isthe measure thatindicateshow accuratelyamodel canpredict(bothpositive
and negative) outof the total predictionsthatthe model makes.Itiscomputedbydividing
the True Predictions(True Positive andTrue Negative) bytotal numberof predictions(i.e.
the numberof records/observations/customers).
Model False Negative True Negative False Positive True Positive
AutonomousDT 1460 1804 617 962
Regression 1111 1467 954 1311
Interactive DT 1153 1406 1015 1269
Figure 16: Confusion matrix obtained from Model Comparison (Validation dataset).
Figure 14: ASE produced by the Predictive models.
Figure 15: Misclassification rate produced by the
Predictive models.
11 | P a g e 1 9 1 3 9 5 0 7
Basedon accuracy, the regressionmodel couldpredictthe donorsandnon-donorsmore
accuratelycomparedto the othertwomodels(Figure 17).
 F1
F1 is the harmonicaverage of Precision (True positivesbytotal positives) andRecall
(proportionof positivescorrectlyidentified).A score of 1 wouldmeanthe model isa perfect
predictorwhile ascore of zero indicatespoorpredictor.
Regressionoutperformsthe othertwomodelsonF1 score as well,indicatingitasthe best
predictor(Figure 17).
Conclusion:
The performance of the modelsonthe basisof the above machine learningmetricscanbe
summarizedbythe followingtable:
Machine Learning Metrics
Models Accuracy F1 ROC Lift ASE Misclassification
AutonomousDT 0.571 0.481 0.591 1.30 0.2432 0.428
Interactive DT 0.552 0.539 0.567 1.16 0.250 0.398
Regression 0.574 0.559 0.595 1.27 0.2437 0.436
Figure 17: Summarizing the performance of models based on various metrics.
Figure 18: Visual comparison between the models.
On evidence,we can conclude the Regressionmodelisthe bestmodel ittermsof performance,
accuracy, effectiveness anderrorgeneration.
5. Scoring and PredictingPotential Donors
Aftercareful andthoroughexamination,we concludedthatRegressionisthe bestmodel intermsof
itspredictingcapabilitiesasitismore accurate andproducesfewererrors.Thus,we use the
Regressionmodeltoscore (i.e.applyingthe predictions onthe dataset) onanew datasetof lapsing
donors. The scoringis performedthrough Score node.
0.571
0.481
0.591
1.30
0.2432
0.428
0.552
0.539
0.567
1.16
0.250
0.398
0.574
0.559
0.595
1.27
0.2437
0.436
A C C U R ACY
F 1
R OC
L I F T
A S E
M I S C LAS S IF ICATI ON
P ERFORMANCE COMPARIS ON OF MODELS
Autonomous DT Interactive DT Regression
12 | P a g e 1 9 1 3 9 5 0 7
To explore the results,we create ahistogramona new variable thathasbeencreateddue toscoring
calledpredictedtarget_B=1. Thisnewvariable contains the predictive value assignedtothe
customersbasedontheirprobabilityof donating.
To visuallyrepresentthe scoring,we create histogramof customerspredictedbythe model as
potential donorswiththeirattachedprobabilityof donating. The resultare as follows:
Figure 19: Exploring Predicted Donors.
To enable betterinsights,we change the numberof binsto20. The highlightedrecordsinthe
datasetare customers belongingto the selectedbarinthe histogram (Figure 17).The valuesonthe
X axisrepresentthe likelihoodof thatcustomerdonatingfor the nextcampaign.
The average response rate wasfoundto be 5%. Assuch, customerswithapredictedprobabilityof
over0.05 can be consideredascandidatesforthe campaign. However,tomaximize the cost-
effectivenessof the campaign,we couldderive aprobabilitythreshold basedonthe past
informationaboutcustomerlifetime value theygenerate.Customerswithpredicted valuesbeyond
that thresholdcouldbe then be targetedtogenerate evenbetterresponse/hitrate andmargins.
A rational approachwouldbe to solicitcustomerswhohave beenassignedpredictivevalue of 0.55
of more. Otheralternative toachievesignificantlybetterresponse wouldbe approachingcustomers
whobelonginthe top 30th
percentilebasedontheirpredictedvalues(Figure20).
13 | P a g e 1 9 1 3 9 5 0 7
Figure 20: Snapshot of Customers with highest predictive probability of donating.
PART B: PREDICTIVE MODELLING BASED ON R
-------------------------------------------------------------------------------------------
Aftercreatingthree predictive modelsinSASminer,we builtaDecisionTree inR.As isthe
procedure before performinganydataminingactivity,we explore the variablesatthe firststepto
investigatetheirdistribution.
Exploringand transformingthe data
Summaryfunctionaswell as package Psych isinstalledtoprovide adetaileddescriptive
summarizationof the variables.
We notice thatCustomerID isconsiderasvariable byR while afew variableshadmissingvaluesin
them.A histogramiscreatedfor MedianIncome,whichrevealsdisproportionatenumberof zero
value inthe variable.
To ensure a cleanmodel,we wrangle andtransform the variables:
Figure 21: Exploring the variables.
14 | P a g e 1 9 1 3 9 5 0 7
 CustomerIDis rejectedasitis an ID andnot an inputfor modelling.
 Target_D isrejectedasTarget_B containsthe collapseddataof Target_D and inclusionof
Target_D will leadtoleakage.
 Since R considervariables containing$valuesas categorical,we transformGiftAvgLast,
GiftAvg36,GiftAvgAll,GiftAvgCard36,DemMedHomeValueandDemMedIncomeinto
numericvariables.
Figure 22: Transforming variables into numeric variables.
 The zero valuesinDemMedIncome are replace withNA,asthose valueswere customers
whodidnot reveal theirincomes.
Buildinga DecisionTree
Aftercleaningthe data, we divide the dataequallyintotrainingandvalidationsets. The Decision
Tree will be builtonthe validationset usingrpartpackage.Target_Bis selectedasthe target
variablesand all the non-rejectedvariablesare selectedasthe inputs.Since the targetvariable isa
binarynumber,we select Classasthe methodwhile the complexityparameter issetat0.001. The
scriptfor Decisiontree isandthe model is plottedthe decision usingrpart.plot:
Figure 23: Decision Tree will all non-rejected variables as inputs and cp of 0.001.
15 | P a g e 1 9 1 3 9 5 0 7
The decisiontree thatiscreatedis verycomplex,withlotof leaves.OnplottingitsROCcurve we
notice a huge discrepancy asthe model curvesperfectlytowardssensitivity. The model showssigns
of overfittingaswell as leakage.
Figure 24: ROC curve of the Decision Tree.
To rectifythismodel, we prune the decision tree basedoncross-validationerror. The complexity
parameterat whichthe lowesterrorisproducedisselectedasthe complexityparameterforthe
newmodel (0.0072).
Figure 25: Cross-Validation plot.
The other change inthe newmodel isselectionof variablesasinput.Forthe new model,we select
variablesthathave relevance andpredictioncapabilitiesbasedonrational judgement.Forexample,
GftAvg36 ismore relevanttothe model thanGiftAvgAll,thusonlythe formerisusedinthe new
model.
The new model DecisionTree canbe seeninthe below plot:
16 | P a g e 1 9 1 3 9 5 0 7
Figure 26: Pruned Decision Tree.
Comparing the model to the modelscreated in SAS
The newmodel isa lot more cleanerandresultsin5 definite predictionrules.We plotthe ROCcurve
and liftchart forthismodel (Blue line forTrainingandRedline forValidation). The resultsasfollows:
On comparingthe ROC curve and Liftchart of DecisionTree createdonR withthe three modelsbuilt
on SASminer,we couldobserve the Regressionwouldstill comfortablytrumpall the othermodels
basedon accuracy and effectiveness.The ROCcurve of the DecisionTree builtonRisclose to the
baseline,indicatingittonotbeinga verygoodpredictor.Similarly,the liftgeneratedisveryclose to
1, whichisn’tan ideal value.
Scoring the data
The purpose of any model isto predict,andto conductthispredictionwe importthe score dataset.
As done forthe original datasource,we transformvariableswith$valuesinthemintonumeric
variables.The predictionof the model isthenappliedtothe scoreddatasetusingthe Predict
function.A snapshotof resultsof the scoringisas follows:
Figure 27: ROC curve (Left) and Lift chart (right) of Pruned Decision Tree.
17 | P a g e 1 9 1 3 9 5 0 7
The firstcolumnindicatesthe customerID,the secondcolumnindicatesthe customerbeinganon-
donorwhile the 3rd
columnindicatesthe customerbeingadonor.The valuesinsecondandthird
columnindicate the probabilityof the customerfallinginthatclassification.Thiscanbe interpreted
as customerId 96362 has 41.96% predictedprobabilityof donatingforthe nextcampaign.
Similarly,the predictedvalue forall customershave beenderived. Toachieve maximumprofitability,
the organizationshouldsolicitate customerswhohave beenflaggedwithmore than60% of
probabilityof donating bythe model.
PART C: MARKET BASKET ANALYSIS AND ASSOCIATION RULES
-------------------------------------------------------------------------------------------
In thissectionof the report,we attemptto derive meaningful patternsinthe purchasingbehaviour
of customerswithreference toarange of products.The primaryobjective of thismarket-basket
analysis (MBA) isto discoveritemsthatare purchasedwithhighconfidence andhave highlift.These
insightscanenable the retail store toexpanditsrevenue numbersandachievehigherprofitability.
The MBA wasconductedon SASminer,onthe datasetcontaininginformationof over400,000
transactionsaccumulatedoverthe past3 months. The propertiesof the variable have beensetto
settingsobservedinFigure 28 andthe type of data source ischangedfrom Raw to Transactions.
Figure 28: Variable Properties for MBA.
Afterdraggingthe data-source intothe diagram, we attach Associationnode toittoconduct the
analysis.Exportrule byIDis changedto yesas we wouldlike toview the rule descriptiontablefor
the analysis.Remainingsettingsare keptunchanged.
18 | P a g e 1 9 1 3 9 5 0 7
The resultsof the Associationnodesunearthedseveral insightsof enormousbusinessvalues;some
of whichare explainedbelow:
Out of the 36 rulesorcombinationof productscreated,the highestachievedliftwasfoundto3.60.
Liftessentiallymeasuresthe degreeof associationbetweenthe combinationof products.For
example,rule A ->B withlift3 wouldbe interpretedasa customeristhrice as likelytobuyproductB
if he has alreadypurchasedproductA,comparedto the likelihoodof arandomcustomerjustbuying
productB. Lift isderivedbydividingconfidencebyexpectedconfidence.
The highestliftof 3.60 wasachievedbyrule Perfume ->Toothbrush. Thisindicatesthatacustomer
whohas purchaseda Perfume is3.6 timesmore likelytobuyToothbrushcomparedtoa customer
chosenat random. Since liftissymmetrical,ruleToothbrush->Perfume wouldhave the same liftof
3.60.
Liftis significantmetricforMBA as itdenotesthe relation betweencombinationof product.A Higher
lift( > 1) of rule indicatesthe right-handproductismore likelytoboughtincomplementwiththe
left-handproductratherthanbeingboughtjustinisolation. Thisinsightcanhelpimmenselyin
productplacementinthe aisles. Incurrentcontext,rule Magazine &Greetingcards -> CandyBar has
a liftof 2.68, whichdenotesthatthe likelihoodof acustomerbuyingacandy bar incombination
withMagazine and Greetingcardsis2.68 timeshigherthanacustomerbuyingjustthe candy bar.
Basedon associationrules,we derived36 rules,witheachrule possessingsignificantvalue for
implementationatthe store.However,based onfew metrics,we wouldrecommendthe companyto
incorporate followingchangestofacilitate higherrevenue generation:
1. Placementof Products on Aisle
Since Perfume ->Toothbrushproduce the highestliftbuthave comparativelylowersupport
(customerpurchasingboththe products),these twoproductsshouldbe placedinclose
approximate toeach otherto boosttheirsales.Withthese productsinclose vicinity,
Figure 29: Tabular data of all 36 rules.
19 | P a g e 1 9 1 3 9 5 0 7
purchase of Perfume will triggerthe purchase of Toothbrushorvice-a-versa,asindicatedby
theirhighlift. Similarly,productswithhighliftbutrelativelylowersupportshouldbe placed
close-by.
CandyBars -> GreetingCardshave the highestsupport(4.37%),indicatingthese two
productsare oftenpurchasedtogether.Thus,thesetwoproductsshouldbe placedat
distance fromeachotherso that customershave towalk-througharange of otherproducts
inthe processof buyingCandyBars and GreetingCards. Similarly,productsthatare often
purchasedtogether(highsupport) shouldbe placedatsome distance fromeachother.
2. Bundle,Cross-sellingandUp-selling
Figure 30: Link Graph of Products.
The networkbetween productsshow whichproductare linkedwitheachother.We can
observe pensandphotoprocessingare onlypurchasedincombinationwith Magazine.As
such,Pensand Photoprocessingshouldbe soldas a bundle withMagazinestoimprove their
salesnumbers.
We can alsoobserve Magazine isthe mostpopularproductin the store,andas such it gives
an opportunityto create upsellingandcross-sellingsituations of otherproducts around
magazine.Lesspopularproductscanbe placednearMagazine to grab more attention,as
magazinesare verypopular.Similarly, anahigherpricedalternativeproductcanbe placed
close to the magazines(forexample,premiumGreetingcards,candybarsand Toothpaste).
3. Specials
Products withhighlift(Perfume->Toothbrush,Magazine &CandyBar -> Greetingcards,
etc.) shouldbe onsale at different times.Since purchase of one product, isanyway likelyto
triggerthe purchase of anotherproductinthe rule,itscounter-productivetohave asale on
20 | P a g e 1 9 1 3 9 5 0 7
both/all the itemsinthe rule. Thiscansave a large proportionof discountingcostsforthe
companywhile boostingtheirsalesnumber.
Some ruleslike Greetingcards & CandyBar -> Magazine are clearlyhotproductsduring
festive periods. Havingone of the productsona discountedrate,duringfestive season, will
temptthe customerto purchase the othertwonon-discountedproductstocompletetheir
festive shoppingwish-list.

More Related Content

DOCX
Building & Evaluating Predictive model: Supermarket Business Case
PDF
Automation of IT Ticket Automation using NLP and Deep Learning
DOCX
Luis_Ramon_Report.doc
PPT
Excel Datamining Addin Advanced
PDF
Campaign response modeling
PDF
Supervised learning (2)
PDF
Three case studies deploying cluster analysis
PDF
193_report (1)
Building & Evaluating Predictive model: Supermarket Business Case
Automation of IT Ticket Automation using NLP and Deep Learning
Luis_Ramon_Report.doc
Excel Datamining Addin Advanced
Campaign response modeling
Supervised learning (2)
Three case studies deploying cluster analysis
193_report (1)

What's hot (20)

PDF
Final SAS Day 2015 Poster
PPTX
Cluster analysis in prespective to Marketing Research
PDF
Binary Classification Final
PDF
Machine_Learning_Trushita
PDF
Gradient boosting for regression problems with example basics of regression...
PPT
PPTX
Improve Your Regression with CART and RandomForests
PDF
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...
PDF
Detection of credit card fraud
PDF
Using R for customer segmentation
PPTX
CART – Classification & Regression Trees
PDF
CART: Not only Classification and Regression Trees
DOCX
Krupa rm
DOCX
Boosting conversion rates on ecommerce using deep learning algorithms
PDF
Introduction to Random Forest
PDF
Graphical Analysis of Simulated Financial Data Using R
PDF
Machine Learning Decision Tree Algorithms
PDF
Ordinal logistic regression
PDF
Consumption capability analysis for Micro-blog users based on data mining
PDF
Classification and regression trees (cart)
Final SAS Day 2015 Poster
Cluster analysis in prespective to Marketing Research
Binary Classification Final
Machine_Learning_Trushita
Gradient boosting for regression problems with example basics of regression...
Improve Your Regression with CART and RandomForests
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...
Detection of credit card fraud
Using R for customer segmentation
CART – Classification & Regression Trees
CART: Not only Classification and Regression Trees
Krupa rm
Boosting conversion rates on ecommerce using deep learning algorithms
Introduction to Random Forest
Graphical Analysis of Simulated Financial Data Using R
Machine Learning Decision Tree Algorithms
Ordinal logistic regression
Consumption capability analysis for Micro-blog users based on data mining
Classification and regression trees (cart)
Ad

Similar to Predictive Modelling & Market-Basket Analysis. (20)

DOCX
Classification modelling review
PDF
Data Mining Apriori Algorithm Implementation using R
DOCX
Data Analytics Using R - Report
PPTX
Predictive analytics BA4206 Anna University Business Analytics
PDF
Predictive analytics-white-paper
PDF
Data Analysis - Making Big Data Work
PDF
Mevsys Data Mining: Knowledge Discovery.
PDF
bda-unit-5-bda-notes material big da.pdf
PDF
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
PDF
Bank Customer Segmentation & Insurance Claim Prediction
PDF
Predictive Analytics Modeling
PDF
Machine learning meetup
PPTX
DataAnalyticsIntroduction and its ci.pptx
PDF
JEDM_RR_JF_Final
PDF
Internship project report,Predictive Modelling
PDF
Predictive modeling
PDF
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PDF
Exploring the Data science Process
PPTX
Predictive Analytics Predictive Analytics
PDF
Case sas 2
Classification modelling review
Data Mining Apriori Algorithm Implementation using R
Data Analytics Using R - Report
Predictive analytics BA4206 Anna University Business Analytics
Predictive analytics-white-paper
Data Analysis - Making Big Data Work
Mevsys Data Mining: Knowledge Discovery.
bda-unit-5-bda-notes material big da.pdf
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
Bank Customer Segmentation & Insurance Claim Prediction
Predictive Analytics Modeling
Machine learning meetup
DataAnalyticsIntroduction and its ci.pptx
JEDM_RR_JF_Final
Internship project report,Predictive Modelling
Predictive modeling
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
Exploring the Data science Process
Predictive Analytics Predictive Analytics
Case sas 2
Ad

Recently uploaded (20)

PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
chrmotography.pptx food anaylysis techni
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
ai agent creaction with langgraph_presentation_
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
Best Data Science Professional Certificates in the USA | IABAC
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
recommendation Project PPT with details attached
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
AI AND ML PROPOSAL PRESENTATION MUST.pptx
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
chrmotography.pptx food anaylysis techni
CYBER SECURITY the Next Warefare Tactics
1 hour to get there before the game is done so you don’t need a car seat for ...
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
SET 1 Compulsory MNH machine learning intro
ai agent creaction with langgraph_presentation_
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
Best Data Science Professional Certificates in the USA | IABAC
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Tapan_20220802057_Researchinternship_final_stage.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
recommendation Project PPT with details attached
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s

Predictive Modelling & Market-Basket Analysis.

  • 1. Predictive Analyticsand MarketBasket Analysis. BUS5PA – Assignment III SIDDHANTH CHAURASIYA 19139507
  • 2. 1 | P a g e 1 9 1 3 9 5 0 7 INTRODUCTION ------------------------------------------------------------------------------------------- The purpose of this report isto documentthe findingsfromthe datamininganalysis conductedona national veteran’sorganization’s Donor’sdatasetand market-basketanalysison transactional data of Health& BeautyAidsDepartmentandthe StationaryDepartment.The objective of the data miningactivityistoseeka betterresponse andhitrate bytargetingonly those segmentsof customers whohave beenflaggedasa potential donorbythe predictive model. Since the request for donationisalso complemented withasmall gift,mailingonlypotentiallyinterestedcustomers for the upcomingcampaign wouldsubstantiallyreduce the costforthe organization, andthat proportionof savedcostscouldbe utilizedforothercharitable activities. ThisanalysiswasconductedonSASminerand R on database of customerswhohad donated inthe past 12 to 24 months,withdetaileddescriptionof the steps, interpretationof the modelsandmodel comparisons(amongstmodelsandacrossSAS& R) discussedinthe report. The secondpart of the reportexploresandsuggestsproductsthatcouldbe bundledormarketed togethertoenable the organizationtomaximize itsrevenues. ThisanalysiswasperformedonSAS miner,andrelevantproposalshave beenadvisedonthe productbundles andplacements basedon the findings. PART A ------------------------------------------------------------------------------------------- 1. CreatingSAS miner Projectand pre-processingvariables Afterselectinganappropriate directoryforthe projectandsettingupthe diagramand library,we processsome of the defaultsettingsof the datasource. Since there are variousnumericvariables withlevelslessthan20 in the dataset,we setthe Classlevelsthresholdat2.This wouldensure only binaryvariablesare treatedasnominal variables andnumericvariableswithless than20distinct value continue tobe treatedasinterval variables. Similarly,we alsohave one class variable(DemCluster) withover20 distinctlevelshence,thuswe setthe levelscountthresholdas100. The rolesof the variableshave beensetasfollows:
  • 3. 2 | P a g e 1 9 1 3 9 5 0 7 2. Explorationof variable Exploringthe distribution of the variablescanunearthunusual patternsandbehavioursinthe variables.These anomaliescanhave a substantial effectonthe modellingif notrectified. We use the Explore windowtoexamine the distributionof MedianIncome Region.The fetchsize is keptat max (20,000 recordsor recordsinvariable,whicheverisless) toensure all the observations are consideredinthe exploration. Figure 2: Changing the default settings of Explore. We prepare a Histogramforvariable MedianIncome Regiontonotice anyabnormalityinits distribution. The distributionatdefaultsettingswasasfollows: Figure 1: Pre-processing variables.
  • 4. 3 | P a g e 1 9 1 3 9 5 0 7 Figure 3: Distribution of Median Income Region at 10 bins. The distributionatdefaultsettingsdidn’tlook suspicious.However,the bin’srange wassubstantial, whichmighthave concealedanyabnormalityinthe distribution. Hence,we change the numberof binsto 200, whichcreatesrangesof $1000. Figure 4: Distribution of Median Income Region with 200 bins. Changingthe binlimitshedslightonacrucial anomalyinthe variable. We observe a disproportionateskew forcustomerswith medianincomeof $0. The reasonbehindthisabnormal spike coulddue tothe factthat income is a confidential information.Not everyoneisopen orcomfortable indisclosingtheirincome tocompaniesor organization. Thus,peoplewhodonotreporttheirincome might have beenassignedashavingan income of $0 by the organization. UsingMedianIncome withsuchaskeweddistributioncouldlead to a creationof a flawedmodel.
  • 5. 4 | P a g e 1 9 1 3 9 5 0 7 Thus rectifythisanomaly,we shouldreplace the $0income valueswith missingorNA usingthe Replacementnode. Otheralternativecouldtoreplace the $0 income valueswiththe meanof variable.However,replacing$0withthe meanwouldrequire apriorexamination,aswithouta properbusinesscontext(strataof peopleof didn’trecordtheirincome,etc.) thisreplacementcould be inappropriate. 3. BuildingPredictive models Since we discoveredanabnormal peakat$0 for MedianIncome Regionvariable, we replace the $0 value withmissing(SASminerusesfull-stopto denote amissingvalue) usingReplacementnode. To carry thisreplacement,we change the default limitmethod of Interval Variables asnone,aswe do notwant to replace the interval variables. Furthermore,ReplacementvaluesissetasMissing as we want to replace the abnormal value with a missingvalue. Since we needtochange the value of an interval variable (DemMedIncome),we make the following changes inReplacementIntervalEditor: By changingthe limitmethodtoUserSpecifiedandReplacementLowerLimitto1 of DemMedIncome,we setall valuesof thatfall below 1to missing(since only$0isthe value below $1). Figure 5: Snapshot of result of Replacement node - $0 has been replaced with '.' Aftercarryingout thisstep,we divide the datasetinto50:50 ratio betweenTrainingsetand Validationset. The trainingsetisutilizedtobuildasetof modelswhile Validationsetisusedto Figure 5: Properties of Replacement node. Figure 6: Replacement Interval Editor.
  • 6. 5 | P a g e 1 9 1 3 9 5 0 7 selectthe bestmodel. Byallocatingagoodproportionof datasetto training,we reduce the riskof overfitting. To predictmostlikelydonors,we create three predictive models:  AutonomousDecisionTree The firstmodel we create is a DecisionTree whichhasbeensplitautonomouslybythe algorithm. We setthe leaf size as25 and leaf size propertyas50 to ensure preventionof verysmall leaves.The assessmentmethod,whichspecifiesthe methodtoselectthe besttree,hasbeensetasAverage Square Error. Thisessentiallymeansthe tree thatgeneratesthe smallestaverage squarederror (difference betweenpredictedandactual outcome) will be selected. Figure 7: Autonomous DecisionTree. The optimal decisiontree resultsincreationof 5 leaves;witheachendnode indicatingaclassifying rule that coulddistinguishlikely andnon-likelylapsingdonorsbasedonthe targetvariable. The biggestadvantage of DecisionTree isthattheyveryeasyto understand andinterpretevenfor people of non—technical background. Furthermore,the model explainshow itworks,witheachleaf node denotingaclassificationrule. Fewkeyrules/leaves of the AutonomousDecisionTree are: Figure 6: Workspace Diagram for Donors analysis.
  • 7. 6 | P a g e 1 9 1 3 9 5 0 7  CustomerswithGiftcountof more than2.5 or missinginthe last36 monthsand withlast giftamountof lessthan $7.5 is verylikely(64%) todonate [Node 6].  Customerswhose medianhomevalue isless than$67350 and has giftcountof lessthan2.5 inthe last36 monthsisunlikely(37%) todonate [Node 4]. Remaining3leavescouldbe interpretedinsimilarmanner.  Interactive DecisionTree. The secondDecisionTree hasbeencreatedinteractively,with splitsconductedonthe basisof the Logworthof the variables anddomainknowledge(i.e. whichsplits/variablewouldbe more relevant). Logworthis a metricthat onthe basisof Informationgaintheorycalculatesthe importance of the variable.Essentially,Logworthindicatesthe variable’sabilitytocreate homogenousorpure subgroups. The maximumnumberof brancheshave beenkeptat3 to allow athree-wayinterval split.Thiswill facilitate veryspecificinsightsandrules. The intervalsplits of some of the variables are alsochangedto enable creationof more precise rules. Figure 8: Interval splits of GiftCnt36 changed to <2.5 & missing, >=2.5 & < 4 and >=4.5. The assessmentmethodforthisDecisionTree is Misclassificationrate. The Interactive DecisionTree ismore complex comparedtothe AutonomousDecisionTree,withthe tree achievingoptimalityat 21 leaves. Figure 9: Interactive Decision Tree. Some of the keyrulesobtained fromthe interactive DecisionTree are:
  • 8. 7 | P a g e 1 9 1 3 9 5 0 7  Customerswhose promotioncountinthe past12 monthsislessthan17.5 or missing,having a medianhouse value of lessthan$67350, giftcountof lessthan2.5 or missinginpast36 monthsand average giftcard amountof more than equal to $11.5 in the last36 months are unlikely(33%) todonate [Node 54].  Customerswithmedianincomeof lessthan$54473 or missing,withpromotioncountof card in the last12 monthsof lessthan5.5, giftcountin last36 monthsof more or equal to 4.5 and lastgiftamountof more than $7.5 or missingisverylikely (91%) todonate [Node 67]. Remaining19 leavescanbe interpreted insimilarfashion.  Regression. The final model we create isa LogisticRegression(asthe Targetvariable isabinarynumber).The advantage of a Regressionmodel isthe itcanexpressorquantifythe associationbetweenthe input variable andthe targetvariables.Furthermore,Regressionisanexcellenttool forestimation. Since we are usingLogisticRegression,the outcome of the model wouldestimate the odds(probability)of a donor as weightedsumof the attributes(inputvariables). As Regressionmodels doesnotaccommodate missingvalue (unlike DecisionTrees) butratherskips it,we add a Impute node to impute missingintervalvalue withthe variable’smeanandmissing nominal valueswiththe mostcommonnominal value of the variable. Similarly,Regressionmodels worksbestwithlimitedbutworthysetof variables,andthuswe addVariable Clustering node to grouptogethervariablesthatbehave similarly toreduce redundancy.Theseclustersare represented by the variablesthathasthe leastnormalizedRsquare value (since VariablesselectioninVariable ClusteringhasbeensetasBestVariables).MaximumEigenvalue is keptas1; thisspecifies the largest permissible value of the secondeigenvalue ineachcluster,facilitatingcreationof betterclusters. Figure 10: Changing the properties of Variable Clustering node. The selectionmodel optedforthe Regressionmodel isStepwise selectionwhilethe selection criterionissetas Validationerror.Due tothese settings,the model will initiate will novariables (i.e. at intercept).Thenateverystepwe add a variable andat the same time variablesalreadyinthe model are verifiedif theypassthe minimumselectioncriteriathreshold(ValidationErrorinthis case).Variablesbelowthe thresholdlimitare removedandthisprocesscontinuesuntilthe stay significance level isachieved.
  • 9. 8 | P a g e 1 9 1 3 9 5 0 7 The outputof the Regressionmodel suggested StatusCategoryStarAll month,Giftcountof card in the last 36 monthsandGift amountaverage of 36 monthsas the most crucial factors respectively. These variableshave aninfluence orcorrelationwiththe targetvariable. The outputcan be interpretedasaunitchange in the GiftCntCard36,withall the othervariables remainingconstant, willleadtochange in the logodds (as it isLogisticregression) of donation by 0.1156 units.All the othervariablescanbe expressedinsimilarmanner,basedupontheir coefficients. 4. Evaluating Predictive models The three createdpredictive modelsare comparedonthe basisof few prominentmachine-learning metricsto evaluate whichmodeloutperformsthe other.We use the Model Comparisonnode and Excel to compute the metrics.  ReceiverOperatingCharacteristic (ROC) curves The ROC curve demonstratesthe relationbetweenTrue Positive (Sensitivity)andTrue Negative (Specificity) of amodel atvariousdiagnostics testlevels. The model’swhose curve isclosestto the top left end(i.e.nearthe topof Sensitivity)are more accurate predictors while model’swhosecurvesare close tothe baseline canbe saidto be poor predictors.In otherwords,the more area underthe model’scurve,the betterthe model is. Figure 11: Output of Regression model.
  • 10. 9 | P a g e 1 9 1 3 9 5 0 7 The modelsare evaluatedonthe basisof their performance onthe validationset.The Regression node curvesthe closesttothe top-leftsectionof the chart(i.e.towardsTrue Positive).The autonomousdecisiontree staysmarginallybehindwhile the interactivedecisiontree isthe closest to the baseline,indicatingittobe a poorpredictor.  Cumulative Lift Liftis a measure of a model’s effectiveness.Itessentiallycomparesthe resultobtainedby the model tothe resultobtainedwithoutamodel (i.e.randomly).Higherliftvalue would indicate the model isthe manytimesmore effectivethan randomselections. Figure 12: ROC curves of the Predictive models. Figure 13: Cumulative Lift of the Predictive models.
  • 11. 10 | P a g e 1 9 1 3 9 5 0 7 Basedon the cumulative liftaccumulatedbythe models,we could observethatthe Autonomous DecisionTree (1.30) producedthe highestliftat15th depth.Thiswas followed by 1.27 cumulative liftachievedby Regressionmodel and1.16 cumulative liftbyInteractive DecisionTree respectively at15th depth. Thiscan be interpretedas the top15% customersselectedbyAutonomousDecisionTree is likelytocapture 1.30 time more donorsthan15% of customerpickeduprandomly.  Average Square Error (ASE) ASE isthe error arisingdue the variationinthe predicted outcome bythe model andactual outcome.Lowerthe ASE,betterthe model is,as itproducesfewererrors. From the resultsof the model comparisonnode,we couldobserve RegressionandAutonomous Decision Tree performto an identical levelintermsof the ASE generated;withthe latterperformingonlymarginally better.The interactive DecisionTree producedthe most difference betweenpredictedoutcome andactual outcome.  Misclassificationrate Misclassificationrate isthe errorproducedwhena model incorrectlyclassifiesaresponderasnon- responderorvice-a-versa.Inourbusinesscontext,a model wouldbe misclassifyingif itpredictsadonorto be a non-donorandvice-a-versa.Naturally,we would prefera model whichmakesthe leastamountof such we error. Interactive DecisionTree comesoutontop onthis metric,producingonly0.398 worthof misclassification rate incomparison to0.436 by Regressionand0.428 by AutonomousDecisionTree.  Accuracy Accuracy isthe measure thatindicateshow accuratelyamodel canpredict(bothpositive and negative) outof the total predictionsthatthe model makes.Itiscomputedbydividing the True Predictions(True Positive andTrue Negative) bytotal numberof predictions(i.e. the numberof records/observations/customers). Model False Negative True Negative False Positive True Positive AutonomousDT 1460 1804 617 962 Regression 1111 1467 954 1311 Interactive DT 1153 1406 1015 1269 Figure 16: Confusion matrix obtained from Model Comparison (Validation dataset). Figure 14: ASE produced by the Predictive models. Figure 15: Misclassification rate produced by the Predictive models.
  • 12. 11 | P a g e 1 9 1 3 9 5 0 7 Basedon accuracy, the regressionmodel couldpredictthe donorsandnon-donorsmore accuratelycomparedto the othertwomodels(Figure 17).  F1 F1 is the harmonicaverage of Precision (True positivesbytotal positives) andRecall (proportionof positivescorrectlyidentified).A score of 1 wouldmeanthe model isa perfect predictorwhile ascore of zero indicatespoorpredictor. Regressionoutperformsthe othertwomodelsonF1 score as well,indicatingitasthe best predictor(Figure 17). Conclusion: The performance of the modelsonthe basisof the above machine learningmetricscanbe summarizedbythe followingtable: Machine Learning Metrics Models Accuracy F1 ROC Lift ASE Misclassification AutonomousDT 0.571 0.481 0.591 1.30 0.2432 0.428 Interactive DT 0.552 0.539 0.567 1.16 0.250 0.398 Regression 0.574 0.559 0.595 1.27 0.2437 0.436 Figure 17: Summarizing the performance of models based on various metrics. Figure 18: Visual comparison between the models. On evidence,we can conclude the Regressionmodelisthe bestmodel ittermsof performance, accuracy, effectiveness anderrorgeneration. 5. Scoring and PredictingPotential Donors Aftercareful andthoroughexamination,we concludedthatRegressionisthe bestmodel intermsof itspredictingcapabilitiesasitismore accurate andproducesfewererrors.Thus,we use the Regressionmodeltoscore (i.e.applyingthe predictions onthe dataset) onanew datasetof lapsing donors. The scoringis performedthrough Score node. 0.571 0.481 0.591 1.30 0.2432 0.428 0.552 0.539 0.567 1.16 0.250 0.398 0.574 0.559 0.595 1.27 0.2437 0.436 A C C U R ACY F 1 R OC L I F T A S E M I S C LAS S IF ICATI ON P ERFORMANCE COMPARIS ON OF MODELS Autonomous DT Interactive DT Regression
  • 13. 12 | P a g e 1 9 1 3 9 5 0 7 To explore the results,we create ahistogramona new variable thathasbeencreateddue toscoring calledpredictedtarget_B=1. Thisnewvariable contains the predictive value assignedtothe customersbasedontheirprobabilityof donating. To visuallyrepresentthe scoring,we create histogramof customerspredictedbythe model as potential donorswiththeirattachedprobabilityof donating. The resultare as follows: Figure 19: Exploring Predicted Donors. To enable betterinsights,we change the numberof binsto20. The highlightedrecordsinthe datasetare customers belongingto the selectedbarinthe histogram (Figure 17).The valuesonthe X axisrepresentthe likelihoodof thatcustomerdonatingfor the nextcampaign. The average response rate wasfoundto be 5%. Assuch, customerswithapredictedprobabilityof over0.05 can be consideredascandidatesforthe campaign. However,tomaximize the cost- effectivenessof the campaign,we couldderive aprobabilitythreshold basedonthe past informationaboutcustomerlifetime value theygenerate.Customerswithpredicted valuesbeyond that thresholdcouldbe then be targetedtogenerate evenbetterresponse/hitrate andmargins. A rational approachwouldbe to solicitcustomerswhohave beenassignedpredictivevalue of 0.55 of more. Otheralternative toachievesignificantlybetterresponse wouldbe approachingcustomers whobelonginthe top 30th percentilebasedontheirpredictedvalues(Figure20).
  • 14. 13 | P a g e 1 9 1 3 9 5 0 7 Figure 20: Snapshot of Customers with highest predictive probability of donating. PART B: PREDICTIVE MODELLING BASED ON R ------------------------------------------------------------------------------------------- Aftercreatingthree predictive modelsinSASminer,we builtaDecisionTree inR.As isthe procedure before performinganydataminingactivity,we explore the variablesatthe firststepto investigatetheirdistribution. Exploringand transformingthe data Summaryfunctionaswell as package Psych isinstalledtoprovide adetaileddescriptive summarizationof the variables. We notice thatCustomerID isconsiderasvariable byR while afew variableshadmissingvaluesin them.A histogramiscreatedfor MedianIncome,whichrevealsdisproportionatenumberof zero value inthe variable. To ensure a cleanmodel,we wrangle andtransform the variables: Figure 21: Exploring the variables.
  • 15. 14 | P a g e 1 9 1 3 9 5 0 7  CustomerIDis rejectedasitis an ID andnot an inputfor modelling.  Target_D isrejectedasTarget_B containsthe collapseddataof Target_D and inclusionof Target_D will leadtoleakage.  Since R considervariables containing$valuesas categorical,we transformGiftAvgLast, GiftAvg36,GiftAvgAll,GiftAvgCard36,DemMedHomeValueandDemMedIncomeinto numericvariables. Figure 22: Transforming variables into numeric variables.  The zero valuesinDemMedIncome are replace withNA,asthose valueswere customers whodidnot reveal theirincomes. Buildinga DecisionTree Aftercleaningthe data, we divide the dataequallyintotrainingandvalidationsets. The Decision Tree will be builtonthe validationset usingrpartpackage.Target_Bis selectedasthe target variablesand all the non-rejectedvariablesare selectedasthe inputs.Since the targetvariable isa binarynumber,we select Classasthe methodwhile the complexityparameter issetat0.001. The scriptfor Decisiontree isandthe model is plottedthe decision usingrpart.plot: Figure 23: Decision Tree will all non-rejected variables as inputs and cp of 0.001.
  • 16. 15 | P a g e 1 9 1 3 9 5 0 7 The decisiontree thatiscreatedis verycomplex,withlotof leaves.OnplottingitsROCcurve we notice a huge discrepancy asthe model curvesperfectlytowardssensitivity. The model showssigns of overfittingaswell as leakage. Figure 24: ROC curve of the Decision Tree. To rectifythismodel, we prune the decision tree basedoncross-validationerror. The complexity parameterat whichthe lowesterrorisproducedisselectedasthe complexityparameterforthe newmodel (0.0072). Figure 25: Cross-Validation plot. The other change inthe newmodel isselectionof variablesasinput.Forthe new model,we select variablesthathave relevance andpredictioncapabilitiesbasedonrational judgement.Forexample, GftAvg36 ismore relevanttothe model thanGiftAvgAll,thusonlythe formerisusedinthe new model. The new model DecisionTree canbe seeninthe below plot:
  • 17. 16 | P a g e 1 9 1 3 9 5 0 7 Figure 26: Pruned Decision Tree. Comparing the model to the modelscreated in SAS The newmodel isa lot more cleanerandresultsin5 definite predictionrules.We plotthe ROCcurve and liftchart forthismodel (Blue line forTrainingandRedline forValidation). The resultsasfollows: On comparingthe ROC curve and Liftchart of DecisionTree createdonR withthe three modelsbuilt on SASminer,we couldobserve the Regressionwouldstill comfortablytrumpall the othermodels basedon accuracy and effectiveness.The ROCcurve of the DecisionTree builtonRisclose to the baseline,indicatingittonotbeinga verygoodpredictor.Similarly,the liftgeneratedisveryclose to 1, whichisn’tan ideal value. Scoring the data The purpose of any model isto predict,andto conductthispredictionwe importthe score dataset. As done forthe original datasource,we transformvariableswith$valuesinthemintonumeric variables.The predictionof the model isthenappliedtothe scoreddatasetusingthe Predict function.A snapshotof resultsof the scoringisas follows: Figure 27: ROC curve (Left) and Lift chart (right) of Pruned Decision Tree.
  • 18. 17 | P a g e 1 9 1 3 9 5 0 7 The firstcolumnindicatesthe customerID,the secondcolumnindicatesthe customerbeinganon- donorwhile the 3rd columnindicatesthe customerbeingadonor.The valuesinsecondandthird columnindicate the probabilityof the customerfallinginthatclassification.Thiscanbe interpreted as customerId 96362 has 41.96% predictedprobabilityof donatingforthe nextcampaign. Similarly,the predictedvalue forall customershave beenderived. Toachieve maximumprofitability, the organizationshouldsolicitate customerswhohave beenflaggedwithmore than60% of probabilityof donating bythe model. PART C: MARKET BASKET ANALYSIS AND ASSOCIATION RULES ------------------------------------------------------------------------------------------- In thissectionof the report,we attemptto derive meaningful patternsinthe purchasingbehaviour of customerswithreference toarange of products.The primaryobjective of thismarket-basket analysis (MBA) isto discoveritemsthatare purchasedwithhighconfidence andhave highlift.These insightscanenable the retail store toexpanditsrevenue numbersandachievehigherprofitability. The MBA wasconductedon SASminer,onthe datasetcontaininginformationof over400,000 transactionsaccumulatedoverthe past3 months. The propertiesof the variable have beensetto settingsobservedinFigure 28 andthe type of data source ischangedfrom Raw to Transactions. Figure 28: Variable Properties for MBA. Afterdraggingthe data-source intothe diagram, we attach Associationnode toittoconduct the analysis.Exportrule byIDis changedto yesas we wouldlike toview the rule descriptiontablefor the analysis.Remainingsettingsare keptunchanged.
  • 19. 18 | P a g e 1 9 1 3 9 5 0 7 The resultsof the Associationnodesunearthedseveral insightsof enormousbusinessvalues;some of whichare explainedbelow: Out of the 36 rulesorcombinationof productscreated,the highestachievedliftwasfoundto3.60. Liftessentiallymeasuresthe degreeof associationbetweenthe combinationof products.For example,rule A ->B withlift3 wouldbe interpretedasa customeristhrice as likelytobuyproductB if he has alreadypurchasedproductA,comparedto the likelihoodof arandomcustomerjustbuying productB. Lift isderivedbydividingconfidencebyexpectedconfidence. The highestliftof 3.60 wasachievedbyrule Perfume ->Toothbrush. Thisindicatesthatacustomer whohas purchaseda Perfume is3.6 timesmore likelytobuyToothbrushcomparedtoa customer chosenat random. Since liftissymmetrical,ruleToothbrush->Perfume wouldhave the same liftof 3.60. Liftis significantmetricforMBA as itdenotesthe relation betweencombinationof product.A Higher lift( > 1) of rule indicatesthe right-handproductismore likelytoboughtincomplementwiththe left-handproductratherthanbeingboughtjustinisolation. Thisinsightcanhelpimmenselyin productplacementinthe aisles. Incurrentcontext,rule Magazine &Greetingcards -> CandyBar has a liftof 2.68, whichdenotesthatthe likelihoodof acustomerbuyingacandy bar incombination withMagazine and Greetingcardsis2.68 timeshigherthanacustomerbuyingjustthe candy bar. Basedon associationrules,we derived36 rules,witheachrule possessingsignificantvalue for implementationatthe store.However,based onfew metrics,we wouldrecommendthe companyto incorporate followingchangestofacilitate higherrevenue generation: 1. Placementof Products on Aisle Since Perfume ->Toothbrushproduce the highestliftbuthave comparativelylowersupport (customerpurchasingboththe products),these twoproductsshouldbe placedinclose approximate toeach otherto boosttheirsales.Withthese productsinclose vicinity, Figure 29: Tabular data of all 36 rules.
  • 20. 19 | P a g e 1 9 1 3 9 5 0 7 purchase of Perfume will triggerthe purchase of Toothbrushorvice-a-versa,asindicatedby theirhighlift. Similarly,productswithhighliftbutrelativelylowersupportshouldbe placed close-by. CandyBars -> GreetingCardshave the highestsupport(4.37%),indicatingthese two productsare oftenpurchasedtogether.Thus,thesetwoproductsshouldbe placedat distance fromeachotherso that customershave towalk-througharange of otherproducts inthe processof buyingCandyBars and GreetingCards. Similarly,productsthatare often purchasedtogether(highsupport) shouldbe placedatsome distance fromeachother. 2. Bundle,Cross-sellingandUp-selling Figure 30: Link Graph of Products. The networkbetween productsshow whichproductare linkedwitheachother.We can observe pensandphotoprocessingare onlypurchasedincombinationwith Magazine.As such,Pensand Photoprocessingshouldbe soldas a bundle withMagazinestoimprove their salesnumbers. We can alsoobserve Magazine isthe mostpopularproductin the store,andas such it gives an opportunityto create upsellingandcross-sellingsituations of otherproducts around magazine.Lesspopularproductscanbe placednearMagazine to grab more attention,as magazinesare verypopular.Similarly, anahigherpricedalternativeproductcanbe placed close to the magazines(forexample,premiumGreetingcards,candybarsand Toothpaste). 3. Specials Products withhighlift(Perfume->Toothbrush,Magazine &CandyBar -> Greetingcards, etc.) shouldbe onsale at different times.Since purchase of one product, isanyway likelyto triggerthe purchase of anotherproductinthe rule,itscounter-productivetohave asale on
  • 21. 20 | P a g e 1 9 1 3 9 5 0 7 both/all the itemsinthe rule. Thiscansave a large proportionof discountingcostsforthe companywhile boostingtheirsalesnumber. Some ruleslike Greetingcards & CandyBar -> Magazine are clearlyhotproductsduring festive periods. Havingone of the productsona discountedrate,duringfestive season, will temptthe customerto purchase the othertwonon-discountedproductstocompletetheir festive shoppingwish-list.