SlideShare a Scribd company logo
Assignment 2: ClusterAnalysisand
Predictive Modelling
BUS5PA - 19139507
SIDDHANTH CHAURASIYA 19139507
1 |
P a g e
1 9 1 3 9 5 0 7
PART A - SEGMENTATION BASED EXPLORATION OF CUSTOMERS
---------------------------------------------------------------------------------------------------------------------------
Thissectionof the report containsthe explorationandfindingsfromthe segmentationand
clusteringanalysisconductedonCHURN_TELECOMdataset,usingSASMiner.Three typesof
segmentation were carriedoutonthe basisof distinctdeterminants –Demographics,Customer
Statusand CustomerUsages.
Demographics basedProfiling:
Aftercreatingthe project, library anddiagram,we add the data-source andsetthe rolesof all the
variablesasinputexceptforChurnFlag(Target),Customerandsubscriptionidentifier(ID) and
Subscribername (Text).
We drag the data-source intothe diagramand connectitwithClusterandSegmentprofilesnodes.
Age,GenderandCustomerValue are selectedasthe variablesforthe Clusteraswell asthe Segment
Profile node forthisprofilingactivity. CustomerValuecontains68% missingvalues,andthus
imputingthose missingvalueswithasyntheticvalue (mean,median,max,etc.) wouldcreate avery
skeweddistribution;whichisn’tdesirable.Hence,CustomerValueisn’timputed.
Figure 1: Process flow for Demographics based Profiling.
Since the measurementscalesof the variablesselectedasthe inputfor Demographical profilingare
different,we keepthe methodfor‘Internal Standardization’as ‘Standardization’fromthe properties
panel of the Clusternode. Restall propertiesof nodesClusterandSegmentProfile are keptat
default.
Figure 2: Cluster and Segment Profile results for Demographical segmentation.
We founda goodcombinationof clusterswithfairamountof observationsineachsegment(Figure
2) aftersettingthe numberof clustersas4. The four segmentscouldbe broadlyclassifiedas:
2 |
P a g e
1 9 1 3 9 5 0 7
Cluster 1 – ValuableYoung Adults.
Thissegmentcanbe describedasa groupof Maleswhoare justabout start theirprofessional
careersand generate highcustomervalue forthe organization. Since thisclustershow the tendency
of highcustomervalue,the companyshould ensure retentionof thissegment.
Cluster 2 – Distressed Damsel.
Thisclustercan be bestexpressedasa segment of juvenileFemaleswhoaccumulateforarelatively
lowerCustomerValue. Thissegmentaccountsforlowercustomervalue,whichmaybe anindicator
that customersaren’tsatisfiedwiththe servicesofferedandmaychurninthe future.The company
shoulddevise plans,offersanddiscountstonegate the chancesof churnof thiscluster.
Figure 3: Results from Segment Profile node.
Cluster 3 – Stingy Seniors.
Thisgroup ischaracterizedbyseniormales whogenerate low valueforthe Telecomcompany. As
such,customersbelongingtothis segmentmayneedspecial attentionsastheyhave highlikelihood
of churning, asindicatedbytheirlowcustomervalue generation.
Cluster 4 – Bankableladies.
Thisclusteris classifiedbyelderwomenwhoproduce highvalueforthe company. The company
shouldlooktomaximize the value derivedfromthissegment.
3 |
P a g e
1 9 1 3 9 5 0 7
Figure 4: Variable significance for each cluster.
As observedfromFigure 4, Genderwasthe mostinfluential variable forthe classificationof
DistressedDamsel,StingySeniorsandBankable ladieswhileAge hasthe mostsignificance for
Valuable YoungAdults.
Note:The variable CustomerValuewasonlycollectedforcustomerswho were identifiedashaving
highprobabilityof churning.Customervalue wasn’tcollectedforcustomerswhohadlow probability
of churning.Assuch,these leadstoa distortedanalysisforcluster.However,since we don’thave
sufficientdemographical variables,we stilluse CustomerValue forthe clustering.
CustomerStatus basedProfiling:
To conduct CustomerStatusbasedsegmentation,we optforvariables whichhighlightwhatthe
statusof the customeriswithreference tothe servicesofferedbythe company. Email queriessent,
revenue throughGPRS,internet,&fix-lineanddayssince lastcomplain are the variableswhichare
selected. ThroughStatExplore we foundoutthe distributionof the latterfourselected variables
were highlyskewed, andthus we normalizethemusingTransformvariablesnode.
Figure 5: Process flow for Customer Status based Profiling.
Settingupof 4 clustersledtoan excellentcreationof fairlyequalsegments. The fourclusterscould
be interpreted as:
Cluster 1 – Superactive
Thisclusterischaracterizedbycustomerswhotendto conversate backand forthwiththe company
throughemailsquite oftenbuthaven’treallyhadacomplaintregardingthe servicesrecently.
Additionally,these customersgenerate arelativelyhigherrevenue throughinternet,GPRSaswell as
fix-line services.As such,the customersfrom these segments are very importantfromprofitability
4 |
P a g e
1 9 1 3 9 5 0 7
pointof view.
Cluster 2 – Curious
Customersfromthisclustercanbe describedasbeing quite curiousaboutthe new plans,asevident
fromtheirhighnumberof email queriessentinthe past6 months.Similarly,theyhave lodgeda
complaintveryrecentlyandproduce ahighrevenue throughthe internetmediumforthe company.
Thus,theyhave beenaptlynamedas‘Curious’. Thiswill needspecialattentionfromthe
organization,asitshowssignsof churning,
Figure 6: Results from Segment Profile node.
Cluster 3 – Content
Customersbelongingtothissegmenthave rathersatisfiedwiththe servicesandhave laidback
attitude.These customersdon’tgenerallysendinemailqueriesandhaven’tmade acomplaintwith
the companyrecently.The cashinflowgeneratedbythese particulars customersisidentical tothe
overall distributionof the customersacrossthe whole dataset.
Cluster 4 –Transitionals
‘Transitionals’representsaclusterof customerswhotransitioningtothe modernservicesofferedby
the company.Theyhave made a complaintfairlyrecentlybutdon’tgenerallysendmuchemailsto
the organization.The revenue generatedthroughinternetbythemisonthe lowerside butthey
produce highrevenue throughfix-linesandGPRS.
Days since lastcomplaintwasverysignificantvariablesforclusters‘SuperActive’ and‘Transitionals’,
while emailsquerieswerestrongdeterminantsforvariables‘Curious’and‘Content’(Appendix -
Figure 11).
5 |
P a g e
1 9 1 3 9 5 0 7
Usage based Profiling:
To conduct usage basedprofiling,we selectvariableswhichhighlightusage pattern –outgoing
national,international,roaming&local calls,change inbill andrevenue throughinternet andfixline.
Since these variableswere highlyskewed,we usedtransformvariablestonormalize their
distribution.
Figure 7: Process flow for Customer Status based Profiling.
Since we convertedall the variablesinlog,we setthe ‘Standardization’tonone.We setthe number
of clusterto4. The resultswere interpretedas:
Cluster 1 – Cosmopolitan
Thisclusterischaracterizedbycustomers whohave a highusage of outgoinginternational calls.Rest
of the usageslike national calls,local calls,roamingcallsandinternetforthissegmentissame asthe
patternof customersacrossthe dataset.Assuch, the companyshouldoffercustomersfrom this
clusterplanswhichmore attractive forinternational calling,if theywanttoretaintheminlong-run.
Cluster 2 – Connected
Customersfromthisclustertendtohave a highusage of outgoingcallsat national level.Theirusage
of otherservicesis prettymuchsimilartothe overall usage patternof the customers.Churning
customersfromthissegmentcanbe luredback by offeringthemvalue-for-moneyplansfornational
calling.
6 |
P a g e
1 9 1 3 9 5 0 7
Figure 8: Results from Segment Profile node.
Cluster 3 - Traditionals
Thisclusterisdescribedas‘Traditionals’sincetheirusage patternstaysthe same throughout,as
evidentfromtheirlowpercentage change inbills.Theirutilizationof nationalandinternational calls
stayson the lowerside thoughtheyuse ahighamount of internet.
Cluster 4 –Modern
Thissegmentisdescribedas‘Modern’asit ischaracterizedbythe usage of contemporarycustomers
– fluctuatingbills,low usage of calls(national,local,international &roaming) andhighinternet
usage.
7 |
P a g e
1 9 1 3 9 5 0 7
Figure 9: Variables' influence on each cluster.
Cross-clusteranalysis:
AftercreatingrespectiveclustersbasedonDemographics,CustomerStatusandUsage,we conducta
cross-clusteranalysistoexplore if there’s anyassociationbetweenthese segments;whichcould
potentiallybe harnessedintosomethingprofitable forthe company.
We addthe Save data node tothe Clusternodesandexportthe datafor segmentfromall three
categories.UsingVLOOKUPfunctioninexcel,we arrivedatthe followingobservation:
Demographic
ValuableYoungAdults Distressed Damsel StingySeniors Bankableladies
Usage
Cosmopolitan 28.79% 23.51% 21.85% 22.92%
Connected 22.06% 29.47% 23.69% 25.45%
Traditionals 25.63% 22.29% 27.66% 35.29%
Modern 24.59% 25.78% 25.23% 21.02%
Cross-cluster analysis:Demographic vs Usage
It was seen ‘Valuable YoungAdults’and ‘Cosmopolitan’sharedagoodassociation,indicatingthat
youngmentendto use international callingfrequently.Similarly,itwasobservedwomeninlatter
stages(‘Bankable ladies’) hadavery‘Traditional’usage i.e.theirbillsrarelyfluctuatedandtheyused
the callingfeaturesasmuchas the overall average. Lastly,DistressedDamselwascloselyrelatedto
‘Connected’,whichmeanstheyare prettyactive intermsof outgoingscallsnationally. Theseinsights
can be usedveryeffectivelytogaincompetitiveadvantage andimprove the offeringstothe
respective customers.A lotof businessvaluecanbe derivedbycorrectinterpretationandproper
actionsoverthem.
Cross-clustershighlightedinredrepresentthe group whichhave a high-chance of churning(derived
usingChurnFlagvariable andVLOOKUP).Assuch,it isimperative thatcompanyofferssuch
customersgooddiscountsandplans dependingupontheirusages soontoretainthem.
8 |
P a g e
1 9 1 3 9 5 0 7
Cross-clusteranalysisbetweenCustomerStatusandUsage helpedustodiscoversome hidden
insights.
Customer Status
Super
active Curious Content Transitionals
Usage
Cosmopolitan 22.37% 23.67% 26.39% 24.23%
Connected 33.42% 27.43% 22.13% 25.16%
Traditionals 23.92% 23.54% 30.03% 21.37%
Modern 25.41% 29.83% 24.22% 27.55%
Cross-cluster analysis: Customer Status vs Usage
‘SuperActive’customerstendtobe involvedinalotof interactioninternationally(‘Cosmopolitan’),
while ‘Curious’customershadaverymodern-like usage.Similarly,customerswhohave been
categorizedas‘Content’hada lotincommon with‘Traditional’usages.Assuch,the companies
shouldkeepthese insightsinmindandprepare planstomaximize profitoutof such groupof
customers.
On the otherhand,customersbelongingto‘SuperActive’clusterwithusage of ‘Traditional’have a
highprobabilityof churning.Additionally,‘Transitionals’withhighinternationalcallingusage and
national callingusage mayleave the companysoonersratherthanlater. Thus,the companyneedsto
dishout offersanddiscountsaccordingly,basedonthe usage patternsasmentionedabove,to
retainthose customers.
PART B – EXTENDING KNOWLEDGE OF PREDICTIVE ANALYTICS
---------------------------------------------------------------------------------------------------------------------------
Sevenreasonsfor Predictive Analytics
Since the turn of the millennium, andespeciallyinthe lastdecade orso,there has an unprecedented
generationof data,whetherstructuredorunstructured.Infact,IBMhas statedthat suchlarge of
volumesof dataisgeneratedevery day thatthe amountof data doublesupeverytwoyears.
In thisgiganticamountof data liesnumeroushiddenpatternsandtrends,whichif harnessedinthe
rightmannercouldresultinbusinessvalue of epicproportionsforthe organization. Predictive
analyticsisone suchtool thatcan exploitthese giganticdatato conjure upwithmeaningful and
actionable insights.
Eric Siegel,anaccomplishedheavyweightinthe fieldof PredictiveAnalytics,putforwardhis
thoughtsonPredictive Analyticsinawhite paper,statingpreciselywhythe worldneedstoembrace
Predictive Analytics. AsperEricSiegel,adoption,implementationandapplicationof Predictive
Analyticscanenable anorganizationtoachieve the followingsevenobjectives -
 Compete:Gaincompetitive advantageoverrivals.
 Grow: Increase sales,expandcustomerbase andretainexistingcustomers.
 Enforce:Detectfrauds,anomalies andundesirablecircumstances.
9 |
P a g e
1 9 1 3 9 5 0 7
 Improve:Enhancement&refinementin core productofferings,processautomationand
resourcesoptimization.
 Satisfy:Provide tailoredsolutions andrecommendationsforcustomers.
 Learn: Learningfromthe pastdata (structuredas well asunstructured) toprovide insights
and foresightsaboutthe future.
 Act: Actionable recommendations &insights.
Case Study II – Predictive Analyticsfor Insurers
Insurance company’s operatingsuccess chiefly reliesonitsforecastingcapabilities.The primary
distinguisherbetweenthe bestandthe restof the insurance companies isthe accuracy withwhich
the organization cantarget the potential customers,setthe pricingof the premiumanddetect
fraudulentclaims.Muchof these taskswere carriedouton the basisof guestimatesinthe olden
days;a methodwhichwasn’treallyefficientorcost-effective.
Soon,keydeterminantslike age andhistorybecame the foundationonwhichinsurance companies
forecastedits operations.However,today, Predictive Analyticshaschangedthe entire landscape of
howinsurance companiesconductedtheiroperations.
Withthe helpof PredictiveAnalytics,insurance companieshave notonlybeenabletoimprove their
core operations(e.g. Creditscores,frauddetection)butalsomarketingof the product(basedupon
buyingpatternsi.e.hitratio,retentionratio) andunderwriting(filteringoutcustomerswhodonot
meeta givencriteria,therebysavingtime andmoney).
RelatingCase Study II to sevenreasons for Predictive Analytics
The applicationof PredictiveAnalyticsisveryprevalentinthe insurance landscape;andisinfact
consideredasindustry bestpractise.The businessvalue thatcanbe derivedfromutilizationof
Predictive Analyticsinthe fieldof Insurance is tremendous. Afterthoroughlyanalysingthe given
Insurance case study,we couldsummarize how usage of PredictiveAnalyticsbyInsurance firms
enabledthemtoachieve the outcomesdescribedbyEricSiegel as:
Compete:
Insurance industryis verycompetitive,withcompaniesalwaysiteratingtostayone stepaheadof
the rivals.PredictiveAnalyticscanenable anorganizationtogatherknowledge aboutthe customers
ina more holisticmanner,whichcancreate a competitiveadvantage forthe firm.Similarly,
Predictive Analyticscancreate creditscore rating models,adverseselectionmodelsandsoon,which
will aidthe organizationtostayaheadof theirrivals.
Grow:
The insurance industryhaswitnessedsnail-pacedgrowthoverthe pastfew years.Thishasledto
organizationsexploringthe optionstoexpandtonew horizonsandlocations.Withthe helpof
Predictive Analytics,insurancecompaniescanpredictthe whichcustomersare likelytorespondto
offersandmarketingcampaigns. Similarly,throughPredictive Analytics,canunderstandthe buying
pattern,whichcan be usedformarketing’shitratioandcustomerretentionratios.
Enforce:
One of the mostsignificantfunctionforanyinsurance companiesis detectionof fraudulentclaims.
10 |
P a g e
1 9 1 3 9 5 0 7
Withthe helpof scoringandrankingmodels,Predictive Analyticscanhighlightwhichclaimsare a bit
suspiciousandneedmore investigationbefore settlement.
Improve:
Predictive Analyticscanimmenselyaidthe operating efficiencyandproductofferingof aninsurance
company. Throughpredictive models,insurerscanidentifyatthe initial stage itselfwhichclaimsare
likelytobe settledforhighvalue inthe future. Thiswill allow the companytorunits operations
more efficientlyandinamore economical manner.Additionally,Predictive Analyticscanfindout
whichcustomersmeetthe stipulatedobligationsforthe insurance andwhichcustomersdonot.This
helpsinsavingtime,moneyandresources of the organization.
Satisfy:
To maximize the customervalue,insurersneedtopitchthe righttype of insurance (lifeinsurance,
vehicle insuranceandsoon) to the customer. By observingthe buyingpatternsof the customers,
Predictive modelscansuggestthe rightfitof insurance individuallyforeachcustomer.Similarly,
Predictive modelscanassignariskscore foreach customerdependinguponvariousdeterminants
(age,location,history,etc.).These scoresthenenable the companytosetappropriate premium
pricingforthe customersaccordingly.
Learn:
Predictive Analyticsusessophisticatedmodelstofindoutpatternsandtrendsinthe dataset.As
such,usage of Predictive modelslike Linearregression,logitregression,decisiontreesandsoon can
enable the insurers tofindif anypatternexistsbetweenthe variables.Thisinformationcanbe used
for variousoperational activities.
Act:
The insightsandforesightsgeneratedbythe Predictive modelscanaddgreatbusinessvalue if they
are implementedbythe organization.Insurershave beenproactivelyactingonthe insights
producedbyPredictive Analytics.Frauddetection,customerretention,churnanalysis,adverse
selectionare some of the modelsthathave beencreatedthroughPredictivemodellingandbeen
actedupon bythe insurance companies.
Commenton sevenreasons for Predictive Analyticsand its relationwith Churn Case Study
The sevenreasonsof Predictive AnalyticsstatedbyEricSiegel addsdefinitevalue toPredictive
Analyticsproject.The steps mentionedby‘Dr.Data’ are comprehensiveanddescribe the benefits
that couldbe derivedfromaPredictive modelata veryminute level.
From the above Case StudyaboutInsurance,we couldobserve andrelate areal-life applicationof
the sevenreasonsforPredictiveAnalyticsandhow itprovedadvantageoustothe industry.
The sevenreasonsforPredictive Analyticscanalsobe witnessedinChurnCase study inparts.The
churn analysisenablesthe Telecomcompany togaincompetitiveadvantage (‘Compete’) overits
rivalsas itcouldact uponthe highchurn customersandretainthem (‘Grow’) byofferingthemoffers
and discounts (‘Act’) while theircompetitors whodon’tuse PredictiveAnalytics won’tbe able to
retaintheirhighchurningcustomers
The DecisionTree andRegression modelswere builtusingpastdata(‘Learn’).The DecisionTree
11 |
P a g e
1 9 1 3 9 5 0 7
model wasthenused onthe new datasetwiththe helpof Score node todetectwhichcustomersare
on the verge of churning(‘Enforce’).
Eventhoughthe model flagscustomershavinghighprobabilityof churn,the case studydoesn’t
reallyfollow ‘Improve’ asthe model doesn’tenhance the core productofferingbutjustindicate
whichcustomersmaybe unhappywiththe services.Similarly,the case studydoesn’tfollowthe
‘Satisfy’ asit cannotsuggesttailoredsolutionstoindividual customersbutcan onlysuggestwhich
customersshouldbe offeredadiscounttoretainthem.
SEMMA
SEMMA (Sample,Explore,Modify,Model andAssess) isa methodologyformulatedby SASinstitute,
to conductany data miningtasksonits software, SASEnterprise Miner. SEMMA isconcernedwith
the model developmentaspectsof data-mininginSASMiner,anditsadherence ensuresend-to-end
coverage of the core data miningprocesses;whichdirectlyleadstomore informedandaccurate
analysis.
However,due tolackof concrete approachestowardsdatamining processflow (otherthanCRISP-
DM), SEMMA isfollowedbymanyanalyststoconductdata miningactivities. SEMMA standsfor-
Sample: Everydata miningactivityshouldstartwithsamplingof the datasetintotraining,validation
and testsets,ensuringthere’senoughinformationtocarry all these tasks.
Explore:In thisstage,we investigateandexplorethe variablestodiscoverinformationandpatterns
that may existbetweenthe variables.
Modify:Atthisstage,we selectappropriate methodstomodify,transformandrectifyvariablesthat
wouldbe usedinthe modelling.
Model: Afterexplorationandmodificationof variables,we applythe modelling technique onthe
selectedvariables.
Assess: At the lastphase,we evaluate the accuracyand predictingcapabilitiesof the models.
Relationto ChurnCase Study
SASproposedthatSEMMA isthe core processof conductinga data miningactivity. Itcanbe
observed fromFigure 10, the churn case studyreligiouslyfollowedthe SEMMA principles. The Churn
analysiscommenceswith DataPartitionnode (Sample),whichenablesustocreate sample fromthe
datasetand allocate sufficientenoughdatafortraining,validationandtest. Thisisthenfollowedby
imputationof missingvaluesandreplacementof variabletoreduce its numberof classes(Modify).
To reduce the redundancy,we utilizethe Variable Clusteringnode (Explore) andthenrunour
DecisionTree andRegressionmodels(Model).ThroughModel Comparison(Assess),we compare the
twopredictive modelsandfindoutsomethingpeculiar.Toinvestigateitfurther,we use Multiplot
node (Explore) anddetectabnormal variableswhichaffectedthe predictivecapabilitiesof the
model.
12 |
P a g e
1 9 1 3 9 5 0 7
Figure 10: Process flow of Churn Case Study
Usingmetadata(Modify),we remove theseabnormal variablesandre-connectthe Decisiontree and
Regressionmodels(Model) toit. Then,we againuse the Model Comparisonnode (Assess) togauge
whichmodel outperformsthe other.Finally,we use the Score node (Assess)toapplythe bestmodel
to the newdatasetand complete the dataminingprocess.
Thus, it couldbe concludedthatall the stepsof SEMMA were comprehensivelycoveredbythe
Churncase study.
The adherence of SEMMA inthe Churn Case studycan be summarizedas:
Steps Nodes
Sample Data Partition
Explore Multiplot, Variable Clustering
Modify Impute, Replacement, Metadata
Model Decision Tree, Regression
Assess Model Comparison, Score
Relating SEMMA with Churn Case Study.
Importance of SEMMA
EventhoughSASinsistsSEMMA is merelyasetof guidelinestobe followedforSASminer,the
methodology’sapplicationcanbe extendedtodataminingtasksasa whole.SEMMA is a veryrobust
approach thatencompassesall the chief criteriarequired forundertakingorbuildingacomplex
predictive model.Adherence of SEMMA ensuresease of processflow,detectionof faultsand
creationof more accurate models.
ChurnCase studyhugelybenefittedbyfollowingthe SEMMA methodology. ThroughMultiplotand
Variable Clusteringwe could ‘explore’ erroneousvariablesandredundantvariablesandthrough
impute,replacementandmetadata,we could ‘modify’ suchvariables.Model Comparison enabledus
to compare,contrastand ‘assess’ the twopredictive ‘models’ –DecisionTree andRegression.With
the helpof Score node, we evaluatedandappliedthe model toanew dataset.
Sample
Samp
le
Modify
y
Samp
le
Sample
Model
Samp
le
Explore
Samp
le
Assess
y
Sam
ple
Sample
13 |
P a g e
1 9 1 3 9 5 0 7
Appendix
Figure 11: Most significant variables for each of the four clusters.

More Related Content

PDF
Customer Segmentation Project
PDF
Customer Segmentation
PDF
Identifying customer segments using machine learning
PPTX
web mining
PPTX
Data mining tasks
PPTX
Web mining (structure mining)
PPTX
Clustering - K-Means, DBSCAN
DOCX
Tweet sentiment analysis
Customer Segmentation Project
Customer Segmentation
Identifying customer segments using machine learning
web mining
Data mining tasks
Web mining (structure mining)
Clustering - K-Means, DBSCAN
Tweet sentiment analysis

What's hot (20)

PPTX
Major issues in data mining
PDF
Big Data Analytics Powerpoint Presentation Slide
PPTX
Data Mining in telecommunication industry
PPTX
Web Mining Presentation Final
PPTX
What Is Unstructured Data And Why Is It So Important To Businesses?
PPTX
Introduction to Data Analytics
PPT
Data Mining: Concepts and techniques: Chapter 13 trend
PPTX
Architecture of data mining system
PPTX
Data clustring
PDF
Churn prediction data modeling
PPTX
Customer Segmentation using Clustering
PPTX
Cyber security and current trends
PPTX
5 v of big data
PDF
CLUSTERING IN DATA MINING.pdf
PPTX
Data mining in Telecommunications
PDF
Data science presentation
PPT
Social Media Sentiment Analysis
PDF
Machine learning for customer classification
PDF
Data mining
PDF
Churn Prediction in Practice
Major issues in data mining
Big Data Analytics Powerpoint Presentation Slide
Data Mining in telecommunication industry
Web Mining Presentation Final
What Is Unstructured Data And Why Is It So Important To Businesses?
Introduction to Data Analytics
Data Mining: Concepts and techniques: Chapter 13 trend
Architecture of data mining system
Data clustring
Churn prediction data modeling
Customer Segmentation using Clustering
Cyber security and current trends
5 v of big data
CLUSTERING IN DATA MINING.pdf
Data mining in Telecommunications
Data science presentation
Social Media Sentiment Analysis
Machine learning for customer classification
Data mining
Churn Prediction in Practice
Ad

Similar to Machine-Learning: Customer Segmentation and Analysis. (20)

PDF
IRJET - Customer Churn Analysis in Telecom Industry
PDF
Customer churn classification using machine learning techniques
PDF
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
PDF
Data Mining on Customer Churn Classification
PDF
Automated Feature Selection and Churn Prediction using Deep Learning Models
PDF
Project crm submission sonali
PPTX
Customer_Churn_prediction.pptx
PPTX
Customer_Churn_prediction.pptx
PDF
IRJET- Finding Optimal Skyline Product Combinations Under Price Promotion
PDF
ML_project_ppt.pdf
PDF
Clustering
PDF
Bank Customer Segmentation & Insurance Claim Prediction
PPTX
Data mining and analysis of customer churn dataset
PPTX
Churn customer analysis
PDF
Applying Call and Event Detail Records to Customer Segmentation and CLV
PPTX
Airtel iCreate National Wildcard Winners 2019
PDF
Setanta Systems - Supply Chain Report and Analyses Module
PDF
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
PDF
Market Segmentation Customer Maximum Profit
PDF
IRJET- Credit Profile of E-Commerce Customer
IRJET - Customer Churn Analysis in Telecom Industry
Customer churn classification using machine learning techniques
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Data Mining on Customer Churn Classification
Automated Feature Selection and Churn Prediction using Deep Learning Models
Project crm submission sonali
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
IRJET- Finding Optimal Skyline Product Combinations Under Price Promotion
ML_project_ppt.pdf
Clustering
Bank Customer Segmentation & Insurance Claim Prediction
Data mining and analysis of customer churn dataset
Churn customer analysis
Applying Call and Event Detail Records to Customer Segmentation and CLV
Airtel iCreate National Wildcard Winners 2019
Setanta Systems - Supply Chain Report and Analyses Module
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
Market Segmentation Customer Maximum Profit
IRJET- Credit Profile of E-Commerce Customer
Ad

More from Siddhanth Chaurasiya (6)

DOCX
Predictive Modelling & Market-Basket Analysis.
DOCX
Building & Evaluating Predictive model: Supermarket Business Case
PPTX
Visualization Techniques: Framework, Effective viz & Non-effective viz.
DOCX
Escape Trave: Analytical solution
PPTX
Innovation at International Foods Group
PPTX
Sustainable reporting and its effects on financial performance.
Predictive Modelling & Market-Basket Analysis.
Building & Evaluating Predictive model: Supermarket Business Case
Visualization Techniques: Framework, Effective viz & Non-effective viz.
Escape Trave: Analytical solution
Innovation at International Foods Group
Sustainable reporting and its effects on financial performance.

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Business Analytics and business intelligence.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Foundation of Data Science unit number two notes
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Analytics and business intelligence.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
annual-report-2024-2025 original latest.
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Acumen Training GuidePresentation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ISS -ESG Data flows What is ESG and HowHow
Foundation of Data Science unit number two notes

Machine-Learning: Customer Segmentation and Analysis.

  • 1. Assignment 2: ClusterAnalysisand Predictive Modelling BUS5PA - 19139507 SIDDHANTH CHAURASIYA 19139507
  • 2. 1 | P a g e 1 9 1 3 9 5 0 7 PART A - SEGMENTATION BASED EXPLORATION OF CUSTOMERS --------------------------------------------------------------------------------------------------------------------------- Thissectionof the report containsthe explorationandfindingsfromthe segmentationand clusteringanalysisconductedonCHURN_TELECOMdataset,usingSASMiner.Three typesof segmentation were carriedoutonthe basisof distinctdeterminants –Demographics,Customer Statusand CustomerUsages. Demographics basedProfiling: Aftercreatingthe project, library anddiagram,we add the data-source andsetthe rolesof all the variablesasinputexceptforChurnFlag(Target),Customerandsubscriptionidentifier(ID) and Subscribername (Text). We drag the data-source intothe diagramand connectitwithClusterandSegmentprofilesnodes. Age,GenderandCustomerValue are selectedasthe variablesforthe Clusteraswell asthe Segment Profile node forthisprofilingactivity. CustomerValuecontains68% missingvalues,andthus imputingthose missingvalueswithasyntheticvalue (mean,median,max,etc.) wouldcreate avery skeweddistribution;whichisn’tdesirable.Hence,CustomerValueisn’timputed. Figure 1: Process flow for Demographics based Profiling. Since the measurementscalesof the variablesselectedasthe inputfor Demographical profilingare different,we keepthe methodfor‘Internal Standardization’as ‘Standardization’fromthe properties panel of the Clusternode. Restall propertiesof nodesClusterandSegmentProfile are keptat default. Figure 2: Cluster and Segment Profile results for Demographical segmentation. We founda goodcombinationof clusterswithfairamountof observationsineachsegment(Figure 2) aftersettingthe numberof clustersas4. The four segmentscouldbe broadlyclassifiedas:
  • 3. 2 | P a g e 1 9 1 3 9 5 0 7 Cluster 1 – ValuableYoung Adults. Thissegmentcanbe describedasa groupof Maleswhoare justabout start theirprofessional careersand generate highcustomervalue forthe organization. Since thisclustershow the tendency of highcustomervalue,the companyshould ensure retentionof thissegment. Cluster 2 – Distressed Damsel. Thisclustercan be bestexpressedasa segment of juvenileFemaleswhoaccumulateforarelatively lowerCustomerValue. Thissegmentaccountsforlowercustomervalue,whichmaybe anindicator that customersaren’tsatisfiedwiththe servicesofferedandmaychurninthe future.The company shoulddevise plans,offersanddiscountstonegate the chancesof churnof thiscluster. Figure 3: Results from Segment Profile node. Cluster 3 – Stingy Seniors. Thisgroup ischaracterizedbyseniormales whogenerate low valueforthe Telecomcompany. As such,customersbelongingtothis segmentmayneedspecial attentionsastheyhave highlikelihood of churning, asindicatedbytheirlowcustomervalue generation. Cluster 4 – Bankableladies. Thisclusteris classifiedbyelderwomenwhoproduce highvalueforthe company. The company shouldlooktomaximize the value derivedfromthissegment.
  • 4. 3 | P a g e 1 9 1 3 9 5 0 7 Figure 4: Variable significance for each cluster. As observedfromFigure 4, Genderwasthe mostinfluential variable forthe classificationof DistressedDamsel,StingySeniorsandBankable ladieswhileAge hasthe mostsignificance for Valuable YoungAdults. Note:The variable CustomerValuewasonlycollectedforcustomerswho were identifiedashaving highprobabilityof churning.Customervalue wasn’tcollectedforcustomerswhohadlow probability of churning.Assuch,these leadstoa distortedanalysisforcluster.However,since we don’thave sufficientdemographical variables,we stilluse CustomerValue forthe clustering. CustomerStatus basedProfiling: To conduct CustomerStatusbasedsegmentation,we optforvariables whichhighlightwhatthe statusof the customeriswithreference tothe servicesofferedbythe company. Email queriessent, revenue throughGPRS,internet,&fix-lineanddayssince lastcomplain are the variableswhichare selected. ThroughStatExplore we foundoutthe distributionof the latterfourselected variables were highlyskewed, andthus we normalizethemusingTransformvariablesnode. Figure 5: Process flow for Customer Status based Profiling. Settingupof 4 clustersledtoan excellentcreationof fairlyequalsegments. The fourclusterscould be interpreted as: Cluster 1 – Superactive Thisclusterischaracterizedbycustomerswhotendto conversate backand forthwiththe company throughemailsquite oftenbuthaven’treallyhadacomplaintregardingthe servicesrecently. Additionally,these customersgenerate arelativelyhigherrevenue throughinternet,GPRSaswell as fix-line services.As such,the customersfrom these segments are very importantfromprofitability
  • 5. 4 | P a g e 1 9 1 3 9 5 0 7 pointof view. Cluster 2 – Curious Customersfromthisclustercanbe describedasbeing quite curiousaboutthe new plans,asevident fromtheirhighnumberof email queriessentinthe past6 months.Similarly,theyhave lodgeda complaintveryrecentlyandproduce ahighrevenue throughthe internetmediumforthe company. Thus,theyhave beenaptlynamedas‘Curious’. Thiswill needspecialattentionfromthe organization,asitshowssignsof churning, Figure 6: Results from Segment Profile node. Cluster 3 – Content Customersbelongingtothissegmenthave rathersatisfiedwiththe servicesandhave laidback attitude.These customersdon’tgenerallysendinemailqueriesandhaven’tmade acomplaintwith the companyrecently.The cashinflowgeneratedbythese particulars customersisidentical tothe overall distributionof the customersacrossthe whole dataset. Cluster 4 –Transitionals ‘Transitionals’representsaclusterof customerswhotransitioningtothe modernservicesofferedby the company.Theyhave made a complaintfairlyrecentlybutdon’tgenerallysendmuchemailsto the organization.The revenue generatedthroughinternetbythemisonthe lowerside butthey produce highrevenue throughfix-linesandGPRS. Days since lastcomplaintwasverysignificantvariablesforclusters‘SuperActive’ and‘Transitionals’, while emailsquerieswerestrongdeterminantsforvariables‘Curious’and‘Content’(Appendix - Figure 11).
  • 6. 5 | P a g e 1 9 1 3 9 5 0 7 Usage based Profiling: To conduct usage basedprofiling,we selectvariableswhichhighlightusage pattern –outgoing national,international,roaming&local calls,change inbill andrevenue throughinternet andfixline. Since these variableswere highlyskewed,we usedtransformvariablestonormalize their distribution. Figure 7: Process flow for Customer Status based Profiling. Since we convertedall the variablesinlog,we setthe ‘Standardization’tonone.We setthe number of clusterto4. The resultswere interpretedas: Cluster 1 – Cosmopolitan Thisclusterischaracterizedbycustomers whohave a highusage of outgoinginternational calls.Rest of the usageslike national calls,local calls,roamingcallsandinternetforthissegmentissame asthe patternof customersacrossthe dataset.Assuch, the companyshouldoffercustomersfrom this clusterplanswhichmore attractive forinternational calling,if theywanttoretaintheminlong-run. Cluster 2 – Connected Customersfromthisclustertendtohave a highusage of outgoingcallsat national level.Theirusage of otherservicesis prettymuchsimilartothe overall usage patternof the customers.Churning customersfromthissegmentcanbe luredback by offeringthemvalue-for-moneyplansfornational calling.
  • 7. 6 | P a g e 1 9 1 3 9 5 0 7 Figure 8: Results from Segment Profile node. Cluster 3 - Traditionals Thisclusterisdescribedas‘Traditionals’sincetheirusage patternstaysthe same throughout,as evidentfromtheirlowpercentage change inbills.Theirutilizationof nationalandinternational calls stayson the lowerside thoughtheyuse ahighamount of internet. Cluster 4 –Modern Thissegmentisdescribedas‘Modern’asit ischaracterizedbythe usage of contemporarycustomers – fluctuatingbills,low usage of calls(national,local,international &roaming) andhighinternet usage.
  • 8. 7 | P a g e 1 9 1 3 9 5 0 7 Figure 9: Variables' influence on each cluster. Cross-clusteranalysis: AftercreatingrespectiveclustersbasedonDemographics,CustomerStatusandUsage,we conducta cross-clusteranalysistoexplore if there’s anyassociationbetweenthese segments;whichcould potentiallybe harnessedintosomethingprofitable forthe company. We addthe Save data node tothe Clusternodesandexportthe datafor segmentfromall three categories.UsingVLOOKUPfunctioninexcel,we arrivedatthe followingobservation: Demographic ValuableYoungAdults Distressed Damsel StingySeniors Bankableladies Usage Cosmopolitan 28.79% 23.51% 21.85% 22.92% Connected 22.06% 29.47% 23.69% 25.45% Traditionals 25.63% 22.29% 27.66% 35.29% Modern 24.59% 25.78% 25.23% 21.02% Cross-cluster analysis:Demographic vs Usage It was seen ‘Valuable YoungAdults’and ‘Cosmopolitan’sharedagoodassociation,indicatingthat youngmentendto use international callingfrequently.Similarly,itwasobservedwomeninlatter stages(‘Bankable ladies’) hadavery‘Traditional’usage i.e.theirbillsrarelyfluctuatedandtheyused the callingfeaturesasmuchas the overall average. Lastly,DistressedDamselwascloselyrelatedto ‘Connected’,whichmeanstheyare prettyactive intermsof outgoingscallsnationally. Theseinsights can be usedveryeffectivelytogaincompetitiveadvantage andimprove the offeringstothe respective customers.A lotof businessvaluecanbe derivedbycorrectinterpretationandproper actionsoverthem. Cross-clustershighlightedinredrepresentthe group whichhave a high-chance of churning(derived usingChurnFlagvariable andVLOOKUP).Assuch,it isimperative thatcompanyofferssuch customersgooddiscountsandplans dependingupontheirusages soontoretainthem.
  • 9. 8 | P a g e 1 9 1 3 9 5 0 7 Cross-clusteranalysisbetweenCustomerStatusandUsage helpedustodiscoversome hidden insights. Customer Status Super active Curious Content Transitionals Usage Cosmopolitan 22.37% 23.67% 26.39% 24.23% Connected 33.42% 27.43% 22.13% 25.16% Traditionals 23.92% 23.54% 30.03% 21.37% Modern 25.41% 29.83% 24.22% 27.55% Cross-cluster analysis: Customer Status vs Usage ‘SuperActive’customerstendtobe involvedinalotof interactioninternationally(‘Cosmopolitan’), while ‘Curious’customershadaverymodern-like usage.Similarly,customerswhohave been categorizedas‘Content’hada lotincommon with‘Traditional’usages.Assuch,the companies shouldkeepthese insightsinmindandprepare planstomaximize profitoutof such groupof customers. On the otherhand,customersbelongingto‘SuperActive’clusterwithusage of ‘Traditional’have a highprobabilityof churning.Additionally,‘Transitionals’withhighinternationalcallingusage and national callingusage mayleave the companysoonersratherthanlater. Thus,the companyneedsto dishout offersanddiscountsaccordingly,basedonthe usage patternsasmentionedabove,to retainthose customers. PART B – EXTENDING KNOWLEDGE OF PREDICTIVE ANALYTICS --------------------------------------------------------------------------------------------------------------------------- Sevenreasonsfor Predictive Analytics Since the turn of the millennium, andespeciallyinthe lastdecade orso,there has an unprecedented generationof data,whetherstructuredorunstructured.Infact,IBMhas statedthat suchlarge of volumesof dataisgeneratedevery day thatthe amountof data doublesupeverytwoyears. In thisgiganticamountof data liesnumeroushiddenpatternsandtrends,whichif harnessedinthe rightmannercouldresultinbusinessvalue of epicproportionsforthe organization. Predictive analyticsisone suchtool thatcan exploitthese giganticdatato conjure upwithmeaningful and actionable insights. Eric Siegel,anaccomplishedheavyweightinthe fieldof PredictiveAnalytics,putforwardhis thoughtsonPredictive Analyticsinawhite paper,statingpreciselywhythe worldneedstoembrace Predictive Analytics. AsperEricSiegel,adoption,implementationandapplicationof Predictive Analyticscanenable anorganizationtoachieve the followingsevenobjectives -  Compete:Gaincompetitive advantageoverrivals.  Grow: Increase sales,expandcustomerbase andretainexistingcustomers.  Enforce:Detectfrauds,anomalies andundesirablecircumstances.
  • 10. 9 | P a g e 1 9 1 3 9 5 0 7  Improve:Enhancement&refinementin core productofferings,processautomationand resourcesoptimization.  Satisfy:Provide tailoredsolutions andrecommendationsforcustomers.  Learn: Learningfromthe pastdata (structuredas well asunstructured) toprovide insights and foresightsaboutthe future.  Act: Actionable recommendations &insights. Case Study II – Predictive Analyticsfor Insurers Insurance company’s operatingsuccess chiefly reliesonitsforecastingcapabilities.The primary distinguisherbetweenthe bestandthe restof the insurance companies isthe accuracy withwhich the organization cantarget the potential customers,setthe pricingof the premiumanddetect fraudulentclaims.Muchof these taskswere carriedouton the basisof guestimatesinthe olden days;a methodwhichwasn’treallyefficientorcost-effective. Soon,keydeterminantslike age andhistorybecame the foundationonwhichinsurance companies forecastedits operations.However,today, Predictive Analyticshaschangedthe entire landscape of howinsurance companiesconductedtheiroperations. Withthe helpof PredictiveAnalytics,insurance companieshave notonlybeenabletoimprove their core operations(e.g. Creditscores,frauddetection)butalsomarketingof the product(basedupon buyingpatternsi.e.hitratio,retentionratio) andunderwriting(filteringoutcustomerswhodonot meeta givencriteria,therebysavingtime andmoney). RelatingCase Study II to sevenreasons for Predictive Analytics The applicationof PredictiveAnalyticsisveryprevalentinthe insurance landscape;andisinfact consideredasindustry bestpractise.The businessvalue thatcanbe derivedfromutilizationof Predictive Analyticsinthe fieldof Insurance is tremendous. Afterthoroughlyanalysingthe given Insurance case study,we couldsummarize how usage of PredictiveAnalyticsbyInsurance firms enabledthemtoachieve the outcomesdescribedbyEricSiegel as: Compete: Insurance industryis verycompetitive,withcompaniesalwaysiteratingtostayone stepaheadof the rivals.PredictiveAnalyticscanenable anorganizationtogatherknowledge aboutthe customers ina more holisticmanner,whichcancreate a competitiveadvantage forthe firm.Similarly, Predictive Analyticscancreate creditscore rating models,adverseselectionmodelsandsoon,which will aidthe organizationtostayaheadof theirrivals. Grow: The insurance industryhaswitnessedsnail-pacedgrowthoverthe pastfew years.Thishasledto organizationsexploringthe optionstoexpandtonew horizonsandlocations.Withthe helpof Predictive Analytics,insurancecompaniescanpredictthe whichcustomersare likelytorespondto offersandmarketingcampaigns. Similarly,throughPredictive Analytics,canunderstandthe buying pattern,whichcan be usedformarketing’shitratioandcustomerretentionratios. Enforce: One of the mostsignificantfunctionforanyinsurance companiesis detectionof fraudulentclaims.
  • 11. 10 | P a g e 1 9 1 3 9 5 0 7 Withthe helpof scoringandrankingmodels,Predictive Analyticscanhighlightwhichclaimsare a bit suspiciousandneedmore investigationbefore settlement. Improve: Predictive Analyticscanimmenselyaidthe operating efficiencyandproductofferingof aninsurance company. Throughpredictive models,insurerscanidentifyatthe initial stage itselfwhichclaimsare likelytobe settledforhighvalue inthe future. Thiswill allow the companytorunits operations more efficientlyandinamore economical manner.Additionally,Predictive Analyticscanfindout whichcustomersmeetthe stipulatedobligationsforthe insurance andwhichcustomersdonot.This helpsinsavingtime,moneyandresources of the organization. Satisfy: To maximize the customervalue,insurersneedtopitchthe righttype of insurance (lifeinsurance, vehicle insuranceandsoon) to the customer. By observingthe buyingpatternsof the customers, Predictive modelscansuggestthe rightfitof insurance individuallyforeachcustomer.Similarly, Predictive modelscanassignariskscore foreach customerdependinguponvariousdeterminants (age,location,history,etc.).These scoresthenenable the companytosetappropriate premium pricingforthe customersaccordingly. Learn: Predictive Analyticsusessophisticatedmodelstofindoutpatternsandtrendsinthe dataset.As such,usage of Predictive modelslike Linearregression,logitregression,decisiontreesandsoon can enable the insurers tofindif anypatternexistsbetweenthe variables.Thisinformationcanbe used for variousoperational activities. Act: The insightsandforesightsgeneratedbythe Predictive modelscanaddgreatbusinessvalue if they are implementedbythe organization.Insurershave beenproactivelyactingonthe insights producedbyPredictive Analytics.Frauddetection,customerretention,churnanalysis,adverse selectionare some of the modelsthathave beencreatedthroughPredictivemodellingandbeen actedupon bythe insurance companies. Commenton sevenreasons for Predictive Analyticsand its relationwith Churn Case Study The sevenreasonsof Predictive AnalyticsstatedbyEricSiegel addsdefinitevalue toPredictive Analyticsproject.The steps mentionedby‘Dr.Data’ are comprehensiveanddescribe the benefits that couldbe derivedfromaPredictive modelata veryminute level. From the above Case StudyaboutInsurance,we couldobserve andrelate areal-life applicationof the sevenreasonsforPredictiveAnalyticsandhow itprovedadvantageoustothe industry. The sevenreasonsforPredictive Analyticscanalsobe witnessedinChurnCase study inparts.The churn analysisenablesthe Telecomcompany togaincompetitiveadvantage (‘Compete’) overits rivalsas itcouldact uponthe highchurn customersandretainthem (‘Grow’) byofferingthemoffers and discounts (‘Act’) while theircompetitors whodon’tuse PredictiveAnalytics won’tbe able to retaintheirhighchurningcustomers The DecisionTree andRegression modelswere builtusingpastdata(‘Learn’).The DecisionTree
  • 12. 11 | P a g e 1 9 1 3 9 5 0 7 model wasthenused onthe new datasetwiththe helpof Score node todetectwhichcustomersare on the verge of churning(‘Enforce’). Eventhoughthe model flagscustomershavinghighprobabilityof churn,the case studydoesn’t reallyfollow ‘Improve’ asthe model doesn’tenhance the core productofferingbutjustindicate whichcustomersmaybe unhappywiththe services.Similarly,the case studydoesn’tfollowthe ‘Satisfy’ asit cannotsuggesttailoredsolutionstoindividual customersbutcan onlysuggestwhich customersshouldbe offeredadiscounttoretainthem. SEMMA SEMMA (Sample,Explore,Modify,Model andAssess) isa methodologyformulatedby SASinstitute, to conductany data miningtasksonits software, SASEnterprise Miner. SEMMA isconcernedwith the model developmentaspectsof data-mininginSASMiner,anditsadherence ensuresend-to-end coverage of the core data miningprocesses;whichdirectlyleadstomore informedandaccurate analysis. However,due tolackof concrete approachestowardsdatamining processflow (otherthanCRISP- DM), SEMMA isfollowedbymanyanalyststoconductdata miningactivities. SEMMA standsfor- Sample: Everydata miningactivityshouldstartwithsamplingof the datasetintotraining,validation and testsets,ensuringthere’senoughinformationtocarry all these tasks. Explore:In thisstage,we investigateandexplorethe variablestodiscoverinformationandpatterns that may existbetweenthe variables. Modify:Atthisstage,we selectappropriate methodstomodify,transformandrectifyvariablesthat wouldbe usedinthe modelling. Model: Afterexplorationandmodificationof variables,we applythe modelling technique onthe selectedvariables. Assess: At the lastphase,we evaluate the accuracyand predictingcapabilitiesof the models. Relationto ChurnCase Study SASproposedthatSEMMA isthe core processof conductinga data miningactivity. Itcanbe observed fromFigure 10, the churn case studyreligiouslyfollowedthe SEMMA principles. The Churn analysiscommenceswith DataPartitionnode (Sample),whichenablesustocreate sample fromthe datasetand allocate sufficientenoughdatafortraining,validationandtest. Thisisthenfollowedby imputationof missingvaluesandreplacementof variabletoreduce its numberof classes(Modify). To reduce the redundancy,we utilizethe Variable Clusteringnode (Explore) andthenrunour DecisionTree andRegressionmodels(Model).ThroughModel Comparison(Assess),we compare the twopredictive modelsandfindoutsomethingpeculiar.Toinvestigateitfurther,we use Multiplot node (Explore) anddetectabnormal variableswhichaffectedthe predictivecapabilitiesof the model.
  • 13. 12 | P a g e 1 9 1 3 9 5 0 7 Figure 10: Process flow of Churn Case Study Usingmetadata(Modify),we remove theseabnormal variablesandre-connectthe Decisiontree and Regressionmodels(Model) toit. Then,we againuse the Model Comparisonnode (Assess) togauge whichmodel outperformsthe other.Finally,we use the Score node (Assess)toapplythe bestmodel to the newdatasetand complete the dataminingprocess. Thus, it couldbe concludedthatall the stepsof SEMMA were comprehensivelycoveredbythe Churncase study. The adherence of SEMMA inthe Churn Case studycan be summarizedas: Steps Nodes Sample Data Partition Explore Multiplot, Variable Clustering Modify Impute, Replacement, Metadata Model Decision Tree, Regression Assess Model Comparison, Score Relating SEMMA with Churn Case Study. Importance of SEMMA EventhoughSASinsistsSEMMA is merelyasetof guidelinestobe followedforSASminer,the methodology’sapplicationcanbe extendedtodataminingtasksasa whole.SEMMA is a veryrobust approach thatencompassesall the chief criteriarequired forundertakingorbuildingacomplex predictive model.Adherence of SEMMA ensuresease of processflow,detectionof faultsand creationof more accurate models. ChurnCase studyhugelybenefittedbyfollowingthe SEMMA methodology. ThroughMultiplotand Variable Clusteringwe could ‘explore’ erroneousvariablesandredundantvariablesandthrough impute,replacementandmetadata,we could ‘modify’ suchvariables.Model Comparison enabledus to compare,contrastand ‘assess’ the twopredictive ‘models’ –DecisionTree andRegression.With the helpof Score node, we evaluatedandappliedthe model toanew dataset. Sample Samp le Modify y Samp le Sample Model Samp le Explore Samp le Assess y Sam ple Sample
  • 14. 13 | P a g e 1 9 1 3 9 5 0 7 Appendix Figure 11: Most significant variables for each of the four clusters.