SlideShare a Scribd company logo
Capstone Project - IS 6596
Project Supervisor:
Dr. Rohit Aggarwal
Project Contributors:
Mayank Badjatya - u1085897
Sagar Singh - u1088202
MARKETING ANALYTICS USING
R/PYTHON
1
Capstone Project – IS 6596
Contents
Executive Summary.......................................................................................................................................2
Book Description...........................................................................................................................................3
Why Data Science?........................................................................................................................................5
Skill sets required for a Data Science............................................................................................................6
7 Steps to effective Predictive Modelling.....................................................................................................7
Marketing Analysis........................................................................................................................................9
Fraud Detection ......................................................................................................................................10
Market Segmentation.............................................................................................................................13
Advertising..............................................................................................................................................16
Lessons Learned..........................................................................................................................................19
Next Steps...................................................................................................................................................19
2
Capstone Project – IS 6596
Executive Summary
The objective of this project is to discuss the importance of Machine Learning in different sectors and how
does it solve the problems in the Marketing Analytics field. We have discussed Marketing Segmentation,
Advertisement, and Fraud detection in our project. We used different Machine Learning algorithms and
used R and Python library to predict and solve these problems. After making models and running test data
on those models we got following results:
• We trained a Decision tree and Random Forest classifier model which has 73% accuracy to predict
whether a person will be a defaulter or not based on credit history, income, job type, dependents
etc.
• We segmented the Social networking profiles based on the likes and dislikes of a person using K-
Means Clustering.
• We made a predictive model on the messages a customer receives and determined whether a
message will be a Spam or not a spam with an accuracy of 97%. We used Naïve Bayes classifier
for this model.
• We created several other models using different algorithms, but these are beyond the scope of
this report.
3
Capstone Project – IS 6596
Book Description
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning,
an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging
from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of
the most important modeling and prediction techniques, along with relevant applications. Topics include
linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support
vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the
methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning
techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on
implementing the analyses and methods presented in R, an extremely popular open source statistical
software platform. An Introduction to Statistical Learning covers many of the same topics, but at a level
accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike
who wish to use innovative statistical learning techniques to analyze their data. The text assumes only a
previous course in linear regression and no knowledge of matrix algebra.
Machine Learning with R: This book is intended for anybody hoping to use data for action. Perhaps you
already know a bit about machine learning, but have never used R; or perhaps you know a little about R,
but are new to machine learning. In any case, this book will get you up and running quickly. It would be
helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is
required. All you need is curiosity.
Machine learning, at its core, is concerned with the algorithms that transform information into actionable
intelligence. This fact makes machine learning well-suited to the present-day era of big data. Without
machine learning, it would be nearly impossible to keep up with the massive stream of information. Given
the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there
has never been a better time to start using machine learning. R offers a powerful but easy-to-learn set of
tools that can assist you with finding data insights. By combining hands-on case studies with the essential
4
Capstone Project – IS 6596
theory that you need to understand how things work under the hood, this book provides all the knowledge
that you will need to start applying machine learning to your own projects.
Marketing Analytics Data Driven Techniques: This book helps tech-savvy marketers and data analysts
solve real-world business problems with Excel.
Using data-driven business analytics to understand customers and improve results is a great idea in
theory, but in today's busy offices, marketers and analysts need simple, low-cost ways to process and
make the most of all that data. This expert book offers the perfect solution. Written by data analysis expert
Wayne L. Winston, this practical resource shows you how to tap a simple and cost-effective tool, Microsoft
Excel, to solve specific business problems using powerful analytic techniques—and achieve optimum
results. Practical exercises in each chapter helped us to apply and reinforce techniques as you learn.
Shows you how to perform sophisticated business analyses using the cost-effective and widely available
Microsoft Excel instead of expensive, proprietary analytical tools
• Reveals how to target and retain profitable customers and avoid high-risk customers
• Helps you forecast sales and improve response rates for marketing campaigns
• Explores how to optimize price points for products and services, optimize store layouts, and
improve online advertising
• Covers social media, viral marketing, and how to exploit both effectively.
5
Capstone Project – IS 6596
Why Data Science?
Data Science is a field, which can be implemented anywhere. Here is the list of people who uses data
science as a tool in their field and are not from IT background.
• Politics: We may have heard how statistical wizard Nate Silver predicted the electoral votes for
each state in the 2012 presidential election, showing that raw data crunching of polls is much
more reliable than traditional punditry.
• Healthcare: The role of big data in medicine is one where we can build better health profiles and
better predictive models around individual patients so that we can better diagnose and treat
disease. Big data comes into play around aggregating increasingly information around multiple
scales for what constitutes a disease—from the DNA, proteins, and metabolites to cells, tissues,
organs, organisms, and ecosystems.
• Automotive Industry: Areas in the automotive industry impacted by Big Data include:
a. Conceptual Design: Real-world data collected from billions of miles driven will undoubtedly
influence safety, aerodynamics, power algorithms and other fundamental elements of the vehicle.
b. Drawing Boards: Efficiency gained in design, production volumes and manufacturing through
Big Data in the auto industry will make it economically feasible to make today’s options
tomorrow’s standard equipment.
c. Procurement: Supply chain management optimized by Big Data will help manufacturers
continue to wring new efficiency from the procurement process.
d. Manufacturing: On the assembly line, data gathered throughout the building process will be
used in predictive analytics to improve manufacturing simulations and watch machine
performance, making the next assembly line even more efficient and flexible.
• Marketing: Big Data is already having a major influence on vehicle marketing. Social sentiment
will play a growing role in manufacturers’ plans to design new vehicles. Customer feedback on
current models also helps marketing experts identify key themes and messages for new
campaigns.
• Finance: Understanding consumer habits, preferences and buying power across market segments
gives manufacturers insights needed to develop more-effective financing programs. But that’s just
the first step. New insights from Big Data analyses of sales and in-field use data will help captive
financing companies develop new services and new revenue streams.
• Services: Like performance, service will benefit as both a contributor and a user of Big Data in the
automotive industry. Information gathered through millions of service events will provide
feedback to designers.
6
Capstone Project – IS 6596
Skill sets required for a Data Science
Technical Skills:
Python Coding – Python is the most common coding language I typically see required in data science roles,
along with Java, Perl, or C/C++.
Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having
experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3
can also be beneficial.
SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science,
it is still expected that a candidate will be able to write and execute complex queries in SQL.
Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is
from social media, video feeds or audio.
Non-Technical Skills
Intellectual curiosity – No doubt we have seen this phrase everywhere lately, especially as it relates to
data scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest
blog posted a few months ago.
Business acumen – To be a data scientist we’ll need a solid understanding of the industry we’re working
in, and know what business problems your company is trying to solve. In terms of data science, being able
to discern which problems are important to solve for the business is critical, in addition to identifying new
ways the business should be leveraging its data.
Communication skills – Companies searching for a strong data scientist are looking for someone who can
clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or
Sales departments. A data scientist must enable the business to make decisions by arming them with
quantified insights, in addition to understanding the needs of their non-technical colleagues to wrangle
the data appropriately.
7
Capstone Project – IS 6596
7 Steps to effective Predictive Modelling
Step 1: Defining the Objective
The first step in any modeling process is defining the objective. We see in what field does the problem fall
in. There are many fields like Target Marketing, Risk & Fraud Management, Strategy Implementation and
Change Management, Operational Efficiency, Increase Customer Experience, Manage Marketing,
Campaigns Forecast, Revenue or Loss, Workforce Management, Financial Modeling, Churn Management,
and Social Media Influencers
Step 2: Gathering the Data
Accurate, actionable, accessible data is the lifeblood of any successful model. So we collect enough data
to make a predictive model on it.
Step 3: Preparing the Data for Modeling
The average modeler spends 70% of his or her time preparing data. In this step we need to prepare data
into right format for analysis and the tool we may want use.
1. Do initial cleaning up
2. Define Variables and Create Data Dictionary
3. Joining/Appending multiple datasets
4. Validate for correctness
5. Produce Basic Summary Reports
Step 4: Selecting and Transforming the Variables
Determining the best fit is essential to good model performance. The underlying structure of the
independent variables in relation to the dependent variable, determines the power and longevity of a
model.
Special consideration is given to the fact that marketing data can have hundreds or even thousands of
variables. We apply methods for identifying the best candidate variables. Programs are introduced that
automatically segment and transform the most powerful variables, to ensure the best fit.
Step 5: Processing and Evaluating the Model
All the preparation works up to this point makes this next step run smoothly. Weights of Evidence and
Information Values are calculated. For our main case study, we used various options within PROC LOGISTIC
to determine the model with the best fit. Validation data are scored, tabulated, and compared using both
SAS® & MSExcel®.
Step 6: Validating the Model
Models should perform well on the development data. Plus, if the hold-out sample is randomly selected,
the model performance should score the validation data with similar results. A true test of model
performance is how well it performs on data from a different time or market area. So, we used three
powerful methods for ensuring model fit. 1) Scoring alternate data is the best way to tell if our model will
8
Capstone Project – IS 6596
perform in a real campaign; 2) Bootstrapping uses simple resampling techniques to find confidence
intervals around our estimates; 3) Key Variable Analysis calculates important market factors as they are
affected by the model, thus ensuring reasonable results.
Step 7: Implementing and Maintaining the Model
Effective implementation is a combination of business intelligence and well-designed procedures. So, we
score a new data set with the new model. Several auditing procedures are done and tracking, and model
maintenance are emphasized as best practices.
Figure 1 7 Steps of Predictive Model
9
Capstone Project – IS 6596
Marketing Analysis
Figure 2 : Facets of Marketing Analysis
An accurate customer risk assessment will help us acquire the most profitable consumers while
minimizing risk. For business-to-consumer companies, Experian offers consumer credit information,
advanced scoring software, prescreening systems, and application decisioning tools. For companies
looking to acquire business customers, our business reports and public records, portfolio data and risk
modeling tools allow clients to create comprehensive profiles of business prospects. Determine which
businesses are well-capitalized and financially suited for customer acquisition.
10
Capstone Project – IS 6596
Fraud Detection
Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of
2016 suggests that more than one in three (36%) of organizations experienced economic crime.
Traditional methods of data analysis have long been used to detect fraud. They require complex and time-
consuming investigations that deal with different domains of knowledge like financial, economics,
business practices and law.
To know more about how Machine Learning algorithms, solve Fraud detection problem we took a dataset
from the “Machine Learning using R” credit data set.
The idea behind our credit model is to identify factors that make an applicant at higher risk of default.
Therefore, we need to obtain data on many past bank loans and whether the loan went into default, as
well as information about the applicant.
We can see that “job”, “phone”,
“checking_balance”,
“credit_history”, “purpose”,”
savings_balance”,
“employment_duration”,
“other_credit”, “housing” are the
categorical data so in Python we
use onehotencoder() to convert
the categorical data into 0s and 1s.
After applying the
onehotencoder() on all categorical
dataset we got 36 columns. The
credit dataset includes 1,000
examples of loans, plus a
combination of numeric and
nominal features indicating
characteristics of the loan and the
loan applicant. A class variable
indicates whether the loan went
into default.
Figure 3 Conversion of categorical data into 0s and 1s
11
Capstone Project – IS 6596
We did the initial data exploration and plotted that using matplotlib library.
Figure 4 Exploratory Data Analysis
We used decision tree to determine whether a person is a defaulter or not depending on the features.
The core algorithm for building decision trees called ID3. The Decision tree classifiers uses greedy
approach hence an attribute chooses at first step can’t be used anymore which can give better
classification if used in later steps. Also, it overfits the training data which can give poor results for unseen
data. It uses two concepts to determine on which feature it needs to divide the dataset.
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).
Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that
contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample
is an equally divided it has entropy of one.
After applying the Decision tree model, we got the following classification report.
12
Capstone Project – IS 6596
Figure 5 F1 Score for Decision Tree
F1 score is a measure of a test's accuracy. The F1 score is the harmonic average of the precision and recall,
where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
Decision tree makes a model which is biased so to overcome this drawback we use Bagging.
Bagging is a way to decrease the variance of our prediction by generating additional data for training from
our original dataset using combinations with repetitions to produce multisets of the same cardinality/size
as our original data.
Random Forests is an ensemble classifier which uses many decision tree models to predict the result. A
different subset of training data is selected, with replacement to train each tree. A collection of trees is a
forest, and the trees are being trained on subsets which are being selected at random, hence random
forests. After applying Random Forest classifier, we got the following result.
Figure 6 F1 Score for Random Forest
We can clearly see the increase in the F1-score.
Now the next step in building model as discussed earlier is to fine tune the model. For this we use Grid
Search Cross Validation technique. After applying the GridSearchCV we got the following classification
report.
Figure 7 F1 Score after GridSearchCV
From this model we understand that the model will predict 73% of the time whether a person will be a
defaulter or not.
13
Capstone Project – IS 6596
Market Segmentation
One of the most fundamental marketing activities is in market segmentation. As companies cannot
connect with all their potential customers, they must divide markets into groups (segments) of consumers,
customers, or clients with similar needs and wants. Firms can then target each of these segments by
positioning themselves in a unique segment (such as Ferrari in the high-end sports car market).
While market researchers often form market, segments based on
practical grounds, industry practice and wisdom, cluster analysis
allows segments to be formed that are based on data that are less
dependent on subjectivity.
Cluster analysis is a convenient method for identifying homogeneous
groups of objects called clusters. Objects (or cases, observations) in a
specific cluster share many characteristics, but are very dissimilar to
objects not belonging to that cluster.
Below we have tried try this process from start to finish.
For this analysis, we used a dataset representing a random sample of 30,000 U.S. high school students
who had profiles on a well-known SNS in 2006. To protect the users' anonymity, the SNS will remain
unnamed. However, at the time the data was collected, the SNS was a popular web destination for US
teenagers. Therefore, it is reasonable to assume that the profiles represent a wide cross section of
American adolescents in 2006.
Let's take a quick look at the specifics of the data.
Figure 8 Description of the data set
14
Capstone Project – IS 6596
Figure 9 Min-Max of the Age Figure 10 Gender and Age anomaly
There is something strange around the gender row. On looking carefully, we noticed the NA value. We
see that 2,724 records (9 percent) have missing gender data.
Besides gender, only age has missing values. A total of 5,086 records (17 percent) have missing ages. Also
concerning is the fact that the minimum and maximum values seem to be unreasonable; it is unlikely that
a 3-year-old or a 106-year-old is attending high school. To ensure that these extreme values don't cause
problems for the analysis, we cleaned them up before moving on.
Figure 11 Box Plot for the age distribution
A more reasonable range of ages for the high school students includes those who are at least 13 years old
and not yet 20 years old. Any age value falling outside this range we treated the same as missing data.
An easy solution for handling the missing values is to exclude any record with a missing value. In this case,
we created dummy variables for female and unknown gender. We assigned teens$female the value 1 if
gender is equal to F and the gender is not equal to NA; otherwise, it assigns the value 0 .
Next, we eliminated the 5,523 missing age values. We have used a different strategy known as data
imputation, which involves filling in the missing data with a guess as to the true value. Most people in a
graduation cohort were born within a single calendar year. We have identified the typical age for each
cohort, we had a reasonable estimate of the age of a student in that graduation year.
15
Capstone Project – IS 6596
To cluster the teenagers into marketing segments, we used an implementation of k-means clustering. We
started our cluster analysis by considering only the 36 features that represent the number of times various
interests appeared on the teen SNS profiles.
Evaluating clustering results can be somewhat subjective. Ultimately, the success or failure of the model
hinges on whether the clusters are useful for their intended purpose. As the goal of this analysis was to
identify clusters of teenagers with similar interests for marketing purposes, we largely measured our
success in qualitative terms. For other clustering applications, more quantitative measures of success may
be needed. By examining whether the clusters fall above or below the mean level for each interest
category, we can notice patterns that distinguish the clusters from each other. Cluster 3 is substantially
above the mean interest level on all the sports. This suggests that this may be a group of Athletes per The
Breakfast Club stereotype.
Figure 12 Cluster segmentation
Cluster 0 includes the most mentions of "cheerleading," the word "hot," and is above the average level of
football interest. Hence, these are the so-called Princesses. Similarly, we tried to cluster the different
groups, and this is what we found.
We now focused our effort on turning these insights into action. We applied the clusters back onto the
full dataset.
We looked at the demographic characteristics of the clusters. The mean age does not vary much by
cluster, which is not too surprising as these teen identities are often determined before high school. On
the other hand, there are some substantial differences in the proportion of females by cluster. This is a
very interesting finding as we didn't use gender data to create the clusters, yet the clusters are still
Cluster 0 (N =
872) Princess
cute
hair
shopping
clothes
dance
Cluster 1 (N =
21308) Basket
Cases
???
Cluster 2 (N =
1041) Criminals
drunk
deaths
drugs
die
music
Cluster 3 (N =
5971) Athletes
basketball
soccer
football
volleyball
soccer
Cluster 4 (N =
808) Brains
band
marching
music
rock
16
Capstone Project – IS 6596
predictive of gender. Given our success in predicting gender, we also suspected that the clusters are
predictive of the number of friends the users have. This hypothesis seems to be supported by the data.
Our findings support the popular adage that "birds of a feather flock together." By using machine learning
methods to cluster teenagers with others who have similar interests, we were able to develop a typology
of teen identities that was predictive of personal characteristics, such as gender and the number of
friends. These same methods can be applied to other contexts with similar results.
Advertising
Compared to all the marketing techniques, email marketing is the cheapest way of sending a marketing
message to millions of people. Being so cheap, it is the tool of choice for marketing teams with a small
budget trying to sell cheap products. Most of the times, such products do not deliver what they promise.
Unfortunately, with email marketing, we run the risk of being exposed to malware and fraudulent emails.
Worms and viruses often make use of email and spam techniques to propagate. Phishing emails and
Nigerian 419 scams are examples of fraudulent emails which try to harvest either our money or our
personal information including credit card details. So, while email marketing is the tool of choice for most
marketing teams, it does require stringent regulations to ensure that it does not get abused. Below we
tried to build a model which predicts whether a composed message is spam or not.
The dataset included the text of SMS messages along with a label indicating whether the message is
unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham. Since Naive
Bayes has been used successfully for e-mail spam filtering, it seems likely that it could also be applied to
SMS spam. However, relative to e-mail spam, SMS spam poses additional challenges for automated filters.
SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify
whether a message is junk.
Figure 13 Description of the data set
The first step towards constructing our classifier involves processing the raw data for analysis. SMS
messages are strings of text composed of words, spaces, numbers, and punctuation. Handling this type of
complex data takes a lot of thought and effort. One needs to consider how to remove numbers and
17
Capstone Project – IS 6596
punctuation; handle uninteresting words such as and, but, and or; and how to break apart sentences into
individual words.
Figure 14 Description of length of the Ham messages Figure15 Description of length of the Spam messages
Our first order of business was to standardize the messages to use only lowercase characters. To this end,
we used tolower() function that returns a lowercase version of text strings. Continuing with our cleanup
process, we also eliminated any punctuation from the text messages. Our next task was to remove filler
words such as to, and, but, and or from our SMS messages. These terms are known as stop words and are
typically removed prior to text mining. This is due to the fact that although they appear very frequently,
they do not provide much useful information for machine learning.
Another common standardization for text data involves reducing words to their root form in a process
called stemming. The stemming process takes words like learned, learning, and learns, and strips the suffix
to transform them into the base form, learn. These are left with the blank spaces that previously separated
the now-missing pieces. The final step in our text cleanup process was to remove additional whitespace.
A word cloud is a way to visually depict the frequency at which words appear in text data. The cloud is
composed of words scattered somewhat randomly around the figure. The resulting word clouds are
shown in the following diagram:
18
Capstone Project – IS 6596
Figure 16 Spam Word cloud Figure 17 Ham Word cloud
Now that the data are processed to our liking, the final step is to split the messages into individual
components through a process called vectorization. We took the corpus and created a data structure in
which rows indicate documents (SMS messages) and columns indicate terms (words). The final step in the
data preparation process was to transform the sparse matrix into a data structure that can be used to
train a Naive Bayes classifier. The sparse matrix included over 6,500 features; this is a feature for every
word that appears in at least one SMS message. It's unlikely that these are useful for classification. To
reduce the number of features, we eliminated any word that appear in less than five SMS messages, or in
less than about 0.1 percent of the records in the training data.
Figure 18 Vectorization
To evaluate the SMS classifier, we need to test its predictions on unseen messages in the test data. The
process of evaluating machine learning algorithms is very similar to the process of evaluating students.
Since algorithms have varying strengths and weaknesses, tests should distinguish among the learners.
Figure 19 Classification report
19
Capstone Project – IS 6596
A confusion matrix is a table that categorizes predictions according to whether they match the actual
value. One of the table's dimensions indicates the possible categories of predicted values, while the other
dimension indicates the same for actual values. Although we have only seen 2 x 2 confusion matrices so
far, a matrix can be created for models that predict any number of class value.
Lessons Learned
Lesson 1: Marketing research is fun- We get to work with a wide variety of datasets, dive in and learn all
about the market their operating in and relay valuable insights back to stakeholders. We dig up everything
from why consumers make certain purchase decisions to what they’re passionate about and what makes
them tick.
Lesson 2: Collaboration is key- While doing this project we found out that while they might be tremendous
innovators, but collaboration is very important.
Lesson 3: Check, re-check and then check again Projects move quickly which means we don’t have time
to go back and re-collect data or make corrections to a report. Questionnaires, surveys, and reports must
be checked, checked by our coworker and checked again.
Next Steps
The next step would be to discover the other facets of Marketing Analysis like “Upsell and Cross Sell”,
“Recommendation System” etc. We can use algorithms like Principal Component Analysis(PCA), QDA, LDA
to reduce the number of features. Also, we can make analysis on the time series data using ARIMA
algorithm.

More Related Content

PDF
Data Structures and Algorithms Made Easy in Java ( PDFDrive ).pdf
PPTX
BIG DATA and USE CASES
PPTX
Next generation of data scientist
PPT
PPTX
Introduction to Data Science.pptx
PPTX
Data Mining : Concepts and Techniques
PPT
MACHINE LEARNING LIFE CYCLE
PPTX
Data Wrangling
Data Structures and Algorithms Made Easy in Java ( PDFDrive ).pdf
BIG DATA and USE CASES
Next generation of data scientist
Introduction to Data Science.pptx
Data Mining : Concepts and Techniques
MACHINE LEARNING LIFE CYCLE
Data Wrangling

Similar to Marketing Analytics using R/Python (20)

PPTX
Impact of Data Science
PDF
Unit-I.pdf Data Science unit 1 Introduction of data science
PDF
What are Big Data, Data Science, and Data Analytics
PPTX
Introduction to data science
PPTX
In-Depth Data Analytics
PDF
What is Data Science? Daniel D Gutierrez
PPTX
Data scientist What is inside it?
PDF
Marketing data
PDF
Credit card fraud detection using python machine learning
PPTX
data science and business analytics
PPTX
Mtech First_Year Data Analytics in Industry with power bI
PPTX
intro to data science Clustering and visualization of data science subfields ...
PDF
Continuous Improvement through Data Science From Products to Systems Beyond C...
PPTX
Bdml ecom
PDF
5_Data Analytics, Data Science and Machine Learning
PDF
365 Data Science
PPTX
Data Science.pptx NEW COURICUUMN IN DATA
PDF
Empowering Careers with Advanced Training in Digital Marketing
PDF
IPCS GLOBAL KANNUR DM INSTITUTION KNR.pdf
PDF
Ml in a day v 1.1
 
Impact of Data Science
Unit-I.pdf Data Science unit 1 Introduction of data science
What are Big Data, Data Science, and Data Analytics
Introduction to data science
In-Depth Data Analytics
What is Data Science? Daniel D Gutierrez
Data scientist What is inside it?
Marketing data
Credit card fraud detection using python machine learning
data science and business analytics
Mtech First_Year Data Analytics in Industry with power bI
intro to data science Clustering and visualization of data science subfields ...
Continuous Improvement through Data Science From Products to Systems Beyond C...
Bdml ecom
5_Data Analytics, Data Science and Machine Learning
365 Data Science
Data Science.pptx NEW COURICUUMN IN DATA
Empowering Careers with Advanced Training in Digital Marketing
IPCS GLOBAL KANNUR DM INSTITUTION KNR.pdf
Ml in a day v 1.1
 
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
sap open course for s4hana steps from ECC to s4
Encapsulation_ Review paper, used for researhc scholars
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectroscopy.pptx food analysis technology
Digital-Transformation-Roadmap-for-Companies.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Ad

Marketing Analytics using R/Python

  • 1. Capstone Project - IS 6596 Project Supervisor: Dr. Rohit Aggarwal Project Contributors: Mayank Badjatya - u1085897 Sagar Singh - u1088202 MARKETING ANALYTICS USING R/PYTHON
  • 2. 1 Capstone Project – IS 6596 Contents Executive Summary.......................................................................................................................................2 Book Description...........................................................................................................................................3 Why Data Science?........................................................................................................................................5 Skill sets required for a Data Science............................................................................................................6 7 Steps to effective Predictive Modelling.....................................................................................................7 Marketing Analysis........................................................................................................................................9 Fraud Detection ......................................................................................................................................10 Market Segmentation.............................................................................................................................13 Advertising..............................................................................................................................................16 Lessons Learned..........................................................................................................................................19 Next Steps...................................................................................................................................................19
  • 3. 2 Capstone Project – IS 6596 Executive Summary The objective of this project is to discuss the importance of Machine Learning in different sectors and how does it solve the problems in the Marketing Analytics field. We have discussed Marketing Segmentation, Advertisement, and Fraud detection in our project. We used different Machine Learning algorithms and used R and Python library to predict and solve these problems. After making models and running test data on those models we got following results: • We trained a Decision tree and Random Forest classifier model which has 73% accuracy to predict whether a person will be a defaulter or not based on credit history, income, job type, dependents etc. • We segmented the Social networking profiles based on the likes and dislikes of a person using K- Means Clustering. • We made a predictive model on the messages a customer receives and determined whether a message will be a Spam or not a spam with an accuracy of 97%. We used Naïve Bayes classifier for this model. • We created several other models using different algorithms, but these are beyond the scope of this report.
  • 4. 3 Capstone Project – IS 6596 Book Description An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use innovative statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra. Machine Learning with R: This book is intended for anybody hoping to use data for action. Perhaps you already know a bit about machine learning, but have never used R; or perhaps you know a little about R, but are new to machine learning. In any case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is required. All you need is curiosity. Machine learning, at its core, is concerned with the algorithms that transform information into actionable intelligence. This fact makes machine learning well-suited to the present-day era of big data. Without machine learning, it would be nearly impossible to keep up with the massive stream of information. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start using machine learning. R offers a powerful but easy-to-learn set of tools that can assist you with finding data insights. By combining hands-on case studies with the essential
  • 5. 4 Capstone Project – IS 6596 theory that you need to understand how things work under the hood, this book provides all the knowledge that you will need to start applying machine learning to your own projects. Marketing Analytics Data Driven Techniques: This book helps tech-savvy marketers and data analysts solve real-world business problems with Excel. Using data-driven business analytics to understand customers and improve results is a great idea in theory, but in today's busy offices, marketers and analysts need simple, low-cost ways to process and make the most of all that data. This expert book offers the perfect solution. Written by data analysis expert Wayne L. Winston, this practical resource shows you how to tap a simple and cost-effective tool, Microsoft Excel, to solve specific business problems using powerful analytic techniques—and achieve optimum results. Practical exercises in each chapter helped us to apply and reinforce techniques as you learn. Shows you how to perform sophisticated business analyses using the cost-effective and widely available Microsoft Excel instead of expensive, proprietary analytical tools • Reveals how to target and retain profitable customers and avoid high-risk customers • Helps you forecast sales and improve response rates for marketing campaigns • Explores how to optimize price points for products and services, optimize store layouts, and improve online advertising • Covers social media, viral marketing, and how to exploit both effectively.
  • 6. 5 Capstone Project – IS 6596 Why Data Science? Data Science is a field, which can be implemented anywhere. Here is the list of people who uses data science as a tool in their field and are not from IT background. • Politics: We may have heard how statistical wizard Nate Silver predicted the electoral votes for each state in the 2012 presidential election, showing that raw data crunching of polls is much more reliable than traditional punditry. • Healthcare: The role of big data in medicine is one where we can build better health profiles and better predictive models around individual patients so that we can better diagnose and treat disease. Big data comes into play around aggregating increasingly information around multiple scales for what constitutes a disease—from the DNA, proteins, and metabolites to cells, tissues, organs, organisms, and ecosystems. • Automotive Industry: Areas in the automotive industry impacted by Big Data include: a. Conceptual Design: Real-world data collected from billions of miles driven will undoubtedly influence safety, aerodynamics, power algorithms and other fundamental elements of the vehicle. b. Drawing Boards: Efficiency gained in design, production volumes and manufacturing through Big Data in the auto industry will make it economically feasible to make today’s options tomorrow’s standard equipment. c. Procurement: Supply chain management optimized by Big Data will help manufacturers continue to wring new efficiency from the procurement process. d. Manufacturing: On the assembly line, data gathered throughout the building process will be used in predictive analytics to improve manufacturing simulations and watch machine performance, making the next assembly line even more efficient and flexible. • Marketing: Big Data is already having a major influence on vehicle marketing. Social sentiment will play a growing role in manufacturers’ plans to design new vehicles. Customer feedback on current models also helps marketing experts identify key themes and messages for new campaigns. • Finance: Understanding consumer habits, preferences and buying power across market segments gives manufacturers insights needed to develop more-effective financing programs. But that’s just the first step. New insights from Big Data analyses of sales and in-field use data will help captive financing companies develop new services and new revenue streams. • Services: Like performance, service will benefit as both a contributor and a user of Big Data in the automotive industry. Information gathered through millions of service events will provide feedback to designers.
  • 7. 6 Capstone Project – IS 6596 Skill sets required for a Data Science Technical Skills: Python Coding – Python is the most common coding language I typically see required in data science roles, along with Java, Perl, or C/C++. Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3 can also be beneficial. SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science, it is still expected that a candidate will be able to write and execute complex queries in SQL. Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is from social media, video feeds or audio. Non-Technical Skills Intellectual curiosity – No doubt we have seen this phrase everywhere lately, especially as it relates to data scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest blog posted a few months ago. Business acumen – To be a data scientist we’ll need a solid understanding of the industry we’re working in, and know what business problems your company is trying to solve. In terms of data science, being able to discern which problems are important to solve for the business is critical, in addition to identifying new ways the business should be leveraging its data. Communication skills – Companies searching for a strong data scientist are looking for someone who can clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or Sales departments. A data scientist must enable the business to make decisions by arming them with quantified insights, in addition to understanding the needs of their non-technical colleagues to wrangle the data appropriately.
  • 8. 7 Capstone Project – IS 6596 7 Steps to effective Predictive Modelling Step 1: Defining the Objective The first step in any modeling process is defining the objective. We see in what field does the problem fall in. There are many fields like Target Marketing, Risk & Fraud Management, Strategy Implementation and Change Management, Operational Efficiency, Increase Customer Experience, Manage Marketing, Campaigns Forecast, Revenue or Loss, Workforce Management, Financial Modeling, Churn Management, and Social Media Influencers Step 2: Gathering the Data Accurate, actionable, accessible data is the lifeblood of any successful model. So we collect enough data to make a predictive model on it. Step 3: Preparing the Data for Modeling The average modeler spends 70% of his or her time preparing data. In this step we need to prepare data into right format for analysis and the tool we may want use. 1. Do initial cleaning up 2. Define Variables and Create Data Dictionary 3. Joining/Appending multiple datasets 4. Validate for correctness 5. Produce Basic Summary Reports Step 4: Selecting and Transforming the Variables Determining the best fit is essential to good model performance. The underlying structure of the independent variables in relation to the dependent variable, determines the power and longevity of a model. Special consideration is given to the fact that marketing data can have hundreds or even thousands of variables. We apply methods for identifying the best candidate variables. Programs are introduced that automatically segment and transform the most powerful variables, to ensure the best fit. Step 5: Processing and Evaluating the Model All the preparation works up to this point makes this next step run smoothly. Weights of Evidence and Information Values are calculated. For our main case study, we used various options within PROC LOGISTIC to determine the model with the best fit. Validation data are scored, tabulated, and compared using both SAS® & MSExcel®. Step 6: Validating the Model Models should perform well on the development data. Plus, if the hold-out sample is randomly selected, the model performance should score the validation data with similar results. A true test of model performance is how well it performs on data from a different time or market area. So, we used three powerful methods for ensuring model fit. 1) Scoring alternate data is the best way to tell if our model will
  • 9. 8 Capstone Project – IS 6596 perform in a real campaign; 2) Bootstrapping uses simple resampling techniques to find confidence intervals around our estimates; 3) Key Variable Analysis calculates important market factors as they are affected by the model, thus ensuring reasonable results. Step 7: Implementing and Maintaining the Model Effective implementation is a combination of business intelligence and well-designed procedures. So, we score a new data set with the new model. Several auditing procedures are done and tracking, and model maintenance are emphasized as best practices. Figure 1 7 Steps of Predictive Model
  • 10. 9 Capstone Project – IS 6596 Marketing Analysis Figure 2 : Facets of Marketing Analysis An accurate customer risk assessment will help us acquire the most profitable consumers while minimizing risk. For business-to-consumer companies, Experian offers consumer credit information, advanced scoring software, prescreening systems, and application decisioning tools. For companies looking to acquire business customers, our business reports and public records, portfolio data and risk modeling tools allow clients to create comprehensive profiles of business prospects. Determine which businesses are well-capitalized and financially suited for customer acquisition.
  • 11. 10 Capstone Project – IS 6596 Fraud Detection Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of 2016 suggests that more than one in three (36%) of organizations experienced economic crime. Traditional methods of data analysis have long been used to detect fraud. They require complex and time- consuming investigations that deal with different domains of knowledge like financial, economics, business practices and law. To know more about how Machine Learning algorithms, solve Fraud detection problem we took a dataset from the “Machine Learning using R” credit data set. The idea behind our credit model is to identify factors that make an applicant at higher risk of default. Therefore, we need to obtain data on many past bank loans and whether the loan went into default, as well as information about the applicant. We can see that “job”, “phone”, “checking_balance”, “credit_history”, “purpose”,” savings_balance”, “employment_duration”, “other_credit”, “housing” are the categorical data so in Python we use onehotencoder() to convert the categorical data into 0s and 1s. After applying the onehotencoder() on all categorical dataset we got 36 columns. The credit dataset includes 1,000 examples of loans, plus a combination of numeric and nominal features indicating characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Figure 3 Conversion of categorical data into 0s and 1s
  • 12. 11 Capstone Project – IS 6596 We did the initial data exploration and plotted that using matplotlib library. Figure 4 Exploratory Data Analysis We used decision tree to determine whether a person is a defaulter or not depending on the features. The core algorithm for building decision trees called ID3. The Decision tree classifiers uses greedy approach hence an attribute chooses at first step can’t be used anymore which can give better classification if used in later steps. Also, it overfits the training data which can give poor results for unseen data. It uses two concepts to determine on which feature it needs to divide the dataset. Information Gain The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches). Entropy A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one. After applying the Decision tree model, we got the following classification report.
  • 13. 12 Capstone Project – IS 6596 Figure 5 F1 Score for Decision Tree F1 score is a measure of a test's accuracy. The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. Decision tree makes a model which is biased so to overcome this drawback we use Bagging. Bagging is a way to decrease the variance of our prediction by generating additional data for training from our original dataset using combinations with repetitions to produce multisets of the same cardinality/size as our original data. Random Forests is an ensemble classifier which uses many decision tree models to predict the result. A different subset of training data is selected, with replacement to train each tree. A collection of trees is a forest, and the trees are being trained on subsets which are being selected at random, hence random forests. After applying Random Forest classifier, we got the following result. Figure 6 F1 Score for Random Forest We can clearly see the increase in the F1-score. Now the next step in building model as discussed earlier is to fine tune the model. For this we use Grid Search Cross Validation technique. After applying the GridSearchCV we got the following classification report. Figure 7 F1 Score after GridSearchCV From this model we understand that the model will predict 73% of the time whether a person will be a defaulter or not.
  • 14. 13 Capstone Project – IS 6596 Market Segmentation One of the most fundamental marketing activities is in market segmentation. As companies cannot connect with all their potential customers, they must divide markets into groups (segments) of consumers, customers, or clients with similar needs and wants. Firms can then target each of these segments by positioning themselves in a unique segment (such as Ferrari in the high-end sports car market). While market researchers often form market, segments based on practical grounds, industry practice and wisdom, cluster analysis allows segments to be formed that are based on data that are less dependent on subjectivity. Cluster analysis is a convenient method for identifying homogeneous groups of objects called clusters. Objects (or cases, observations) in a specific cluster share many characteristics, but are very dissimilar to objects not belonging to that cluster. Below we have tried try this process from start to finish. For this analysis, we used a dataset representing a random sample of 30,000 U.S. high school students who had profiles on a well-known SNS in 2006. To protect the users' anonymity, the SNS will remain unnamed. However, at the time the data was collected, the SNS was a popular web destination for US teenagers. Therefore, it is reasonable to assume that the profiles represent a wide cross section of American adolescents in 2006. Let's take a quick look at the specifics of the data. Figure 8 Description of the data set
  • 15. 14 Capstone Project – IS 6596 Figure 9 Min-Max of the Age Figure 10 Gender and Age anomaly There is something strange around the gender row. On looking carefully, we noticed the NA value. We see that 2,724 records (9 percent) have missing gender data. Besides gender, only age has missing values. A total of 5,086 records (17 percent) have missing ages. Also concerning is the fact that the minimum and maximum values seem to be unreasonable; it is unlikely that a 3-year-old or a 106-year-old is attending high school. To ensure that these extreme values don't cause problems for the analysis, we cleaned them up before moving on. Figure 11 Box Plot for the age distribution A more reasonable range of ages for the high school students includes those who are at least 13 years old and not yet 20 years old. Any age value falling outside this range we treated the same as missing data. An easy solution for handling the missing values is to exclude any record with a missing value. In this case, we created dummy variables for female and unknown gender. We assigned teens$female the value 1 if gender is equal to F and the gender is not equal to NA; otherwise, it assigns the value 0 . Next, we eliminated the 5,523 missing age values. We have used a different strategy known as data imputation, which involves filling in the missing data with a guess as to the true value. Most people in a graduation cohort were born within a single calendar year. We have identified the typical age for each cohort, we had a reasonable estimate of the age of a student in that graduation year.
  • 16. 15 Capstone Project – IS 6596 To cluster the teenagers into marketing segments, we used an implementation of k-means clustering. We started our cluster analysis by considering only the 36 features that represent the number of times various interests appeared on the teen SNS profiles. Evaluating clustering results can be somewhat subjective. Ultimately, the success or failure of the model hinges on whether the clusters are useful for their intended purpose. As the goal of this analysis was to identify clusters of teenagers with similar interests for marketing purposes, we largely measured our success in qualitative terms. For other clustering applications, more quantitative measures of success may be needed. By examining whether the clusters fall above or below the mean level for each interest category, we can notice patterns that distinguish the clusters from each other. Cluster 3 is substantially above the mean interest level on all the sports. This suggests that this may be a group of Athletes per The Breakfast Club stereotype. Figure 12 Cluster segmentation Cluster 0 includes the most mentions of "cheerleading," the word "hot," and is above the average level of football interest. Hence, these are the so-called Princesses. Similarly, we tried to cluster the different groups, and this is what we found. We now focused our effort on turning these insights into action. We applied the clusters back onto the full dataset. We looked at the demographic characteristics of the clusters. The mean age does not vary much by cluster, which is not too surprising as these teen identities are often determined before high school. On the other hand, there are some substantial differences in the proportion of females by cluster. This is a very interesting finding as we didn't use gender data to create the clusters, yet the clusters are still Cluster 0 (N = 872) Princess cute hair shopping clothes dance Cluster 1 (N = 21308) Basket Cases ??? Cluster 2 (N = 1041) Criminals drunk deaths drugs die music Cluster 3 (N = 5971) Athletes basketball soccer football volleyball soccer Cluster 4 (N = 808) Brains band marching music rock
  • 17. 16 Capstone Project – IS 6596 predictive of gender. Given our success in predicting gender, we also suspected that the clusters are predictive of the number of friends the users have. This hypothesis seems to be supported by the data. Our findings support the popular adage that "birds of a feather flock together." By using machine learning methods to cluster teenagers with others who have similar interests, we were able to develop a typology of teen identities that was predictive of personal characteristics, such as gender and the number of friends. These same methods can be applied to other contexts with similar results. Advertising Compared to all the marketing techniques, email marketing is the cheapest way of sending a marketing message to millions of people. Being so cheap, it is the tool of choice for marketing teams with a small budget trying to sell cheap products. Most of the times, such products do not deliver what they promise. Unfortunately, with email marketing, we run the risk of being exposed to malware and fraudulent emails. Worms and viruses often make use of email and spam techniques to propagate. Phishing emails and Nigerian 419 scams are examples of fraudulent emails which try to harvest either our money or our personal information including credit card details. So, while email marketing is the tool of choice for most marketing teams, it does require stringent regulations to ensure that it does not get abused. Below we tried to build a model which predicts whether a composed message is spam or not. The dataset included the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham. Since Naive Bayes has been used successfully for e-mail spam filtering, it seems likely that it could also be applied to SMS spam. However, relative to e-mail spam, SMS spam poses additional challenges for automated filters. SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify whether a message is junk. Figure 13 Description of the data set The first step towards constructing our classifier involves processing the raw data for analysis. SMS messages are strings of text composed of words, spaces, numbers, and punctuation. Handling this type of complex data takes a lot of thought and effort. One needs to consider how to remove numbers and
  • 18. 17 Capstone Project – IS 6596 punctuation; handle uninteresting words such as and, but, and or; and how to break apart sentences into individual words. Figure 14 Description of length of the Ham messages Figure15 Description of length of the Spam messages Our first order of business was to standardize the messages to use only lowercase characters. To this end, we used tolower() function that returns a lowercase version of text strings. Continuing with our cleanup process, we also eliminated any punctuation from the text messages. Our next task was to remove filler words such as to, and, but, and or from our SMS messages. These terms are known as stop words and are typically removed prior to text mining. This is due to the fact that although they appear very frequently, they do not provide much useful information for machine learning. Another common standardization for text data involves reducing words to their root form in a process called stemming. The stemming process takes words like learned, learning, and learns, and strips the suffix to transform them into the base form, learn. These are left with the blank spaces that previously separated the now-missing pieces. The final step in our text cleanup process was to remove additional whitespace. A word cloud is a way to visually depict the frequency at which words appear in text data. The cloud is composed of words scattered somewhat randomly around the figure. The resulting word clouds are shown in the following diagram:
  • 19. 18 Capstone Project – IS 6596 Figure 16 Spam Word cloud Figure 17 Ham Word cloud Now that the data are processed to our liking, the final step is to split the messages into individual components through a process called vectorization. We took the corpus and created a data structure in which rows indicate documents (SMS messages) and columns indicate terms (words). The final step in the data preparation process was to transform the sparse matrix into a data structure that can be used to train a Naive Bayes classifier. The sparse matrix included over 6,500 features; this is a feature for every word that appears in at least one SMS message. It's unlikely that these are useful for classification. To reduce the number of features, we eliminated any word that appear in less than five SMS messages, or in less than about 0.1 percent of the records in the training data. Figure 18 Vectorization To evaluate the SMS classifier, we need to test its predictions on unseen messages in the test data. The process of evaluating machine learning algorithms is very similar to the process of evaluating students. Since algorithms have varying strengths and weaknesses, tests should distinguish among the learners. Figure 19 Classification report
  • 20. 19 Capstone Project – IS 6596 A confusion matrix is a table that categorizes predictions according to whether they match the actual value. One of the table's dimensions indicates the possible categories of predicted values, while the other dimension indicates the same for actual values. Although we have only seen 2 x 2 confusion matrices so far, a matrix can be created for models that predict any number of class value. Lessons Learned Lesson 1: Marketing research is fun- We get to work with a wide variety of datasets, dive in and learn all about the market their operating in and relay valuable insights back to stakeholders. We dig up everything from why consumers make certain purchase decisions to what they’re passionate about and what makes them tick. Lesson 2: Collaboration is key- While doing this project we found out that while they might be tremendous innovators, but collaboration is very important. Lesson 3: Check, re-check and then check again Projects move quickly which means we don’t have time to go back and re-collect data or make corrections to a report. Questionnaires, surveys, and reports must be checked, checked by our coworker and checked again. Next Steps The next step would be to discover the other facets of Marketing Analysis like “Upsell and Cross Sell”, “Recommendation System” etc. We can use algorithms like Principal Component Analysis(PCA), QDA, LDA to reduce the number of features. Also, we can make analysis on the time series data using ARIMA algorithm.