Business Analytics using R.ppt

Business
Analytics Using ‘R’

Descriptive and Diagnostic
Analytics
Function:
describe the main features of organizational data
Common tools:
sampling, mean, mode, median, standard deviation,
range, variance, stem and leaf diagram, histogram,
interquartile range, quartiles, and frequency distributions
Displaying results:
 graphics/charts, tables, and summary statistics such as
single numbers

 Function:
 draw conclusions and predict future behavior
 Common tools:
 cluster analysis, association analysis, multiple regression,
logistic regression, decision tree methods, neural
networks, text mining and forecasting tools (such as time
series and causal relationships)
Predictive Analytics

Function:
make decisions based on data
Common models:
linear programming
sensitivity analysis
integer programming
goal programming
nonlinear programming
simulation modeling
Prescriptive Analytics

Major Techniques
 Classification
Clustering
Association Mining
Regression

Classification Defined
The classification problem may be formalized
using a-posteriori probabilities:
 P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
E.g. P(class=N | outlook=sunny,windy=true,…)
Idea: assign to sample X the class label C such
that P(C|X) is maximal

Classification:
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification

Classification—A Two-Step
Process
 Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting will
occur

Classification Process
(1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)

Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?

Issues (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data

Issues (2): Evaluating
Classification Methods

Measurement
The Accuracy (AC) is the proportion of the
total number of predictions that were
correct.
Sensitivity or true positive rate (TP) is the
proportion of positive cases that were
correctly identified
Specificity : the proportion of actual
negative cases which are correctly
identified(TN).

Classification by Decision
Tree Induction
 Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
 Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree

Training Dataset
age income student credit_rating
<=30 high no fair
<=30 high no excellent
31…40 high no fair
>40 medium no fair
>40 low yes fair
>40 low yes excellent
31…40 low yes excellent
<=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent
This
follows
an
example
from

Output: A Decision Tree for
“buys_computer”
age?
overcast
student? credit rating?
no yes fair
excellent
<=30 >40
no no
yes yes
yes
30..40

Extracting Classification
Rules from Trees
 Represent the knowledge in the form of IF-THEN rules
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand
 Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”

Bayesian Theorem
Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
MAP (maximum posteriori) hypothesis
Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P 
.
)
(
)
|
(
max
arg
)
|
(
max
arg h
P
h
D
P
H
h
D
h
P
H
h
MAP
h





Bayesian classification
The classification problem may be formalized
using a-posteriori probabilities:
 P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
Idea: assign to sample X the class label C such
that P(C|X) is maximal

Naïve Bayesian
Classification
Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
Computationally easy in both cases

Other Classification
Methods
k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
Support Vector Machine (SVM)
Logistic Regression

Business Analytics using R.ppt

More Related Content

Similar to Business Analytics using R.ppt (20)

Recently uploaded (20)

Business Analytics using R.ppt