Bayes’ theorem and logistic regression

 Bayes’ theorem gives the relationship between the probabilities of A and B, P(A) and
P(B), and the conditional probabilities of A given B and B given A, P(A|B) and P(B|A), in
its most common form P(A|B)=
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃(𝐵)
 In Bayesian interpretation, probability measures the degree of belief. Bayes theorem links
the belief in a proposition before and after accounting for an evidence.
 For proposition A and evidence B:
 P(A), the prior, is the initial degree of belief in A
 P(A|B), the posterior, is the degree of belief having accounted for B
 The quotient P(B|A)/P(B) represents the support B provides for A

 The probability model for a classifier is a conditional model p(C|F1, …., Fn), over a
dependent class variable C, with a small number of outcomes or classes conditioned on
several feature variables F1 through Fn .
 Problem – large number of features or features that can take large number of values
makes the probability tables infeasible
 Using Bayesian theorem p(C|F1, …., Fn) =
𝒑 𝑪 𝒑(𝑭 𝟏
,…….𝑭𝒏|𝑪)
𝒑(𝑭 𝟏
,…….𝑭𝒏)
 in plain English, posterior=
𝑝𝑟𝑖𝑜𝑟 ∗𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒

 Since the denominator is not dependent on C and the values of the features Fi are given,
so that the denominator is effectively constant. The numerator is equivalent to the joint
probability model p(C,F1, …., Fn ).
 Using the chain rule for repeated applications of definition of conditional probability
 Role of Naïve condition: assume that the feature Fi is conditionally independent of every
other feature Fj, for j≠i given the category C.

 The joint model can be represented as
 Under conditional distribution over the class variable
 Where z=p(F1, ….., Fn) is a scaling factor

 Naïve Bayes classifier combines this model with a decision rule.
 Most common rule is to pick hypothesis that is most probable, known as maximum a
posteriori
 The probability of a document F being in class c is computed as
 P(F|c) is the conditional probability of term F occurring in a document of class c.
 It is a measure of how much evidence F contributes that c is a correct class.
 P(c) is the prior probability of a document occurring in class c.

 Statistical classification model
 Predicts binary response from a binary predictor for predicting the outcome of a
categorical dependent variable
 Logistic regression measures the relationship between a categorical dependent variable
and one or more independent variable
 Applications:
 medical and social science field like Trauma and Injury Severity Score (TRISS), used to predict
mortality in injured patients
 used to predict whether a patient has diabetes based on observed characteristics like age,
gender, BMI
 Predict whether a person will vote for congress or BJP based on age, income, gender, race, state
of residency

 Classification
 Binomial or Binary logistic regression deals with variable in which the observed outcome have
two possible types ex dead or alive
 Outcome is coded as 0 or 1
 Straightforward interpretation
 Multinomial logistic regression deals with situation where there are three or more outcomes
 Logistic regression is used for predicting binary outcomes rather than continuous
 Takes the natural logarithm of odds of the logit transformation

 Selects a subset of terms occurring in the training set and uses this subset as features in
text classification
 Serves two main purposes:
 First, makes training and applying classifier more efficient by decreasing the size of effective
vocabulary
 Increases the accuracy by eliminating noise features
 A noise feature is one which when added to the document representation, increases the
classification error on new data.
 Feature selection replaces the complex classifier (using all features) with a simpler one
(using a subset of features)

 Mutual Information measures how much information the presence / absence of a term
contributes to making the correct classification decision on c
 X2 Feature Selection test’s the independence of two events
 Frequency-based feature selection selects terms that are most common in the class.
 frequency can be either defined as document frequency – documents in class c that contain the
terms t
 Collection frequency – tokens of t that occur in documents in c
 document frequency -> Bernoulli model
 Collection frequency -> multinomial model
 Feature selection for multiple classifiers selects single set of features instead of different
one for each classifier

Bayes’ theorem and logistic regression

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Bayes’ theorem and logistic regression (20)

More from Ujjawal (7)

Recently uploaded (20)

Bayes’ theorem and logistic regression

Editor's Notes