SlideShare a Scribd company logo
Machine Learning with
Python
Conditional Probability
In case of Dependent Events, the probability of an event B, “given” that A has happened is known
as Conditional Probability or Posterior Probability and is denoted as:
P(B|A)
P(A AND B) = P(A) * P(B|A)
Or
P(B|A) = P(A and B)/P(A)
Conditional Probability
Conditional Probability
Probability that a randomly selected person uses an iPhone:
P(iPhone)= 5/10 = 0.5
What is the probability that a randomly selected person uses an iPhone given that
person uses a Mac laptop?
there are 4 people who use both a Mac and an iPhone:
and the probability of a random person using a mac is P(mac)= 6/10
So the probability of that some person uses an iPhone given that person uses a
Mac is
P(iphone|mac) = 0.4/0.6 = 0.667
Bayes Theorm
P(h) = Prior Probability
Bayes Theorm
P(D) = P(D|h)*P(h) + P(D|~h)*P(~h)
0.8% of the people in the U.S. have diabetes. There is a simple blood test we can do
that will help us determine whether someone has it. The test is a binary one—it
comes back either POS or NEG. When the disease is present the test returns a correct
POS result 98% of the time; it returns a correct NEG result 97% of the time in cases
when the disease is not present.
Suppose a patient takes the test for diabetes and the result comes back as Positive.
What is more likely : Patient has diabetes or Patient does not have diabetes?
Bayes Theorem
P(disease) = 0.008
P(~disease) = 0.992
P(POS|disease) = 0.98
P(NEG|disease) = 0.02
P(NEG|~disease)=0.97
P(POS|~disease) = 0.03
P(disease|POS) = ??
As per Bayes Theorm:
P(disease|POS) = [P(POS|disease)* P(disease)]/P(POS)
P(POS) = P(POS|disease)* P(disease)] + P(POS|~disease)* P(~disease)]
P(disease|POS) = 0.98*0.008/(0.98*0.008 + 0.03*0.992) = 0.21
P(~disease|POS) = 0.03*0.992/(0.98*0.008 + 0.03*0.992) = 0.79
The person has only 21% chance of getting the disease
Bayes Theorem as Classifier
Using the naïve Bayes method,
which model would you
recommend to a person whose
main interest is health current
exercise level is moderate is
moderately motivated and is
comfortable with technological
devices
Bayes Theorem as Classifier
We will have to calculate:
P(i100 | health, moderateExercise, moderateMotivation, techComfortable)
AND
P(i500 | health, moderateExercise, moderateMotivation, techComfortable)
We will pick the model with the highest probability
P(i100 | health, moderateExercise, moderateMotivation, techComfortable) will be equal to the product
of :
P(health|i100),P(moderateExercise|i100),P(moderateMotivated|i100),
P(techComfortable|i100)P(i100)
P(health|i100) = 1/6
P(moderateExercise|i100) = 1/6
P(moderateMotivated|i100) = 5/6
P(techComfortable|i100) = 2/6
P(i100) = 6 / 15
so P(i100| evidence) = .167 * .167 * .833 * .333 * .4 = .00309
Bayes Theorem as Classifier
Now we compute
P(i500 | health, moderateExercise, moderateMotivation, techComfortable)
P(health|i500) = 4/9
P(moderateExercise|i500) = 3/9
P(moderateMotivated|i500) = 3/9
P(techComfortable|i500) = 6/9
P(i500) = 9 / 15
P(i500| evidence) = .444 * .333 * .333 * .667 * .6 = .01975
So, the ouput will be i500
NB Algorithm
Given the data, we are learning the best hypothesis that is most Probable.
H is the set of all hypothesis. Using Bayes theorm:
MAP – Maximum a Posteriori
Bayes Theorem - Example
There are 2 machines – Mach1 and Mach2
Mach1 produced 30 toys
Mach2 produced 20 toys
Out of all the toys produced, 1% are defective.
50% of defective parts came from Mach1 and 50% came from Mach2.
What is the probability that the toy produced by Mach2 is defective?
Naïve Bayes in case of Numerical Attributes
Naïve Bayes works well for categorical data.
In case of a numerical(continuous) attribute, we can do 2 things:
 Make categories
So Age can be categorized as <18, 18-22, 23-30, 31-40, >40
 Gaussian Distribution
Naïve Bayes assumes that the attribute follow a Normal Distribution. We can use Probability
density function to compute the probability of the numerical attribute
Pros and Cons of NB
Pros
 Fast and Easy to implement. Very good performance for multi-class prediction
 If attributes are truly independent, it performs better compared to other models like Logistic
regression
 Performs better in case of categorical variables
Cons
 If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a
prediction.
 In real life, it is almost impossible that we get a set of predictors which are completely
independent.
kNN Algorithm
K – Nearest Neighbors:
We have a set of training data with labels. When we are given a new piece of data, we compare it
to every piece of existing data. We then take the most “similar” pieces of data(the nearest
neighbors) and look at their labels. We look at the top “k” most “similar” pieces of data from
training set. Lastly, we take a majority vote from these k pieces and assign the majority class to
our new piece of data.
How to calculate “similarity” between 2 pieces of data?
We can represent each piece of data as a point and calculate the distances between those points.
The pints which are close to each other can be considered as “similar”
Eager Vs Lazy Learning
 Decision Trees, Regression, Neural Networks, SVMs, Bayes nets: all of these can be described
as eager learners. In these models, we fit a function that best fits our training data; when we
have new inputs, the input’s features are fed into the function, which produces an output.
Once a function has been computed, the data could be lost to no detriment of the model’s
performance (until we obtain more data and want to recompute our function). For eager
learners, we take the time to learn from the data first and sacrifice local sensitivity to obtain
quick, global scale estimates on new data points.
 KNN is a lazy learner. Lazy learners do not compute a function to fit the training data before
new data is received. Instead, new instances are compared to the training data itself to make
a classification or regression judgment. Essentially, the data itself is the function to which new
instances are fit. While the memory requirements are much larger than eager learners (storing
all training data versus just a function) and judgments take longer to compute than eager
learners, lazy learners have advantages in localscale estimation and easy integration of
additional training data.
Example
User Bahubali 2 Half-Girlfriend
Avnish 5 5
Pawan 2 5
Nitin 5 1
Aishwarya 5 4
We have some users who have given ratings to 2 movies.
We will calculate the distance between these users based on
the ratings they gave to the movies.
Since, there are 2 movies – each user can be represented as
point in 2 dimensions
Distances
Manhattan Distance:
This is the easiest distance to calculate.
So lets say Aishwary is (x1,y1)
Pawan is (x2,y2)
Manhattan Distance is calculated by : |x1 – x2| + |y1 – y2|
So distance between Aishwary and Pawan is |5 – 2| + |5 – 4| = 4
distance between Aishwary and Avnish is |5 – 5| + |5 – 4| = 1
distance between Aishwary and Nitin is |5 – 5| + |4 -1| = 3
Aishwarya is more similar to Avnish. So if we know the best rated movies of Avnish, we can
recommend it to Aishwarya.
Distances
Euclidean Distance or As-the-Crow flies Distance:
We can also use Euclidean distance between 2 points using the below formula:
So distance between Aishwary and Pawan is 3.16
distance between Aishwary and Avnish is 1
distance between Aishwary and Nitin is 3
Minkowski Distance Metric:
Generalization:
When
• r = 1: The formula is Manhattan Distance.
• r = 2: The formula is Euclidean Distance
• r = ∞: Supremum Distance
Manhattan Distance and Euclidean Distance work best when there are no missing values
N - Dimensions
 Previous example can be expanded into more than 2 dimensions where we can have more no of
people giving ratings to many movies
 In this case, distance between 2 people will be computed based on the number of movies they both
reviewed e.g Distance between Gaurav and Somendra is
Gaurav Somendra Paras Ram Tomar Atul Sumit Divakar
Bahubali-2 3.5 2 5 3 5 3
Half-Girlfriend 2 3.5 1 4 4 4.5 2
Sarkar3 4 1 4.5 1 4
Kaabil 4.5 3 4 5 3 5
Raees 5 2 5 3 5 5 4
Sultan 1.5 3.5 1 4.5 4.5 4 2.5
Dangal 2.5 4 4 4 5 3
Piku 2 3 2 1 4
Gaurav Somendra Difference
Bahubali-2 3.5 2 1.5
Half-Girlfriend 2 3.5 1.5
Sarkar3 4
Kaabil 4.5
Raees 5 2 3
Sultan 1.5 3.5 2
Dangal 2.5
Piku 2 3 1
Manhattan Distance 9
Pearson Corelation Coefficient
 Somendra avoids extremes. He is rating between 2 to 4. He doesn’t give 1 or 5
 Atul seems to like every movie. He rates between 4 and 5
 Tomar either gives 1 or 4
So its difficult to compare Tomar to Atul because both are using different scales(This is called grade-
inflation) To solve this, we use Pearson Correlation Coefficient
The Pearson Correlation Coefficient is a measure of correlation between two variables. It ranges
between -1 and 1 inclusive. 1 indicates perfect agreement. -1 indicates perfect disagreement
 Formula
Gaurav Somendra Paras Ram Tomar Atul Sumit Divakar
Bahubali-2 3.5 2 5 3 5 3
Half-Girlfriend 2 3.5 1 4 4 4.5 2
Sarkar3 4 1 4.5 1 4
Kaabil 4.5 3 4 5 3 5
Raees 5 2 5 3 5 5 4
Sultan 1.5 3.5 1 4.5 4.5 4 2.5
Dangal 2.5 4 4 4 5 3
Piku 2 3 2 1 4
Cosine Similarity
X and y are the vectors.
Cos(x,y) finds the similarity between these 2 vectors
x.y is the dot product
||x|| is the length of the vector. It is equal to:
Which Similarity measure to use?
 If the data contains grade-inflation, use Pearson coefficient
 If the data is dense(almost all attributes have non-zero values) and the magnitude of
the attribute value is important, use distance measures like Manhattan or Euclidean
 If the data is sparse, use Cosine Similarity
Recommendation Systems
Some examples:
 Facebook suggests you the list of names “People you may know”
 Youtube suggests you the relevant videos
 Amazon suggests you the relevant books
 Netflix suggests you the movies you may like
A recommendation system seeks to predict the ‘rating’ or ‘preference’ that user would give to an
item.
 In today’s digital age, businesses often have hundreds of thousands of items to offer their
customers
 Recommendation Systems can hugely benefit such businesses
Filtering methods
 Collaborative Filtering: The process of using other users’ inputs to make predictions or
recommendations is called collaborative filtering. It requires lot of data and hence lots of
computing power to make accurate recommendations.
An example
 Content Filtering: The process of using the data itself (contents of data) to make predictions
or recommendations is called content filtering. It requires limited data but can be limited in
scope. This is because you are using the contents of data. So you end up recommending
similar things to what user has already liked. So recommendations are often not very
surprising or insightful.
Netflix uses a hybrid system – both Collaborative and Content Filtering
Body Guard Men in Black Inception Terminator
Kapil 5 4 5 4
Mradul 4 2 5
Saurabh 5 4 4
Atul 4 2
Recommendation Systems
Ways to Recommend
 By using User Ratings for Products: Collaborative Filtering
 By using Product Attributes : Content Based Filtering
 By using Purchase Patterns: Association Rules Learning
RECOMMENDATION
ENGINE
What users bought?
What users surfed?
What users liked and
rated?
What users clicked?
If you liked that, you
will love this?
People who bought
that, also buy this…
Top Pics for you!!
Tasks performed by a Recommendation Engine
Filter relevant
Products
1
Predict whether a
user will buy a
product
2
Predict what rating
the user would give
to Product
3
Rank products
based on their
relevance to user
4
Relevant products:
 “Similar” to the ones user “liked” (e.g if a user liked a machine learning book, show him other books on
machine learning)
 “Liked” by “similar” users (e.g if 2 users liked 3 same movies, show a different movie liked by user1 to user2)
 “Purchased” along with the ones user “liked” (e.g somebody bought camera, show him tripods)
Collaborative Filtering
Cons:
1. Scalability: if the no of users is very high, say 1 million, every time you want to make a
recommendation for someone you need to calculate one million distances (comparing that
person to the 999,999 other people). If we are making multiple recommendations per second,
the number of calculations get extreme.
2. Sparsity: Most recommendation systems have many users and many products but the
average user rates a small fraction of the total products. For example, Amazon carries millions
of books but the average user rates just a handful of books. Because of this, the algorithm
may not find any nearest neighbors.

More Related Content

PPTX
Machine learning session6(decision trees random forrest)
PPTX
Machine learning session2
PPTX
Machine learning session8(svm nlp)
PPTX
Machine learning session5(logistic regression)
PPTX
Machine learning session1
PDF
Logistic regression
PDF
Types of Probability Distributions - Statistics II
PDF
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Machine learning session6(decision trees random forrest)
Machine learning session2
Machine learning session8(svm nlp)
Machine learning session5(logistic regression)
Machine learning session1
Logistic regression
Types of Probability Distributions - Statistics II
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit

What's hot (17)

PDF
Directional Hypothesis testing
PPTX
Interval estimation for proportions
PDF
Data Science - Part III - EDA & Model Selection
PDF
Hypothesis Testing with ease
PDF
Parameter estimation
PPTX
Interval Estimation & Estimation Of Proportion
PDF
Logistic regression
PPTX
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
PDF
Machine Learning Decision Tree Algorithms
PPTX
Point Estimation
PPTX
Machine learning algorithms and business use cases
PPTX
Machine Learning - Simple Linear Regression
PPTX
Validation and Over fitting , Validation strategies
PPTX
Application of Machine Learning in Agriculture
PPT
Chapter09
PPTX
Estimating population mean
PDF
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Directional Hypothesis testing
Interval estimation for proportions
Data Science - Part III - EDA & Model Selection
Hypothesis Testing with ease
Parameter estimation
Interval Estimation & Estimation Of Proportion
Logistic regression
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
Machine Learning Decision Tree Algorithms
Point Estimation
Machine learning algorithms and business use cases
Machine Learning - Simple Linear Regression
Validation and Over fitting , Validation strategies
Application of Machine Learning in Agriculture
Chapter09
Estimating population mean
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Ad

Similar to Machine learning session7(nb classifier k-nn) (20)

PPTX
Statistical Machine Learning unit3 lecture notes
PPTX
Machine Learning course Lecture number 2 - Supervised machine learning, part ...
PDF
Bayes 6
PPTX
MachineLearningGlobalAcademyofTechnologySlides
PPTX
Clasification approaches
PPTX
k-Nearest Neighbors with brief explanation.pptx
PPTX
K-Nearest Neighbor Classifier
PPT
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
PPTX
Pattern recognition binoy 05-naive bayes classifier
PPTX
Unit-2 Bayes Decision Theory.pptx
PDF
Dr. Shivu__Machine Learning-Module 3.pdf
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PPTX
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
PPTX
knn-1.pptx
PDF
Classification_Algorithms_Student_Data_Presentation
PPTX
3a-knn.pptxhggmtdu0lphm0kultkkkkkkkkkkkk
PPT
K nearest neighbors s machine learning method
PDF
Lecture 6 - Classification Classification
PPTX
Lecture 10 Naive Bayes Classifier.hdghpptx
PPTX
baysian in machine learning in Supervised Learning .pptx
Statistical Machine Learning unit3 lecture notes
Machine Learning course Lecture number 2 - Supervised machine learning, part ...
Bayes 6
MachineLearningGlobalAcademyofTechnologySlides
Clasification approaches
k-Nearest Neighbors with brief explanation.pptx
K-Nearest Neighbor Classifier
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Pattern recognition binoy 05-naive bayes classifier
Unit-2 Bayes Decision Theory.pptx
Dr. Shivu__Machine Learning-Module 3.pdf
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
W5_CLASSIFICATION.pptxW5_CLASSIFICATION.pptx
knn-1.pptx
Classification_Algorithms_Student_Data_Presentation
3a-knn.pptxhggmtdu0lphm0kultkkkkkkkkkkkk
K nearest neighbors s machine learning method
Lecture 6 - Classification Classification
Lecture 10 Naive Bayes Classifier.hdghpptx
baysian in machine learning in Supervised Learning .pptx
Ad

More from Abhimanyu Dwivedi (7)

PPTX
Deepfakes videos
DOCX
John mc carthy contribution to AI
PPTX
Machine learning session9(clustering)
PPTX
Machine learning session4(linear regression)
PPTX
Machine learning session3(intro to python)
PPTX
Data analytics with python introductory
PPTX
Housing price prediction
Deepfakes videos
John mc carthy contribution to AI
Machine learning session9(clustering)
Machine learning session4(linear regression)
Machine learning session3(intro to python)
Data analytics with python introductory
Housing price prediction

Recently uploaded (20)

PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
01-Introduction-to-Information-Management.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Cell Structure & Organelles in detailed.
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
master seminar digital applications in india
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Microbial disease of the cardiovascular and lymphatic systems
01-Introduction-to-Information-Management.pdf
Anesthesia in Laparoscopic Surgery in India
Pharma ospi slides which help in ospi learning
Cell Structure & Organelles in detailed.
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Supply Chain Operations Speaking Notes -ICLT Program
Insiders guide to clinical Medicine.pdf
Institutional Correction lecture only . . .
Module 4: Burden of Disease Tutorial Slides S2 2025
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Complications of Minimal Access Surgery at WLH
Microbial diseases, their pathogenesis and prophylaxis
Sports Quiz easy sports quiz sports quiz
Final Presentation General Medicine 03-08-2024.pptx
O7-L3 Supply Chain Operations - ICLT Program
master seminar digital applications in india
2.FourierTransform-ShortQuestionswithAnswers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf

Machine learning session7(nb classifier k-nn)

  • 2. Conditional Probability In case of Dependent Events, the probability of an event B, “given” that A has happened is known as Conditional Probability or Posterior Probability and is denoted as: P(B|A) P(A AND B) = P(A) * P(B|A) Or P(B|A) = P(A and B)/P(A)
  • 4. Conditional Probability Probability that a randomly selected person uses an iPhone: P(iPhone)= 5/10 = 0.5 What is the probability that a randomly selected person uses an iPhone given that person uses a Mac laptop? there are 4 people who use both a Mac and an iPhone: and the probability of a random person using a mac is P(mac)= 6/10 So the probability of that some person uses an iPhone given that person uses a Mac is P(iphone|mac) = 0.4/0.6 = 0.667
  • 5. Bayes Theorm P(h) = Prior Probability
  • 6. Bayes Theorm P(D) = P(D|h)*P(h) + P(D|~h)*P(~h) 0.8% of the people in the U.S. have diabetes. There is a simple blood test we can do that will help us determine whether someone has it. The test is a binary one—it comes back either POS or NEG. When the disease is present the test returns a correct POS result 98% of the time; it returns a correct NEG result 97% of the time in cases when the disease is not present. Suppose a patient takes the test for diabetes and the result comes back as Positive. What is more likely : Patient has diabetes or Patient does not have diabetes?
  • 7. Bayes Theorem P(disease) = 0.008 P(~disease) = 0.992 P(POS|disease) = 0.98 P(NEG|disease) = 0.02 P(NEG|~disease)=0.97 P(POS|~disease) = 0.03 P(disease|POS) = ?? As per Bayes Theorm: P(disease|POS) = [P(POS|disease)* P(disease)]/P(POS) P(POS) = P(POS|disease)* P(disease)] + P(POS|~disease)* P(~disease)] P(disease|POS) = 0.98*0.008/(0.98*0.008 + 0.03*0.992) = 0.21 P(~disease|POS) = 0.03*0.992/(0.98*0.008 + 0.03*0.992) = 0.79 The person has only 21% chance of getting the disease
  • 8. Bayes Theorem as Classifier Using the naïve Bayes method, which model would you recommend to a person whose main interest is health current exercise level is moderate is moderately motivated and is comfortable with technological devices
  • 9. Bayes Theorem as Classifier We will have to calculate: P(i100 | health, moderateExercise, moderateMotivation, techComfortable) AND P(i500 | health, moderateExercise, moderateMotivation, techComfortable) We will pick the model with the highest probability P(i100 | health, moderateExercise, moderateMotivation, techComfortable) will be equal to the product of : P(health|i100),P(moderateExercise|i100),P(moderateMotivated|i100), P(techComfortable|i100)P(i100) P(health|i100) = 1/6 P(moderateExercise|i100) = 1/6 P(moderateMotivated|i100) = 5/6 P(techComfortable|i100) = 2/6 P(i100) = 6 / 15 so P(i100| evidence) = .167 * .167 * .833 * .333 * .4 = .00309
  • 10. Bayes Theorem as Classifier Now we compute P(i500 | health, moderateExercise, moderateMotivation, techComfortable) P(health|i500) = 4/9 P(moderateExercise|i500) = 3/9 P(moderateMotivated|i500) = 3/9 P(techComfortable|i500) = 6/9 P(i500) = 9 / 15 P(i500| evidence) = .444 * .333 * .333 * .667 * .6 = .01975 So, the ouput will be i500
  • 11. NB Algorithm Given the data, we are learning the best hypothesis that is most Probable. H is the set of all hypothesis. Using Bayes theorm: MAP – Maximum a Posteriori
  • 12. Bayes Theorem - Example There are 2 machines – Mach1 and Mach2 Mach1 produced 30 toys Mach2 produced 20 toys Out of all the toys produced, 1% are defective. 50% of defective parts came from Mach1 and 50% came from Mach2. What is the probability that the toy produced by Mach2 is defective?
  • 13. Naïve Bayes in case of Numerical Attributes Naïve Bayes works well for categorical data. In case of a numerical(continuous) attribute, we can do 2 things:  Make categories So Age can be categorized as <18, 18-22, 23-30, 31-40, >40  Gaussian Distribution Naïve Bayes assumes that the attribute follow a Normal Distribution. We can use Probability density function to compute the probability of the numerical attribute
  • 14. Pros and Cons of NB Pros  Fast and Easy to implement. Very good performance for multi-class prediction  If attributes are truly independent, it performs better compared to other models like Logistic regression  Performs better in case of categorical variables Cons  If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction.  In real life, it is almost impossible that we get a set of predictors which are completely independent.
  • 15. kNN Algorithm K – Nearest Neighbors: We have a set of training data with labels. When we are given a new piece of data, we compare it to every piece of existing data. We then take the most “similar” pieces of data(the nearest neighbors) and look at their labels. We look at the top “k” most “similar” pieces of data from training set. Lastly, we take a majority vote from these k pieces and assign the majority class to our new piece of data. How to calculate “similarity” between 2 pieces of data? We can represent each piece of data as a point and calculate the distances between those points. The pints which are close to each other can be considered as “similar”
  • 16. Eager Vs Lazy Learning  Decision Trees, Regression, Neural Networks, SVMs, Bayes nets: all of these can be described as eager learners. In these models, we fit a function that best fits our training data; when we have new inputs, the input’s features are fed into the function, which produces an output. Once a function has been computed, the data could be lost to no detriment of the model’s performance (until we obtain more data and want to recompute our function). For eager learners, we take the time to learn from the data first and sacrifice local sensitivity to obtain quick, global scale estimates on new data points.  KNN is a lazy learner. Lazy learners do not compute a function to fit the training data before new data is received. Instead, new instances are compared to the training data itself to make a classification or regression judgment. Essentially, the data itself is the function to which new instances are fit. While the memory requirements are much larger than eager learners (storing all training data versus just a function) and judgments take longer to compute than eager learners, lazy learners have advantages in localscale estimation and easy integration of additional training data.
  • 17. Example User Bahubali 2 Half-Girlfriend Avnish 5 5 Pawan 2 5 Nitin 5 1 Aishwarya 5 4 We have some users who have given ratings to 2 movies. We will calculate the distance between these users based on the ratings they gave to the movies. Since, there are 2 movies – each user can be represented as point in 2 dimensions
  • 18. Distances Manhattan Distance: This is the easiest distance to calculate. So lets say Aishwary is (x1,y1) Pawan is (x2,y2) Manhattan Distance is calculated by : |x1 – x2| + |y1 – y2| So distance between Aishwary and Pawan is |5 – 2| + |5 – 4| = 4 distance between Aishwary and Avnish is |5 – 5| + |5 – 4| = 1 distance between Aishwary and Nitin is |5 – 5| + |4 -1| = 3 Aishwarya is more similar to Avnish. So if we know the best rated movies of Avnish, we can recommend it to Aishwarya.
  • 19. Distances Euclidean Distance or As-the-Crow flies Distance: We can also use Euclidean distance between 2 points using the below formula: So distance between Aishwary and Pawan is 3.16 distance between Aishwary and Avnish is 1 distance between Aishwary and Nitin is 3 Minkowski Distance Metric: Generalization: When • r = 1: The formula is Manhattan Distance. • r = 2: The formula is Euclidean Distance • r = ∞: Supremum Distance Manhattan Distance and Euclidean Distance work best when there are no missing values
  • 20. N - Dimensions  Previous example can be expanded into more than 2 dimensions where we can have more no of people giving ratings to many movies  In this case, distance between 2 people will be computed based on the number of movies they both reviewed e.g Distance between Gaurav and Somendra is Gaurav Somendra Paras Ram Tomar Atul Sumit Divakar Bahubali-2 3.5 2 5 3 5 3 Half-Girlfriend 2 3.5 1 4 4 4.5 2 Sarkar3 4 1 4.5 1 4 Kaabil 4.5 3 4 5 3 5 Raees 5 2 5 3 5 5 4 Sultan 1.5 3.5 1 4.5 4.5 4 2.5 Dangal 2.5 4 4 4 5 3 Piku 2 3 2 1 4 Gaurav Somendra Difference Bahubali-2 3.5 2 1.5 Half-Girlfriend 2 3.5 1.5 Sarkar3 4 Kaabil 4.5 Raees 5 2 3 Sultan 1.5 3.5 2 Dangal 2.5 Piku 2 3 1 Manhattan Distance 9
  • 21. Pearson Corelation Coefficient  Somendra avoids extremes. He is rating between 2 to 4. He doesn’t give 1 or 5  Atul seems to like every movie. He rates between 4 and 5  Tomar either gives 1 or 4 So its difficult to compare Tomar to Atul because both are using different scales(This is called grade- inflation) To solve this, we use Pearson Correlation Coefficient The Pearson Correlation Coefficient is a measure of correlation between two variables. It ranges between -1 and 1 inclusive. 1 indicates perfect agreement. -1 indicates perfect disagreement  Formula Gaurav Somendra Paras Ram Tomar Atul Sumit Divakar Bahubali-2 3.5 2 5 3 5 3 Half-Girlfriend 2 3.5 1 4 4 4.5 2 Sarkar3 4 1 4.5 1 4 Kaabil 4.5 3 4 5 3 5 Raees 5 2 5 3 5 5 4 Sultan 1.5 3.5 1 4.5 4.5 4 2.5 Dangal 2.5 4 4 4 5 3 Piku 2 3 2 1 4
  • 22. Cosine Similarity X and y are the vectors. Cos(x,y) finds the similarity between these 2 vectors x.y is the dot product ||x|| is the length of the vector. It is equal to:
  • 23. Which Similarity measure to use?  If the data contains grade-inflation, use Pearson coefficient  If the data is dense(almost all attributes have non-zero values) and the magnitude of the attribute value is important, use distance measures like Manhattan or Euclidean  If the data is sparse, use Cosine Similarity
  • 24. Recommendation Systems Some examples:  Facebook suggests you the list of names “People you may know”  Youtube suggests you the relevant videos  Amazon suggests you the relevant books  Netflix suggests you the movies you may like A recommendation system seeks to predict the ‘rating’ or ‘preference’ that user would give to an item.  In today’s digital age, businesses often have hundreds of thousands of items to offer their customers  Recommendation Systems can hugely benefit such businesses
  • 25. Filtering methods  Collaborative Filtering: The process of using other users’ inputs to make predictions or recommendations is called collaborative filtering. It requires lot of data and hence lots of computing power to make accurate recommendations. An example  Content Filtering: The process of using the data itself (contents of data) to make predictions or recommendations is called content filtering. It requires limited data but can be limited in scope. This is because you are using the contents of data. So you end up recommending similar things to what user has already liked. So recommendations are often not very surprising or insightful. Netflix uses a hybrid system – both Collaborative and Content Filtering Body Guard Men in Black Inception Terminator Kapil 5 4 5 4 Mradul 4 2 5 Saurabh 5 4 4 Atul 4 2
  • 27. Ways to Recommend  By using User Ratings for Products: Collaborative Filtering  By using Product Attributes : Content Based Filtering  By using Purchase Patterns: Association Rules Learning RECOMMENDATION ENGINE What users bought? What users surfed? What users liked and rated? What users clicked? If you liked that, you will love this? People who bought that, also buy this… Top Pics for you!!
  • 28. Tasks performed by a Recommendation Engine Filter relevant Products 1 Predict whether a user will buy a product 2 Predict what rating the user would give to Product 3 Rank products based on their relevance to user 4 Relevant products:  “Similar” to the ones user “liked” (e.g if a user liked a machine learning book, show him other books on machine learning)  “Liked” by “similar” users (e.g if 2 users liked 3 same movies, show a different movie liked by user1 to user2)  “Purchased” along with the ones user “liked” (e.g somebody bought camera, show him tripods)
  • 29. Collaborative Filtering Cons: 1. Scalability: if the no of users is very high, say 1 million, every time you want to make a recommendation for someone you need to calculate one million distances (comparing that person to the 999,999 other people). If we are making multiple recommendations per second, the number of calculations get extreme. 2. Sparsity: Most recommendation systems have many users and many products but the average user rates a small fraction of the total products. For example, Amazon carries millions of books but the average user rates just a handful of books. Because of this, the algorithm may not find any nearest neighbors.