SlideShare a Scribd company logo
Introduction to
Machine
Learning
Sanghamitra Deb
Staff Data Scientist
Chegg Inc
Outline
• Introduction
• Types of Learning
• Machine Learning Models
• Definition
• Optimization
• Regularization
• Metrics
• Feature Engineering
• Real World Examples
What is
Machine
Learning?
• “Learning is any process by which a system
improves performance from experience.” -
Herbert Simon
• Machine learning is training computers to
effectively achieve a performance criterion
using examples or historical data.
Why?
Machine Learning is used when
§ Human expertise is unavailable (space expeditions).
§ Human expertise is not explicable (speech translation).
§ Information need to be personalized (education, medicine).
§ Domains with huge amount of data.
Applications
• Education --- Developing Learning Paths
for Students
• Healthcare --- Personalized Medicine
• Retail --- Product recommendations
• Web --- Search
• Manufacturing --- robotics, control
• Finance – fraud detection, asset
management
• HR --- people analytics
• Medical --- drug discovery, automated
diagnosis
• ………..
Types of Learning
Supervised Unsupervised
§ Examples or training data is available
o Human annotations, user Interactions
§ Data contains features correlated with the
desired outcome
§ A model is learned from the examples
§ Goal of the model is to predict future
behavior
§ Direct examples are not available
§ Data contains features correlated,
outcome may not be defined.
§ It is possible to create clusters
correlated with the learning objective
based on patterns in the data
Learning Objective ---
Types of Learning
Supervised Unsupervised
§ Regression
§ Linear
§ Decision Trees
§ Classification
§ Logistic Regression
§ Naïve Bayes
§ SVM
§ Decision Trees – RF, GBDT
§ Clustering --- kmeans
§ Similarity based results
§ Transfer Learning
Models ---
Linear Regression
Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship
between input and output numerical variables, but has been borrowed by machine learning. It is both a statistical
algorithm and a machine learning algorithm.
Linear Regression | simple linear regression | Ordinary Least Squares | multiple linear regression
• The dependent variable Y has a linear relationship to the independent variable X
• For each value of X, the probability distribution of Y has the same standard deviation σ.
• For any given value of X, The Y values are independent, as indicated by a random pattern on the residual plot.
• The Y values are roughly normally distributed (i.e., symmetric and unimodal).
Example: Sales Prediction à company’s advertising spend on radio, TV, and newspapers.
Cost Function
Mean Squared Errors
To minimize MSE we use Gradient Descent to calculate the gradient of our cost
function. Gradient Descent is an algorithm used to minimize some function by
iteratively moving in the direction of steepest descent as defined by the negative
of the gradient. In linear regression gradient descent is used to update the
parameters or weights of the linear model
https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html
Learning Rate : The size of these steps is called the learning rate. With a high learning rate we can cover more ground
each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing.
Cost Function
https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html
Linear Regression using scikit-learn
https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
Data set : https://drive.google.com/file/d/1oakZCv7g3mlmCSdv9J8kdSaqO5_6dIOw/view
4.64, which is less than 10% of the mean
Applications
• Trendline --- A trend line represents a trend, the long-term movement in time series data after other
components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have
increased or decreased over the period of time.
• Epidemiology --- Early evidence relating tobacco smoking to mortality and morbidity came from observational
studies employing regression analysis. In order to reduce spurious correlations when analyzing observational data,
researchers usually include several variables in their regression models in addition to the variable of primary interest.
• Finance --- The capital asset pricing model uses linear regression as well as the concept of beta for analyzing and
quantifying the systematic risk of an investment. This comes directly from the beta coefficient of the linear regression
model that relates the return on the investment to the return on all risky assets.
• Economics --- Linear regression is the predominant empirical tool in economics. For example, it is used to
predict consumption spending,[20] fixed investment spending, inventory investment, purchases of a
country's exports,[21] spending on imports,[21] the demand to hold liquid assets,[22] labor demand,[23] and labor
supply.[23]
https://en.wikipedia.org/wiki/Linear_regression
Classification
http://cs229.stanford.edu/notes2020spring/cs229-notes1.pdf
Values of Y, i.e the response can take discrete values
o Binary Classification – response can belong to two
classes (0,1)
üRating – thumbs up, thumbs down.
o Multi-class classification – there can be n classes à
[1, …. n]
ü Movie/restaurant rating can range from
[1,2,3,4,5]
o Multi-class , multi – label classification.
üExample –
ØConcept space in Probability has many
concepts (classes) – [Probability, Bayes
Theorem, DiscretePDF – Binomial, Poisson,
Continuous PDF – normal, exponential, …]
ØA question will typically belong to multiple
classes.
Logistic Regression -- Binary Classification
Logistic Function or Sigmoid function
https://sebastianraschka.com/faq/docs/logistic-why-sigmoid.html -- discussion on the logit function
Interpretation: Output is the probability that y=1 given x.
Output > 0.5 represents y =1 and output < 0.5 represents y=0
Cost
Function –
Cross
Entropy
Multinomial Logistic / Softmax
Sigmoid gets replaced by the softmax function.
This applies when there are 1,…k classes.
Regularization --- Linear
/Logistic
• Penalty against complexity
o model does not up
"peculiarities," "noise,"
or "imagines a pattern
where there is none.”
• Helps with generalizing to
new data sets.
• Addition of a bias to the
model when suffers from
high variance or overfitting
L2 regularization
L1 regularization
Data --- Train,
Test and
Validation
Train – 90% , Test –5 %, Validation –5% ---- the
percentages can vary depending on the total size of
the dataset.
• Train --- data used to train the model
• Validation --- data that the model has not seen
but is used for parameter tuning, i.e the model is
optimized based on performance on this set.
• Test --- model has not seen this data, this data is
not used in any part of the computation. Final
performance metrics are reported on this data.
Bias – Variance Tradeoff
BIAS — When we model in a very simple, novice way, for example
just a single linear equation prediction for an actual complex model.
Of course because of this Model becomes Under fit and miss out
various important insights and relations between variables.
VARIANCE — On the other hand, when we become over concern for
a simple given data and fit a model in a very complicated way, it
results in Over fit. So each noise and outlier will be considered as
valid data point and modelled accordingly.
Decision Tree
Advantage ---
• Results are interpretable
• Works for both numerical and categorical data
• Does not require feature transformations (example
--- normalization, scaling)
• Robust to Multicollinearity – correlated features.
Single decision are rarely used in practice
• Unstable --- small changes in data can lead to large
structural changes in the decision tree
• Prone to overfitting
• Easily becomes complex
Ensemble
Tecniques ---
Random Forest
Many trees are better than one!
N slightly differently trained
decision trees and merges them
together to get more accurate
and stable predictions.
Regularization
• Limit tree depth.
• Pruning
• Penalize selection of new
features over features that
have similar gain
• Set stricter stopping criterion
on when to split a node
further (e.g. min gain, number
of samples etc.)
Feature Importance
A feature’s importance score measures
the contribution from the feature. It is
based on the impurity reduction of the
class due to the feature.
Bias towards features with more categories
It is important to compute the correlation
with accuracy.
Unsupervised Learning
There is no training data …
Clustering
• Hierarchical clustering
• K-means clustering
• K-NN (k nearest neighbors)
Clustering is a technique that
finds groups (clusters) in the
data that have similar patterns.
K-means
Clustering
Similarity based
Recommendations
• Text data – news article,
study material , books, …
• For every piece of content
compute the distance to
all every other content in
the cluster --- save the top
–n in a database
• When a user views any
content surface the top-n
(typically 5) other similar
items
Applications
• Recommendations based on Text
Similarity
• Customer Segmentation
• Content Categorization
• As a pre-analysis for supervised
learning
Performance Metrics
Is the model good enough?
Regression
R² Error: The metric helps us to compare our current model
with a constant baseline and tells us how much our model is
better. The constant baseline is chosen by taking the mean of
the data and drawing a line at the mean.
Adjusted R²: Adjusted R² depicts the same meaning as R² but
is an improvement of it. R² suffers from the problem that the
scores improve on increasing terms even though the model is
not improving which may misguide the researcher. Adjusted
R² is always lower than R² as it adjusts for the increasing
predictors and only shows improvement if there is a real
improvement
Classification
True Positive --- Number of observations that model correctly
predicts the positive class
False Positive --- Number of observations where model
incorrectly predicts the positive class.
False Negatives --- Number of observations where model
incorrectly predicts the negative class.
True Negatives --- Number of observations where model
correctly predicts the negative class
https://en.wikipedia.org/wiki/Precision_and_recall
Classification
https://en.wikipedia.org/wiki/Precision_and_recall
Precision : TP/(TP+FP) --- what percentage of the positive class
is actually positive?
Recall : TP/(TP+FN) --- what percentage of the positive class
gets captured by the model?
Accuracy --- (TP+TN)/(TP+FP+TN+FN) --- what percentage of
predictions are correct?
Thresholding --- Coverage
In a binary classification if you choose randomly the probability of belonging to a class is 0.5
0.3
0.7
It is possible improve the percentage of
correct results at the cost of coverage.
Confusion
Matrix
ROC & AUC
ROC – Reciever Operating Characteristics
An ROC curve (receiver operating characteristic curve) is a graph
showing the performance of a classification model at all
classification thresholds.
AUC – Area Under the Curve.
• AUC is scale-invariant. It measures how well predictions
are ranked, rather than their absolute values.
• AUC is classification-threshold-invariant. It measures the
quality of the model's predictions irrespective of what
classification threshold is chosen.
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
• TPR = TP/(TP+FN)
• FPR = FP/(FP+TN)
Random
Feature Engineering
Combining math and intuition
Imputation or missing values
https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114#1c08
Drop rows with missing values
Cons: Reduces training data. In case of multiple features values may only be missing in a subset of features.
Replace with a reasonable value – median is common
Cons: There are assumptions used when values are missing which may not be correct.
Categorical Imputation --- replace with the most common value.
Cons: Assumptions
Handling
Outliers –
drop, capp
Binning
The trade-off
between performance and overfitting is the key
point of the binning process.
Log Transform
• It helps to handle skewed data and after transformation, the
distribution becomes more approximate to normal.
• In most of the cases the magnitude order of the data changes within
the range of the data. For instance, the difference between
ages 15 and 20 is not equal to the ages 65 and 70. In terms of years,
yes, they are identical, but for all other aspects, 5 years of difference
in young ages mean a higher magnitude difference. This type of data
comes from a multiplicative process and log transform normalizes
the magnitude differences like that.
• It also decreases the effect of the outliers, due to the normalization
of magnitude differences and the model become more robust.
One-hot encoding
Scaling -- all features have the same range
The continuous features become identical in terms of the range, after
a scaling process. This process is not mandatory for many algorithms,
but it might be still nice to apply. However, the algorithms based
on distance calculations such as k-NN or k-Means need to have scaled
continuous features as model input.
Normalization Normalization (or min-max normalization) scale all values in a
fixed range between 0 and 1.
Standardization
Real World Application
There is no training data …
Use Case --- Predict will a
user buy a book?
Online Book Store
Generating Features
Book Features User Features Book-User Features
• Tags --- genre, subject, …
• Level – Beginner, Intermediate
Advanced
• Popularity score –
1. exponential decay on clicks
2. Create time bound scores,
such number of views in the
last 7 days, last 14 days
• Price
• Length of the book
• …
• Tags --- genre, subject, …
• Level – Beginner, Intermediate
Advanced
Derived from user interactions on the
site.
• View Score
1. Exponential decay clicks on
books
2. Time bound --- number of
views in past 7/14 days
• Price category
• Time of day – categorical
• …
• Number of Views in past 14/30
days
• Already bought
• Number of views from the
same author
• …
Response Variable & Modeling
• If the site is not super active you might not have enough data on purchases
• Multi-stage model
• Stage 1 : response variable views – i.e will the user view/click on this
book in the next 3 days?
• Stage 2: response variable purchase – will the user purchase this book
in the next 3 days. The probability of view will be a feature in the Stage
2 model.
Modeling: We have a mix of categorical and numerical features --- Random Forest
Thank You
@sangha_deb
sangha123@gmail.com
Appendix
K-Nearest Neighbors
K-Nearest Neighbor is a classification
algorithm that leverages observations
close to the target point to decide which
class it belongs to
There are two parts of the algorithm: first, how to measure
“close”; second, how many close observations (K) we need.
https://github.com/spotify/annoy
KNN - algo
Derivative of the sigmoid & cost function
Metrics related to Ranking
• MRR
• MAP
• NDCG
Gradient Boosted Trees
Loss Functions
• Cross-Entropy
• Hinge
• Huber
• Kullback-Leibler
• MAE (L1)
• MSE (L2)
Optimization Techniques
• Gradient Descent
oStochastic
oMini Batch
• Adagrad
• RMSprop
• Adam
• Cross-Entropy
• Others
Decision Tree Based Regression
Pros:
Decision trees can handle both categorical and numerical data.
Cons:
Does not handle feature interaction very well.

More Related Content

PPTX
Introduction to ML (Machine Learning)
PPTX
Deep learning
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PPTX
INTRODUCTION TO MACHINE LEARNING.pptx
PPTX
Random forest
PPTX
Text data mining1
PDF
Feature Engineering
PDF
An introduction to Machine Learning
Introduction to ML (Machine Learning)
Deep learning
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
INTRODUCTION TO MACHINE LEARNING.pptx
Random forest
Text data mining1
Feature Engineering
An introduction to Machine Learning

What's hot (20)

PPTX
Introduction to Machine Learning
PDF
Feature Engineering - Getting most out of data for predictive models
PPTX
Deep learning
PPTX
Image recognition
PDF
Convolutional Neural Networks (CNN)
PPTX
Introduction to Deep Learning
PPTX
Unsupervised learning
PPT
Clustering
PDF
Generative adversarial text to image synthesis
PPTX
Introduction to Data Mining
PDF
Performance Metrics for Machine Learning Algorithms
PPTX
Consistency in NoSQL
PDF
Feature Engineering
PPTX
Introduction to Machine Learning
PPT
backpropagation in neural networks
PPTX
Machine Learning
PPTX
Exploratory data analysis with Python
PDF
Understanding random forests
PPT
Data mining slides
 
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Introduction to Machine Learning
Feature Engineering - Getting most out of data for predictive models
Deep learning
Image recognition
Convolutional Neural Networks (CNN)
Introduction to Deep Learning
Unsupervised learning
Clustering
Generative adversarial text to image synthesis
Introduction to Data Mining
Performance Metrics for Machine Learning Algorithms
Consistency in NoSQL
Feature Engineering
Introduction to Machine Learning
backpropagation in neural networks
Machine Learning
Exploratory data analysis with Python
Understanding random forests
Data mining slides
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Ad

Similar to Introduction to machine learning (20)

PPTX
Intro to ml_2021
PPTX
UNIT II SUPERVISED LEARNING - Introduction
PPTX
Introduction to machine learning and model building using linear regression
PPTX
Ml ppt at
PDF
Machine learning Introduction
PPTX
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
PPT
Machine Learning
PDF
Machine Learning deep learning artificial
PPTX
Machine learning and linear regression programming
PPTX
06-01 Machine Learning and Linear Regression.pptx
PDF
Classification Techniques for Machine Learning
PDF
Machine Learning.pdf
PDF
Basics of Machine Learning
PDF
Data Science Cheatsheet.pdf
PPTX
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
PDF
MLEARN 210 B Autumn 2018: Lecture 1
PDF
Machine learning cheat sheet
PPTX
Lecture 8 about data mining and how to use it.pptx
PPTX
Statistical Learning and Model Selection module 2.pptx
PPTX
Linear Regression final-1.pptx thbejnnej
Intro to ml_2021
UNIT II SUPERVISED LEARNING - Introduction
Introduction to machine learning and model building using linear regression
Ml ppt at
Machine learning Introduction
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning
Machine Learning deep learning artificial
Machine learning and linear regression programming
06-01 Machine Learning and Linear Regression.pptx
Classification Techniques for Machine Learning
Machine Learning.pdf
Basics of Machine Learning
Data Science Cheatsheet.pdf
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
MLEARN 210 B Autumn 2018: Lecture 1
Machine learning cheat sheet
Lecture 8 about data mining and how to use it.pptx
Statistical Learning and Model Selection module 2.pptx
Linear Regression final-1.pptx thbejnnej
Ad

More from Sanghamitra Deb (16)

PDF
odsc_2023.pdf
PPTX
Multi-modal sources for predictive modeling using deep learning
PPTX
Computer Vision Landscape : Present and Future
PDF
Intro to NLP: Text Categorization and Topic Modeling
PPTX
Computer Vision for Beginners
PPTX
NLP Classifier Models & Metrics
PPTX
Developing Recommendation System to provide a Personalized Learning experienc...
PDF
NLP and Deep Learning for non_experts
PDF
NLP and Machine Learning for non-experts
PDF
Democratizing NLP content modeling with transfer learning using GPUs
PDF
Natural Language Comprehension: Human Machine Collaboration.
PDF
Data day2017
PDF
Extracting knowledgebase from text
PDF
Extracting medical attributes and finding relations
PDF
From Rocket Science to Data Science
PPTX
Understanding Product Attributes from Reviews
odsc_2023.pdf
Multi-modal sources for predictive modeling using deep learning
Computer Vision Landscape : Present and Future
Intro to NLP: Text Categorization and Topic Modeling
Computer Vision for Beginners
NLP Classifier Models & Metrics
Developing Recommendation System to provide a Personalized Learning experienc...
NLP and Deep Learning for non_experts
NLP and Machine Learning for non-experts
Democratizing NLP content modeling with transfer learning using GPUs
Natural Language Comprehension: Human Machine Collaboration.
Data day2017
Extracting knowledgebase from text
Extracting medical attributes and finding relations
From Rocket Science to Data Science
Understanding Product Attributes from Reviews

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
2. Earth - The Living Planet earth and life
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
1.pptx 2.pptx for biology endocrine system hum ppt
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
An interstellar mission to test astrophysical black holes
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
2. Earth - The Living Planet Module 2ELS
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
ECG_Course_Presentation د.محمد صقران ppt
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
2. Earth - The Living Planet earth and life
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Taita Taveta Laboratory Technician Workshop Presentation.pptx
1.pptx 2.pptx for biology endocrine system hum ppt
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
INTRODUCTION TO EVS | Concept of sustainability
Biophysics 2.pdffffffffffffffffffffffffff
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Viruses (History, structure and composition, classification, Bacteriophage Re...
An interstellar mission to test astrophysical black holes

Introduction to machine learning

  • 2. Outline • Introduction • Types of Learning • Machine Learning Models • Definition • Optimization • Regularization • Metrics • Feature Engineering • Real World Examples
  • 3. What is Machine Learning? • “Learning is any process by which a system improves performance from experience.” - Herbert Simon • Machine learning is training computers to effectively achieve a performance criterion using examples or historical data.
  • 4. Why? Machine Learning is used when § Human expertise is unavailable (space expeditions). § Human expertise is not explicable (speech translation). § Information need to be personalized (education, medicine). § Domains with huge amount of data.
  • 5. Applications • Education --- Developing Learning Paths for Students • Healthcare --- Personalized Medicine • Retail --- Product recommendations • Web --- Search • Manufacturing --- robotics, control • Finance – fraud detection, asset management • HR --- people analytics • Medical --- drug discovery, automated diagnosis • ………..
  • 6. Types of Learning Supervised Unsupervised § Examples or training data is available o Human annotations, user Interactions § Data contains features correlated with the desired outcome § A model is learned from the examples § Goal of the model is to predict future behavior § Direct examples are not available § Data contains features correlated, outcome may not be defined. § It is possible to create clusters correlated with the learning objective based on patterns in the data Learning Objective ---
  • 7. Types of Learning Supervised Unsupervised § Regression § Linear § Decision Trees § Classification § Logistic Regression § Naïve Bayes § SVM § Decision Trees – RF, GBDT § Clustering --- kmeans § Similarity based results § Transfer Learning Models ---
  • 8. Linear Regression Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables, but has been borrowed by machine learning. It is both a statistical algorithm and a machine learning algorithm. Linear Regression | simple linear regression | Ordinary Least Squares | multiple linear regression • The dependent variable Y has a linear relationship to the independent variable X • For each value of X, the probability distribution of Y has the same standard deviation σ. • For any given value of X, The Y values are independent, as indicated by a random pattern on the residual plot. • The Y values are roughly normally distributed (i.e., symmetric and unimodal). Example: Sales Prediction à company’s advertising spend on radio, TV, and newspapers.
  • 9. Cost Function Mean Squared Errors To minimize MSE we use Gradient Descent to calculate the gradient of our cost function. Gradient Descent is an algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In linear regression gradient descent is used to update the parameters or weights of the linear model https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html Learning Rate : The size of these steps is called the learning rate. With a high learning rate we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing.
  • 11. Linear Regression using scikit-learn https://stackabuse.com/linear-regression-in-python-with-scikit-learn/ Data set : https://drive.google.com/file/d/1oakZCv7g3mlmCSdv9J8kdSaqO5_6dIOw/view 4.64, which is less than 10% of the mean
  • 12. Applications • Trendline --- A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. • Epidemiology --- Early evidence relating tobacco smoking to mortality and morbidity came from observational studies employing regression analysis. In order to reduce spurious correlations when analyzing observational data, researchers usually include several variables in their regression models in addition to the variable of primary interest. • Finance --- The capital asset pricing model uses linear regression as well as the concept of beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets. • Economics --- Linear regression is the predominant empirical tool in economics. For example, it is used to predict consumption spending,[20] fixed investment spending, inventory investment, purchases of a country's exports,[21] spending on imports,[21] the demand to hold liquid assets,[22] labor demand,[23] and labor supply.[23] https://en.wikipedia.org/wiki/Linear_regression
  • 13. Classification http://cs229.stanford.edu/notes2020spring/cs229-notes1.pdf Values of Y, i.e the response can take discrete values o Binary Classification – response can belong to two classes (0,1) üRating – thumbs up, thumbs down. o Multi-class classification – there can be n classes à [1, …. n] ü Movie/restaurant rating can range from [1,2,3,4,5] o Multi-class , multi – label classification. üExample – ØConcept space in Probability has many concepts (classes) – [Probability, Bayes Theorem, DiscretePDF – Binomial, Poisson, Continuous PDF – normal, exponential, …] ØA question will typically belong to multiple classes.
  • 14. Logistic Regression -- Binary Classification Logistic Function or Sigmoid function https://sebastianraschka.com/faq/docs/logistic-why-sigmoid.html -- discussion on the logit function Interpretation: Output is the probability that y=1 given x. Output > 0.5 represents y =1 and output < 0.5 represents y=0
  • 16. Multinomial Logistic / Softmax Sigmoid gets replaced by the softmax function. This applies when there are 1,…k classes.
  • 17. Regularization --- Linear /Logistic • Penalty against complexity o model does not up "peculiarities," "noise," or "imagines a pattern where there is none.” • Helps with generalizing to new data sets. • Addition of a bias to the model when suffers from high variance or overfitting L2 regularization L1 regularization
  • 18. Data --- Train, Test and Validation Train – 90% , Test –5 %, Validation –5% ---- the percentages can vary depending on the total size of the dataset. • Train --- data used to train the model • Validation --- data that the model has not seen but is used for parameter tuning, i.e the model is optimized based on performance on this set. • Test --- model has not seen this data, this data is not used in any part of the computation. Final performance metrics are reported on this data.
  • 19. Bias – Variance Tradeoff BIAS — When we model in a very simple, novice way, for example just a single linear equation prediction for an actual complex model. Of course because of this Model becomes Under fit and miss out various important insights and relations between variables. VARIANCE — On the other hand, when we become over concern for a simple given data and fit a model in a very complicated way, it results in Over fit. So each noise and outlier will be considered as valid data point and modelled accordingly.
  • 20. Decision Tree Advantage --- • Results are interpretable • Works for both numerical and categorical data • Does not require feature transformations (example --- normalization, scaling) • Robust to Multicollinearity – correlated features. Single decision are rarely used in practice • Unstable --- small changes in data can lead to large structural changes in the decision tree • Prone to overfitting • Easily becomes complex
  • 21. Ensemble Tecniques --- Random Forest Many trees are better than one! N slightly differently trained decision trees and merges them together to get more accurate and stable predictions.
  • 22. Regularization • Limit tree depth. • Pruning • Penalize selection of new features over features that have similar gain • Set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.) Feature Importance A feature’s importance score measures the contribution from the feature. It is based on the impurity reduction of the class due to the feature. Bias towards features with more categories It is important to compute the correlation with accuracy.
  • 23. Unsupervised Learning There is no training data …
  • 24. Clustering • Hierarchical clustering • K-means clustering • K-NN (k nearest neighbors) Clustering is a technique that finds groups (clusters) in the data that have similar patterns.
  • 26. Similarity based Recommendations • Text data – news article, study material , books, … • For every piece of content compute the distance to all every other content in the cluster --- save the top –n in a database • When a user views any content surface the top-n (typically 5) other similar items
  • 27. Applications • Recommendations based on Text Similarity • Customer Segmentation • Content Categorization • As a pre-analysis for supervised learning
  • 28. Performance Metrics Is the model good enough?
  • 29. Regression R² Error: The metric helps us to compare our current model with a constant baseline and tells us how much our model is better. The constant baseline is chosen by taking the mean of the data and drawing a line at the mean. Adjusted R²: Adjusted R² depicts the same meaning as R² but is an improvement of it. R² suffers from the problem that the scores improve on increasing terms even though the model is not improving which may misguide the researcher. Adjusted R² is always lower than R² as it adjusts for the increasing predictors and only shows improvement if there is a real improvement
  • 30. Classification True Positive --- Number of observations that model correctly predicts the positive class False Positive --- Number of observations where model incorrectly predicts the positive class. False Negatives --- Number of observations where model incorrectly predicts the negative class. True Negatives --- Number of observations where model correctly predicts the negative class https://en.wikipedia.org/wiki/Precision_and_recall
  • 31. Classification https://en.wikipedia.org/wiki/Precision_and_recall Precision : TP/(TP+FP) --- what percentage of the positive class is actually positive? Recall : TP/(TP+FN) --- what percentage of the positive class gets captured by the model? Accuracy --- (TP+TN)/(TP+FP+TN+FN) --- what percentage of predictions are correct?
  • 32. Thresholding --- Coverage In a binary classification if you choose randomly the probability of belonging to a class is 0.5 0.3 0.7 It is possible improve the percentage of correct results at the cost of coverage.
  • 34. ROC & AUC ROC – Reciever Operating Characteristics An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. AUC – Area Under the Curve. • AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. • AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc • TPR = TP/(TP+FN) • FPR = FP/(FP+TN) Random
  • 36. Imputation or missing values https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114#1c08 Drop rows with missing values Cons: Reduces training data. In case of multiple features values may only be missing in a subset of features. Replace with a reasonable value – median is common Cons: There are assumptions used when values are missing which may not be correct. Categorical Imputation --- replace with the most common value. Cons: Assumptions
  • 38. Binning The trade-off between performance and overfitting is the key point of the binning process.
  • 39. Log Transform • It helps to handle skewed data and after transformation, the distribution becomes more approximate to normal. • In most of the cases the magnitude order of the data changes within the range of the data. For instance, the difference between ages 15 and 20 is not equal to the ages 65 and 70. In terms of years, yes, they are identical, but for all other aspects, 5 years of difference in young ages mean a higher magnitude difference. This type of data comes from a multiplicative process and log transform normalizes the magnitude differences like that. • It also decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust.
  • 41. Scaling -- all features have the same range The continuous features become identical in terms of the range, after a scaling process. This process is not mandatory for many algorithms, but it might be still nice to apply. However, the algorithms based on distance calculations such as k-NN or k-Means need to have scaled continuous features as model input.
  • 42. Normalization Normalization (or min-max normalization) scale all values in a fixed range between 0 and 1.
  • 44. Real World Application There is no training data …
  • 45. Use Case --- Predict will a user buy a book? Online Book Store
  • 46. Generating Features Book Features User Features Book-User Features • Tags --- genre, subject, … • Level – Beginner, Intermediate Advanced • Popularity score – 1. exponential decay on clicks 2. Create time bound scores, such number of views in the last 7 days, last 14 days • Price • Length of the book • … • Tags --- genre, subject, … • Level – Beginner, Intermediate Advanced Derived from user interactions on the site. • View Score 1. Exponential decay clicks on books 2. Time bound --- number of views in past 7/14 days • Price category • Time of day – categorical • … • Number of Views in past 14/30 days • Already bought • Number of views from the same author • …
  • 47. Response Variable & Modeling • If the site is not super active you might not have enough data on purchases • Multi-stage model • Stage 1 : response variable views – i.e will the user view/click on this book in the next 3 days? • Stage 2: response variable purchase – will the user purchase this book in the next 3 days. The probability of view will be a feature in the Stage 2 model. Modeling: We have a mix of categorical and numerical features --- Random Forest
  • 50. K-Nearest Neighbors K-Nearest Neighbor is a classification algorithm that leverages observations close to the target point to decide which class it belongs to There are two parts of the algorithm: first, how to measure “close”; second, how many close observations (K) we need. https://github.com/spotify/annoy
  • 52. Derivative of the sigmoid & cost function
  • 53. Metrics related to Ranking • MRR • MAP • NDCG
  • 55. Loss Functions • Cross-Entropy • Hinge • Huber • Kullback-Leibler • MAE (L1) • MSE (L2)
  • 56. Optimization Techniques • Gradient Descent oStochastic oMini Batch • Adagrad • RMSprop • Adam • Cross-Entropy • Others
  • 57. Decision Tree Based Regression Pros: Decision trees can handle both categorical and numerical data. Cons: Does not handle feature interaction very well.