SlideShare a Scribd company logo
Introduction to Machine Learning for
Statisticians
ganesh.vigneswara@gmail.com, ganesh@ganeshniyer.com
Dr Ganesh Neelakanta Iyer
Industry Expert, Academician, Researcher, YouTuber, Kathakali Artist
http://ganeshniyer.com, https://www.linkedin.com/in/ganeshniyer/
About Me • Masters & PhD from National University of Singapore (NUS)
• Several years in Industry/Academia
• Architect, Manager, Technology Evangelist, Professor
• Talks/workshops in USA, Europe, Australia, Asia
• Cloud Computing, Game Theory, Machine Learning,
DevOps, SRE
• Kathakali Artist, Composer, Speaker, Traveler, YouTuber
GANESHNIYER http://ganeshniyer.com
https://bit.ly/MLPlaylistGanesh
Agenda
Introduction
• Artificial Intelligence
• AI vs ML
Machine Learning
• Introduction
• Types of ML
• Applications
• ML Algorithms
ML vs Statistics
ML resources
• Courses
• Data Sets
• Projects
DISCLAIMER
• I am NOT an expert in Machine Learning. I intend to share
some knowledge I have to help you kick-start your interest
• I have been informed that audience are new to this area. So
the session is a GENTLE introduction to ML and what it means
for statisticians
• For all guys who are forced to be here today, please enjoy
Dilbert cartoons and pictures of countries I have been
Dr Ganesh Neelakanta Iyer 5
nCorona
Machine Learning for Statisticians - Introduction
https://interestingengineering.com/china-uses-drones-and-ai-robots-to-fight-the-coronavirus-outbreak
https://www.dailymail.co.uk/news/article-7948181/Chinese-hospitals-start-use-AI-powered-robots-treat-coronavirus-patients.html
https://techresider.com/robots/the-penaut-robot-which-takes-food-to-patients-isolated-by-the-coronavirus/
https://www.archyde.com/body-heat-detector-drones-china-makes-massive-use-of-technologies-to-contain-the-coronavirus/
http://engnews24h.com/corona-virus-drones-with-speakers-are-patrolling-in-china/
8
BlueDot – an AI company made its first alert on December 31st.
This was ahead of the US Centers for Disease Control and
Prevention, which made its own determination on January 6th.
https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/#41dd3f555817
How?
9
nCorona - AI
• “We are currently using natural language processing (NLP) and
machine learning (ML) to process vast amounts of unstructured
text data, currently in 65 languages, to track outbreaks of over
100 different diseases, every 15 minutes around the clock,” said
Kamran Khan, founder of BlueDot
• “If we did this work manually, we would probably need over a
hundred people to do it well. These data analytics enable health
experts to focus their time and energy on how to respond to
infectious disease risks, rather than spending their time and
energy gathering and organizing information.”
10
https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/#41dd3f555817
What is AI?
Dr Ganesh Neelakanta Iyer
Artificial Intelligence
• “The study of the modelling of human mental functions by
computer programs.” — Collins Dictionary
12https://medium.com/life-of-a-technologist/what-would-the-managers-manage-in-the-
age-of-ai-6a00c26df257
Dr Ganesh Neelakanta Iyer
Artificial Intelligence
• AI is composed of 2 words Artificial and Intelligence
• Anything which is not natural and created by humans is artificial
• Intelligence means ability to understand, reason, plan etc.
• So any code, tech or algorithm that enable machine to mimic,
develop or demonstrate the human cognition or behavior is AI
13
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
McDonald’s + Dynamic Yield
• McDonald’s thinks AI can help it sell more fast food to customers
• The company has announced that it is acquiring Dynamic Yield, an Israeli company
that uses AI to customise experiences
• McDonald's would use AI to tweak the menu options on the displays in the outlets,
based on factors such as the time of day, the weather outside and how busy the
restaurant is at the time
• If it is warm outside, the menu could offer more options for cold drinks such as
shakes, and perhaps more warm tea options if it is cold outside
• The system will also make recommendations in real-time for additional items that a
customer might want to order, based on what they had already ordered
https://www.news18.com/news/tech/a-burger-french-fries-and-some-artificial-intelligence-with-your-next-mcdonalds-order-2078213.html
Artificial Intelligence vs Machine Learning
AI vs ML
http://godigitalcrazy.com/artificial-intelligence-machine-learning-data-analytics/
What is ML?
Dr Ganesh Neelakanta Iyer
Machine Learning
• Machine learning is the field of study that gives computers
the ability to learn without being explicitly programmed.
• In simple term, Machine Learning means making
prediction based on data
20
Dr Ganesh Neelakanta Iyer
Machine Learning
21https://towardsdatascience.com/machine-learning-65dbd95f1603
A quick history.
From intuition to machine learning
Early
1900s
1970s
1990s
Now
Intuition Statistical
programming languages
Automated
machine learning
Manual analysis Visual statistical software
Using experience and
judgement to predict
outcomes
Writing code to construct
statistical models
The software knows how to analyse
your data and does it for you
Manual
calculations to
predict outcomes
Drag and drop workflows with menu
driven commands to set up and
statistical analysis
Slide credit: Edit
Why Machine Learning is Hard
You See Your ML Algorithm Sees
Why Machine Learning Is Hard, Redux
What is a “2”?
Why machine learning is hard?
Learning to identify an ‘apple’?
Apple Apple corporation Peach
Colour Red White Red
Type Fruit Logo Fruit
Shape Oval Cut oval Round
Slide credit: Edit
So much for a cat.
Principle of machine learning
Slide credit: Edit
Samples from Daily Life
Machine Learning for Statisticians - Introduction
Dr Ganesh Neelakanta Iyer
Google ML
29
Dr Ganesh Neelakanta Iyer
Google Translate
30
Dr Ganesh Neelakanta Iyer
Google Voice search
31
Dr Ganesh Neelakanta Iyer
Google Photos
32
Dr Ganesh Neelakanta Iyer
Gmail smart reply
33
Dr Ganesh Neelakanta Iyer
Google Maps
34
Dr Ganesh Neelakanta Iyer
Dr Ganesh Neelakana Iyer
Example 101
Dr Ganesh Neelakanta Iyer
Example
• Suppose we want to create a
system that tells us the
expected weight of person
based on its height
• Firstly, we will collect the data
• Each point on graph
represents a data point
37
https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
Dr Ganesh Neelakanta Iyer
Example
• To start with, we will draw a
simple line to predict weight
based on height
• A simple line could be W=H-100
• Where
– W=Weight in kgs
– H=Height in cms
38
https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
Dr Ganesh Neelakanta Iyer
Example
• This line can help us to make
prediction
• Our main goal is to reduce
distance between estimated
value and actual value i.e the
error
• In order to achieve this, will draw
a straight line which fits through
all the points
39
https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
Dr Ganesh Neelakanta Iyer
Example
• Our main goal is to minimize the
error and make them as small as
possible
• Decreasing the error between actual
and estimated value improves the
performance of model and also the
more data points we collect the
better our model will become
• So when we feed new data (height of
a person), it could easily tell us the
weight of the person
40
https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
Steps in ML
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
Types of Data
Data
Numerical
Discrete Continuous
Interval Ratio
Categorical
Nominal Ordinal
Time
series
Text
How do you get data?
45
46
Resources: Datasets
• UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html
• UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html
• Kaggle https://www.kaggle.com/
• India Govt ISRO Data Sets https://bhuvan.nrsc.gov.in/bhuvan_links.php
• NIST https://data.world/nist
• Statlib: http://lib.stat.cmu.edu/
• Delve: http://www.cs.utoronto.ca/~delve/
Dr Ganesh Neelakanta Iyer
Generate your own set
47
Dr Ganesh Neelakanta Iyer
Generate your own set
48
Machine Learning for Statisticians - Introduction
50
Dimensionality Reduction
• It is so easy and convenient to collect data
– An experiment
• Data is not collected only for data mining
• Data accumulates in an unprecedented speed
• Data preprocessing is an important part for effective machine
learning and data mining
• Dimensionality reduction is an effective approach to
downsizing data
51
Document Classification
Internet
ACM Portal PubMedIEEE Xplore
Digital Libraries
Web Pages
Emails
■ Task: To classify unlabeled
documents into categories
■ Challenge: thousands of terms
■ Solution: to apply
dimensionality reduction
D1
D2
Sports
T1 T2 ….…… TN
12 0 ….…… 6
DM
C
Travel
Jobs
…
…
…
Terms
Documents
3 10 ….…… 28
0 11 ….…… 16
…
Dr Ganesh Neelakanta Iyer
Dimensionality Reduction
• Selecting the most relevant attributes
• Feature Selection
• Combining attributes into a new reduced set of
features
• Feature Extraction
52
Types of Machine Learning
Machine Learning for Statisticians - Introduction
https://www.clariba.com/machine-learning-for-business
Dr Ganesh Neelakanta Iyer
Types of ML Algorithms
56
Dr Ganesh Neelakanta Iyer
Classification vs Regression
57
https://medium.com/@ali_88273/regression-vs-
classification-87c224350d69
Dr Ganesh Neelakanta Iyer
Classification
• A classification problem is when the output variable is a category,
such as “red” or “blue” or “disease” and “no disease”
• A classification model attempts to draw some conclusion from
observed values
• Given one or more inputs a classification model will try to predict the
value of one or more outcomes
58
Classification
• A classification problem is when the output variable is a category,
such as “red” or “blue” or “disease” and “no disease”
• A classification model attempts to draw some conclusion from
observed values
• Given one or more inputs a classification model will try to predict the
value of one or more outcomes
https://developers.google.com/machine-learning/guides/
text-classification/
Regression
• A regression problem is when the output variable is a real or
continuous value, such as “salary” or “weight”
• Many different models can be used, the simplest is the linear
regression
• It tries to fit data with the best hyper-plane which goes through the
points
Dr Ganesh Neelakanta Iyer
Examples
• Regression vs Classification
– Predicting age of a person
– Predicting nationality of a person
– Predicting whether stock price of a company will increase tomorrow
– Predicting the gender of a person by his/her handwriting style
– Predicting house price based on area
– Predicting whether monsoon will be normal next year
– Predict the number of copies a music album will be sold next month
61
Dr Ganesh Neelakanta Iyer
Examples
• Regression vs Classification
– Predicting age of a person
– Predicting nationality of a person
– Predicting whether stock price of a company will increase tomorrow
– Predicting the gender of a person by his/her handwriting style
– Predicting house price based on area
– Predicting whether monsoon will be normal next year
– Predict the number of copies a music album will be sold next month
62
Machine Learning for Statisticians - Introduction
Evaluation Metrics
Accuracy
Confusion
Matrix
Precision
Recall /
Sensitivity
Specificity F1 Score
Gain and Lift
charts
Root Mean
Squared Error
Root Mean
Squared
Logarithmic
Error
R-squared Cross-validation Gini coefficient
https://www.analyticsvidhya.com/blog/2019/08/11-important-
model-evaluation-error-metrics/
https://medium.com/thalus-ai/performance-metrics-for-
classification-problems-in-machine-learning-part-i-b085d432082b
Statistics vs ML
65
Statistics vs ML
66https://qph.fs.quoracdn.net/main-qimg-220b49a6aa9c221f5d44877ad1f6dfd7
https://www.unitedglobalgrp.com/wp-content/uploads/2018/05/machineLearning2-830x829.png
Statistics vs ML
• The major difference
between machine learning
and statistics is their purpose
• Machine learning models are
designed to make the most
accurate predictions possible
• Statistical models are
designed for inference about
the relationships between
variables
67
https://www.analyticsvidhya.com/blog/2015/12/hilarious-jokes-videos-statistics-data-science/
Statistics vs ML
68https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3
ML is built upon Statistics
• Machine learning involves data, and data has to be described
using a statistical framework
• machine learning draws upon a large number of other fields
of mathematics and computer science, for example:
• ML theory from fields like mathematics & statistics
• ML algorithms from fields like optimization, matrix algebra,
calculus
• ML implementations from computer science & engineering
concepts (e.g. kernel tricks, feature hashing)
69
Both machine learning and statistics have the
same objective
70
Statistics Machine Learning
Estimation Learning
Classifier Hypothesis
Data Point Example/ Instance
Regression Supervised Learning
Classification Supervised Learning
Covariate Feature
Response Label
https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html
Methodological differences between machine
learning and statistics
• ML professional: “The model is 85% accurate in predicting
Y, given a, b and c.”
• Statistician: “The model is 85% accurate in predicting Y,
given a, b and c; and I am 90% certain that you will obtain
the same result.”
71
https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html
How statistics is used in Machine Learning?
• Do you have outliers?
• Is your data independent or correlated?
• Is your data sample identically distributed?
• Is the metric you have used to evaluate your model the
best one?
• How confident are you about the produced results?
• How can you construct a confidence interval for your
results?
72
https://www.quora.com/How-statistics-is-used-in-Machine-Learning
7 WAYS DATA SCIENTISTS
USE STATISTICS
73
1. Design and interpret experiments to inform
product decisions
Observation: Advertisement variant A has a 5% higher click-through rate than
variant B.
Let's say you're a national retailer and you're trying to test the effect of a new
marketing campaigns. Data Scientists can help you decide which stores you
should assign to the experimental group to get a good balance between the
experimental and control groups, what sample size you should assign to the
experimental group to get clear results, and how to run the study spending as
little money as possible.
Statistics Used: Experimental Design, Frequentist Statistics (Hypothesis
Tests and Confidence Intervals
74https://www.quora.com/How-do-data-scientists-use-statistics
2. Build models that predict signal, not noise
Observation: Sales in December increased by 5%.
Data Scientists can tell you potential reasons why sales have increased by
5%. Data scientists can help you understand what drives sales, what sales
could look like next month, and potential trends to pay attention to.
Statistics Used: Regression, Classification, Time Series Analysis, Causal
Analysis
75https://www.quora.com/How-do-data-scientists-use-statistics
3. Turn big data into the big picture
Observation: Some customers only buy healthy food, while others only buy
when there's a sale.
Data Scientists can help you label each customer, group them with similar
customers, and understand their buying habits. This allows you to see how
business developments can affect certain groups of the population, instead of
looking at everyone as a whole or looking at everyone individually.
Statistics Used: Clustering, Dimensionality Reduction, Latent Variable
Analysis
76https://www.quora.com/How-do-data-scientists-use-statistics
4. Understand user engagement, retention,
conversion, and leads
Observation: A lot of people are signing up for our site and never coming
back.
Why do your customers buy items from your site? How do you keep your
clients coming back? Why are users dropping out of your funnel? When will
they come out next? What kinds of emails from your company are most
successfully engaging users? What are some leading indicators of
engagement, activity, or success? What are some good sales leads?
Statistics Used: Regression, Causal Effects Analysis, Latent Variable
analysis, Survey Design
77https://www.quora.com/How-do-data-scientists-use-statistics
5. Give your users what they want
Given a matrix of users (customers, clients, users), and their interactions
(clicks, purchases, ratings) with your companies items (ads, goods, movies),
can you suggest what items your users will want next?
Statistics Used: Predictive Modeling, Latent Variable Analysis, Dimensionality
Reduction, Collaborative Filtering, Clustering
78https://www.quora.com/How-do-data-scientists-use-statistics
6. Estimate intelligently
Observation: We have a banner with 100 impressions and 0 clicks.
Is 0% a good estimate of the click-through-rate?
Data Scientists can incorporate data, global data, and prior knowledge to get
a desirable estimate, tell you the properties of that estimate, and summarize
what the estimate means.
Statistics Used: Bayesian Data Analysis
79https://www.quora.com/How-do-data-scientists-use-statistics
7. Tell the story with the data
The Data Scientist's role in the company is the serve as the ambassador
between the data and the company. Communication is key, and the Data
Scientist must be able to explain their insights in a way that the company can get
aboard, without sacrificing the fidelity of the data.
The Data Scientist does not simply summarize the numbers, but explains why
the numbers are important and what actionable insights one can get from these.
The Data Scientist is the storyteller of the company, communicating the
meaning of the data and why it is important to the company.
Statistics Used: Presenting and Communicating Data, Data Visualization
80https://www.quora.com/How-do-data-scientists-use-statistics
Dr Ganesh Neelakanta Iyer
Resources for
you to start….
81
Fun ML projects for beginners
• Machine Learning Gladiator
• Play Money Ball
• Predict Stock Prices
• Teach a Neural Network to Read Handwriting
• Investigate Enron
• Write ML Algorithms from Scratch
• Mine Social Media Sentiment
• Improve Health Care
https://elitedatascience.com/machine-learning-projects-for-beginners
Predict Stock Prices
https://elitedatascience.com/machine-learning-projects-for-beginners
Interesting ML projects to start trying
• Beginner Level
– Iris Data
– Loan Prediction Data
– Bigmart Sales Data
– Boston Housing Data
– Time Series Analysis
Data
– Wine Quality Data
– Turkiye Student
Evaluation Data
– Heights and Weights
Data
• Intermediate Level
– Black Friday Data
– Human Activity
Recognition Data
– Siam Competition Data
– Trip History Data
– Million Song Data
– Census Income Data
– Movie Lens Data
– Twitter Classification
Data
• Advanced Level
– Identify your Digits
– Urban Sound
Classification
– Vox Celebrity Data
– ImageNet Data
– Chicago Crime Data
– Age Detection of Indian
Actors Data
– Recommendation
Engine Data
– VisualQA Data
https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-to-boost-your-knowledge-and-skills/
Dr Ganesh Neelakanta Iyer 85
Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com
http://ganeshniyer.com
https://www.linkedin.com/in/ganeshniyer/
https://bit.ly/MLPlaylistGanesh

More Related Content

PDF
Machine Learning Introduction
PDF
"Introduction to Machine Learning and its Applications" at sapthgiri engineer...
PDF
Machine Learning and its Applications
PDF
AI Orange Belt - Session 2
PDF
Barga DIDC'14 Invited Talk
PDF
AI Orange Belt - Session 1
PDF
Machine Learning: Applications, Process and Techniques
PDF
Machine learning and_buzzwords
Machine Learning Introduction
"Introduction to Machine Learning and its Applications" at sapthgiri engineer...
Machine Learning and its Applications
AI Orange Belt - Session 2
Barga DIDC'14 Invited Talk
AI Orange Belt - Session 1
Machine Learning: Applications, Process and Techniques
Machine learning and_buzzwords

What's hot (20)

PPTX
Application of machine learning in industrial applications
PPTX
AI Orange Belt - Session 3
PDF
Introduction To Machine Learning | Edureka
PDF
Barga Data Science lecture 2
PPTX
Product School - AI Funding / Trends & Product Management
PDF
Towards a Reactive Game Engine
PDF
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
PDF
DL Classe 0 - You can do it
PPTX
Sippin: A Mobile Application Case Study presented at Techfest Louisville
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
PDF
AI, Machine Learning and Deep Learning - The Overview
PPTX
Artificial Intelligence for Automated Decision Support Project
PPTX
Artificial intelligence: Simulation of Intelligence
PDF
Barga Galvanize Sept 2015
PPTX
AI Orange Belt - Session 4
PDF
Barga Data Science lecture 4
PDF
Barga Data Science lecture 1
PDF
Guide to end end machine learning projects
PDF
Healthcare + AI: Use cases & Challenges
PPTX
Mathematics, Machine Learning and ML Engineering
Application of machine learning in industrial applications
AI Orange Belt - Session 3
Introduction To Machine Learning | Edureka
Barga Data Science lecture 2
Product School - AI Funding / Trends & Product Management
Towards a Reactive Game Engine
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
DL Classe 0 - You can do it
Sippin: A Mobile Application Case Study presented at Techfest Louisville
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
AI, Machine Learning and Deep Learning - The Overview
Artificial Intelligence for Automated Decision Support Project
Artificial intelligence: Simulation of Intelligence
Barga Galvanize Sept 2015
AI Orange Belt - Session 4
Barga Data Science lecture 4
Barga Data Science lecture 1
Guide to end end machine learning projects
Healthcare + AI: Use cases & Challenges
Mathematics, Machine Learning and ML Engineering
Ad

Similar to Machine Learning for Statisticians - Introduction (20)

PPTX
MACHINE LEARNING PPT.pptx for the machine learning studnets
PPTX
Ml - A shallow dive
PPT
intro to ML by the way m toh phasee movie Punjabi
PDF
Introduction to Machine Learning
PPTX
Unit - 1 - Introduction of the machine learning
PDF
Case study on machine learning
PDF
Introduction to machine learning
PPTX
Machine learning ppt.
PDF
Machine Learning: Artificial Intelligence isn't just a Science Fiction topic
PPTX
Machine learning
PDF
ML.pdf
PDF
Artificial Intelligence For Data Science In Theory And Practice Mohamed Allog...
PDF
Efficient Learning Machines Theories Concepts And Applications For Engineers ...
PPTX
Introduction to Machine Learning.pptx
PPTX
machine learning introduction notes foRr
PPTX
ppt on introduction to Machine learning tools
PDF
Intro to machine learning
PDF
DSCI 552 machine learning for data science
PDF
The Ultimate Guide to Machine Learning (ML)
PDF
Machine Learning: Inteligencia Artificial no es sólo un tema de Ciencia Ficci...
MACHINE LEARNING PPT.pptx for the machine learning studnets
Ml - A shallow dive
intro to ML by the way m toh phasee movie Punjabi
Introduction to Machine Learning
Unit - 1 - Introduction of the machine learning
Case study on machine learning
Introduction to machine learning
Machine learning ppt.
Machine Learning: Artificial Intelligence isn't just a Science Fiction topic
Machine learning
ML.pdf
Artificial Intelligence For Data Science In Theory And Practice Mohamed Allog...
Efficient Learning Machines Theories Concepts And Applications For Engineers ...
Introduction to Machine Learning.pptx
machine learning introduction notes foRr
ppt on introduction to Machine learning tools
Intro to machine learning
DSCI 552 machine learning for data science
The Ultimate Guide to Machine Learning (ML)
Machine Learning: Inteligencia Artificial no es sólo un tema de Ciencia Ficci...
Ad

More from Dr Ganesh Iyer (20)

PDF
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
PDF
SRE Demystified - 14 - SRE Practices overview
PDF
SRE Demystified - 13 - Docs that matter -2
PDF
SRE Demystified - 12 - Docs that matter -1
PDF
SRE Demystified - 01 - SLO SLI and SLA
PDF
SRE Demystified - 11 - Release management-2
PDF
SRE Demystified - 10 - Release management-1
PDF
SRE Demystified - 09 - Simplicity
PDF
SRE Demystified - 07 - Practical Alerting
PDF
SRE Demystified - 06 - Distributed Monitoring
PDF
SRE Demystified - 05 - Toil Elimination
PDF
SRE Demystified - 04 - Engagement Model
PDF
SRE Demystified - 03 - Choosing SLIs and SLOs
PDF
Making Decisions - A Game Theoretic approach
PDF
Cloud and Industry4.0
PDF
Game Theory and Engineering Applications
PDF
How to become a successful entrepreneur
PDF
Dockers and kubernetes
PDF
Containerization Principles Overview for app development and deployment
PDF
Game Theory and Engineering Applications
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 11 - Release management-2
SRE Demystified - 10 - Release management-1
SRE Demystified - 09 - Simplicity
SRE Demystified - 07 - Practical Alerting
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 04 - Engagement Model
SRE Demystified - 03 - Choosing SLIs and SLOs
Making Decisions - A Game Theoretic approach
Cloud and Industry4.0
Game Theory and Engineering Applications
How to become a successful entrepreneur
Dockers and kubernetes
Containerization Principles Overview for app development and deployment
Game Theory and Engineering Applications

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Modernizing your data center with Dell and AMD
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
Machine learning based COVID-19 study performance prediction
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
Modernizing your data center with Dell and AMD
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
GamePlan Trading System Review: Professional Trader's Honest Take

Machine Learning for Statisticians - Introduction

  • 1. Introduction to Machine Learning for Statisticians ganesh.vigneswara@gmail.com, ganesh@ganeshniyer.com Dr Ganesh Neelakanta Iyer Industry Expert, Academician, Researcher, YouTuber, Kathakali Artist http://ganeshniyer.com, https://www.linkedin.com/in/ganeshniyer/
  • 2. About Me • Masters & PhD from National University of Singapore (NUS) • Several years in Industry/Academia • Architect, Manager, Technology Evangelist, Professor • Talks/workshops in USA, Europe, Australia, Asia • Cloud Computing, Game Theory, Machine Learning, DevOps, SRE • Kathakali Artist, Composer, Speaker, Traveler, YouTuber GANESHNIYER http://ganeshniyer.com https://bit.ly/MLPlaylistGanesh
  • 3. Agenda Introduction • Artificial Intelligence • AI vs ML Machine Learning • Introduction • Types of ML • Applications • ML Algorithms ML vs Statistics ML resources • Courses • Data Sets • Projects
  • 4. DISCLAIMER • I am NOT an expert in Machine Learning. I intend to share some knowledge I have to help you kick-start your interest • I have been informed that audience are new to this area. So the session is a GENTLE introduction to ML and what it means for statisticians • For all guys who are forced to be here today, please enjoy Dilbert cartoons and pictures of countries I have been
  • 5. Dr Ganesh Neelakanta Iyer 5 nCorona
  • 8. 8 BlueDot – an AI company made its first alert on December 31st. This was ahead of the US Centers for Disease Control and Prevention, which made its own determination on January 6th. https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/#41dd3f555817
  • 10. nCorona - AI • “We are currently using natural language processing (NLP) and machine learning (ML) to process vast amounts of unstructured text data, currently in 65 languages, to track outbreaks of over 100 different diseases, every 15 minutes around the clock,” said Kamran Khan, founder of BlueDot • “If we did this work manually, we would probably need over a hundred people to do it well. These data analytics enable health experts to focus their time and energy on how to respond to infectious disease risks, rather than spending their time and energy gathering and organizing information.” 10 https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/#41dd3f555817
  • 12. Dr Ganesh Neelakanta Iyer Artificial Intelligence • “The study of the modelling of human mental functions by computer programs.” — Collins Dictionary 12https://medium.com/life-of-a-technologist/what-would-the-managers-manage-in-the- age-of-ai-6a00c26df257
  • 13. Dr Ganesh Neelakanta Iyer Artificial Intelligence • AI is composed of 2 words Artificial and Intelligence • Anything which is not natural and created by humans is artificial • Intelligence means ability to understand, reason, plan etc. • So any code, tech or algorithm that enable machine to mimic, develop or demonstrate the human cognition or behavior is AI 13
  • 16. McDonald’s + Dynamic Yield • McDonald’s thinks AI can help it sell more fast food to customers • The company has announced that it is acquiring Dynamic Yield, an Israeli company that uses AI to customise experiences • McDonald's would use AI to tweak the menu options on the displays in the outlets, based on factors such as the time of day, the weather outside and how busy the restaurant is at the time • If it is warm outside, the menu could offer more options for cold drinks such as shakes, and perhaps more warm tea options if it is cold outside • The system will also make recommendations in real-time for additional items that a customer might want to order, based on what they had already ordered https://www.news18.com/news/tech/a-burger-french-fries-and-some-artificial-intelligence-with-your-next-mcdonalds-order-2078213.html
  • 17. Artificial Intelligence vs Machine Learning
  • 20. Dr Ganesh Neelakanta Iyer Machine Learning • Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. • In simple term, Machine Learning means making prediction based on data 20
  • 21. Dr Ganesh Neelakanta Iyer Machine Learning 21https://towardsdatascience.com/machine-learning-65dbd95f1603
  • 22. A quick history. From intuition to machine learning Early 1900s 1970s 1990s Now Intuition Statistical programming languages Automated machine learning Manual analysis Visual statistical software Using experience and judgement to predict outcomes Writing code to construct statistical models The software knows how to analyse your data and does it for you Manual calculations to predict outcomes Drag and drop workflows with menu driven commands to set up and statistical analysis Slide credit: Edit
  • 23. Why Machine Learning is Hard You See Your ML Algorithm Sees
  • 24. Why Machine Learning Is Hard, Redux What is a “2”?
  • 25. Why machine learning is hard? Learning to identify an ‘apple’? Apple Apple corporation Peach Colour Red White Red Type Fruit Logo Fruit Shape Oval Cut oval Round Slide credit: Edit
  • 26. So much for a cat. Principle of machine learning Slide credit: Edit
  • 29. Dr Ganesh Neelakanta Iyer Google ML 29
  • 30. Dr Ganesh Neelakanta Iyer Google Translate 30
  • 31. Dr Ganesh Neelakanta Iyer Google Voice search 31
  • 32. Dr Ganesh Neelakanta Iyer Google Photos 32
  • 33. Dr Ganesh Neelakanta Iyer Gmail smart reply 33
  • 34. Dr Ganesh Neelakanta Iyer Google Maps 34
  • 36. Dr Ganesh Neelakana Iyer Example 101
  • 37. Dr Ganesh Neelakanta Iyer Example • Suppose we want to create a system that tells us the expected weight of person based on its height • Firstly, we will collect the data • Each point on graph represents a data point 37 https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
  • 38. Dr Ganesh Neelakanta Iyer Example • To start with, we will draw a simple line to predict weight based on height • A simple line could be W=H-100 • Where – W=Weight in kgs – H=Height in cms 38 https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
  • 39. Dr Ganesh Neelakanta Iyer Example • This line can help us to make prediction • Our main goal is to reduce distance between estimated value and actual value i.e the error • In order to achieve this, will draw a straight line which fits through all the points 39 https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
  • 40. Dr Ganesh Neelakanta Iyer Example • Our main goal is to minimize the error and make them as small as possible • Decreasing the error between actual and estimated value improves the performance of model and also the more data points we collect the better our model will become • So when we feed new data (height of a person), it could easily tell us the weight of the person 40 https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55
  • 44. Types of Data Data Numerical Discrete Continuous Interval Ratio Categorical Nominal Ordinal Time series Text
  • 45. How do you get data? 45
  • 46. 46 Resources: Datasets • UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html • UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html • Kaggle https://www.kaggle.com/ • India Govt ISRO Data Sets https://bhuvan.nrsc.gov.in/bhuvan_links.php • NIST https://data.world/nist • Statlib: http://lib.stat.cmu.edu/ • Delve: http://www.cs.utoronto.ca/~delve/
  • 47. Dr Ganesh Neelakanta Iyer Generate your own set 47
  • 48. Dr Ganesh Neelakanta Iyer Generate your own set 48
  • 50. 50 Dimensionality Reduction • It is so easy and convenient to collect data – An experiment • Data is not collected only for data mining • Data accumulates in an unprecedented speed • Data preprocessing is an important part for effective machine learning and data mining • Dimensionality reduction is an effective approach to downsizing data
  • 51. 51 Document Classification Internet ACM Portal PubMedIEEE Xplore Digital Libraries Web Pages Emails ■ Task: To classify unlabeled documents into categories ■ Challenge: thousands of terms ■ Solution: to apply dimensionality reduction D1 D2 Sports T1 T2 ….…… TN 12 0 ….…… 6 DM C Travel Jobs … … … Terms Documents 3 10 ….…… 28 0 11 ….…… 16 …
  • 52. Dr Ganesh Neelakanta Iyer Dimensionality Reduction • Selecting the most relevant attributes • Feature Selection • Combining attributes into a new reduced set of features • Feature Extraction 52
  • 53. Types of Machine Learning
  • 56. Dr Ganesh Neelakanta Iyer Types of ML Algorithms 56
  • 57. Dr Ganesh Neelakanta Iyer Classification vs Regression 57 https://medium.com/@ali_88273/regression-vs- classification-87c224350d69
  • 58. Dr Ganesh Neelakanta Iyer Classification • A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease” • A classification model attempts to draw some conclusion from observed values • Given one or more inputs a classification model will try to predict the value of one or more outcomes 58
  • 59. Classification • A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease” • A classification model attempts to draw some conclusion from observed values • Given one or more inputs a classification model will try to predict the value of one or more outcomes https://developers.google.com/machine-learning/guides/ text-classification/
  • 60. Regression • A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight” • Many different models can be used, the simplest is the linear regression • It tries to fit data with the best hyper-plane which goes through the points
  • 61. Dr Ganesh Neelakanta Iyer Examples • Regression vs Classification – Predicting age of a person – Predicting nationality of a person – Predicting whether stock price of a company will increase tomorrow – Predicting the gender of a person by his/her handwriting style – Predicting house price based on area – Predicting whether monsoon will be normal next year – Predict the number of copies a music album will be sold next month 61
  • 62. Dr Ganesh Neelakanta Iyer Examples • Regression vs Classification – Predicting age of a person – Predicting nationality of a person – Predicting whether stock price of a company will increase tomorrow – Predicting the gender of a person by his/her handwriting style – Predicting house price based on area – Predicting whether monsoon will be normal next year – Predict the number of copies a music album will be sold next month 62
  • 64. Evaluation Metrics Accuracy Confusion Matrix Precision Recall / Sensitivity Specificity F1 Score Gain and Lift charts Root Mean Squared Error Root Mean Squared Logarithmic Error R-squared Cross-validation Gini coefficient https://www.analyticsvidhya.com/blog/2019/08/11-important- model-evaluation-error-metrics/ https://medium.com/thalus-ai/performance-metrics-for- classification-problems-in-machine-learning-part-i-b085d432082b
  • 67. Statistics vs ML • The major difference between machine learning and statistics is their purpose • Machine learning models are designed to make the most accurate predictions possible • Statistical models are designed for inference about the relationships between variables 67 https://www.analyticsvidhya.com/blog/2015/12/hilarious-jokes-videos-statistics-data-science/
  • 69. ML is built upon Statistics • Machine learning involves data, and data has to be described using a statistical framework • machine learning draws upon a large number of other fields of mathematics and computer science, for example: • ML theory from fields like mathematics & statistics • ML algorithms from fields like optimization, matrix algebra, calculus • ML implementations from computer science & engineering concepts (e.g. kernel tricks, feature hashing) 69
  • 70. Both machine learning and statistics have the same objective 70 Statistics Machine Learning Estimation Learning Classifier Hypothesis Data Point Example/ Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html
  • 71. Methodological differences between machine learning and statistics • ML professional: “The model is 85% accurate in predicting Y, given a, b and c.” • Statistician: “The model is 85% accurate in predicting Y, given a, b and c; and I am 90% certain that you will obtain the same result.” 71 https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html
  • 72. How statistics is used in Machine Learning? • Do you have outliers? • Is your data independent or correlated? • Is your data sample identically distributed? • Is the metric you have used to evaluate your model the best one? • How confident are you about the produced results? • How can you construct a confidence interval for your results? 72 https://www.quora.com/How-statistics-is-used-in-Machine-Learning
  • 73. 7 WAYS DATA SCIENTISTS USE STATISTICS 73
  • 74. 1. Design and interpret experiments to inform product decisions Observation: Advertisement variant A has a 5% higher click-through rate than variant B. Let's say you're a national retailer and you're trying to test the effect of a new marketing campaigns. Data Scientists can help you decide which stores you should assign to the experimental group to get a good balance between the experimental and control groups, what sample size you should assign to the experimental group to get clear results, and how to run the study spending as little money as possible. Statistics Used: Experimental Design, Frequentist Statistics (Hypothesis Tests and Confidence Intervals 74https://www.quora.com/How-do-data-scientists-use-statistics
  • 75. 2. Build models that predict signal, not noise Observation: Sales in December increased by 5%. Data Scientists can tell you potential reasons why sales have increased by 5%. Data scientists can help you understand what drives sales, what sales could look like next month, and potential trends to pay attention to. Statistics Used: Regression, Classification, Time Series Analysis, Causal Analysis 75https://www.quora.com/How-do-data-scientists-use-statistics
  • 76. 3. Turn big data into the big picture Observation: Some customers only buy healthy food, while others only buy when there's a sale. Data Scientists can help you label each customer, group them with similar customers, and understand their buying habits. This allows you to see how business developments can affect certain groups of the population, instead of looking at everyone as a whole or looking at everyone individually. Statistics Used: Clustering, Dimensionality Reduction, Latent Variable Analysis 76https://www.quora.com/How-do-data-scientists-use-statistics
  • 77. 4. Understand user engagement, retention, conversion, and leads Observation: A lot of people are signing up for our site and never coming back. Why do your customers buy items from your site? How do you keep your clients coming back? Why are users dropping out of your funnel? When will they come out next? What kinds of emails from your company are most successfully engaging users? What are some leading indicators of engagement, activity, or success? What are some good sales leads? Statistics Used: Regression, Causal Effects Analysis, Latent Variable analysis, Survey Design 77https://www.quora.com/How-do-data-scientists-use-statistics
  • 78. 5. Give your users what they want Given a matrix of users (customers, clients, users), and their interactions (clicks, purchases, ratings) with your companies items (ads, goods, movies), can you suggest what items your users will want next? Statistics Used: Predictive Modeling, Latent Variable Analysis, Dimensionality Reduction, Collaborative Filtering, Clustering 78https://www.quora.com/How-do-data-scientists-use-statistics
  • 79. 6. Estimate intelligently Observation: We have a banner with 100 impressions and 0 clicks. Is 0% a good estimate of the click-through-rate? Data Scientists can incorporate data, global data, and prior knowledge to get a desirable estimate, tell you the properties of that estimate, and summarize what the estimate means. Statistics Used: Bayesian Data Analysis 79https://www.quora.com/How-do-data-scientists-use-statistics
  • 80. 7. Tell the story with the data The Data Scientist's role in the company is the serve as the ambassador between the data and the company. Communication is key, and the Data Scientist must be able to explain their insights in a way that the company can get aboard, without sacrificing the fidelity of the data. The Data Scientist does not simply summarize the numbers, but explains why the numbers are important and what actionable insights one can get from these. The Data Scientist is the storyteller of the company, communicating the meaning of the data and why it is important to the company. Statistics Used: Presenting and Communicating Data, Data Visualization 80https://www.quora.com/How-do-data-scientists-use-statistics
  • 81. Dr Ganesh Neelakanta Iyer Resources for you to start…. 81
  • 82. Fun ML projects for beginners • Machine Learning Gladiator • Play Money Ball • Predict Stock Prices • Teach a Neural Network to Read Handwriting • Investigate Enron • Write ML Algorithms from Scratch • Mine Social Media Sentiment • Improve Health Care https://elitedatascience.com/machine-learning-projects-for-beginners
  • 84. Interesting ML projects to start trying • Beginner Level – Iris Data – Loan Prediction Data – Bigmart Sales Data – Boston Housing Data – Time Series Analysis Data – Wine Quality Data – Turkiye Student Evaluation Data – Heights and Weights Data • Intermediate Level – Black Friday Data – Human Activity Recognition Data – Siam Competition Data – Trip History Data – Million Song Data – Census Income Data – Movie Lens Data – Twitter Classification Data • Advanced Level – Identify your Digits – Urban Sound Classification – Vox Celebrity Data – ImageNet Data – Chicago Crime Data – Age Detection of Indian Actors Data – Recommendation Engine Data – VisualQA Data https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-to-boost-your-knowledge-and-skills/
  • 86. Dr Ganesh Neelakanta Iyer ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com http://ganeshniyer.com https://www.linkedin.com/in/ganeshniyer/ https://bit.ly/MLPlaylistGanesh