SlideShare a Scribd company logo
H2O.ai

Machine Intelligence
Data Science for
Non-Data Scientists
Erin LeDell Ph.D.
Silicon Valley Big Data Science
August 2015
H2O.ai

Machine Intelligence
H2O.ai
H2O Company
H2O Software
• Team: 35. Founded in 2012, Mountain View, CA
• Stanford Math & Systems Engineers
• Open Source Software

• Ease of Use via Web Interface
• R, Python, Scala, Spark & Hadoop Interfaces
• Distributed Algorithms Scale to Big Data
H2O.ai

Machine Intelligence
Scientific Advisory Council
Dr. Trevor Hastie
Dr. Rob Tibshirani
Dr. Stephen Boyd
• John A. Overdeck Professor of Mathematics, Stanford University
• PhD in Statistics, Stanford University
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Co-author with John Chambers, Statistical Models in S
• Co-author, Generalized Additive Models
• 108,404 citations (via Google Scholar)
• Professor of Statistics and Health Research and Policy, Stanford University
• PhD in Statistics, Stanford University
• COPPS Presidents’ Award recipient
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Author, Regression Shrinkage and Selection via the Lasso
• Co-author, An Introduction to the Bootstrap
• Professor of Electrical Engineering and Computer Science, Stanford University
• PhD in Electrical Engineering and Computer Science, UC Berkeley
• Co-author, Convex Optimization
• Co-author, Linear Matrix Inequalities in System and Control Theory
• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers
H2O.ai

Machine Intelligence
What is Data Science?
Problem
Formulation
• Identify an outcome of interest and the type of task:
classification / regression / clustering
• Identify the potential predictor variables
• Identify the independent sampling units
• Conduct research experiment (e.g. Clinical Trial)
• Collect examples / randomly sample the population
• Transform, clean, impute, filter, aggregate data
• Prepare the data for machine learning — X, Y
• Modeling using a machine learning algorithm (training)
• Model evaluation and comparison
• Sensitivity & Cost Analysis
• Translate results into action items
• Feed results into research pipeline
Collect &
Process Data
Machine Learning
Insights & Action
H2O.ai

Machine Intelligence Source: marketingdistillery.com
H2O.ai

Machine Intelligence
What is Machine Learning?
What it is: ✤ “Field of study that gives computers the ability to learn
without being explicitly programmed.” (Samuel, 1959)
✤ “Machine learning and statistics are closely related
fields. The ideas of machine learning, from
methodological principles to theoretical tools, have
had a long pre-history in statistics.” (Jordan, 2014)
✤ M.I. Jordan also suggested the term data science as
a placeholder to call the overall field.
Unlike rules-based systems which require a human
expert to hard-code domain knowledge directly into
the system, a machine learning algorithm learns how
to make decisions from the data alone.
What it’s not:
H2O.ai

Machine Intelligence
Classification
Clustering
Machine Learning Overview
• Predict a real-valued response (viral load, weight)
• Gaussian, Gamma, Poisson and Tweedie
• MSE and R^2
• Multi-class or Binary classification
• Ranking
• Accuracy and AUC
• Unsupervised learning (no training labels)
• Partition the data / identify clusters
• AIC and BIC
Regression
H2O.ai

Machine Intelligence
Machine Learning Workflow
Source: NLTK
Example of a supervised machine learning workflow.
H2O.ai

Machine Intelligence
ML Model Performance
Test & Train
• Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the “training set” and evaluate
performance on the “test set” or “validation set.”
• Train & test K
models as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance
Metrics
H2O.ai

Machine Intelligence
What is Deep Learning?
What it is: ✤ “A branch of machine learning based on a set of
algorithms that attempt to model high-level
abstractions in data by using model architectures,
composed of multiple non-linear
transformations.” (Wikipedia, 2015)
✤ Deep neural networks have more than one hidden
layer in their architecture. That’s what’s “deep.”
✤ Very useful for complex input data such as images,
video, audio.
Deep learning architectures, specifically artificial
neural networks (ANNs) have been around since
1980, so they are not new. However, there were
breakthroughs in training techniques that lead to their
recent resurgence (mid 2000’s). Combined with
modern computing power, they are quite effective.
What it’s not:
H2O.ai

Machine Intelligence
Deep Learning Architecture
Example of a deep neural net architecture.
H2O.ai

Machine Intelligence
What is Ensemble Learning?
What it is: ✤ “Ensemble methods use multiple learning algorithms
to obtain better predictive performance that could be
obtained from any of the constituent learning
algorithms.” (Wikipedia, 2015)
✤ Random Forests and Gradient Boosting Machines
(GBM) are both ensembles of decision trees.
✤ Stacking, or Super Learning, is technique for
combining various learners into a single, powerful
learner using a second-level metalearning algorithm.
Ensembles typically achieve superior model
performance over singular methods. However, this
comes at a price — computation time.
What it’s not:
H2O.ai

Machine Intelligence
Where to learn more?
• H2O Online Training (free): http://learn.h2o.ai
• H2O Slidedecks: http://www.slideshare.net/0xdata
• H2O Video Presentations: https://www.youtube.com/user/0xdata
• H2O Community Events & Meetups: http://h2o.ai/events
• Machine Learning & Data Science courses: http://coursebuffet.com
Customers ! Community ! Evangelists
November 9, 10, 11
Computer History Museum

H 2 O W O R L D . H 2 O . A I

!
20% off registration
using code:

h2ocommunity
!
H2O.ai

Machine Intelligence
Questions?
@ledell on Twitter, GitHub
erin@h2o.ai
http://www.stat.berkeley.edu/~ledell
Data Science for Non-Data
Scientists 



aka. How the Business Views Data
Science
Chen Huang
August 20, 2015
Intro to Data Science for Non-Data Scientists
Agenda
•  Introduction
•  Data Science Primer
•  Working with Data Scientists
•  Decoding the Data Science Lingo
•  Q&A
Introduction
•  Who am I?
•  Why am I giving this talk?
Who am I?
•  Data Strategist
•  Career in Business Intelligence,
Analytics, and Big Data
•  Various roles
•  Consultant
•  Developer
•  Business and Data Analyst
•  Product Manager
•  Functional and Technical Trainer
•  Client Services
•  Worked in various industries
•  Health care, pharmaceutics,
communications and high tech,
consumer products, automotive,
finance, government contracting
August, 2015 – San Francisco, CA
Why am I giving this talk?
July, 2011 – Beijing, China
Data Science Primer
•  What can Data Science do for the Business?
•  Applications of Data Science
•  Data-Driven Decisions
•  What does a Data Scientist do?
•  Data Science Skills
What can Data Science do for the
Business?
A: Data science! Extracting useful
information and knowledge from large
volumes of data in order to improve
business decision-making or
providing the business insights to make
data-driven decisions
DataBusiness
What can Data do?
Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
Applications of Data Science
Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
Data-Driven Decisions
•  Practice of basing decisions on data, rather than purely
on intuition
•  There is evidence that data-driven decision making and
big data technologies substantially improve business
performance
The Art and Science of Data Science
•  Discover unknowns in data
•  Obtain predictive, actionable insights
•  Communicate business data stories
•  Build confidence in decision making
•  Create valuable Data Products that has business
impacts
http://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do
Intro to Data Science for Non-Data Scientists
What does a Data Scientist do?
•  Data curiosity. Explore data. Discover unknowns
•  Understand data relationships
•  Understand the business, has domain knowledge
•  Can tell relevant stories with data
•  Holistic view of the business
•  Knows machine learning, statistics, probability
•  Can hack and code
•  Define and test an hypothesis, run experiences
•  Asks good questions
http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
Data Science Skills
Image: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize
Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize
Working with Data Scientists
•  Collaboration
•  Data Science Cycle
•  Organizational Models for Data Science Teams
Intro to Data Science for Non-Data Scientists
Working with Data Scientists
Data
Science
Business
Data
Engineering
Data Science Cycle
Image: https://en.wikipedia.org/wiki/Data_science
Organizational Models for Data
Science Teams
Image: http://www.slideshare.net/emcacademics/building-data-science-teams-31057129
Decoding the Data Science Lingo
Machine Learning
•  A subfield of computer science
and artificial intelligence (AI) that
focuses on the design of
systems that can learn from and
make decisions and predictions
based on data.
•  Machine learning enables
computers to act and make
data-driven decisions rather than
being explicitly programmed to
carry out a certain task.
•  Machine Learning programs are
also designed to learn and
improve over time when
exposed to new data.
•  Everything!
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Unsupervised Learning
Data Science Definition:
•  Where a program, given a
dataset, can automatically find
patterns and relationships
within the dataset.
•  The business will decide how
deeply or many categories
there are.
•  Clustering or grouping of like
data.
•  Examples: k-means clustering,
hierarchical clustering
Business Application:
•  Customer segmentation
•  Understanding users and
behaviors
•  Classifying unknown and pre-
defined images into categories
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Supervised Learning
•  Where a program is “trained”
on a pre-defined dataset.
•  Based off its training data the
program can make accurate
decisions when given new
data.
•  Classifying Twitter sentiments
•  Recommender systems
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Score
•  Number of ways to evaluate
how well the model assigns the
correct class value to the test
instances.
•  Confidence gauge
Data Science Definition: Business Application:
Definition: https://mlcorner.wordpress.com/tag/scoring/
Score Cont.
•  True Positive (TP):    If the instance
is positive and it is classified as
positive False
•  Negative (FN): If the instance is
positive but it is classified as
negative True
•  Negative (TN):  If the instance is
negative and it is classified as
negative False
•  Positive (FP):   If the instance is
negative but it is classified as
positive
•  Classification problems:
•  Precision = the number of times you correctly classify = TP/(TP+FP)
•  Accuracy = proportion of correctly classified instances = (TP+TN)/(TP+TN
+FP+FN)
•  Recall or Sensitivity = the number of positive that you correctly classify out
of all the actual positives = TP/(TP+FN)
•  Specificity = classifier’s ability to identify negative results = TN/(TN+FP)
Classification
•  Sub-category of Supervised
Learning
•  Classification is the process of
taking some sort of input and
assign a label to it. The
predictions are discrete,
categories, or “yes or no”
nature.
•  Examples: Logistic
Regression, Random Forest
•  What customers should a
company target with its
marketing campaigns?
•  Is this Nigerian prince
committing fraud? (Spam
classification)
•  Is this actually Barack
Obama’s Facebook profile and
review on Amazon? (Fraud
detection)
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Regression
•  Sub-category of Supervised
Learning
•  Regression is a type of
algorithm that predicts a
continuous values.
•  How much would a user spend
on a mobile game like
CandyCrush?
•  How much would someone
spend on healthcare out of
pocket?
•  How many attendees will come
to this event based on past
registration?
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Decision Trees
•  Using a tree-like graph or model
of decisions and their possible
consequence.
•  Medical Testing (e.g. health
incidences, etc.)
•  Genealogy breakdowns (e.g.
eye color, blood type, etc.)
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Deep Learning
•  A category of machine learning
algorithms that often use
Artificial Neural Networks to
generate model.
•  Image classification
•  Language processing
•  Audio processing
•  Outlier and fraud detection
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Questions?

More Related Content

ODP
Introduction To Analytics
PDF
Introduction to data science
PPTX
Data science applications and usecases
PPTX
Data preprocessing
PDF
IoT Innovation in Education
PPTX
Introduction to Data Analytics
PPTX
Data Science
PDF
Data science
Introduction To Analytics
Introduction to data science
Data science applications and usecases
Data preprocessing
IoT Innovation in Education
Introduction to Data Analytics
Data Science
Data science

What's hot (20)

PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
PPTX
Analytics in healthcare
PDF
Data science presentation
PDF
Introduction on Data Science
PPTX
Ppt on data science
PDF
Introduction to Data Science and Analytics
PPTX
Data science
PPTX
1. Data Analytics-introduction
PPTX
chapter 6 data visualization ppt.pptx
PPTX
Data Mining: Application and trends in data mining
PDF
Big Data
PDF
Data Analytics
PPTX
Document clustering and classification
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PPTX
AI in Healthcare: Real-World Machine Learning Use Cases
PPTX
Data Science
PDF
Building an Agentic RAG locally with Ollama and Milvus
PPTX
Data mining
PPTX
Data mining introduction
PPTX
Career in Data Science
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Analytics in healthcare
Data science presentation
Introduction on Data Science
Ppt on data science
Introduction to Data Science and Analytics
Data science
1. Data Analytics-introduction
chapter 6 data visualization ppt.pptx
Data Mining: Application and trends in data mining
Big Data
Data Analytics
Document clustering and classification
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
AI in Healthcare: Real-World Machine Learning Use Cases
Data Science
Building an Agentic RAG locally with Ollama and Milvus
Data mining
Data mining introduction
Career in Data Science
Ad

Viewers also liked (20)

PDF
Python for Data Science - TDC 2015
PPTX
Data Science Driven Malware Detection
PDF
[FAST CAMPUS] 1강 data science overview
PDF
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
PDF
Pivotal Digital Transformation Forum: Data Science
PDF
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
PDF
저성장 시대 데이터 경제만이 살길이다
PDF
What Is the Future of Data Sharing?
KEY
Intro to Data Science for Enterprise Big Data
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PDF
How to Become a Data Scientist
PDF
Titan: Big Graph Data with Cassandra
PDF
How to Interview a Data Scientist
PDF
Data Science - Part XIV - Genetic Algorithms
PDF
Data Science - Part XI - Text Analytics
PDF
Data Science - Part X - Time Series Forecasting
PDF
Data Science - Part XIII - Hidden Markov Models
PDF
Data Science - Part XVII - Deep Learning & Image Processing
PDF
To Serve and Protect: Making Sense of Hadoop Security
PPTX
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
Python for Data Science - TDC 2015
Data Science Driven Malware Detection
[FAST CAMPUS] 1강 data science overview
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
저성장 시대 데이터 경제만이 살길이다
What Is the Future of Data Sharing?
Intro to Data Science for Enterprise Big Data
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
How to Become a Data Scientist
Titan: Big Graph Data with Cassandra
How to Interview a Data Scientist
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XI - Text Analytics
Data Science - Part X - Time Series Forecasting
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XVII - Deep Learning & Image Processing
To Serve and Protect: Making Sense of Hadoop Security
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
Ad

Similar to Intro to Data Science for Non-Data Scientists (20)

PDF
H2O World - Intro to Data Science with Erin Ledell
PDF
H2O with Erin LeDell at Portland R User Group
PDF
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
PDF
Introducción al Machine Learning Automático
PDF
what-is-machine-learning-and-its-importance-in-todays-world.pdf
PDF
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
PDF
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
PPTX
Programming-Introduction-to-Machine-Learning.pptx
PDF
Top 10 Data Science Practitioner Pitfalls
PPTX
machine learning introduction notes foRr
PPTX
Ml - A shallow dive
PPTX
The 4 Machine Learning Models Imperative for Business Transformation
PPTX
Machine learning ppt
PPTX
Lectuhhhhhhhhhhhhhhhhhhhhhhbbbhhhre 1.pptx
PPTX
BIG DATA AND MACHINE LEARNING
PDF
Simplifying ai: What to use when?
PPTX
artificialintelligencedata driven analytics23.pptx
PDF
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
PPTX
Machine Learning - A Trending Tech Skill in 2020
H2O World - Intro to Data Science with Erin Ledell
H2O with Erin LeDell at Portland R User Group
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
Introducción al Machine Learning Automático
what-is-machine-learning-and-its-importance-in-todays-world.pdf
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Programming-Introduction-to-Machine-Learning.pptx
Top 10 Data Science Practitioner Pitfalls
machine learning introduction notes foRr
Ml - A shallow dive
The 4 Machine Learning Models Imperative for Business Transformation
Machine learning ppt
Lectuhhhhhhhhhhhhhhhhhhhhhhbbbhhhre 1.pptx
BIG DATA AND MACHINE LEARNING
Simplifying ai: What to use when?
artificialintelligencedata driven analytics23.pptx
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Machine Learning - A Trending Tech Skill in 2020

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Digital Strategies for Manufacturing Companies
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Nekopoi APK 2025 free lastest update
PDF
System and Network Administraation Chapter 3
PPT
Introduction Database Management System for Course Database
PPTX
ai tools demonstartion for schools and inter college
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
AI in Product Development-omnex systems
L1 - Introduction to python Backend.pptx
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Design an Analysis of Algorithms I-SECS-1021-03
Digital Strategies for Manufacturing Companies
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Nekopoi APK 2025 free lastest update
System and Network Administraation Chapter 3
Introduction Database Management System for Course Database
ai tools demonstartion for schools and inter college
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Online Work Permit System for Fast Permit Processing
Operating system designcfffgfgggggggvggggggggg
Navsoft: AI-Powered Business Solutions & Custom Software Development
ManageIQ - Sprint 268 Review - Slide Deck
Upgrade and Innovation Strategies for SAP ERP Customers
AI in Product Development-omnex systems

Intro to Data Science for Non-Data Scientists

  • 1. H2O.ai
 Machine Intelligence Data Science for Non-Data Scientists Erin LeDell Ph.D. Silicon Valley Big Data Science August 2015
  • 2. H2O.ai
 Machine Intelligence H2O.ai H2O Company H2O Software • Team: 35. Founded in 2012, Mountain View, CA • Stanford Math & Systems Engineers • Open Source Software
 • Ease of Use via Web Interface • R, Python, Scala, Spark & Hadoop Interfaces • Distributed Algorithms Scale to Big Data
  • 3. H2O.ai
 Machine Intelligence Scientific Advisory Council Dr. Trevor Hastie Dr. Rob Tibshirani Dr. Stephen Boyd • John A. Overdeck Professor of Mathematics, Stanford University • PhD in Statistics, Stanford University • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Co-author with John Chambers, Statistical Models in S • Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar) • Professor of Statistics and Health Research and Policy, Stanford University • PhD in Statistics, Stanford University • COPPS Presidents’ Award recipient • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Author, Regression Shrinkage and Selection via the Lasso • Co-author, An Introduction to the Bootstrap • Professor of Electrical Engineering and Computer Science, Stanford University • PhD in Electrical Engineering and Computer Science, UC Berkeley • Co-author, Convex Optimization • Co-author, Linear Matrix Inequalities in System and Control Theory • Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
  • 4. H2O.ai
 Machine Intelligence What is Data Science? Problem Formulation • Identify an outcome of interest and the type of task: classification / regression / clustering • Identify the potential predictor variables • Identify the independent sampling units • Conduct research experiment (e.g. Clinical Trial) • Collect examples / randomly sample the population • Transform, clean, impute, filter, aggregate data • Prepare the data for machine learning — X, Y • Modeling using a machine learning algorithm (training) • Model evaluation and comparison • Sensitivity & Cost Analysis • Translate results into action items • Feed results into research pipeline Collect & Process Data Machine Learning Insights & Action
  • 5. H2O.ai
 Machine Intelligence Source: marketingdistillery.com
  • 6. H2O.ai
 Machine Intelligence What is Machine Learning? What it is: ✤ “Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel, 1959) ✤ “Machine learning and statistics are closely related fields. The ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics.” (Jordan, 2014) ✤ M.I. Jordan also suggested the term data science as a placeholder to call the overall field. Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone. What it’s not:
  • 7. H2O.ai
 Machine Intelligence Classification Clustering Machine Learning Overview • Predict a real-valued response (viral load, weight) • Gaussian, Gamma, Poisson and Tweedie • MSE and R^2 • Multi-class or Binary classification • Ranking • Accuracy and AUC • Unsupervised learning (no training labels) • Partition the data / identify clusters • AIC and BIC Regression
  • 8. H2O.ai
 Machine Intelligence Machine Learning Workflow Source: NLTK Example of a supervised machine learning workflow.
  • 9. H2O.ai
 Machine Intelligence ML Model Performance Test & Train • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30) • Train a model using the “training set” and evaluate performance on the “test set” or “validation set.” • Train & test K models as shown. • Average the model performance over the K test sets. • Report cross- validated metrics. • Regression: R^2, MSE, RMSE • Classification: Accuracy, F1, H-measure • Ranking (Binary Outcome): AUC, Partial AUC K-fold Cross-validation Performance Metrics
  • 10. H2O.ai
 Machine Intelligence What is Deep Learning? What it is: ✤ “A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015) ✤ Deep neural networks have more than one hidden layer in their architecture. That’s what’s “deep.” ✤ Very useful for complex input data such as images, video, audio. Deep learning architectures, specifically artificial neural networks (ANNs) have been around since 1980, so they are not new. However, there were breakthroughs in training techniques that lead to their recent resurgence (mid 2000’s). Combined with modern computing power, they are quite effective. What it’s not:
  • 11. H2O.ai
 Machine Intelligence Deep Learning Architecture Example of a deep neural net architecture.
  • 12. H2O.ai
 Machine Intelligence What is Ensemble Learning? What it is: ✤ “Ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms.” (Wikipedia, 2015) ✤ Random Forests and Gradient Boosting Machines (GBM) are both ensembles of decision trees. ✤ Stacking, or Super Learning, is technique for combining various learners into a single, powerful learner using a second-level metalearning algorithm. Ensembles typically achieve superior model performance over singular methods. However, this comes at a price — computation time. What it’s not:
  • 13. H2O.ai
 Machine Intelligence Where to learn more? • H2O Online Training (free): http://learn.h2o.ai • H2O Slidedecks: http://www.slideshare.net/0xdata • H2O Video Presentations: https://www.youtube.com/user/0xdata • H2O Community Events & Meetups: http://h2o.ai/events • Machine Learning & Data Science courses: http://coursebuffet.com
  • 14. Customers ! Community ! Evangelists November 9, 10, 11 Computer History Museum H 2 O W O R L D . H 2 O . A I ! 20% off registration using code: h2ocommunity !
  • 15. H2O.ai
 Machine Intelligence Questions? @ledell on Twitter, GitHub erin@h2o.ai http://www.stat.berkeley.edu/~ledell
  • 16. Data Science for Non-Data Scientists 
 
 aka. How the Business Views Data Science Chen Huang August 20, 2015
  • 18. Agenda •  Introduction •  Data Science Primer •  Working with Data Scientists •  Decoding the Data Science Lingo •  Q&A
  • 19. Introduction •  Who am I? •  Why am I giving this talk?
  • 20. Who am I? •  Data Strategist •  Career in Business Intelligence, Analytics, and Big Data •  Various roles •  Consultant •  Developer •  Business and Data Analyst •  Product Manager •  Functional and Technical Trainer •  Client Services •  Worked in various industries •  Health care, pharmaceutics, communications and high tech, consumer products, automotive, finance, government contracting August, 2015 – San Francisco, CA
  • 21. Why am I giving this talk? July, 2011 – Beijing, China
  • 22. Data Science Primer •  What can Data Science do for the Business? •  Applications of Data Science •  Data-Driven Decisions •  What does a Data Scientist do? •  Data Science Skills
  • 23. What can Data Science do for the Business? A: Data science! Extracting useful information and knowledge from large volumes of data in order to improve business decision-making or providing the business insights to make data-driven decisions DataBusiness
  • 24. What can Data do? Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
  • 25. Applications of Data Science Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
  • 26. Data-Driven Decisions •  Practice of basing decisions on data, rather than purely on intuition •  There is evidence that data-driven decision making and big data technologies substantially improve business performance
  • 27. The Art and Science of Data Science •  Discover unknowns in data •  Obtain predictive, actionable insights •  Communicate business data stories •  Build confidence in decision making •  Create valuable Data Products that has business impacts http://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do
  • 29. What does a Data Scientist do? •  Data curiosity. Explore data. Discover unknowns •  Understand data relationships •  Understand the business, has domain knowledge •  Can tell relevant stories with data •  Holistic view of the business •  Knows machine learning, statistics, probability •  Can hack and code •  Define and test an hypothesis, run experiences •  Asks good questions http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
  • 30. Data Science Skills Image: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  • 33. Working with Data Scientists •  Collaboration •  Data Science Cycle •  Organizational Models for Data Science Teams
  • 35. Working with Data Scientists Data Science Business Data Engineering
  • 36. Data Science Cycle Image: https://en.wikipedia.org/wiki/Data_science
  • 37. Organizational Models for Data Science Teams Image: http://www.slideshare.net/emcacademics/building-data-science-teams-31057129
  • 38. Decoding the Data Science Lingo
  • 39. Machine Learning •  A subfield of computer science and artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data. •  Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task. •  Machine Learning programs are also designed to learn and improve over time when exposed to new data. •  Everything! Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  • 40. Unsupervised Learning Data Science Definition: •  Where a program, given a dataset, can automatically find patterns and relationships within the dataset. •  The business will decide how deeply or many categories there are. •  Clustering or grouping of like data. •  Examples: k-means clustering, hierarchical clustering Business Application: •  Customer segmentation •  Understanding users and behaviors •  Classifying unknown and pre- defined images into categories Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  • 41. Supervised Learning •  Where a program is “trained” on a pre-defined dataset. •  Based off its training data the program can make accurate decisions when given new data. •  Classifying Twitter sentiments •  Recommender systems Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  • 42. Score •  Number of ways to evaluate how well the model assigns the correct class value to the test instances. •  Confidence gauge Data Science Definition: Business Application: Definition: https://mlcorner.wordpress.com/tag/scoring/
  • 43. Score Cont. •  True Positive (TP):    If the instance is positive and it is classified as positive False •  Negative (FN): If the instance is positive but it is classified as negative True •  Negative (TN):  If the instance is negative and it is classified as negative False •  Positive (FP):   If the instance is negative but it is classified as positive •  Classification problems: •  Precision = the number of times you correctly classify = TP/(TP+FP) •  Accuracy = proportion of correctly classified instances = (TP+TN)/(TP+TN +FP+FN) •  Recall or Sensitivity = the number of positive that you correctly classify out of all the actual positives = TP/(TP+FN) •  Specificity = classifier’s ability to identify negative results = TN/(TN+FP)
  • 44. Classification •  Sub-category of Supervised Learning •  Classification is the process of taking some sort of input and assign a label to it. The predictions are discrete, categories, or “yes or no” nature. •  Examples: Logistic Regression, Random Forest •  What customers should a company target with its marketing campaigns? •  Is this Nigerian prince committing fraud? (Spam classification) •  Is this actually Barack Obama’s Facebook profile and review on Amazon? (Fraud detection) Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  • 45. Regression •  Sub-category of Supervised Learning •  Regression is a type of algorithm that predicts a continuous values. •  How much would a user spend on a mobile game like CandyCrush? •  How much would someone spend on healthcare out of pocket? •  How many attendees will come to this event based on past registration? Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  • 46. Decision Trees •  Using a tree-like graph or model of decisions and their possible consequence. •  Medical Testing (e.g. health incidences, etc.) •  Genealogy breakdowns (e.g. eye color, blood type, etc.) Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
  • 47. Deep Learning •  A category of machine learning algorithms that often use Artificial Neural Networks to generate model. •  Image classification •  Language processing •  Audio processing •  Outlier and fraud detection Data Science Definition: Business Application: Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple