SlideShare a Scribd company logo
Building Data Scientists
Machine Learning Mastery in Python
Mitch Sanders
Jan 10th 2018
Internal Use - Confidential
2 of Y
Internal Use - Confidential
Trend #2 Non-Data Scientists will perform
more fairly sophisticated analytics
alongside data scientists
Data Scientist
Algorithm Coder
Data
Science
Citizens
Advanced
Analytics
Programmers
Statisticians
Business
Analyst
Coders
Data Science continues to develop
specialties - this means the mythical
‘full stack’ data scientist will disappear
Trend #1
Data
Scientist
Data
Engineer
Algorithm
Coder
Data
Storyteller
Industry Trends for 2018 – How
what we’re doing fits into the future
the Context
3 of Y
Internal Use - Confidential
the Course
Machine Learning Mastery
- Understand Your Data
- Create Accurate Models
- Work Projects End-To-End
• 16 weeks – May-Oct., 2017
• 20+ class hours – 20% homework, 80% live coding
• 17 notebooks – Python code templates
• 4 Prerequisites – Coding, statistics, algorithms, thirst to learn
• 1 Textbook – Machine Learning Mastery w/ Python -Dr. Jason Brownlee
• 1 Teacher – Mitch Sanders w/ Assistant – Uday Waghmare
• 14 Students – global: software engineers, adv. analysts, statisticians
• Platform – Jupyter, Python 2.7, Anaconda
• Code Repository – GitHub
• NPS Survey – Survey Monkey, LTR = 90
• Awarded – “On the Spot”
4 of Y
Internal Use - Confidential
the Content
Prepare & Explore Model Improve Accuracy & Finalize
Python ML
Ecosystem
SciPy
Scikit-learn
Crash Courses
NumPy
Matplotlib
Pandas
Load Libraries & Data
Descriptive Statistics
Attribute Data Types
Class Distribution
Correlation Analysis
Skew of Univariates
Pre Processing
Rescale
Standardize
Normalize
BinarizeFeature Selection
Tree & Univariate
Recursive -RFE
Principle Comp.
Analysis - PCA
Feature Importance
Resampling
Split into Train/Test
K-fold Cross Validation
Leave One Out
Repeated Random
Evaluation Metrics
For Classification
For Regression
Spot Check
Classification Algorithms
Linear –
• Logistic Regression
• Linear Discriminate
Analysis (LDA)
Non-linear –
• K-Nearest Neighbor (KNN)
• Naïve Bayes
• Class & Regression Trees
(CART)
• Support Vector Machines
(SVM)
Compare Algorithms
Spot Check
Regression Algorithms
Linear – LR, LASSO,
ElasticNet (EN)
Non-Linear – CART, SVR,
KNN
Automate w/ Pipelines
Preparation Pipelines
Feature Extraction Pipelines
Modeling Pipelines
Ensembles - Performance
Improvements
Boosting –
• AdaBoost,
• Gradient Boosting (GBM)
Bagging –
• Random Forest, Extra Trees
• Voting
Algorithm
Parameter Tuning
Parameters
Grid Search
Random Search
Finalize Model
Predict on Validation Data
Create Standalone on Entire Data
Save Model for Production
Visualization
Univariate Plots
Multivariate Plots
Case Studies #1 & #2
Key concepts – and flow – the
17 notebooks
#1
#17
Reference Material
6 of Y
Internal Use - Confidential
the Course Syllabus
Python Ecosystem for Machine
Learning
• Python
• SciPy
• Scikit-learn
• Python Ecosystem Installation
• Summary
Crash Course in Python and SciPy
• Python Crash Course
• NumPy Crash Course
• Matplotlib Crash Course
• Pandas Crash Course
• Summary
How To Load Machine Learning Data
• Considerations When Loading CSV
Data
• Pima Indians Dataset
• Load CSV Files with the Python
Standard Library
• Load CSV Files with NumPy
• Load CSV Files with Pandas
• Summary
Understand Your Data With
Visualization
• Univariate Plots
• Multivariate Plots
• Summary
Prepare Your Data For Machine Learning
• Need For Data Pre-processing
• Data Transforms
• Rescale Data
• Standardize Data
• Normalize Data
• Binarize Data (Make Binary)
• Summary
Feature Selection For Machine Learning
• Feature Selection
• Univariate Selection
• Recursive Feature Elimination
• Principal Component Analysis
• Feature Importance
• Summary
Evaluate the Performance of Machine
Learning Algorithms with Resampling
• Evaluate Machine Learning Algorithms
• Split into Train and Test Sets
• K-fold Cross-Validation
• Leave One Out Cross-Validation
• Repeated Random Test-Train Splits
• What Techniques to Use When
• Summary
Machine Learning Algorithm
Performance Metrics
• Algorithm Evaluation Metrics
• Classification Metrics
• Regression Metrics
• Summary
Spot-Check Classification Algorithms
• Algorithm Spot-Checking
• Algorithms Overview
• Linear Machine Learning Algorithms
• Nonlinear Machine Learning
Algorithms
• Summary
Spot-Check Regression Algorithms
• Algorithms Overview
• Linear Machine Learning Algorithms
• Nonlinear Machine Learning
Algorithms
• Summary
Compare Machine Learning Algorithms
• Choose The Best Machine Learning
Model
• Compare Machine Learning
Algorithms Consistently
• Summary
Automate Machine Learning Workflows
with Pipelines
• Automating Machine Learning
Workflows
• Data Preparation and Modeling
Pipeline
• Feature Extraction and Modeling
Pipeline
• Summary
Improve Performance with Ensembles
• Combine Models Into Ensemble
Predictions
• Bagging Algorithms
• Boosting Algorithms
• Voting Ensemble
• Summary
7 of Y
Internal Use - Confidential
data science student questions - 1
“So you do Data Science work. What really does that involve? And how is that different than programming, statistical work or data
engineering?”
“I want to learn Data Science. Between R, Python and SAS, where should I start and what are the Pros and Cons of each?”
“What is OOP (Object orientated programming) and Structured Programming and what’s the difference between them?"
“What is main differences between Python 2.7 and Python 3.x versions? And why do so many developers stay with Python 2.7?”
"What is the difference between Supervised Learning an Unsupervised Learning?"
"What's different graphing might a univariate have compared to a bivariate analysis? Can you graph multivariate?"
"How do you explain machine learning to an 8-year old child?"
"What is Gradient Descent?
"What is multicollinearity and how you can overcome it?"
8 of Y
Internal Use - Confidential
data science student questions - 2
"What is the curse of dimensionality?"
"What do you understand by Hypothesis in the content of Machine Learning?"
"What's the difference between a Test Set and a Validation Set?"
"What is cross-validation and what is it used for?"
"What's difference between a Classification Regression Tree algoithm and a Random Forest? And when is one better than the other?"
"What are the basic assumptions to be made for linear regression?"
"Can you explain in simple language what is an Eigenvalue and Eigenvector?"
"Do gradient descent methods always converge to same point?"
"What's difference between continuous, ordinal and categorical variables?"
"What is K-means? How can you select K for K-means?"
9 of Y
Internal Use - Confidential
data science student questions - 3
"Why is naive Bayes so ‘naive’ ?"
"OLS is to linear regression as Maximum likelihood is to logistic regression. Explain the statement."
"What do you understand by Bias Variance trade off?"
"Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?"
"When does regularization becomes necessary in Machine Learning?"
"Explain a model and its dimensions to an 8 year old."
"How do you determine and deal with correlated features in your data set, how to reduce the dimensionality of data?"
"During analysis, how do you treat missing values?"
"What is Regularization and what kind of problems does regularization solve?"
Extras
11 of Y
Internal Use - Confidential
the Data Scientist Roles
Roles Defined by 3 different Data Science Authors
Data Scientist Core Skills
How To Build A Successful Data Science
Team
The seven people you need on your
Big Data team Descriptions:
Capture Data Engineer Handyman
Expert in Dell EDW, D3, BO, Hana/BMS,
other RDBMS, and ETL work
Open Source Guru (plus Data
Modeler)
Hadoop stack, Cloudera, Linux, data
structures and network
Analyze Machine Learning Expert
Data Modeler (plus all aspets of Data
Engineer and Business Analyst)
SQL, RDBMS, Teradata, Dell
infrastructure
Deep Diver
Machine Learning, R, Python, SQL, ETL
work, algorithm modeling, statistics
Present Business Analyst Story Teller
PowerPoint, Design, Tableau,
understands customers business
language and technical, artistic eye
Snoop (plus Handyman skills)
Enthusiastic, deeply creative, super savy
in Dell envirionments, finds contacts and
not hesitant to do work-arounds
Privacy Wonk
Dell policy meticulous, socially aware,
foresees roadblocks
12 of Y
Internal Use - Confidential
13 of Y
Internal Use - Confidential
14 of Y
Internal Use - Confidential

More Related Content

PDF
Introduction to Python for Data Science
PDF
Programming for data science in python
PPTX
Introduction to data science
PPTX
Machine Learning - Challenges, Learnings & Opportunities
PDF
Python for Data Science
PDF
Different Career Paths in Data Science
PDF
Data science presentation
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Introduction to Python for Data Science
Programming for data science in python
Introduction to data science
Machine Learning - Challenges, Learnings & Opportunities
Python for Data Science
Different Career Paths in Data Science
Data science presentation
Data Science Tutorial | Introduction To Data Science | Data Science Training ...

What's hot (20)

PPTX
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
PDF
Introduction To Data Science
PDF
How to Become a Data Scientist
PPTX
Data Science using Python
PDF
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
DOCX
Self Study Business Approach to DS_01022022.docx
PPTX
Introduction to Big Data/Machine Learning
PDF
Data+Science : A First Course
PDF
Unit 3 part 2
PPTX
How To Become a Data Scientist in Iran Marketplace
PDF
Introduction to Data Science
PPTX
NLP & ML Webinar
PPTX
Introduction to Machine Learning & AI
PPTX
Introduction to Data Science by Datalent Team @Data Science Clinic #9
PDF
Data science
PDF
Introduction to Data Science
PDF
General introduction to AI ML DL DS
PPTX
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
PDF
Data Science in Action
PDF
Machine Learning Algorithms (Part 1)
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Introduction To Data Science
How to Become a Data Scientist
Data Science using Python
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Self Study Business Approach to DS_01022022.docx
Introduction to Big Data/Machine Learning
Data+Science : A First Course
Unit 3 part 2
How To Become a Data Scientist in Iran Marketplace
Introduction to Data Science
NLP & ML Webinar
Introduction to Machine Learning & AI
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Data science
Introduction to Data Science
General introduction to AI ML DL DS
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Science in Action
Machine Learning Algorithms (Part 1)
Ad

Similar to Building Data Scientists (20)

PPTX
Data scientist roadmap
PDF
Python Advanced Predictive Analytics Kumar Ashish
PPTX
Data Science_Unit-1.2 part - 2 of intro.pptx
PDF
Data Science curriculum
PDF
Data science and Machine learning Booklet
PDF
Learn Python teaching deck, learn how to code
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
PDF
Python for Data Science 1 / converted Edition Yuli Vasiliev
PPTX
Data Science.pptx
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Predict the Oscars with Data Science
PDF
How to become a Data Scientist?
PDF
Intro to Python for Data Science
PDF
Data Analysis - Making Big Data Work
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PPTX
Data Science Course In Bangalore with Placement
PPTX
Workshop_Presentation.pptx
PPTX
Python for Data Science Professionals.pptx
PDF
Intro to Python for Data Science
PPTX
Master Python for Data Science Skills, Tools, and Salaries.pptx
Data scientist roadmap
Python Advanced Predictive Analytics Kumar Ashish
Data Science_Unit-1.2 part - 2 of intro.pptx
Data Science curriculum
Data science and Machine learning Booklet
Learn Python teaching deck, learn how to code
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Python for Data Science 1 / converted Edition Yuli Vasiliev
Data Science.pptx
Introduction to Machine Learning with SciKit-Learn
Predict the Oscars with Data Science
How to become a Data Scientist?
Intro to Python for Data Science
Data Analysis - Making Big Data Work
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science Course In Bangalore with Placement
Workshop_Presentation.pptx
Python for Data Science Professionals.pptx
Intro to Python for Data Science
Master Python for Data Science Skills, Tools, and Salaries.pptx
Ad

Recently uploaded (20)

PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
OOP with Java - Java Introduction (Basics)
PPT
Project quality management in manufacturing
PPTX
Welding lecture in detail for understanding
PDF
composite construction of structures.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPT
Mechanical Engineering MATERIALS Selection
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
web development for engineering and engineering
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Sustainable Sites - Green Building Construction
PPTX
Geodesy 1.pptx...............................................
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
Structs to JSON How Go Powers REST APIs.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
OOP with Java - Java Introduction (Basics)
Project quality management in manufacturing
Welding lecture in detail for understanding
composite construction of structures.pdf
UNIT 4 Total Quality Management .pptx
Mechanical Engineering MATERIALS Selection
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Operating System & Kernel Study Guide-1 - converted.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Arduino robotics embedded978-1-4302-3184-4.pdf
web development for engineering and engineering
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Sustainable Sites - Green Building Construction
Geodesy 1.pptx...............................................
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Construction Project Organization Group 2.pptx
CH1 Production IntroductoryConcepts.pptx

Building Data Scientists

  • 1. Building Data Scientists Machine Learning Mastery in Python Mitch Sanders Jan 10th 2018 Internal Use - Confidential
  • 2. 2 of Y Internal Use - Confidential Trend #2 Non-Data Scientists will perform more fairly sophisticated analytics alongside data scientists Data Scientist Algorithm Coder Data Science Citizens Advanced Analytics Programmers Statisticians Business Analyst Coders Data Science continues to develop specialties - this means the mythical ‘full stack’ data scientist will disappear Trend #1 Data Scientist Data Engineer Algorithm Coder Data Storyteller Industry Trends for 2018 – How what we’re doing fits into the future the Context
  • 3. 3 of Y Internal Use - Confidential the Course Machine Learning Mastery - Understand Your Data - Create Accurate Models - Work Projects End-To-End • 16 weeks – May-Oct., 2017 • 20+ class hours – 20% homework, 80% live coding • 17 notebooks – Python code templates • 4 Prerequisites – Coding, statistics, algorithms, thirst to learn • 1 Textbook – Machine Learning Mastery w/ Python -Dr. Jason Brownlee • 1 Teacher – Mitch Sanders w/ Assistant – Uday Waghmare • 14 Students – global: software engineers, adv. analysts, statisticians • Platform – Jupyter, Python 2.7, Anaconda • Code Repository – GitHub • NPS Survey – Survey Monkey, LTR = 90 • Awarded – “On the Spot”
  • 4. 4 of Y Internal Use - Confidential the Content Prepare & Explore Model Improve Accuracy & Finalize Python ML Ecosystem SciPy Scikit-learn Crash Courses NumPy Matplotlib Pandas Load Libraries & Data Descriptive Statistics Attribute Data Types Class Distribution Correlation Analysis Skew of Univariates Pre Processing Rescale Standardize Normalize BinarizeFeature Selection Tree & Univariate Recursive -RFE Principle Comp. Analysis - PCA Feature Importance Resampling Split into Train/Test K-fold Cross Validation Leave One Out Repeated Random Evaluation Metrics For Classification For Regression Spot Check Classification Algorithms Linear – • Logistic Regression • Linear Discriminate Analysis (LDA) Non-linear – • K-Nearest Neighbor (KNN) • Naïve Bayes • Class & Regression Trees (CART) • Support Vector Machines (SVM) Compare Algorithms Spot Check Regression Algorithms Linear – LR, LASSO, ElasticNet (EN) Non-Linear – CART, SVR, KNN Automate w/ Pipelines Preparation Pipelines Feature Extraction Pipelines Modeling Pipelines Ensembles - Performance Improvements Boosting – • AdaBoost, • Gradient Boosting (GBM) Bagging – • Random Forest, Extra Trees • Voting Algorithm Parameter Tuning Parameters Grid Search Random Search Finalize Model Predict on Validation Data Create Standalone on Entire Data Save Model for Production Visualization Univariate Plots Multivariate Plots Case Studies #1 & #2 Key concepts – and flow – the 17 notebooks #1 #17
  • 6. 6 of Y Internal Use - Confidential the Course Syllabus Python Ecosystem for Machine Learning • Python • SciPy • Scikit-learn • Python Ecosystem Installation • Summary Crash Course in Python and SciPy • Python Crash Course • NumPy Crash Course • Matplotlib Crash Course • Pandas Crash Course • Summary How To Load Machine Learning Data • Considerations When Loading CSV Data • Pima Indians Dataset • Load CSV Files with the Python Standard Library • Load CSV Files with NumPy • Load CSV Files with Pandas • Summary Understand Your Data With Visualization • Univariate Plots • Multivariate Plots • Summary Prepare Your Data For Machine Learning • Need For Data Pre-processing • Data Transforms • Rescale Data • Standardize Data • Normalize Data • Binarize Data (Make Binary) • Summary Feature Selection For Machine Learning • Feature Selection • Univariate Selection • Recursive Feature Elimination • Principal Component Analysis • Feature Importance • Summary Evaluate the Performance of Machine Learning Algorithms with Resampling • Evaluate Machine Learning Algorithms • Split into Train and Test Sets • K-fold Cross-Validation • Leave One Out Cross-Validation • Repeated Random Test-Train Splits • What Techniques to Use When • Summary Machine Learning Algorithm Performance Metrics • Algorithm Evaluation Metrics • Classification Metrics • Regression Metrics • Summary Spot-Check Classification Algorithms • Algorithm Spot-Checking • Algorithms Overview • Linear Machine Learning Algorithms • Nonlinear Machine Learning Algorithms • Summary Spot-Check Regression Algorithms • Algorithms Overview • Linear Machine Learning Algorithms • Nonlinear Machine Learning Algorithms • Summary Compare Machine Learning Algorithms • Choose The Best Machine Learning Model • Compare Machine Learning Algorithms Consistently • Summary Automate Machine Learning Workflows with Pipelines • Automating Machine Learning Workflows • Data Preparation and Modeling Pipeline • Feature Extraction and Modeling Pipeline • Summary Improve Performance with Ensembles • Combine Models Into Ensemble Predictions • Bagging Algorithms • Boosting Algorithms • Voting Ensemble • Summary
  • 7. 7 of Y Internal Use - Confidential data science student questions - 1 “So you do Data Science work. What really does that involve? And how is that different than programming, statistical work or data engineering?” “I want to learn Data Science. Between R, Python and SAS, where should I start and what are the Pros and Cons of each?” “What is OOP (Object orientated programming) and Structured Programming and what’s the difference between them?" “What is main differences between Python 2.7 and Python 3.x versions? And why do so many developers stay with Python 2.7?” "What is the difference between Supervised Learning an Unsupervised Learning?" "What's different graphing might a univariate have compared to a bivariate analysis? Can you graph multivariate?" "How do you explain machine learning to an 8-year old child?" "What is Gradient Descent? "What is multicollinearity and how you can overcome it?"
  • 8. 8 of Y Internal Use - Confidential data science student questions - 2 "What is the curse of dimensionality?" "What do you understand by Hypothesis in the content of Machine Learning?" "What's the difference between a Test Set and a Validation Set?" "What is cross-validation and what is it used for?" "What's difference between a Classification Regression Tree algoithm and a Random Forest? And when is one better than the other?" "What are the basic assumptions to be made for linear regression?" "Can you explain in simple language what is an Eigenvalue and Eigenvector?" "Do gradient descent methods always converge to same point?" "What's difference between continuous, ordinal and categorical variables?" "What is K-means? How can you select K for K-means?"
  • 9. 9 of Y Internal Use - Confidential data science student questions - 3 "Why is naive Bayes so ‘naive’ ?" "OLS is to linear regression as Maximum likelihood is to logistic regression. Explain the statement." "What do you understand by Bias Variance trade off?" "Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?" "When does regularization becomes necessary in Machine Learning?" "Explain a model and its dimensions to an 8 year old." "How do you determine and deal with correlated features in your data set, how to reduce the dimensionality of data?" "During analysis, how do you treat missing values?" "What is Regularization and what kind of problems does regularization solve?"
  • 11. 11 of Y Internal Use - Confidential the Data Scientist Roles Roles Defined by 3 different Data Science Authors Data Scientist Core Skills How To Build A Successful Data Science Team The seven people you need on your Big Data team Descriptions: Capture Data Engineer Handyman Expert in Dell EDW, D3, BO, Hana/BMS, other RDBMS, and ETL work Open Source Guru (plus Data Modeler) Hadoop stack, Cloudera, Linux, data structures and network Analyze Machine Learning Expert Data Modeler (plus all aspets of Data Engineer and Business Analyst) SQL, RDBMS, Teradata, Dell infrastructure Deep Diver Machine Learning, R, Python, SQL, ETL work, algorithm modeling, statistics Present Business Analyst Story Teller PowerPoint, Design, Tableau, understands customers business language and technical, artistic eye Snoop (plus Handyman skills) Enthusiastic, deeply creative, super savy in Dell envirionments, finds contacts and not hesitant to do work-arounds Privacy Wonk Dell policy meticulous, socially aware, foresees roadblocks
  • 12. 12 of Y Internal Use - Confidential
  • 13. 13 of Y Internal Use - Confidential
  • 14. 14 of Y Internal Use - Confidential

Editor's Notes

  • #3: https://www.datasciencecentral.com/profiles/blogs/6-predictions-about-data-science-machine-learning-and-ai-for-2018