SlideShare a Scribd company logo
So your boss wants you to learn
data science
Susan Ibach
susan@aigaming.com
@HockeyGeekGirl
Data Science has become a buzzword
I THINK WE
NEED TO DO
DATA SCIENCE
YOUR DATA
SCIENCE
When your boss walks up to you and says
we need to do data science, where do
you start?
PLATFORM TO USE
DATA SCIENCE
BIG DATA
AI
ML
What is a data scientist?
Advanced
math skills
Subject
Matter
Expertise
Data
Engineering
skills
Follow the 7 Steps to data science success
1
2
3
4
5
6
7
Step 1: Identify your problem and
data to define the problem
1
What insights might help solve/define the problem?
An airline wants to prevent flight delays
Different insights require different tools
SELECT COUNT(*) FROM
FLIGHTS WHERE
ACTUAL_ARR_TIME >
SCHED_ARR_TIME
SELECT COUNT(*) FROM
FLIGHTS WHERE
ACTUAL_ARR_TIME >
SCHED_ARR_TIME BETWEEN
1997 and 2017
Data science tools include
Data Mining
gain insights from data
• Those who bought this also bought
• Keyword extraction
Machine Learning
make predictions
• Who will need hospitalization from the flu?
• How many copies of this book will I sell?
Deep Learning
For complex data processed in layers
• Is there a bird in this photo?
• Will this person get cancer?
Do we need Artificial Intelligence?
•AI is when a computer completes a task that
normally requires human intelligence
• Answering questions from a customer
• Recognizing the content of a photo
• Understanding human speech
•We use data science to analyze and recognize
patterns and responses so we can do AI
Step 2: Collect data
1
2
Which flights are most likely to be delayed next
week?
What data would help you determine:
Relational databases
BLOB storage
NoSQL databases
Data warehouses
Flat Files
Open source data
Sensors
Where do I get all that data?
BIG DATA
When does data become “big data”?
High Volume High Velocity
High Variety
Step 3: Prepare data
1
2
3
Your data will need clean-up/prep
Flight # Dep Date Sched Dep
Time
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
Type
15/12/2016 Pearson NNE 5 MPH 150 mm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Missing Values
Duplicate rows
Different data formats
Decomposition
Outliers
Scaling
Start with what you already know
• Excel, SQL
Write your own Code
• Python Pandas library, R
Third party products
Experian, Paxata, Alteryx, SAP Lumira, Teradata Data Lab,
Knowledge Works, Datameer
What tools might you use for data prep?
If you have Big Data
•Preparing and pulling together your data will require
a LOT of storage and processing power
Step 4: Identify the data that
influences outcomes
1
2
3
4
Which fields “features” might helps us
predict if a flight will be late “label”?
Flight # Dep Date Sched Dep
Time
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
Type
15/12/2016 Pearson NNE 5 MPH 15 cm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Are there any fields we can
decompose to get more
information?
Which fields “features” help us predict if a
picture contains a dog or cat “label”?
• Pixel1Color, Pixel2Color, Pixel3Color,….Pixel9036Color
Break out the deep learning
GPUs
Storage
Pixel Edge Shape Cat
Step 5: Pick the right algorithm
1
2
3
4
5
What are you trying to predict?
Prediction Algorithm Example
Predict continuous
values
Regression Predict what time a
flight will land
Predict what
category something
falls into
Classification Predict if a flight
will be late or on
time
Detect unusual data
points
Anomaly detection Predict if a credit
card transaction is
fraudulent
Predict if a runner
cheated on a
marathon
Supervised vs Unsupervised
Type Definiton Example
Supervised You have existing data with known
inputs and known outputs to help
make predictions
When I try to predict if a flight
next week will be late, I know what
flights have been late in the past
Unsupervised You have input data but no known
outcomes in your data
When I try to predict if a runner
cheated on a marathon, I don’t
have a history of runners who
cheated in the past.
Step 6: Train your model
1
2
3
4
5
6
Once you have data and your algorithm
you can train and create your predictive
model
Python
R
scikit-learn (based on NumPy, SciPy, and matplotlib)
Azure Machine Learning Service
Cognitive Toolkit/Tensorflow (deep learning)
There are lots of tools to choose from
Step 7: Test your model
1
2
3
4
5
6
7
You need to know the accuracy of your
model!
Predictive/Trained
Model
Flt #406
Air Canada
April 1, 2016
3:15 PM
YYZ-YVR
Late: No
Flt #351
West Jet
April 12, 2016
8:01 AM
YOW-YYZ
Late: No
Flt #141
Delta
Sep 25, 2016
1:45 PM
HND-SEA
Late: Yes
Flt #406, Air Canada, April 1, 2016, 3:15
PM, YYZ-YVR
Flt #351, West Jet, April 12, 2016, 8:01
AM, YOW-YYZ
Flt #141, Delta, Sep 25, 2016, 1:45 PM
HND-SEA
Flt #406, Air Canada, April 1, 2016, 3:15
PM, YYZ-YVR, Late: Yes
Flt #351, West Jet, April 12, 2016, 8:01
AM, YOW-YYZ, Late: No
Flt #141, Delta, Sep 25, 2016, 1:45 PM
HND-SEA, Late: Yes
66.6% accuracy
What do I do if my accuracy is lousy?
Go back to step 1
For additional information
•Appendix A
What is Hadoop anyway?
•Appendix B
What cloud tools exist to help with data science?
•Appendix C Lexicon
THANK YOU
QUESTIONS?
Susan Ibach
susan@aigaming.com
@HockeyGeekGirl
Appendix A –
What is Hadoop anyway?
It’s a tool for analyzing Big Data
Hadoop is an OS framework
•Based on java
•Distributed processing of large datasets across
clusters of computers
•Distributed storage and computation across clusters
of computers
•Scales from single server to thousands of machines
Hadoop components
• Hadoop Common – java libraries used by Hadoop to abstract the filesystem and
OS
• Hadoop YARN – framework for job scheduling and managing cluster resources
• HDFS – distributed File system for access to application data (distributed storage)
• Based on Google File System (GFS)
• Hadoop can run on any distributed file system (FS, HFTP, FS, S3, FS) but usually HDFS
• File in HDFS is split into blocks which are stored in DataNodes. Name nodes map blocks to
datanodes
• MapReduce – the query language for parallel processing of large data sets
(distributed computation)
• Map data into key/value pairs (tuples)
• Reduce data tuples into smaller pairs of tuples
• Input/output stored in file system
• Job tracker and TaskTracker schedule, monitor tasks and re-execute failed tasks
Hadoop components
• Hive – similar to SQL, hides complexity of Map Reduce
programming, generates a MapReduce job
• Pig - (Pig latin) – High level data flow language for
parallel computation & ETL
• Hbase - Scalable distributed non-relational database that
supports structured data storage for large tables (billions
of rows X millions of columns)
• Spark - compute engine for Hadoop data used for ETL,
machine learning, stream /real-time processing and
graph computation (gradually replacing MapReduce
because it is faster for iterative algorithms)
How does Hadoop work
• User submits a job to Hadoop
• Location of input and output files
• Java classes containing map and reduce functions
• Job configuration parameters
•Hadoop submits the job to JobTracker which
distributes the job to the slaves, schedules tasks and
monitors them
•Task trackers execute the task and output is stored in
output files on the file system
Why is it popular?
• Allows user to quickly write and test distributed systems
• It automatically distributes data and work across the
machines and utilizes the parallelism of CPU cores
• Does not rely on hardware for fault tolerance and high
availability
• Servers can be added or removed dynamically
• It’s Open source and compatible on many platforms since
it is Java based
Appendix B –
What cloud tools exist to help with
data science?
Cortana, Bot Framework Interact with it Type messages, talk, send images, or video and get answers
Power BI See it Visualize data with heat maps, graphs and charts
Stream Analytics Stream it Monitor data as it arrives and act on it in real time
Azure Machine Learning, Microsoft R
Server
Learn it Analyze past data to learn by finding patterns you can use to predict
outcomes for new data
SQL Data Warehouse, SQL DB, Document
DB, Blob storage
Relate it Store related data together using the best data store for the job
Data Lake Store it A data store that can handle data of any size, shape or speed
Event Hubs Collect it Collect data from sources such as IoT sensors that send large amounts of
data over small amounts of time
Data Factory Move it Move data from one place to another, transform it as you move it
Data Catalog Document it Document all your data sources
Cognitive Services Use it Pre-trained models available for use
HD Insight & Azure Data Bricks Scale it Create clusters for Hadoop or Spark (DataBricks for Spark)
Microsoft Azure
Cloud Dataprep Prepare it Prepare your data for analysis
BiqQuery ML, BigQuery GIS Train it Train machine learning models
Big Query, GCP Data Lake Store it Data warehouse
Cloud Dataproc Scale it Spin up clusters for Hadoop and Spark
Cloud Pub/Sub, Cloud Dataflow Stream it Ingest events in real time
Cloud DataFlow Store it A data store that can handle data of any size, shape or speed
Prepackaged AI solutions Use it Pre-trained models available for use
Google Cloud Platform
Analytics Engine Scale it Build and deploy clusters for Hadoop and Spark
InfoSphere Information Server on
cloud
Access it Extract, transform & load data + data standardization
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
IBM Watson Train it or Use it Train your own models or leverage pre-trained models for features
such as speech to text, natural language processing, and image
analysis
Watson IoT Platform Collect it Connect devices and analyze the associated data
Deep Learning Analyze it Design and deploy deep learning modules using neural networks
IBM Data Refinery Prepare it Data preparation tool
IBM
Data Lakes, Redshift Store it Store your data
Lake formation Move it Get data into your data lake
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
Amazon Kinesis, IoT Core Collect it Collect, process and analyze real time data including data from IoT
devices
Glue Document it Create a catalog of your data that is searchable and queryable by
users
Athena Analyze it Analyze your data
EMR, Deep Learning AMIs Scale it Scale using Hadoop and Spark
QuickSight See it Visualizations and dashboards
Application Services Use it Pre-trained models ready for use
Deep Learning AMIs, SageMaker Train it Tools to help you build and train models
AWS
Appendix C Lexicon
Buzzwords and Tools
Amazon Redshift – Data warehouse infrastructure
Ambari web based tool for managing Apache Hadoop clusters – provision, manage and monitor your Hadoop clusters
Avro – a data serialization system (like XML or JSON)
Apache Hadoop distributed storage and processing of big data. Splits files into large blocks and distributes them across nodes in a cluster. It then transfers the
packaged code into nodes to process the data in parallel, for faster processing
Apache Flink - open source stream processing framework to help you move data from your sensros and applications to your data stores and applications
Apache Storm - Open source realtime computation system . Storm does for realtime processing what Hadoop does for batch processing
Azure DataBricks – platform for managing and deploying Spark at scale
Azure Data Lake Analytics – allows you to write queries against data in a wide variety of data stores
Azure notebooks – Basically Jupyter notebooks on Azures supporting Python, F# and R
Azure SQL Data Warehouse – Data warehouse infrastructure
Caffe – Deep learning framework
Cassandra –NoSQL Database
Cognitive Toolkit (CNTK) – Microsoft’s Deep learning toolkit (competes with Google Tensorflow) for training machine learning models. Provides APIs you call with
Python
CouchDB – NoSQL Database
Chukwa – Data collection system for managing large distributed systems
H2O – Open source deep learning platform (competes with Tensorflow and Cognitive Toolkit)
Hadoop Distributed File System (HDFS) is the distributed file system used by Hadoop great for horizontal scalability (does not support insert, update & delete)
Hadoop Map Reduce – programming model used to process data, provides horizontal scalability
Hadoop YARN Platform for managing resources and scheduling in Hadoop clusters
HD Insight –Microsoft Azure service used to spin up Hadoop clusters to help analyze big data with Hadoop, Spark, Hbase, R-Server, Storm, etc..
Hive – Data warehouse infrastructure
Hbase – Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of
columns)
Jupyter Notebooks – web applications that allow you to create shareable interactive documents containing text, equations, code, and data
visualizations. Very useful for data scientists to explore and manipulate data sets and to share results. You can use them for data cleaning
and transformation, machine learning, and data visualization supports Python, R, Julia, and Scala. You can use Jupyter notebooks on a Spark
Cluster.
Kafka – distributed publisher subscriber messaging system. Used in the extraction step of ETL for high volume high velocity data flow
MapReduce – a two stage algorithm for processing large datasets. Data is split across a Hadoop cluster, the map function breaks data in key
value pairs (e.g. individual words in a text file), the Reduce function combines the mapped data (e.g. total counts of each word). MapReduce
functions can be written in Java, Python, C# or Pig
MATLAB – tools for machine learning – build models
MongoDB – NoSQL Database
MySQL – NoSQL Database
Scikit-learn – tools for data mining and data analysis built on Python (NumPy, SciPy and matplotlib)
Spark – compute engine for Hadoop data used for ETL, machine learning, stream processing and graph computation (starting to replace
MapReduce because Spark is faster)
Sqoop – Used for transferring data between structured databases and Hadoop
Tensorflow – Google’s deep learning toolkit. An open source software library for training machine learning models, allows you to deploy
computation across one or more CPUs or GPUS with a single API. Tensorflow provides APIs you call from Python
Torch – computing framework for Machine learning algorithms that puts GPUs first (good for deep learning)
TPU – Tensor processing unit. Custom built ASIC designed for high performance for running models rather than training them,
Google Compute Engine – second generation of Google TPUs
Tez – data flow programming framework built on YARN, runs projects like Hive and Pig, starting to replace MapReduce as the execution
engine on Hadoop because it can process data in a single job instead of multiple jobs
ZooKeeper – high performance coordination service for distributed applications
Scala – libraries and tools for performing data analysis
Python –
Pandas (for exploring data, data preparation: e.g. missing values, joins, string manipulation)
NumPy – fundamental package for scientific computing with Python
SciPy – numerical routines for numerical integration and optimization
matplotlib (for graphing, charting and visualizing data sets or query results)
Keras – deep learning for building your own neural networks
R - language for statistical (linear and nonlinear modelling, classification, clustering)
and graphics
Julia – numerical computing language supports parallel execution based on C
Mahout – Scalable Machine Learning and data mining library
Pig (Pig latin) – High level data flow language for parallel computation & ETL
HiveQL - similar to SQL, hides complexity of Map Reduce programming, generates a
MapReduce job
USQL – data language used by Azure Data Lake to query across data sources
Programming languages and libraries

More Related Content

PDF
Data Science with Spark
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
PPTX
Machine Learning with Spark
PDF
Paytm labs soyouwanttodatascience
PDF
Uber's data science workbench
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Data Science with Spark
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Using Hadoop to build a Data Quality Service for both real-time and batch data
Scaling up with Cisco Big Data: Data + Science = Data Science
Machine Learning with Spark
Paytm labs soyouwanttodatascience
Uber's data science workbench
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...

What's hot (20)

PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
PPTX
TechEvent Databricks on Azure
PPTX
Hadoop in Validated Environment - Data Governance Initiative
PDF
Big data with java
PPTX
Predictive Analytics with Hadoop
PDF
About Streaming Data Solutions for Hadoop
PDF
Architecture in action 01
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
PDF
Introduction to Analytics with Azure Notebooks and Python
PDF
Big Data is changing abruptly, and where it is likely heading
PPTX
IBM Strategy for Spark
PPTX
Apache Spark in Scientific Applciations
PPTX
Machine Learning and Hadoop
PPTX
EDHREC @ Data Science MD
PPTX
Dataiku Flow and dctc - Berlin Buzzwords
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
TechEvent Databricks on Azure
Hadoop in Validated Environment - Data Governance Initiative
Big data with java
Predictive Analytics with Hadoop
About Streaming Data Solutions for Hadoop
Architecture in action 01
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Data infrastructure architecture for medium size organization: tips for colle...
Introduction to Analytics with Azure Notebooks and Python
Big Data is changing abruptly, and where it is likely heading
IBM Strategy for Spark
Apache Spark in Scientific Applciations
Machine Learning and Hadoop
EDHREC @ Data Science MD
Dataiku Flow and dctc - Berlin Buzzwords
Pandas UDF: Scalable Analysis with Python and PySpark
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Ad

Similar to So your boss says you need to learn data science (20)

PDF
Data Science On The Google Cloud Platform 1st Edition Valliappa Lakshmanan
PDF
Big Data Analytics (ML, DL, AI) hands-on
PDF
Data Science On The Google Cloud Platform Implementing Endtoend Realtime Data...
DOCX
GLOSARIO SOBRE LA CIENCIA DE DATOS ORDENADO SEGUN CURSO
PDF
Data Science on the Google Cloud Platform 1st Edition Valliappa Lakshmanan
PDF
Data Science at Scale - The DevOps Approach
PPTX
Workshop_Presentation.pptx
PPTX
Machine Learning with Apache Spark
PPTX
Hadoop and Big Data: Revealed
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Bigdata and Hadoop Bootcamp
PPTX
Using The Hadoop Ecosystem to Drive Healthcare Innovation
PDF
IIPGH Webinar 1: Getting Started With Data Science
PDF
Thinkful DC - Intro to Data Science
PDF
Data Science Presentation.pdf
PPTX
Big Data Analysis : Deciphering the haystack
PDF
Artificial Intelligence (ML - DL)
PPTX
Azure Databricks for Data Scientists
PPTX
Introduction to Big Data
PDF
Tools and techniques for data science
Data Science On The Google Cloud Platform 1st Edition Valliappa Lakshmanan
Big Data Analytics (ML, DL, AI) hands-on
Data Science On The Google Cloud Platform Implementing Endtoend Realtime Data...
GLOSARIO SOBRE LA CIENCIA DE DATOS ORDENADO SEGUN CURSO
Data Science on the Google Cloud Platform 1st Edition Valliappa Lakshmanan
Data Science at Scale - The DevOps Approach
Workshop_Presentation.pptx
Machine Learning with Apache Spark
Hadoop and Big Data: Revealed
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Bigdata and Hadoop Bootcamp
Using The Hadoop Ecosystem to Drive Healthcare Innovation
IIPGH Webinar 1: Getting Started With Data Science
Thinkful DC - Intro to Data Science
Data Science Presentation.pdf
Big Data Analysis : Deciphering the haystack
Artificial Intelligence (ML - DL)
Azure Databricks for Data Scientists
Introduction to Big Data
Tools and techniques for data science
Ad

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Global journeys: estimating international migration
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Computer network topology notes for revision
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Moving the Public Sector (Government) to a Digital Adoption
.pdf is not working space design for the following data for the following dat...
Launch Your Data Science Career in Kochi – 2025
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Acumen Training GuidePresentation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Global journeys: estimating international migration
Introduction to Knowledge Engineering Part 1
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Moving the Public Sector (Government) to a Digital Adoption

So your boss says you need to learn data science

  • 1. So your boss wants you to learn data science Susan Ibach susan@aigaming.com @HockeyGeekGirl
  • 2. Data Science has become a buzzword I THINK WE NEED TO DO DATA SCIENCE YOUR DATA SCIENCE
  • 3. When your boss walks up to you and says we need to do data science, where do you start? PLATFORM TO USE DATA SCIENCE BIG DATA AI ML
  • 4. What is a data scientist? Advanced math skills Subject Matter Expertise Data Engineering skills
  • 5. Follow the 7 Steps to data science success 1 2 3 4 5 6 7
  • 6. Step 1: Identify your problem and data to define the problem 1
  • 7. What insights might help solve/define the problem? An airline wants to prevent flight delays
  • 8. Different insights require different tools SELECT COUNT(*) FROM FLIGHTS WHERE ACTUAL_ARR_TIME > SCHED_ARR_TIME SELECT COUNT(*) FROM FLIGHTS WHERE ACTUAL_ARR_TIME > SCHED_ARR_TIME BETWEEN 1997 and 2017
  • 9. Data science tools include Data Mining gain insights from data • Those who bought this also bought • Keyword extraction Machine Learning make predictions • Who will need hospitalization from the flu? • How many copies of this book will I sell? Deep Learning For complex data processed in layers • Is there a bird in this photo? • Will this person get cancer?
  • 10. Do we need Artificial Intelligence? •AI is when a computer completes a task that normally requires human intelligence • Answering questions from a customer • Recognizing the content of a photo • Understanding human speech •We use data science to analyze and recognize patterns and responses so we can do AI
  • 11. Step 2: Collect data 1 2
  • 12. Which flights are most likely to be delayed next week? What data would help you determine:
  • 13. Relational databases BLOB storage NoSQL databases Data warehouses Flat Files Open source data Sensors Where do I get all that data?
  • 14. BIG DATA When does data become “big data”? High Volume High Velocity High Variety
  • 15. Step 3: Prepare data 1 2 3
  • 16. Your data will need clean-up/prep Flight # Dep Date Sched Dep Time Dep Airport Dep Delay 041 15-dec-2016 09:20 YYZ 253:26 386 15-dec-2016 15:20 YYZ 415 15-dec-2016 19:15 YYZ 0:02 415 15-dec-2016 19:15 YYZ 0:02 Date Airport Wind Precipitation Precipitation Type 15/12/2016 Pearson NNE 5 MPH 150 mm Snow 15/12/2016 Dulles SW 18 MPH 7 mm Rain 15/12/2016 Reagan SW 18 MPH 7 mm Rain Missing Values Duplicate rows Different data formats Decomposition Outliers Scaling
  • 17. Start with what you already know • Excel, SQL Write your own Code • Python Pandas library, R Third party products Experian, Paxata, Alteryx, SAP Lumira, Teradata Data Lab, Knowledge Works, Datameer What tools might you use for data prep?
  • 18. If you have Big Data •Preparing and pulling together your data will require a LOT of storage and processing power
  • 19. Step 4: Identify the data that influences outcomes 1 2 3 4
  • 20. Which fields “features” might helps us predict if a flight will be late “label”? Flight # Dep Date Sched Dep Time Dep Airport Dep Delay 041 15-dec-2016 09:20 YYZ 253:26 386 15-dec-2016 15:20 YYZ 415 15-dec-2016 19:15 YYZ 0:02 415 15-dec-2016 19:15 YYZ 0:02 Date Airport Wind Precipitation Precipitation Type 15/12/2016 Pearson NNE 5 MPH 15 cm Snow 15/12/2016 Dulles SW 18 MPH 7 mm Rain 15/12/2016 Reagan SW 18 MPH 7 mm Rain Are there any fields we can decompose to get more information?
  • 21. Which fields “features” help us predict if a picture contains a dog or cat “label”? • Pixel1Color, Pixel2Color, Pixel3Color,….Pixel9036Color
  • 22. Break out the deep learning GPUs Storage Pixel Edge Shape Cat
  • 23. Step 5: Pick the right algorithm 1 2 3 4 5
  • 24. What are you trying to predict? Prediction Algorithm Example Predict continuous values Regression Predict what time a flight will land Predict what category something falls into Classification Predict if a flight will be late or on time Detect unusual data points Anomaly detection Predict if a credit card transaction is fraudulent Predict if a runner cheated on a marathon
  • 25. Supervised vs Unsupervised Type Definiton Example Supervised You have existing data with known inputs and known outputs to help make predictions When I try to predict if a flight next week will be late, I know what flights have been late in the past Unsupervised You have input data but no known outcomes in your data When I try to predict if a runner cheated on a marathon, I don’t have a history of runners who cheated in the past.
  • 26. Step 6: Train your model 1 2 3 4 5 6
  • 27. Once you have data and your algorithm you can train and create your predictive model
  • 28. Python R scikit-learn (based on NumPy, SciPy, and matplotlib) Azure Machine Learning Service Cognitive Toolkit/Tensorflow (deep learning) There are lots of tools to choose from
  • 29. Step 7: Test your model 1 2 3 4 5 6 7
  • 30. You need to know the accuracy of your model! Predictive/Trained Model Flt #406 Air Canada April 1, 2016 3:15 PM YYZ-YVR Late: No Flt #351 West Jet April 12, 2016 8:01 AM YOW-YYZ Late: No Flt #141 Delta Sep 25, 2016 1:45 PM HND-SEA Late: Yes Flt #406, Air Canada, April 1, 2016, 3:15 PM, YYZ-YVR Flt #351, West Jet, April 12, 2016, 8:01 AM, YOW-YYZ Flt #141, Delta, Sep 25, 2016, 1:45 PM HND-SEA Flt #406, Air Canada, April 1, 2016, 3:15 PM, YYZ-YVR, Late: Yes Flt #351, West Jet, April 12, 2016, 8:01 AM, YOW-YYZ, Late: No Flt #141, Delta, Sep 25, 2016, 1:45 PM HND-SEA, Late: Yes 66.6% accuracy
  • 31. What do I do if my accuracy is lousy? Go back to step 1
  • 32. For additional information •Appendix A What is Hadoop anyway? •Appendix B What cloud tools exist to help with data science? •Appendix C Lexicon
  • 34. Appendix A – What is Hadoop anyway? It’s a tool for analyzing Big Data
  • 35. Hadoop is an OS framework •Based on java •Distributed processing of large datasets across clusters of computers •Distributed storage and computation across clusters of computers •Scales from single server to thousands of machines
  • 36. Hadoop components • Hadoop Common – java libraries used by Hadoop to abstract the filesystem and OS • Hadoop YARN – framework for job scheduling and managing cluster resources • HDFS – distributed File system for access to application data (distributed storage) • Based on Google File System (GFS) • Hadoop can run on any distributed file system (FS, HFTP, FS, S3, FS) but usually HDFS • File in HDFS is split into blocks which are stored in DataNodes. Name nodes map blocks to datanodes • MapReduce – the query language for parallel processing of large data sets (distributed computation) • Map data into key/value pairs (tuples) • Reduce data tuples into smaller pairs of tuples • Input/output stored in file system • Job tracker and TaskTracker schedule, monitor tasks and re-execute failed tasks
  • 37. Hadoop components • Hive – similar to SQL, hides complexity of Map Reduce programming, generates a MapReduce job • Pig - (Pig latin) – High level data flow language for parallel computation & ETL • Hbase - Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of columns) • Spark - compute engine for Hadoop data used for ETL, machine learning, stream /real-time processing and graph computation (gradually replacing MapReduce because it is faster for iterative algorithms)
  • 38. How does Hadoop work • User submits a job to Hadoop • Location of input and output files • Java classes containing map and reduce functions • Job configuration parameters •Hadoop submits the job to JobTracker which distributes the job to the slaves, schedules tasks and monitors them •Task trackers execute the task and output is stored in output files on the file system
  • 39. Why is it popular? • Allows user to quickly write and test distributed systems • It automatically distributes data and work across the machines and utilizes the parallelism of CPU cores • Does not rely on hardware for fault tolerance and high availability • Servers can be added or removed dynamically • It’s Open source and compatible on many platforms since it is Java based
  • 40. Appendix B – What cloud tools exist to help with data science?
  • 41. Cortana, Bot Framework Interact with it Type messages, talk, send images, or video and get answers Power BI See it Visualize data with heat maps, graphs and charts Stream Analytics Stream it Monitor data as it arrives and act on it in real time Azure Machine Learning, Microsoft R Server Learn it Analyze past data to learn by finding patterns you can use to predict outcomes for new data SQL Data Warehouse, SQL DB, Document DB, Blob storage Relate it Store related data together using the best data store for the job Data Lake Store it A data store that can handle data of any size, shape or speed Event Hubs Collect it Collect data from sources such as IoT sensors that send large amounts of data over small amounts of time Data Factory Move it Move data from one place to another, transform it as you move it Data Catalog Document it Document all your data sources Cognitive Services Use it Pre-trained models available for use HD Insight & Azure Data Bricks Scale it Create clusters for Hadoop or Spark (DataBricks for Spark) Microsoft Azure
  • 42. Cloud Dataprep Prepare it Prepare your data for analysis BiqQuery ML, BigQuery GIS Train it Train machine learning models Big Query, GCP Data Lake Store it Data warehouse Cloud Dataproc Scale it Spin up clusters for Hadoop and Spark Cloud Pub/Sub, Cloud Dataflow Stream it Ingest events in real time Cloud DataFlow Store it A data store that can handle data of any size, shape or speed Prepackaged AI solutions Use it Pre-trained models available for use Google Cloud Platform
  • 43. Analytics Engine Scale it Build and deploy clusters for Hadoop and Spark InfoSphere Information Server on cloud Access it Extract, transform & load data + data standardization Streaming Analytics Stream it Monitor data as it arrives and act on it in real time IBM Watson Train it or Use it Train your own models or leverage pre-trained models for features such as speech to text, natural language processing, and image analysis Watson IoT Platform Collect it Connect devices and analyze the associated data Deep Learning Analyze it Design and deploy deep learning modules using neural networks IBM Data Refinery Prepare it Data preparation tool IBM
  • 44. Data Lakes, Redshift Store it Store your data Lake formation Move it Get data into your data lake Streaming Analytics Stream it Monitor data as it arrives and act on it in real time Amazon Kinesis, IoT Core Collect it Collect, process and analyze real time data including data from IoT devices Glue Document it Create a catalog of your data that is searchable and queryable by users Athena Analyze it Analyze your data EMR, Deep Learning AMIs Scale it Scale using Hadoop and Spark QuickSight See it Visualizations and dashboards Application Services Use it Pre-trained models ready for use Deep Learning AMIs, SageMaker Train it Tools to help you build and train models AWS
  • 46. Amazon Redshift – Data warehouse infrastructure Ambari web based tool for managing Apache Hadoop clusters – provision, manage and monitor your Hadoop clusters Avro – a data serialization system (like XML or JSON) Apache Hadoop distributed storage and processing of big data. Splits files into large blocks and distributes them across nodes in a cluster. It then transfers the packaged code into nodes to process the data in parallel, for faster processing Apache Flink - open source stream processing framework to help you move data from your sensros and applications to your data stores and applications Apache Storm - Open source realtime computation system . Storm does for realtime processing what Hadoop does for batch processing Azure DataBricks – platform for managing and deploying Spark at scale Azure Data Lake Analytics – allows you to write queries against data in a wide variety of data stores Azure notebooks – Basically Jupyter notebooks on Azures supporting Python, F# and R Azure SQL Data Warehouse – Data warehouse infrastructure Caffe – Deep learning framework Cassandra –NoSQL Database Cognitive Toolkit (CNTK) – Microsoft’s Deep learning toolkit (competes with Google Tensorflow) for training machine learning models. Provides APIs you call with Python CouchDB – NoSQL Database Chukwa – Data collection system for managing large distributed systems H2O – Open source deep learning platform (competes with Tensorflow and Cognitive Toolkit) Hadoop Distributed File System (HDFS) is the distributed file system used by Hadoop great for horizontal scalability (does not support insert, update & delete) Hadoop Map Reduce – programming model used to process data, provides horizontal scalability Hadoop YARN Platform for managing resources and scheduling in Hadoop clusters HD Insight –Microsoft Azure service used to spin up Hadoop clusters to help analyze big data with Hadoop, Spark, Hbase, R-Server, Storm, etc..
  • 47. Hive – Data warehouse infrastructure Hbase – Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of columns) Jupyter Notebooks – web applications that allow you to create shareable interactive documents containing text, equations, code, and data visualizations. Very useful for data scientists to explore and manipulate data sets and to share results. You can use them for data cleaning and transformation, machine learning, and data visualization supports Python, R, Julia, and Scala. You can use Jupyter notebooks on a Spark Cluster. Kafka – distributed publisher subscriber messaging system. Used in the extraction step of ETL for high volume high velocity data flow MapReduce – a two stage algorithm for processing large datasets. Data is split across a Hadoop cluster, the map function breaks data in key value pairs (e.g. individual words in a text file), the Reduce function combines the mapped data (e.g. total counts of each word). MapReduce functions can be written in Java, Python, C# or Pig MATLAB – tools for machine learning – build models MongoDB – NoSQL Database MySQL – NoSQL Database Scikit-learn – tools for data mining and data analysis built on Python (NumPy, SciPy and matplotlib) Spark – compute engine for Hadoop data used for ETL, machine learning, stream processing and graph computation (starting to replace MapReduce because Spark is faster) Sqoop – Used for transferring data between structured databases and Hadoop Tensorflow – Google’s deep learning toolkit. An open source software library for training machine learning models, allows you to deploy computation across one or more CPUs or GPUS with a single API. Tensorflow provides APIs you call from Python Torch – computing framework for Machine learning algorithms that puts GPUs first (good for deep learning) TPU – Tensor processing unit. Custom built ASIC designed for high performance for running models rather than training them, Google Compute Engine – second generation of Google TPUs Tez – data flow programming framework built on YARN, runs projects like Hive and Pig, starting to replace MapReduce as the execution engine on Hadoop because it can process data in a single job instead of multiple jobs ZooKeeper – high performance coordination service for distributed applications
  • 48. Scala – libraries and tools for performing data analysis Python – Pandas (for exploring data, data preparation: e.g. missing values, joins, string manipulation) NumPy – fundamental package for scientific computing with Python SciPy – numerical routines for numerical integration and optimization matplotlib (for graphing, charting and visualizing data sets or query results) Keras – deep learning for building your own neural networks R - language for statistical (linear and nonlinear modelling, classification, clustering) and graphics Julia – numerical computing language supports parallel execution based on C Mahout – Scalable Machine Learning and data mining library Pig (Pig latin) – High level data flow language for parallel computation & ETL HiveQL - similar to SQL, hides complexity of Map Reduce programming, generates a MapReduce job USQL – data language used by Azure Data Lake to query across data sources Programming languages and libraries

Editor's Notes

  • #2: 1
  • #8: What insights might help solve/define the problem? How many flights were late last year? How much money did we spend on flight delays? What are the most common causes of late flights? Is the number of late flights increasing or decreasing year over year Which flights are most likely to be delayed next week?
  • #9: SQL Queries are used to extract data from relational databases Data Warehouses are used to aggregate historical data to see trends Dashboards are used to provide visualizations of important data
  • #10: Data mining to gain insights from data Those who bought this also bought Keyword extraction Machine Learning to make predictions by using algorithms to parse and learn from historical data Predict if this credit card was stolen based on the most recent transactions Deep learning to analyze data with a lot of different features Is there a bird in this photo? Will this person get cancer?
  • #13: Weather forecast Crew schedules Maintenance history Passenger information Airport information
  • #15: Big data is what we call data that is so big and complex that traditional data processing is inadequate (e.g. internet search, financial, genomics) High volume (amount of data) High variety (range of data types and sources) High velocity (speed of data in or out)
  • #17: Missing values Duplicate rows Different data formats Outliers Decomposition Aggregation Scaling
  • #19: You will need parallel processing and distributed storage Hadoop gives you distributed storage and processing across one or more servers You set up a cluster and run Hadoop on your cluster to abstract the hardware Numerous tools run on top of Hadoop to access the data and perform the processing (Hive, Spark, MapReduce, Pig)
  • #21: Flight Number Scheduled Departure time Flight distance Day of week Month Year
  • #23: Sometimes even a subject matter expert cannot identify the features You can use deep learning with neural networks to identify the significant features The more processing power the better! Cheap storage and GPUs enabled breakthroughs in deep learning
  • #34: 33