So your boss says you need to learn data science

So your boss wants you to learn
data science
Susan Ibach
susan@aigaming.com
@HockeyGeekGirl

Data Science has become a buzzword
I THINK WE
NEED TO DO
DATA SCIENCE
YOUR DATA
SCIENCE

When your boss walks up to you and says
we need to do data science, where do
you start?
PLATFORM TO USE
DATA SCIENCE
BIG DATA
AI
ML

What is a data scientist?
Advanced
math skills
Subject
Matter
Expertise
Data
Engineering
skills

Follow the 7 Steps to data science success
1
2
3
4
5
6
7

Step 1: Identify your problem and
data to define the problem
1

What insights might help solve/define the problem?
An airline wants to prevent flight delays

Different insights require different tools
SELECT COUNT(*) FROM
FLIGHTS WHERE
ACTUAL_ARR_TIME >
SCHED_ARR_TIME
SELECT COUNT(*) FROM
FLIGHTS WHERE
ACTUAL_ARR_TIME >
SCHED_ARR_TIME BETWEEN
1997 and 2017

Data science tools include
Data Mining
gain insights from data
• Those who bought this also bought
• Keyword extraction
Machine Learning
make predictions
• Who will need hospitalization from the flu?
• How many copies of this book will I sell?
Deep Learning
For complex data processed in layers
• Is there a bird in this photo?
• Will this person get cancer?

Do we need Artificial Intelligence?
•AI is when a computer completes a task that
normally requires human intelligence
• Answering questions from a customer
• Recognizing the content of a photo
• Understanding human speech
•We use data science to analyze and recognize
patterns and responses so we can do AI

Which flights are most likely to be delayed next
week?
What data would help you determine:

Relational databases
BLOB storage
NoSQL databases
Data warehouses
Flat Files
Open source data
Sensors
Where do I get all that data?

BIG DATA
When does data become “big data”?
High Volume High Velocity
High Variety

Your data will need clean-up/prep
Flight # Dep Date Sched Dep
Time
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
Type
15/12/2016 Pearson NNE 5 MPH 150 mm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Missing Values
Duplicate rows
Different data formats
Decomposition
Outliers
Scaling

Start with what you already know
• Excel, SQL
Write your own Code
• Python Pandas library, R
Third party products
Experian, Paxata, Alteryx, SAP Lumira, Teradata Data Lab,
Knowledge Works, Datameer
What tools might you use for data prep?

If you have Big Data
•Preparing and pulling together your data will require
a LOT of storage and processing power

Step 4: Identify the data that
influences outcomes
1
2
3
4

Which fields “features” might helps us
predict if a flight will be late “label”?
Flight # Dep Date Sched Dep
Time
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
Type
15/12/2016 Pearson NNE 5 MPH 15 cm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Are there any fields we can
decompose to get more
information?

Which fields “features” help us predict if a
picture contains a dog or cat “label”?
• Pixel1Color, Pixel2Color, Pixel3Color,….Pixel9036Color

Break out the deep learning
GPUs
Storage
Pixel Edge Shape Cat

Step 5: Pick the right algorithm
1
2
3
4
5

What are you trying to predict?
Prediction Algorithm Example
Predict continuous
values
Regression Predict what time a
flight will land
Predict what
category something
falls into
Classification Predict if a flight
will be late or on
time
Detect unusual data
points
Anomaly detection Predict if a credit
card transaction is
fraudulent
Predict if a runner
cheated on a
marathon

Supervised vs Unsupervised
Type Definiton Example
Supervised You have existing data with known
inputs and known outputs to help
make predictions
When I try to predict if a flight
next week will be late, I know what
flights have been late in the past
Unsupervised You have input data but no known
outcomes in your data
When I try to predict if a runner
cheated on a marathon, I don’t
have a history of runners who
cheated in the past.

Step 6: Train your model
1
2
3
4
5
6

Once you have data and your algorithm
you can train and create your predictive
model

Python
R
scikit-learn (based on NumPy, SciPy, and matplotlib)
Azure Machine Learning Service
Cognitive Toolkit/Tensorflow (deep learning)
There are lots of tools to choose from

Step 7: Test your model
1
2
3
4
5
6
7

You need to know the accuracy of your
model!
Predictive/Trained
Model
Flt #406
Air Canada
April 1, 2016
3:15 PM
YYZ-YVR
Late: No
Flt #351
West Jet
April 12, 2016
8:01 AM
YOW-YYZ
Late: No
Flt #141
Delta
Sep 25, 2016
1:45 PM
HND-SEA
Late: Yes
Flt #406, Air Canada, April 1, 2016, 3:15
PM, YYZ-YVR
Flt #351, West Jet, April 12, 2016, 8:01
AM, YOW-YYZ
Flt #141, Delta, Sep 25, 2016, 1:45 PM
HND-SEA
Flt #406, Air Canada, April 1, 2016, 3:15
PM, YYZ-YVR, Late: Yes
Flt #351, West Jet, April 12, 2016, 8:01
AM, YOW-YYZ, Late: No
Flt #141, Delta, Sep 25, 2016, 1:45 PM
HND-SEA, Late: Yes
66.6% accuracy

What do I do if my accuracy is lousy?
Go back to step 1

For additional information
•Appendix A
What is Hadoop anyway?
•Appendix B
What cloud tools exist to help with data science?
•Appendix C Lexicon

THANK YOU
QUESTIONS?
Susan Ibach
susan@aigaming.com
@HockeyGeekGirl

Appendix A –
What is Hadoop anyway?
It’s a tool for analyzing Big Data

Hadoop is an OS framework
•Based on java
•Distributed processing of large datasets across
clusters of computers
•Distributed storage and computation across clusters
of computers
•Scales from single server to thousands of machines

Hadoop components
• Hadoop Common – java libraries used by Hadoop to abstract the filesystem and
OS
• Hadoop YARN – framework for job scheduling and managing cluster resources
• HDFS – distributed File system for access to application data (distributed storage)
• Based on Google File System (GFS)
• Hadoop can run on any distributed file system (FS, HFTP, FS, S3, FS) but usually HDFS
• File in HDFS is split into blocks which are stored in DataNodes. Name nodes map blocks to
datanodes
• MapReduce – the query language for parallel processing of large data sets
(distributed computation)
• Map data into key/value pairs (tuples)
• Reduce data tuples into smaller pairs of tuples
• Input/output stored in file system
• Job tracker and TaskTracker schedule, monitor tasks and re-execute failed tasks

Hadoop components
• Hive – similar to SQL, hides complexity of Map Reduce
programming, generates a MapReduce job
• Pig - (Pig latin) – High level data flow language for
parallel computation & ETL
• Hbase - Scalable distributed non-relational database that
supports structured data storage for large tables (billions
of rows X millions of columns)
• Spark - compute engine for Hadoop data used for ETL,
machine learning, stream /real-time processing and
graph computation (gradually replacing MapReduce
because it is faster for iterative algorithms)

How does Hadoop work
• User submits a job to Hadoop
• Location of input and output files
• Java classes containing map and reduce functions
• Job configuration parameters
•Hadoop submits the job to JobTracker which
distributes the job to the slaves, schedules tasks and
monitors them
•Task trackers execute the task and output is stored in
output files on the file system

Why is it popular?
• Allows user to quickly write and test distributed systems
• It automatically distributes data and work across the
machines and utilizes the parallelism of CPU cores
• Does not rely on hardware for fault tolerance and high
availability
• Servers can be added or removed dynamically
• It’s Open source and compatible on many platforms since
it is Java based

Appendix B –
What cloud tools exist to help with
data science?

Cortana, Bot Framework Interact with it Type messages, talk, send images, or video and get answers
Power BI See it Visualize data with heat maps, graphs and charts
Stream Analytics Stream it Monitor data as it arrives and act on it in real time
Azure Machine Learning, Microsoft R
Server
Learn it Analyze past data to learn by finding patterns you can use to predict
outcomes for new data
SQL Data Warehouse, SQL DB, Document
DB, Blob storage
Relate it Store related data together using the best data store for the job
Data Lake Store it A data store that can handle data of any size, shape or speed
Event Hubs Collect it Collect data from sources such as IoT sensors that send large amounts of
data over small amounts of time
Data Factory Move it Move data from one place to another, transform it as you move it
Data Catalog Document it Document all your data sources
Cognitive Services Use it Pre-trained models available for use
HD Insight & Azure Data Bricks Scale it Create clusters for Hadoop or Spark (DataBricks for Spark)
Microsoft Azure

Cloud Dataprep Prepare it Prepare your data for analysis
BiqQuery ML, BigQuery GIS Train it Train machine learning models
Big Query, GCP Data Lake Store it Data warehouse
Cloud Dataproc Scale it Spin up clusters for Hadoop and Spark
Cloud Pub/Sub, Cloud Dataflow Stream it Ingest events in real time
Cloud DataFlow Store it A data store that can handle data of any size, shape or speed
Prepackaged AI solutions Use it Pre-trained models available for use
Google Cloud Platform

Analytics Engine Scale it Build and deploy clusters for Hadoop and Spark
InfoSphere Information Server on
cloud
Access it Extract, transform & load data + data standardization
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
IBM Watson Train it or Use it Train your own models or leverage pre-trained models for features
such as speech to text, natural language processing, and image
analysis
Watson IoT Platform Collect it Connect devices and analyze the associated data
Deep Learning Analyze it Design and deploy deep learning modules using neural networks
IBM Data Refinery Prepare it Data preparation tool
IBM

Data Lakes, Redshift Store it Store your data
Lake formation Move it Get data into your data lake
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
Amazon Kinesis, IoT Core Collect it Collect, process and analyze real time data including data from IoT
devices
Glue Document it Create a catalog of your data that is searchable and queryable by
users
Athena Analyze it Analyze your data
EMR, Deep Learning AMIs Scale it Scale using Hadoop and Spark
QuickSight See it Visualizations and dashboards
Application Services Use it Pre-trained models ready for use
Deep Learning AMIs, SageMaker Train it Tools to help you build and train models
AWS

Appendix C Lexicon
Buzzwords and Tools

Amazon Redshift – Data warehouse infrastructure
Ambari web based tool for managing Apache Hadoop clusters – provision, manage and monitor your Hadoop clusters
Avro – a data serialization system (like XML or JSON)
Apache Hadoop distributed storage and processing of big data. Splits files into large blocks and distributes them across nodes in a cluster. It then transfers the
packaged code into nodes to process the data in parallel, for faster processing
Apache Flink - open source stream processing framework to help you move data from your sensros and applications to your data stores and applications
Apache Storm - Open source realtime computation system . Storm does for realtime processing what Hadoop does for batch processing
Azure DataBricks – platform for managing and deploying Spark at scale
Azure Data Lake Analytics – allows you to write queries against data in a wide variety of data stores
Azure notebooks – Basically Jupyter notebooks on Azures supporting Python, F# and R
Azure SQL Data Warehouse – Data warehouse infrastructure
Caffe – Deep learning framework
Cassandra –NoSQL Database
Cognitive Toolkit (CNTK) – Microsoft’s Deep learning toolkit (competes with Google Tensorflow) for training machine learning models. Provides APIs you call with
Python
CouchDB – NoSQL Database
Chukwa – Data collection system for managing large distributed systems
H2O – Open source deep learning platform (competes with Tensorflow and Cognitive Toolkit)
Hadoop Distributed File System (HDFS) is the distributed file system used by Hadoop great for horizontal scalability (does not support insert, update & delete)
Hadoop Map Reduce – programming model used to process data, provides horizontal scalability
Hadoop YARN Platform for managing resources and scheduling in Hadoop clusters
HD Insight –Microsoft Azure service used to spin up Hadoop clusters to help analyze big data with Hadoop, Spark, Hbase, R-Server, Storm, etc..

Hive – Data warehouse infrastructure
Hbase – Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of
columns)
Jupyter Notebooks – web applications that allow you to create shareable interactive documents containing text, equations, code, and data
visualizations. Very useful for data scientists to explore and manipulate data sets and to share results. You can use them for data cleaning
and transformation, machine learning, and data visualization supports Python, R, Julia, and Scala. You can use Jupyter notebooks on a Spark
Cluster.
Kafka – distributed publisher subscriber messaging system. Used in the extraction step of ETL for high volume high velocity data flow
MapReduce – a two stage algorithm for processing large datasets. Data is split across a Hadoop cluster, the map function breaks data in key
value pairs (e.g. individual words in a text file), the Reduce function combines the mapped data (e.g. total counts of each word). MapReduce
functions can be written in Java, Python, C# or Pig
MATLAB – tools for machine learning – build models
MongoDB – NoSQL Database
MySQL – NoSQL Database
Scikit-learn – tools for data mining and data analysis built on Python (NumPy, SciPy and matplotlib)
Spark – compute engine for Hadoop data used for ETL, machine learning, stream processing and graph computation (starting to replace
MapReduce because Spark is faster)
Sqoop – Used for transferring data between structured databases and Hadoop
Tensorflow – Google’s deep learning toolkit. An open source software library for training machine learning models, allows you to deploy
computation across one or more CPUs or GPUS with a single API. Tensorflow provides APIs you call from Python
Torch – computing framework for Machine learning algorithms that puts GPUs first (good for deep learning)
TPU – Tensor processing unit. Custom built ASIC designed for high performance for running models rather than training them,
Google Compute Engine – second generation of Google TPUs
Tez – data flow programming framework built on YARN, runs projects like Hive and Pig, starting to replace MapReduce as the execution
engine on Hadoop because it can process data in a single job instead of multiple jobs
ZooKeeper – high performance coordination service for distributed applications

Scala – libraries and tools for performing data analysis
Python –
Pandas (for exploring data, data preparation: e.g. missing values, joins, string manipulation)
NumPy – fundamental package for scientific computing with Python
SciPy – numerical routines for numerical integration and optimization
matplotlib (for graphing, charting and visualizing data sets or query results)
Keras – deep learning for building your own neural networks
R - language for statistical (linear and nonlinear modelling, classification, clustering)
and graphics
Julia – numerical computing language supports parallel execution based on C
Mahout – Scalable Machine Learning and data mining library
Pig (Pig latin) – High level data flow language for parallel computation & ETL
HiveQL - similar to SQL, hides complexity of Map Reduce programming, generates a
MapReduce job
USQL – data language used by Azure Data Lake to query across data sources
Programming languages and libraries

So your boss says you need to learn data science

More Related Content

What's hot (20)

Similar to So your boss says you need to learn data science (20)

Recently uploaded (20)

So your boss says you need to learn data science

Editor's Notes