SlideShare a Scribd company logo
Data Operations Problems Created by Deep
Learning, and How to Fix Them!
Jim Scott
@kingmesal
2 © 2018 MapR Technologies, Inc. // MapR Confidential
Public Service Announcement
You may see Artificial Intelligence, Machine
Learning and Deep Learning used interchangeably
within this presentation please feel free to
mentally substitute the phrase of your choice if
it is more applicable to you.
I’m not trying to split hairs on terminology.
Thanks for understanding!
3 © 2018 MapR Technologies, Inc. // MapR Confidential
Terminology
Data Science
Artificial Intelligence
Machine Learning
Deep
Learning
Data Science
Artificial Intelligence (AI)
Machine Learning (ML)
The use of algorithms to extract knowledge and
insights from data in various forms in order to
obtain insights.
Some Subfields: Statistics, Artificial Intelligence
(AI), Computational Math
The simulation of intelligent human behavior for
problem-solving and decision-making.
Some Subfields: Robotics, Natural Language
Processing (NLP), Machine Learning.
The process by which machines are taught to make
calculated suggestions and/or predictions by
examining large amounts of input data.
Some Subfields: Logistic Regression, Deep Learning,
Reinforcement Learning
4 © 2018 MapR Technologies, Inc. // MapR Confidential
90% of the effort in successful
machine learning isn’t in the
training or model development…
It’s the logistics!
5 © 2018 MapR Technologies, Inc. // MapR Confidential
“Machine learning offers a
fantastically powerful toolkit for
building complex systems
quickly.… it is remarkably easy
to incur massive ongoing
maintenance costs at the
system level when applying
machine learning.”
The Importance of Data Logistics
6 © 2018 MapR Technologies, Inc. // MapR Confidential
Why?
Just getting the training data is hard:
● Which data? How to make it accessible? Multiple sources!
● New kinds of observations force restarts
● Requires a ton of domain knowledge
The myth of a single model:
● You cannot train just one
● You will have dozens of models, likely hundreds or more
● Handoff to new versions is tricky
● You have to get runtime to be sure about which is better
7 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 1
Lack of support for the
Artificial Intelligence
Software Development Lifecycle
(AI-SDLC)
8 © 2018 MapR Technologies, Inc. // MapR Confidential
Building a Machine Learning Solution
1. Identify the data sources
2. Identify the tools
3. Write some code
4. Train a model
5. Test the model
6. Analyze the output of the tests
7. Repeat steps 3 through 6 until happy-ish
a. Maybe swap out a tool if your cannot achieve happiness
8. Figure out how to get this solution into production
9 © 2018 MapR Technologies, Inc. // MapR Confidential
Choosing the Best Machine Learning Tool
Most successful groups keep several “favorite” machine learning tools on hand:
● No single tool is the best in every situation
The most important tool is a platform that supports logistics well:
● Everything does not need to be done at the application level
● Lots of what matters can be handled at the platform level
A good design for the logistics can make a big difference
10 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 2
The deep learning
workload is only one
type of workload
11 © 2018 MapR Technologies, Inc. // MapR Confidential
Massaging your data
● This will normally include cleansing, normalizing, and even optimizing data formats
for downstream consumption for GPU based workloads
● Distributed compute is often the best approach for this type of activity due to the
volume of data and variety of data types
● Be sure to keep your original data, in case of mistakes
Separate your training and holdback data sets
● This is based off of the massaged data
Analyze model outputs to determine the quality of your models
● Especially valuable over time to know that the models are moving the right way
● Great candidate for a distributed workload
Workloads
12 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 3
Putting machine learning into
production is not quite the same
as other enterprise software
13 © 2018 MapR Technologies, Inc. // MapR Confidential
Gotchas with Making it to Production
● Ops-oriented people will not necessarily “get it” regarding modeling subtleties
● Data scientists will not necessarily “get it” regarding operational realities
● Therefore, modelers have to deliver self-contained models
● And, ops has to provide pre-wired structure
14 © 2018 MapR Technologies, Inc. // MapR Confidential
Handling Real-time
Stream instead of database as the shared “truth”
Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
15 © 2018 MapR Technologies, Inc. // MapR Confidential
Real-time can be started on
your schedule, that is the key
16 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 4
Running machine learning
models in more than
one location is tough
17 © 2018 MapR Technologies, Inc. // MapR Confidential
Streaming Isolates Services
18 © 2018 MapR Technologies, Inc. // MapR Confidential
With MapR, Geo-Distributed Data Appears Local
19 © 2018 MapR Technologies, Inc. // MapR Confidential
With MapR, Geo-Distributed Data Appears Local
Global Data Center
Regional Data Center
20 © 2018 MapR Technologies, Inc. // MapR Confidential
Features of Good Streaming
Persistent
● Messages stick around for other consumers
● Consumers don’t affect producers
● Consumer doesn’t have to be online when message arrives
Performant
● You should NEVER need to worry if a stream can keep up
Pervasive
● It is there whenever you need it, no need to deploy anything
● How much work is it to create a new file? Why harder for a stream?
21 © 2018 MapR Technologies, Inc. // MapR Confidential
Improving Machine Learning Logistics
Stream first architecture is a powerful approach with surprisingly widespread
advantages
● Innovative technologies emerging to for streaming data
Microservices approach provides flexibility
● Streaming supports microservices (if done right)
Containers remove surprises
● Predictable environment for running models
22 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 5
Data dependencies cost more
than code dependencies,
a lot more!
23 © 2018 MapR Technologies, Inc. // MapR Confidential
Data dependencies cost more than code dependencies!
● Code dependencies are easy to track, because it is a well known and a well
practiced discipline
● The data may be unstable
Undeclared consumers can wreak havoc on your models
● Downstream users may create a data dependency on the data from your model
● Updates to your model may break their system, if they made an assumption on
the function of your model
● Who do you think will suffer?
Top Reason to Use a Streaming Architecture
24 © 2018 MapR Technologies, Inc. // MapR Confidential
First Look with Streams
25 © 2018 MapR Technologies, Inc. // MapR Confidential
Then Rendezvous
26 © 2018 MapR Technologies, Inc. // MapR Confidential
Faster Throughput Through Failure
Suppose we have one model that can handle 10,000 t/s @ 2ms
● But this isn’t the most accurate model. Not bad, but not the best.
And our champion model can handle 1,000 t/s @ 10ms
● Then imagine a burst of 2,000 t/s for several minutes
Champion can only evaluate half of all requests
● Should skip to keep up
● Fast model will cover for champion
27 © 2018 MapR Technologies, Inc. // MapR Confidential
Rendezvous – Mainly for Making Decisions
Decisioning models
● Looking for a “right answer”
● Simpler than reinforcement learning
Examples
● Fraud detection
● Predictive analytics / market prediction
● Churn prediction (as in telecommunications)
● Yield optimization
● Deep learning in form of speech or image recognition, in some cases
28 © 2018 MapR Technologies, Inc. // MapR Confidential
Some Key Points
● Note that all models see identical inputs
● All models run in production setting
● All models send scores to same stream
● The rendezvous server decides which scores to ignore
● Roll forward, roll back, correlated comparison are all now trivial
29 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 6
Wash,
Rinse,
Repeat!
30 © 2018 MapR Technologies, Inc. // MapR Confidential
Are you performing all of these steps in your AI-SDLC manually?
● Consider a workflow tool
○ e.g. Airflow, Kubeflow, Argo, etc…
Is all of your data in static files or will there be real-time data?
● Prepare for real-time in development to be ready for production
Version everything!
● I’m sorry, this isn’t a job for GIT!
● Includes source data: static and real-time
● Also includes models and their output
● Ensures sanity checks
● Long-term performance analytics
Concerns About Repeatability
31 © 2018 MapR Technologies, Inc. // MapR Confidential
Quality & Reproducibility of Input Data is Important!
Recording raw-ish data is really a big deal
● Data as seen by a model is worth gold
● Data reconstructed later often has time-machine leaks
● Databases were made for updates, streams are safer
Raw data is useful for non-ML cases as well (think flexibility)
Decoy model records training data as seen by models under development &
evaluation
32 © 2018 MapR Technologies, Inc. // MapR Confidential
A Quick Review
33 © 2018 MapR Technologies, Inc. // MapR Confidential
The Proxy Talks to the Outside World
34 © 2018 MapR Technologies, Inc. // MapR Confidential
The Input Stream Feeds All Models Identically
35 © 2018 MapR Technologies, Inc. // MapR Confidential
The Scores Stream Contains All Results
36 © 2018 MapR Technologies, Inc. // MapR Confidential
The Rendezvous Picks A Result
37 © 2018 MapR Technologies, Inc. // MapR Confidential
Results Return Via A Stream and Return Address
38 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 7
In the real world
conditions may
(will) change!
39 © 2018 MapR Technologies, Inc. // MapR Confidential
Not Such Bad Ideas
Keep models running “in the wings”
● Do not wait until conditions change to start building the next model
● Keep new short-history models ready to roll, some graybeards as well
Hot hand-off
● With rendezvous, stop ignoring the new best model
Deploy a canary server
● Keep an old model active as a reference
● If it was 90% correct, difference with any better model should be small
● Score distribution should be roughly constant
40 © 2018 MapR Technologies, Inc. // MapR Confidential
Prepare for Scaling Up
Model Variety
● Multiple rendezvous frameworks for different tasks
Throughput
● Fast default models
● Partition input stream to allow parallel model evaluation
● Input batching
Extreme Volumes
● Cannibalize fancy models to run more fast/simple models
● Speed before beauty
41 © 2018 MapR Technologies, Inc. // MapR Confidential
Making Improvements
1. Data + the right question + domain knowledge, matters!
2. Prioritize – put serious effort into infrastructure
a. DataOps requires more than just data science
3. Persist – use streams to keep data around
4. Measure – everything, and record it
5. Analyze – understand and see what is happening
6. Containerize – make deployment predictable, repeatable and easy
42 © 2018 MapR Technologies, Inc. // MapR Confidential
Problems 8, 9 and 10
Copying data from your
streaming system,
data lake,
and edge systems to your
machine learning environment
43 © 2018 MapR Technologies, Inc. // MapR Confidential
PLEASE, PLEASE, PLEASE…
...tell me you are not copying
all your data between these systems
44 © 2018 MapR Technologies, Inc. // MapR Confidential
Storage
Appliance
Traditional Storage Vendor Solution
Edge
Copy
Ingest
Core Cloud
Unified Data
Lake
Data Prep
Training
+
Testing
Production
Training
Cluster Deployment
Copy
Storage
Appliance
ServersServers w/
GPU
Does NOT support real-time workflows
Doesn’t support distributed data preparation workloads
Copy
Copy
45 © 2018 MapR Technologies, Inc. // MapR Confidential
Hadoop Based Solutions
Edge
Copy
Core Cloud
Unified Data
Lake
Data Prep
Training
+
Testing
Production
Training
Cluster Deployment
HDFS
Cluster
ServersServers w/
GPU
Minimum of seven non-homogeneous environments to administer and secure
Full data copies without versioning, lineage control or multi-master support
Copy
Kafkain-motion
Copy
Copy
Copy
in-motion Kafka
in-motion
Copy
Copy
Copy
Ingest
Kafka
Where does the
master copy of
the data live?
46 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR Solution
Data Fabric
Global Namespace
Core CloudEdge
Data Prep
Training
+
Testing
Deployment
One homogeneous environment to manage and secure
Supports real-time processing with data protection, lineage, and versioning
Runs directly on DGX servers to create a unified DGX cluster
47 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR AI + RAPIDS
Document
DB
Events
Structured
Data
Unstructured
Data
Inference
Typical Training and Evaluation Workflow
Events
Production DeploymentData Management
Applications
RAPIDS
Apache
Arrow GPU Memory
cuGRAPH
Graph Analytics
cuML
Machine Learning
cuDF
Analytics
Data
Preparation
Training
Data Set
Model
Training
Evaluate/
Visualize
48 © 2018 MapR Technologies, Inc. // MapR Confidential
How Data is Accessed
Advantages of the MapR Data Fabric
● Linear Scalability
● Architected for performance, scale,
and reliability
● Distributed metadata in the fabric
How Data is Stored
How Data is Accessed
● Distributed location support
● Multi-master Replication
● Location awareness
How Data is Distributed
● Capability to serve as a system of record
● Data security and governance within the
fabric
● Mixed Data access from multiple
protocols
● Distributed Multi-tenancy
● Global Namespace
● Integrated data streaming for AI
49 © 2018 MapR Technologies, Inc. // MapR Confidential
On-premise or Cloud Infrastructure
• Combines Distributed
Compute, AI, HPC, and
general purpose
workloads
• MapR provides
complementary data
logistics to better manage
and deploy deep learning
across the entire ecosystem
• Enables deployment agility
with data management
extending from on-premise,
to cross-cloud, to the edge
Architecture Matters
50 © 2018 MapR Technologies, Inc. // MapR Confidential
Simplified administration and security models
● One and done - no need for a different model in each location
● GDPR “compliant”!
Scales linearly with customer needs
● No reason to create a bunch of separate clusters
Sustainability - All data, files, database and event streaming
● Both at-rest and in-motion
An enabling and flexible architecture
● Only way to bring distributed data and GPUs together
● Easy to meet customers needs
● Supports both kubernetes and containers
Low cost of entry and linear cost of scaling
MapR Advantages for AI
51 © 2018 MapR Technologies, Inc. // MapR Confidential
Same platform and architecture in all locations:
● On-premise works the same as the cloud
● Second cloud works the same as a first cloud
● Data mirroring between locations is built-in
● Real-time event management and lineage is built-in
○ Scale out streaming applications without rearchitecting them
● Kubernetes is a simple way to inject MapR storage and GPU support into a
container
○ Leverage resources anywhere with Global Namespace
○ Application portability across all locations, no rework required
On-Premise, Cloud or Both
52 © 2018 MapR Technologies, Inc. // MapR Confidential
Complex data pipelines, large data volumes serving GPUs
● Mixed workloads - distributed data prep plus real-time
Simultaneous data and model versioning
● Data at-rest and in-motion
Model output lands in a stream
● Creates pluggable model flow
Works across on-premise and cloud infrastructures, simultaneously
Simplifying Model Development and Deployment
53 © 2018 MapR Technologies, Inc. // MapR Confidential
“90+% of Machine Learning
Success is Data Logistics”
https://mapr.com/ebook/machine-learning-logistics
The Key is Data Logistics
54 © 2018 MapR Technologies, Inc. // MapR Confidential
● Over 35 FREE on-demand training courses for AI and analytic development, data
engineering and administration
● Certification tracks for developers, administrators, and data scientists
● Expanded support portal and knowledge base
● Containerized clusters, for free download, solution templates and code examples
for hands-on experience
https://mapr.com/training/
Need Help Solving Your Data Logistics Problems?
Thank you!

More Related Content

PDF
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1
PDF
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2
PDF
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
PDF
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
PDF
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
PDF
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
PDF
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
PDF
Graph Gurus Episode 17: Seven Key Data Science Capabilities Powered by a Nati...
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Graph Gurus Episode 17: Seven Key Data Science Capabilities Powered by a Nati...

Similar to Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW TO FIX THEM! (20)

PPTX
Machine Learning logistics
PPTX
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
PPTX
Streaming Architecture including Rendezvous for Machine Learning
PPTX
Machine Learning Success: The Key to Easier Model Management
PDF
Operationalizing Machine Learning at Scale with Sameer Nori
PDF
DataOps: An Agile Method for Data-Driven Organizations
PPTX
Designing data pipelines for analytics and machine learning in industrial set...
PPTX
Machine Learning Logistics
PDF
Predictive Maintenance Using Recurrent Neural Networks
PDF
Productionising Machine Learning Models
PDF
Papis conference
 
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PDF
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PPTX
MapR and Machine Learning Primer
PDF
AI meets Big Data
PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
PDF
Map r chicago_advanalytics_oct_meetup
PDF
ML Application Life Cycle
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PDF
Spark and MapR Streams: A Motivating Example
Machine Learning logistics
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
Streaming Architecture including Rendezvous for Machine Learning
Machine Learning Success: The Key to Easier Model Management
Operationalizing Machine Learning at Scale with Sameer Nori
DataOps: An Agile Method for Data-Driven Organizations
Designing data pipelines for analytics and machine learning in industrial set...
Machine Learning Logistics
Predictive Maintenance Using Recurrent Neural Networks
Productionising Machine Learning Models
Papis conference
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
MapR and Machine Learning Primer
AI meets Big Data
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Map r chicago_advanalytics_oct_meetup
ML Application Life Cycle
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Spark and MapR Streams: A Motivating Example
Ad

More from Matt Stubbs (20)

PDF
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
PDF
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
PDF
Blueprint Series: Expedia Partner Solutions, Data Platform
PDF
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
PDF
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
PDF
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
PDF
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
PDF
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
PDF
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
PDF
Big Data LDN 2018: AI VS. GDPR
PDF
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
PDF
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
PDF
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
PDF
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
PDF
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
PDF
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
PDF
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
PDF
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
PDF
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
PDF
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Ad

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Mega Projects Data Mega Projects Data
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
annual-report-2024-2025 original latest.
Supervised vs unsupervised machine learning algorithms
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
climate analysis of Dhaka ,Banglades.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Mega Projects Data Mega Projects Data
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Data_Analytics_and_PowerBI_Presentation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
oil_refinery_comprehensive_20250804084928 (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to machine learning and Linear Models
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
Foundation of Data Science unit number two notes
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW TO FIX THEM!

  • 1. Data Operations Problems Created by Deep Learning, and How to Fix Them! Jim Scott @kingmesal
  • 2. 2 © 2018 MapR Technologies, Inc. // MapR Confidential Public Service Announcement You may see Artificial Intelligence, Machine Learning and Deep Learning used interchangeably within this presentation please feel free to mentally substitute the phrase of your choice if it is more applicable to you. I’m not trying to split hairs on terminology. Thanks for understanding!
  • 3. 3 © 2018 MapR Technologies, Inc. // MapR Confidential Terminology Data Science Artificial Intelligence Machine Learning Deep Learning Data Science Artificial Intelligence (AI) Machine Learning (ML) The use of algorithms to extract knowledge and insights from data in various forms in order to obtain insights. Some Subfields: Statistics, Artificial Intelligence (AI), Computational Math The simulation of intelligent human behavior for problem-solving and decision-making. Some Subfields: Robotics, Natural Language Processing (NLP), Machine Learning. The process by which machines are taught to make calculated suggestions and/or predictions by examining large amounts of input data. Some Subfields: Logistic Regression, Deep Learning, Reinforcement Learning
  • 4. 4 © 2018 MapR Technologies, Inc. // MapR Confidential 90% of the effort in successful machine learning isn’t in the training or model development… It’s the logistics!
  • 5. 5 © 2018 MapR Technologies, Inc. // MapR Confidential “Machine learning offers a fantastically powerful toolkit for building complex systems quickly.… it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning.” The Importance of Data Logistics
  • 6. 6 © 2018 MapR Technologies, Inc. // MapR Confidential Why? Just getting the training data is hard: ● Which data? How to make it accessible? Multiple sources! ● New kinds of observations force restarts ● Requires a ton of domain knowledge The myth of a single model: ● You cannot train just one ● You will have dozens of models, likely hundreds or more ● Handoff to new versions is tricky ● You have to get runtime to be sure about which is better
  • 7. 7 © 2018 MapR Technologies, Inc. // MapR Confidential Problem 1 Lack of support for the Artificial Intelligence Software Development Lifecycle (AI-SDLC)
  • 8. 8 © 2018 MapR Technologies, Inc. // MapR Confidential Building a Machine Learning Solution 1. Identify the data sources 2. Identify the tools 3. Write some code 4. Train a model 5. Test the model 6. Analyze the output of the tests 7. Repeat steps 3 through 6 until happy-ish a. Maybe swap out a tool if your cannot achieve happiness 8. Figure out how to get this solution into production
  • 9. 9 © 2018 MapR Technologies, Inc. // MapR Confidential Choosing the Best Machine Learning Tool Most successful groups keep several “favorite” machine learning tools on hand: ● No single tool is the best in every situation The most important tool is a platform that supports logistics well: ● Everything does not need to be done at the application level ● Lots of what matters can be handled at the platform level A good design for the logistics can make a big difference
  • 10. 10 © 2018 MapR Technologies, Inc. // MapR Confidential Problem 2 The deep learning workload is only one type of workload
  • 11. 11 © 2018 MapR Technologies, Inc. // MapR Confidential Massaging your data ● This will normally include cleansing, normalizing, and even optimizing data formats for downstream consumption for GPU based workloads ● Distributed compute is often the best approach for this type of activity due to the volume of data and variety of data types ● Be sure to keep your original data, in case of mistakes Separate your training and holdback data sets ● This is based off of the massaged data Analyze model outputs to determine the quality of your models ● Especially valuable over time to know that the models are moving the right way ● Great candidate for a distributed workload Workloads
  • 12. 12 © 2018 MapR Technologies, Inc. // MapR Confidential Problem 3 Putting machine learning into production is not quite the same as other enterprise software
  • 13. 13 © 2018 MapR Technologies, Inc. // MapR Confidential Gotchas with Making it to Production ● Ops-oriented people will not necessarily “get it” regarding modeling subtleties ● Data scientists will not necessarily “get it” regarding operational realities ● Therefore, modelers have to deliver self-contained models ● And, ops has to provide pre-wired structure
  • 14. 14 © 2018 MapR Technologies, Inc. // MapR Confidential Handling Real-time Stream instead of database as the shared “truth” Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
  • 15. 15 © 2018 MapR Technologies, Inc. // MapR Confidential Real-time can be started on your schedule, that is the key
  • 16. 16 © 2018 MapR Technologies, Inc. // MapR Confidential Problem 4 Running machine learning models in more than one location is tough
  • 17. 17 © 2018 MapR Technologies, Inc. // MapR Confidential Streaming Isolates Services
  • 18. 18 © 2018 MapR Technologies, Inc. // MapR Confidential With MapR, Geo-Distributed Data Appears Local
  • 19. 19 © 2018 MapR Technologies, Inc. // MapR Confidential With MapR, Geo-Distributed Data Appears Local Global Data Center Regional Data Center
  • 20. 20 © 2018 MapR Technologies, Inc. // MapR Confidential Features of Good Streaming Persistent ● Messages stick around for other consumers ● Consumers don’t affect producers ● Consumer doesn’t have to be online when message arrives Performant ● You should NEVER need to worry if a stream can keep up Pervasive ● It is there whenever you need it, no need to deploy anything ● How much work is it to create a new file? Why harder for a stream?
  • 21. 21 © 2018 MapR Technologies, Inc. // MapR Confidential Improving Machine Learning Logistics Stream first architecture is a powerful approach with surprisingly widespread advantages ● Innovative technologies emerging to for streaming data Microservices approach provides flexibility ● Streaming supports microservices (if done right) Containers remove surprises ● Predictable environment for running models
  • 22. 22 © 2018 MapR Technologies, Inc. // MapR Confidential Problem 5 Data dependencies cost more than code dependencies, a lot more!
  • 23. 23 © 2018 MapR Technologies, Inc. // MapR Confidential Data dependencies cost more than code dependencies! ● Code dependencies are easy to track, because it is a well known and a well practiced discipline ● The data may be unstable Undeclared consumers can wreak havoc on your models ● Downstream users may create a data dependency on the data from your model ● Updates to your model may break their system, if they made an assumption on the function of your model ● Who do you think will suffer? Top Reason to Use a Streaming Architecture
  • 24. 24 © 2018 MapR Technologies, Inc. // MapR Confidential First Look with Streams
  • 25. 25 © 2018 MapR Technologies, Inc. // MapR Confidential Then Rendezvous
  • 26. 26 © 2018 MapR Technologies, Inc. // MapR Confidential Faster Throughput Through Failure Suppose we have one model that can handle 10,000 t/s @ 2ms ● But this isn’t the most accurate model. Not bad, but not the best. And our champion model can handle 1,000 t/s @ 10ms ● Then imagine a burst of 2,000 t/s for several minutes Champion can only evaluate half of all requests ● Should skip to keep up ● Fast model will cover for champion
  • 27. 27 © 2018 MapR Technologies, Inc. // MapR Confidential Rendezvous – Mainly for Making Decisions Decisioning models ● Looking for a “right answer” ● Simpler than reinforcement learning Examples ● Fraud detection ● Predictive analytics / market prediction ● Churn prediction (as in telecommunications) ● Yield optimization ● Deep learning in form of speech or image recognition, in some cases
  • 28. 28 © 2018 MapR Technologies, Inc. // MapR Confidential Some Key Points ● Note that all models see identical inputs ● All models run in production setting ● All models send scores to same stream ● The rendezvous server decides which scores to ignore ● Roll forward, roll back, correlated comparison are all now trivial
  • 29. 29 © 2018 MapR Technologies, Inc. // MapR Confidential Problem 6 Wash, Rinse, Repeat!
  • 30. 30 © 2018 MapR Technologies, Inc. // MapR Confidential Are you performing all of these steps in your AI-SDLC manually? ● Consider a workflow tool ○ e.g. Airflow, Kubeflow, Argo, etc… Is all of your data in static files or will there be real-time data? ● Prepare for real-time in development to be ready for production Version everything! ● I’m sorry, this isn’t a job for GIT! ● Includes source data: static and real-time ● Also includes models and their output ● Ensures sanity checks ● Long-term performance analytics Concerns About Repeatability
  • 31. 31 © 2018 MapR Technologies, Inc. // MapR Confidential Quality & Reproducibility of Input Data is Important! Recording raw-ish data is really a big deal ● Data as seen by a model is worth gold ● Data reconstructed later often has time-machine leaks ● Databases were made for updates, streams are safer Raw data is useful for non-ML cases as well (think flexibility) Decoy model records training data as seen by models under development & evaluation
  • 32. 32 © 2018 MapR Technologies, Inc. // MapR Confidential A Quick Review
  • 33. 33 © 2018 MapR Technologies, Inc. // MapR Confidential The Proxy Talks to the Outside World
  • 34. 34 © 2018 MapR Technologies, Inc. // MapR Confidential The Input Stream Feeds All Models Identically
  • 35. 35 © 2018 MapR Technologies, Inc. // MapR Confidential The Scores Stream Contains All Results
  • 36. 36 © 2018 MapR Technologies, Inc. // MapR Confidential The Rendezvous Picks A Result
  • 37. 37 © 2018 MapR Technologies, Inc. // MapR Confidential Results Return Via A Stream and Return Address
  • 38. 38 © 2018 MapR Technologies, Inc. // MapR Confidential Problem 7 In the real world conditions may (will) change!
  • 39. 39 © 2018 MapR Technologies, Inc. // MapR Confidential Not Such Bad Ideas Keep models running “in the wings” ● Do not wait until conditions change to start building the next model ● Keep new short-history models ready to roll, some graybeards as well Hot hand-off ● With rendezvous, stop ignoring the new best model Deploy a canary server ● Keep an old model active as a reference ● If it was 90% correct, difference with any better model should be small ● Score distribution should be roughly constant
  • 40. 40 © 2018 MapR Technologies, Inc. // MapR Confidential Prepare for Scaling Up Model Variety ● Multiple rendezvous frameworks for different tasks Throughput ● Fast default models ● Partition input stream to allow parallel model evaluation ● Input batching Extreme Volumes ● Cannibalize fancy models to run more fast/simple models ● Speed before beauty
  • 41. 41 © 2018 MapR Technologies, Inc. // MapR Confidential Making Improvements 1. Data + the right question + domain knowledge, matters! 2. Prioritize – put serious effort into infrastructure a. DataOps requires more than just data science 3. Persist – use streams to keep data around 4. Measure – everything, and record it 5. Analyze – understand and see what is happening 6. Containerize – make deployment predictable, repeatable and easy
  • 42. 42 © 2018 MapR Technologies, Inc. // MapR Confidential Problems 8, 9 and 10 Copying data from your streaming system, data lake, and edge systems to your machine learning environment
  • 43. 43 © 2018 MapR Technologies, Inc. // MapR Confidential PLEASE, PLEASE, PLEASE… ...tell me you are not copying all your data between these systems
  • 44. 44 © 2018 MapR Technologies, Inc. // MapR Confidential Storage Appliance Traditional Storage Vendor Solution Edge Copy Ingest Core Cloud Unified Data Lake Data Prep Training + Testing Production Training Cluster Deployment Copy Storage Appliance ServersServers w/ GPU Does NOT support real-time workflows Doesn’t support distributed data preparation workloads Copy Copy
  • 45. 45 © 2018 MapR Technologies, Inc. // MapR Confidential Hadoop Based Solutions Edge Copy Core Cloud Unified Data Lake Data Prep Training + Testing Production Training Cluster Deployment HDFS Cluster ServersServers w/ GPU Minimum of seven non-homogeneous environments to administer and secure Full data copies without versioning, lineage control or multi-master support Copy Kafkain-motion Copy Copy Copy in-motion Kafka in-motion Copy Copy Copy Ingest Kafka Where does the master copy of the data live?
  • 46. 46 © 2018 MapR Technologies, Inc. // MapR Confidential MapR Solution Data Fabric Global Namespace Core CloudEdge Data Prep Training + Testing Deployment One homogeneous environment to manage and secure Supports real-time processing with data protection, lineage, and versioning Runs directly on DGX servers to create a unified DGX cluster
  • 47. 47 © 2018 MapR Technologies, Inc. // MapR Confidential MapR AI + RAPIDS Document DB Events Structured Data Unstructured Data Inference Typical Training and Evaluation Workflow Events Production DeploymentData Management Applications RAPIDS Apache Arrow GPU Memory cuGRAPH Graph Analytics cuML Machine Learning cuDF Analytics Data Preparation Training Data Set Model Training Evaluate/ Visualize
  • 48. 48 © 2018 MapR Technologies, Inc. // MapR Confidential How Data is Accessed Advantages of the MapR Data Fabric ● Linear Scalability ● Architected for performance, scale, and reliability ● Distributed metadata in the fabric How Data is Stored How Data is Accessed ● Distributed location support ● Multi-master Replication ● Location awareness How Data is Distributed ● Capability to serve as a system of record ● Data security and governance within the fabric ● Mixed Data access from multiple protocols ● Distributed Multi-tenancy ● Global Namespace ● Integrated data streaming for AI
  • 49. 49 © 2018 MapR Technologies, Inc. // MapR Confidential On-premise or Cloud Infrastructure • Combines Distributed Compute, AI, HPC, and general purpose workloads • MapR provides complementary data logistics to better manage and deploy deep learning across the entire ecosystem • Enables deployment agility with data management extending from on-premise, to cross-cloud, to the edge Architecture Matters
  • 50. 50 © 2018 MapR Technologies, Inc. // MapR Confidential Simplified administration and security models ● One and done - no need for a different model in each location ● GDPR “compliant”! Scales linearly with customer needs ● No reason to create a bunch of separate clusters Sustainability - All data, files, database and event streaming ● Both at-rest and in-motion An enabling and flexible architecture ● Only way to bring distributed data and GPUs together ● Easy to meet customers needs ● Supports both kubernetes and containers Low cost of entry and linear cost of scaling MapR Advantages for AI
  • 51. 51 © 2018 MapR Technologies, Inc. // MapR Confidential Same platform and architecture in all locations: ● On-premise works the same as the cloud ● Second cloud works the same as a first cloud ● Data mirroring between locations is built-in ● Real-time event management and lineage is built-in ○ Scale out streaming applications without rearchitecting them ● Kubernetes is a simple way to inject MapR storage and GPU support into a container ○ Leverage resources anywhere with Global Namespace ○ Application portability across all locations, no rework required On-Premise, Cloud or Both
  • 52. 52 © 2018 MapR Technologies, Inc. // MapR Confidential Complex data pipelines, large data volumes serving GPUs ● Mixed workloads - distributed data prep plus real-time Simultaneous data and model versioning ● Data at-rest and in-motion Model output lands in a stream ● Creates pluggable model flow Works across on-premise and cloud infrastructures, simultaneously Simplifying Model Development and Deployment
  • 53. 53 © 2018 MapR Technologies, Inc. // MapR Confidential “90+% of Machine Learning Success is Data Logistics” https://mapr.com/ebook/machine-learning-logistics The Key is Data Logistics
  • 54. 54 © 2018 MapR Technologies, Inc. // MapR Confidential ● Over 35 FREE on-demand training courses for AI and analytic development, data engineering and administration ● Certification tracks for developers, administrators, and data scientists ● Expanded support portal and knowledge base ● Containerized clusters, for free download, solution templates and code examples for hands-on experience https://mapr.com/training/ Need Help Solving Your Data Logistics Problems?