Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW TO FIX THEM!

Data Operations Problems Created by Deep
Learning, and How to Fix Them!
Jim Scott
@kingmesal

2 © 2018 MapR Technologies, Inc. // MapR Confidential
Public Service Announcement
You may see Artificial Intelligence, Machine
Learning and Deep Learning used interchangeably
within this presentation please feel free to
mentally substitute the phrase of your choice if
it is more applicable to you.
I’m not trying to split hairs on terminology.
Thanks for understanding!

Terminology
Data Science
Artificial Intelligence
Machine Learning
Deep
Learning
Data Science
Artificial Intelligence (AI)
Machine Learning (ML)
The use of algorithms to extract knowledge and
insights from data in various forms in order to
obtain insights.
Some Subfields: Statistics, Artificial Intelligence
(AI), Computational Math
The simulation of intelligent human behavior for
problem-solving and decision-making.
Some Subfields: Robotics, Natural Language
Processing (NLP), Machine Learning.
The process by which machines are taught to make
calculated suggestions and/or predictions by
examining large amounts of input data.
Some Subfields: Logistic Regression, Deep Learning,
Reinforcement Learning

90% of the effort in successful
machine learning isn’t in the
training or model development…
It’s the logistics!

“Machine learning offers a
fantastically powerful toolkit for
building complex systems
quickly.… it is remarkably easy
to incur massive ongoing
maintenance costs at the
system level when applying
machine learning.”
The Importance of Data Logistics

Why?
Just getting the training data is hard:
● Which data? How to make it accessible? Multiple sources!
● New kinds of observations force restarts
● Requires a ton of domain knowledge
The myth of a single model:
● You cannot train just one
● You will have dozens of models, likely hundreds or more
● Handoff to new versions is tricky
● You have to get runtime to be sure about which is better

Problem 1
Lack of support for the
Artificial Intelligence
Software Development Lifecycle
(AI-SDLC)

Building a Machine Learning Solution
1. Identify the data sources
2. Identify the tools
3. Write some code
4. Train a model
5. Test the model
6. Analyze the output of the tests
7. Repeat steps 3 through 6 until happy-ish
a. Maybe swap out a tool if your cannot achieve happiness
8. Figure out how to get this solution into production

Choosing the Best Machine Learning Tool
Most successful groups keep several “favorite” machine learning tools on hand:
● No single tool is the best in every situation
The most important tool is a platform that supports logistics well:
● Everything does not need to be done at the application level
● Lots of what matters can be handled at the platform level
A good design for the logistics can make a big difference

Problem 2
The deep learning
workload is only one
type of workload

Massaging your data
● This will normally include cleansing, normalizing, and even optimizing data formats
for downstream consumption for GPU based workloads
● Distributed compute is often the best approach for this type of activity due to the
volume of data and variety of data types
● Be sure to keep your original data, in case of mistakes
Separate your training and holdback data sets
● This is based off of the massaged data
Analyze model outputs to determine the quality of your models
● Especially valuable over time to know that the models are moving the right way
● Great candidate for a distributed workload
Workloads

Problem 3
Putting machine learning into
production is not quite the same
as other enterprise software

Gotchas with Making it to Production
● Ops-oriented people will not necessarily “get it” regarding modeling subtleties
● Data scientists will not necessarily “get it” regarding operational realities
● Therefore, modelers have to deliver self-contained models
● And, ops has to provide pre-wired structure

Handling Real-time
Stream instead of database as the shared “truth”
Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission

Real-time can be started on
your schedule, that is the key

Problem 4
Running machine learning
models in more than
one location is tough

Streaming Isolates Services

With MapR, Geo-Distributed Data Appears Local

With MapR, Geo-Distributed Data Appears Local
Global Data Center
Regional Data Center

Features of Good Streaming
Persistent
● Messages stick around for other consumers
● Consumers don’t affect producers
● Consumer doesn’t have to be online when message arrives
Performant
● You should NEVER need to worry if a stream can keep up
Pervasive
● It is there whenever you need it, no need to deploy anything
● How much work is it to create a new file? Why harder for a stream?

Improving Machine Learning Logistics
Stream first architecture is a powerful approach with surprisingly widespread
advantages
● Innovative technologies emerging to for streaming data
Microservices approach provides flexibility
● Streaming supports microservices (if done right)
Containers remove surprises
● Predictable environment for running models

Problem 5
Data dependencies cost more
than code dependencies,
a lot more!

Data dependencies cost more than code dependencies!
● Code dependencies are easy to track, because it is a well known and a well
practiced discipline
● The data may be unstable
Undeclared consumers can wreak havoc on your models
● Downstream users may create a data dependency on the data from your model
● Updates to your model may break their system, if they made an assumption on
the function of your model
● Who do you think will suffer?
Top Reason to Use a Streaming Architecture

First Look with Streams

Then Rendezvous

Faster Throughput Through Failure
Suppose we have one model that can handle 10,000 t/s @ 2ms
● But this isn’t the most accurate model. Not bad, but not the best.
And our champion model can handle 1,000 t/s @ 10ms
● Then imagine a burst of 2,000 t/s for several minutes
Champion can only evaluate half of all requests
● Should skip to keep up
● Fast model will cover for champion

Rendezvous – Mainly for Making Decisions
Decisioning models
● Looking for a “right answer”
● Simpler than reinforcement learning
Examples
● Fraud detection
● Predictive analytics / market prediction
● Churn prediction (as in telecommunications)
● Yield optimization
● Deep learning in form of speech or image recognition, in some cases

Some Key Points
● Note that all models see identical inputs
● All models run in production setting
● All models send scores to same stream
● The rendezvous server decides which scores to ignore
● Roll forward, roll back, correlated comparison are all now trivial

Problem 6
Wash,
Rinse,
Repeat!

Are you performing all of these steps in your AI-SDLC manually?
● Consider a workflow tool
○ e.g. Airflow, Kubeflow, Argo, etc…
Is all of your data in static files or will there be real-time data?
● Prepare for real-time in development to be ready for production
Version everything!
● I’m sorry, this isn’t a job for GIT!
● Includes source data: static and real-time
● Also includes models and their output
● Ensures sanity checks
● Long-term performance analytics
Concerns About Repeatability

Quality & Reproducibility of Input Data is Important!
Recording raw-ish data is really a big deal
● Data as seen by a model is worth gold
● Data reconstructed later often has time-machine leaks
● Databases were made for updates, streams are safer
Raw data is useful for non-ML cases as well (think flexibility)
Decoy model records training data as seen by models under development &
evaluation

A Quick Review

The Proxy Talks to the Outside World

The Input Stream Feeds All Models Identically

The Scores Stream Contains All Results

The Rendezvous Picks A Result

Results Return Via A Stream and Return Address

Problem 7
In the real world
conditions may
(will) change!

Not Such Bad Ideas
Keep models running “in the wings”
● Do not wait until conditions change to start building the next model
● Keep new short-history models ready to roll, some graybeards as well
Hot hand-off
● With rendezvous, stop ignoring the new best model
Deploy a canary server
● Keep an old model active as a reference
● If it was 90% correct, difference with any better model should be small
● Score distribution should be roughly constant

Prepare for Scaling Up
Model Variety
● Multiple rendezvous frameworks for different tasks
Throughput
● Fast default models
● Partition input stream to allow parallel model evaluation
● Input batching
Extreme Volumes
● Cannibalize fancy models to run more fast/simple models
● Speed before beauty

Making Improvements
1. Data + the right question + domain knowledge, matters!
2. Prioritize – put serious effort into infrastructure
a. DataOps requires more than just data science
3. Persist – use streams to keep data around
4. Measure – everything, and record it
5. Analyze – understand and see what is happening
6. Containerize – make deployment predictable, repeatable and easy

Problems 8, 9 and 10
Copying data from your
streaming system,
data lake,
and edge systems to your
machine learning environment

PLEASE, PLEASE, PLEASE…
...tell me you are not copying
all your data between these systems

Storage
Appliance
Traditional Storage Vendor Solution
Edge
Copy
Ingest
Core Cloud
Unified Data
Lake
Data Prep
Training
+
Testing
Production
Training
Cluster Deployment
Copy
Storage
Appliance
ServersServers w/
GPU
Does NOT support real-time workflows
Doesn’t support distributed data preparation workloads
Copy
Copy

Hadoop Based Solutions
Edge
Copy
Core Cloud
Unified Data
Lake
Data Prep
Training
+
Testing
Production
Training
Cluster Deployment
HDFS
Cluster
ServersServers w/
GPU
Minimum of seven non-homogeneous environments to administer and secure
Full data copies without versioning, lineage control or multi-master support
Copy
Kafkain-motion
Copy
Copy
Copy
in-motion Kafka
in-motion
Copy
Copy
Copy
Ingest
Kafka
Where does the
master copy of
the data live?

MapR Solution
Data Fabric
Global Namespace
Core CloudEdge
Data Prep
Training
+
Testing
Deployment
One homogeneous environment to manage and secure
Supports real-time processing with data protection, lineage, and versioning
Runs directly on DGX servers to create a unified DGX cluster

MapR AI + RAPIDS
Document
DB
Events
Structured
Data
Unstructured
Data
Inference
Typical Training and Evaluation Workflow
Events
Production DeploymentData Management
Applications
RAPIDS
Apache
Arrow GPU Memory
cuGRAPH
Graph Analytics
cuML
Machine Learning
cuDF
Analytics
Data
Preparation
Training
Data Set
Model
Training
Evaluate/
Visualize

How Data is Accessed
Advantages of the MapR Data Fabric
● Linear Scalability
● Architected for performance, scale,
and reliability
● Distributed metadata in the fabric
How Data is Stored
How Data is Accessed
● Distributed location support
● Multi-master Replication
● Location awareness
How Data is Distributed
● Capability to serve as a system of record
● Data security and governance within the
fabric
● Mixed Data access from multiple
protocols
● Distributed Multi-tenancy
● Global Namespace
● Integrated data streaming for AI

On-premise or Cloud Infrastructure
• Combines Distributed
Compute, AI, HPC, and
general purpose
workloads
• MapR provides
complementary data
logistics to better manage
and deploy deep learning
across the entire ecosystem
• Enables deployment agility
with data management
extending from on-premise,
to cross-cloud, to the edge
Architecture Matters

Simplified administration and security models
● One and done - no need for a different model in each location
● GDPR “compliant”!
Scales linearly with customer needs
● No reason to create a bunch of separate clusters
Sustainability - All data, files, database and event streaming
● Both at-rest and in-motion
An enabling and flexible architecture
● Only way to bring distributed data and GPUs together
● Easy to meet customers needs
● Supports both kubernetes and containers
Low cost of entry and linear cost of scaling
MapR Advantages for AI

Same platform and architecture in all locations:
● On-premise works the same as the cloud
● Second cloud works the same as a first cloud
● Data mirroring between locations is built-in
● Real-time event management and lineage is built-in
○ Scale out streaming applications without rearchitecting them
● Kubernetes is a simple way to inject MapR storage and GPU support into a
container
○ Leverage resources anywhere with Global Namespace
○ Application portability across all locations, no rework required
On-Premise, Cloud or Both

Complex data pipelines, large data volumes serving GPUs
● Mixed workloads - distributed data prep plus real-time
Simultaneous data and model versioning
● Data at-rest and in-motion
Model output lands in a stream
● Creates pluggable model flow
Works across on-premise and cloud infrastructures, simultaneously
Simplifying Model Development and Deployment

“90+% of Machine Learning
Success is Data Logistics”
https://mapr.com/ebook/machine-learning-logistics
The Key is Data Logistics

● Over 35 FREE on-demand training courses for AI and analytic development, data
engineering and administration
● Certification tracks for developers, administrators, and data scientists
● Expanded support portal and knowledge base
● Containerized clusters, for free download, solution templates and code examples
for hands-on experience
https://mapr.com/training/
Need Help Solving Your Data Logistics Problems?

Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW TO FIX THEM!

More Related Content

Similar to Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW TO FIX THEM! (20)

More from Matt Stubbs (20)

Recently uploaded (20)

Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW TO FIX THEM!