SlideShare a Scribd company logo
© 2017 MapR TechnologiesMapR Confidential 1
+
Mathieu Dumoulin
Data Engineer PS APAC – Tokyo, Japan
Wednesday, May 10, 2017
© 2017 MapR TechnologiesMapR Confidential 2
Today’s goals
‱ Machine Learning
‱ Enterprise Machine Learning
‱ Challenge of Enterprise ML
‱ MapR Unique Features for ML
‱ H2O and MapR
© 2017 MapR TechnologiesMapR Confidential 3
Machine Learning
Machine learning is a type of artificial intelligence (AI) that
provides computers with the ability to learn without being
explicitly programmed.
ML allows a computer to make predictions on data (usually
based on historical data)
© 2017 MapR TechnologiesMapR Confidential 4
1. Is this A or B?
2. Is this weird?
3. How much – or – How
many?
4. How is this organized?
5. What should I do next?
6. What’s similar?
Questions Data Science Can Answer
1. Classification
2. Anomaly Detection
3. Regression
4. Clustering
5. Reinforcement Learning
6. Recommendation
© 2017 MapR TechnologiesMapR Confidential 5
ML In the Enterprise


 isn’t so easy after all.
© 2017 MapR TechnologiesMapR Confidential 6
Enterprise ML: Business Value Outputs
Growing number of ML use cases at successful companies
Anomaly
Detection
ç•°ćžžæ€œć‡ș
Customer 360
Fraud
Detection
äžæ­Łæ€œć‡ș
Log Security
Analysis
ăƒ­ă‚°ćˆ†æž
Recommender
Engines
ăƒŹă‚łăƒĄăƒłăƒ‡ăƒŒă‚·ăƒ§ăƒł
Sensor Data
Analysis (IoT)
Personalized
Offers
怋äșș挖
Ad Tech
© 2017 MapR TechnologiesMapR Confidential 7
Machine Learning Tools
© 2017 MapR TechnologiesMapR Confidential 8
What Most ML Tools Give You
A common rule of thumb
is that the modeling task
is about 10% of the total
effort of a ML project.
The choice of tool matters (to the
DS), but any top level ML
tool/library can eventually get good
results (if the data allows it at all)
© 2017 MapR TechnologiesMapR Confidential 9
Enterprise ML Projects: More than Just Modeling
© 2017 MapR TechnologiesMapR Confidential 10
Business Value is in Production
All the business value results
from a sufficiently accurate
model running in production
What it means:
Deploying a weaker model in production sooner is
MUCH better than endless work for an excellent model
(But you can make Google money if you get a world
class model in production)
© 2017 MapR TechnologiesMapR Confidential 11
Data Cleaning and Feature Engineering
80% of
the work!
© 2017 MapR TechnologiesMapR Confidential 12
Workflow View of Machine Learning
1
2
3
4
56
7 8
© 2017 MapR TechnologiesMapR Confidential 13
Enterprise ML Challenges
Data comes from
many sources
maybe very large
Data isn’t
always labeled!
Needs ETL
and cleaning
Finding the best
algorithm and
parameters can use a
lot of CPU
Real time data?
Production data
from many
sources?
Needs to run on a server
somewhere
The predictions
are used by
another system...
© 2017 MapR TechnologiesMapR Confidential 14
The Open Source Solution (I’m not joking!)
Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline
Separate
Clusters!
© 2017 MapR TechnologiesMapR Confidential 15
What Data Scientists and ML Engineers Want
I know where the data
is and how to access
it.
My work is made easier in ALL PHASES
of the ML project, not just modeling
Let me use
my favorite
tools at all
scales (MB,
GB, TB, PB)
© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential
An Ideal Platform For Enterprise ML
An ideal platform for ML:
‱Scales with you and your data
‱Freedom to use any tool
– Open source DS tools: Jupyter, Zeppelin, Spark, H2O, TF,

– Legacy/local tools: NLP tools, scikit-learn, R
‱Data can be versioned and kept reliably
‱Combines storage, DB, compute and streams
‱Supports both model building and model deployment
‱Supports security when needed
© 2017 MapR TechnologiesMapR Confidential 17
Our Humble proposal
MapR is The Best Platform for Enterprise ML
on the market today
© 2017 MapR TechnologiesMapR Confidential 18
MapR Converged Data Platform
Open Source Engines & Tools Commercial Engines & Applications
Utility-Grade Platform Services
DataProcessing
Enterprise Storage
MapR-FS MapR-DB MapR Streams
Database Event Streaming
Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy
Search &
Others
Cloud &
Managed
Services
Custom Apps
UnifiedManagementandMonitoring
© 2017 MapR TechnologiesMapR Confidential 19
The MapR Stack: Converged + Open
© 2017 MapR TechnologiesMapR Confidential 20
‱ NFS mount and POSIX file system
– Small scale Python or R data exploration on the real data
– Keep the raw data, ETL work is easily reused
‱ Supports standard big data ecosystem (Spark)
‱ NFS mount can ingest data from any enterprise system that
can output files
– Even if they don’t support Hadoop!
‱ Much faster than HDFS
– Serve production models directly from MapR
MapR Supports All Tools Out of the Box
© 2017 MapR TechnologiesMapR Confidential 21
‱ Volumes and Topologies
– GPU enabled nodes for distributed deep learning on the same cluster
‱ Don’t waste resources
‱ Keep data locality
‱ Avoid unnecessary data movement
‱ Avoid multiple copies of data (which is the real one?)
‱ POSIX file system
– Use any DL framework on the cluster data
MapR has production experience with CaffeOnSpark (Samsung
Micro-Electronics) and has a new TensorFlow QSS
MapR Supports Deep Learning
© 2017 MapR TechnologiesMapR Confidential 22
‱ Volumes and Snapshots
– Experimental reproducibility
– Create models on real production data
– Easy to compare models on the same data
– Easy to evaluate a model across time on different snapshots
– snapshot of models: a time machine, built-in
‱ Volume Quotas
– Support multiple projects and teams on the same cluster
– Share storage resources efficiently
Clever Uses for Volumes and Snapshots
© 2017 MapR TechnologiesMapR Confidential 23
Remember that > 90% of the work in Enterprise ML is to realize the
workflow. This is where MapR shines! 
‱ Operational capabilities (MapR DB, MapR Client)
– Serve production models directly from MapR
‱ Snapshots and Mirrors
– Do A/B testing with almost no coding
– Promote the mirror to go back to the previous state
‱ Just update the path in the production system - no redeployment!
‱ MapR Streams for Real-time predictions
– Zero configuration Kafka – it just works!
– Kafka REST Proxy for max interoperability
– Supports microservices and Stateful Containers
Support the ML Workflow, Not Just Modeling
© 2017 MapR TechnologiesMapR Confidential 24
MapR 💖 Enterprise Machine Learning
‱ Features that work together to support all phases of ML
‱ Supports your existing tools/code and the state of the art
large scale frameworks
‱ Easier to manage, more robust and secure.
‱ MapR is made for the enterprise and great for ML
© 2017 MapR TechnologiesMapR Confidential 25
MapR Converged Application Blueprint
‱ Microservices connected by real-time streams
– Ideal to serve predictions from ML models
‱ Next-Generation large-scale architecture
‱ Working example:
https://www.mapr.com/appblueprint/overview
© 2017 MapR TechnologiesMapR Confidential 26
Q&A
ENGAGE WITH US
@mapr
mdumoulin@mapr.com

More Related Content

PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
PPTX
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
PDF
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
PPTX
CEP - simplified streaming architecture - Strata Singapore 2016
PDF
Deep Learning at Scale
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
PDF
Distributed Deep Learning on Spark
PDF
Apache Spark Overview
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
CEP - simplified streaming architecture - Strata Singapore 2016
Deep Learning at Scale
Evolving Beyond the Data Lake: A Story of Wind and Rain
Distributed Deep Learning on Spark
Apache Spark Overview

What's hot (20)

PDF
Very large scale distributed deep learning on BigDL
PPTX
MapR and Cisco Make IT Better
PDF
Applying Machine Learning to Live Patient Data
PDF
Advanced Threat Detection on Streaming Data
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Rapids: Data Science on GPUs
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
MapR Product Update - Spring 2017
PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PDF
ASGARD Splunk Conf 2016
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
Predictive Maintenance Using Recurrent Neural Networks
PPTX
Apache Spark Machine Learning Decision Trees
PDF
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
PDF
Free Code Friday - Machine Learning with Apache Spark
Very large scale distributed deep learning on BigDL
MapR and Cisco Make IT Better
Applying Machine Learning to Live Patient Data
Advanced Threat Detection on Streaming Data
Build a Time Series Application with Apache Spark and Apache HBase
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Rapids: Data Science on GPUs
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Product Update - Spring 2017
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Fast Cars, Big Data How Streaming can help Formula 1
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
ASGARD Splunk Conf 2016
How Big Data is Reducing Costs and Improving Outcomes in Health Care
RAPIDS – Open GPU-accelerated Data Science
Predictive Maintenance Using Recurrent Neural Networks
Apache Spark Machine Learning Decision Trees
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Free Code Friday - Machine Learning with Apache Spark
Ad

Similar to MapR and Machine Learning Primer (20)

PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PDF
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
PDF
Spark and MapR Streams: A Motivating Example
PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
PDF
An Introduction to the MapR Converged Data Platform
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
PPTX
Machine Learning logistics
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Big Data LDN 2017: How to leverage the cloud for Business Solutions
PDF
Map r chicago_advanalytics_oct_meetup
PDF
Using TensorFlow for Machine Learning
PPTX
Streaming Architecture including Rendezvous for Machine Learning
PDF
Big Data LDN 2017: The Intelligent Edge: What Data-driven Means in the Age of...
PPTX
Real-Time Robot Predictive Maintenance in Action
PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
PDF
Streaming in the Extreme
PPTX
Enabling Real-Time Business with Change Data Capture
PDF
Big Data LDN 2017: Real World Impact of a Global Data Fabric
Machine Learning Success: The Key to Easier Model Management
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Spark and MapR Streams: A Motivating Example
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
An Introduction to the MapR Converged Data Platform
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
Machine Learning logistics
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Map r chicago_advanalytics_oct_meetup
Using TensorFlow for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Big Data LDN 2017: The Intelligent Edge: What Data-driven Means in the Age of...
Real-Time Robot Predictive Maintenance in Action
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Streaming in the Extreme
Enabling Real-Time Business with Change Data Capture
Big Data LDN 2017: Real World Impact of a Global Data Fabric
Ad

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Introduction to Artificial Intelligence
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Essential Infomation Tech presentation.pptx
PDF
AI in Product Development-omnex systems
PPTX
history of c programming in notes for students .pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
System and Network Administration Chapter 2
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Odoo POS Development Services by CandidRoot Solutions
DOCX
The Five Best AI Cover Tools in 2025.docx
PPTX
L1 - Introduction to python Backend.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
System and Network Administraation Chapter 3
Wondershare Filmora 15 Crack With Activation Key [2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Introduction to Artificial Intelligence
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Essential Infomation Tech presentation.pptx
AI in Product Development-omnex systems
history of c programming in notes for students .pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
System and Network Administration Chapter 2
PTS Company Brochure 2025 (1).pdf.......
Operating system designcfffgfgggggggvggggggggg
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Understanding Forklifts - TECH EHS Solution
Odoo POS Development Services by CandidRoot Solutions
The Five Best AI Cover Tools in 2025.docx
L1 - Introduction to python Backend.pptx
Design an Analysis of Algorithms I-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily

MapR and Machine Learning Primer

  • 1. © 2017 MapR TechnologiesMapR Confidential 1 + Mathieu Dumoulin Data Engineer PS APAC – Tokyo, Japan Wednesday, May 10, 2017
  • 2. © 2017 MapR TechnologiesMapR Confidential 2 Today’s goals ‱ Machine Learning ‱ Enterprise Machine Learning ‱ Challenge of Enterprise ML ‱ MapR Unique Features for ML ‱ H2O and MapR
  • 3. © 2017 MapR TechnologiesMapR Confidential 3 Machine Learning Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. ML allows a computer to make predictions on data (usually based on historical data)
  • 4. © 2017 MapR TechnologiesMapR Confidential 4 1. Is this A or B? 2. Is this weird? 3. How much – or – How many? 4. How is this organized? 5. What should I do next? 6. What’s similar? Questions Data Science Can Answer 1. Classification 2. Anomaly Detection 3. Regression 4. Clustering 5. Reinforcement Learning 6. Recommendation
  • 5. © 2017 MapR TechnologiesMapR Confidential 5 ML In the Enterprise
 
 isn’t so easy after all.
  • 6. © 2017 MapR TechnologiesMapR Confidential 6 Enterprise ML: Business Value Outputs Growing number of ML use cases at successful companies Anomaly Detection ç•°ćžžæ€œć‡ș Customer 360 Fraud Detection äžæ­Łæ€œć‡ș Log Security Analysis ăƒ­ă‚°ćˆ†æž Recommender Engines ăƒŹă‚łăƒĄăƒłăƒ‡ăƒŒă‚·ăƒ§ăƒł Sensor Data Analysis (IoT) Personalized Offers 怋äșș挖 Ad Tech
  • 7. © 2017 MapR TechnologiesMapR Confidential 7 Machine Learning Tools
  • 8. © 2017 MapR TechnologiesMapR Confidential 8 What Most ML Tools Give You A common rule of thumb is that the modeling task is about 10% of the total effort of a ML project. The choice of tool matters (to the DS), but any top level ML tool/library can eventually get good results (if the data allows it at all)
  • 9. © 2017 MapR TechnologiesMapR Confidential 9 Enterprise ML Projects: More than Just Modeling
  • 10. © 2017 MapR TechnologiesMapR Confidential 10 Business Value is in Production All the business value results from a sufficiently accurate model running in production What it means: Deploying a weaker model in production sooner is MUCH better than endless work for an excellent model (But you can make Google money if you get a world class model in production)
  • 11. © 2017 MapR TechnologiesMapR Confidential 11 Data Cleaning and Feature Engineering 80% of the work!
  • 12. © 2017 MapR TechnologiesMapR Confidential 12 Workflow View of Machine Learning 1 2 3 4 56 7 8
  • 13. © 2017 MapR TechnologiesMapR Confidential 13 Enterprise ML Challenges Data comes from many sources maybe very large Data isn’t always labeled! Needs ETL and cleaning Finding the best algorithm and parameters can use a lot of CPU Real time data? Production data from many sources? Needs to run on a server somewhere The predictions are used by another system...
  • 14. © 2017 MapR TechnologiesMapR Confidential 14 The Open Source Solution (I’m not joking!) Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline Separate Clusters!
  • 15. © 2017 MapR TechnologiesMapR Confidential 15 What Data Scientists and ML Engineers Want I know where the data is and how to access it. My work is made easier in ALL PHASES of the ML project, not just modeling Let me use my favorite tools at all scales (MB, GB, TB, PB)
  • 16. © 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential An Ideal Platform For Enterprise ML An ideal platform for ML: ‱Scales with you and your data ‱Freedom to use any tool – Open source DS tools: Jupyter, Zeppelin, Spark, H2O, TF,
 – Legacy/local tools: NLP tools, scikit-learn, R ‱Data can be versioned and kept reliably ‱Combines storage, DB, compute and streams ‱Supports both model building and model deployment ‱Supports security when needed
  • 17. © 2017 MapR TechnologiesMapR Confidential 17 Our Humble proposal MapR is The Best Platform for Enterprise ML on the market today
  • 18. © 2017 MapR TechnologiesMapR Confidential 18 MapR Converged Data Platform Open Source Engines & Tools Commercial Engines & Applications Utility-Grade Platform Services DataProcessing Enterprise Storage MapR-FS MapR-DB MapR Streams Database Event Streaming Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy Search & Others Cloud & Managed Services Custom Apps UnifiedManagementandMonitoring
  • 19. © 2017 MapR TechnologiesMapR Confidential 19 The MapR Stack: Converged + Open
  • 20. © 2017 MapR TechnologiesMapR Confidential 20 ‱ NFS mount and POSIX file system – Small scale Python or R data exploration on the real data – Keep the raw data, ETL work is easily reused ‱ Supports standard big data ecosystem (Spark) ‱ NFS mount can ingest data from any enterprise system that can output files – Even if they don’t support Hadoop! ‱ Much faster than HDFS – Serve production models directly from MapR MapR Supports All Tools Out of the Box
  • 21. © 2017 MapR TechnologiesMapR Confidential 21 ‱ Volumes and Topologies – GPU enabled nodes for distributed deep learning on the same cluster ‱ Don’t waste resources ‱ Keep data locality ‱ Avoid unnecessary data movement ‱ Avoid multiple copies of data (which is the real one?) ‱ POSIX file system – Use any DL framework on the cluster data MapR has production experience with CaffeOnSpark (Samsung Micro-Electronics) and has a new TensorFlow QSS MapR Supports Deep Learning
  • 22. © 2017 MapR TechnologiesMapR Confidential 22 ‱ Volumes and Snapshots – Experimental reproducibility – Create models on real production data – Easy to compare models on the same data – Easy to evaluate a model across time on different snapshots – snapshot of models: a time machine, built-in ‱ Volume Quotas – Support multiple projects and teams on the same cluster – Share storage resources efficiently Clever Uses for Volumes and Snapshots
  • 23. © 2017 MapR TechnologiesMapR Confidential 23 Remember that > 90% of the work in Enterprise ML is to realize the workflow. This is where MapR shines!  ‱ Operational capabilities (MapR DB, MapR Client) – Serve production models directly from MapR ‱ Snapshots and Mirrors – Do A/B testing with almost no coding – Promote the mirror to go back to the previous state ‱ Just update the path in the production system - no redeployment! ‱ MapR Streams for Real-time predictions – Zero configuration Kafka – it just works! – Kafka REST Proxy for max interoperability – Supports microservices and Stateful Containers Support the ML Workflow, Not Just Modeling
  • 24. © 2017 MapR TechnologiesMapR Confidential 24 MapR 💖 Enterprise Machine Learning ‱ Features that work together to support all phases of ML ‱ Supports your existing tools/code and the state of the art large scale frameworks ‱ Easier to manage, more robust and secure. ‱ MapR is made for the enterprise and great for ML
  • 25. © 2017 MapR TechnologiesMapR Confidential 25 MapR Converged Application Blueprint ‱ Microservices connected by real-time streams – Ideal to serve predictions from ML models ‱ Next-Generation large-scale architecture ‱ Working example: https://www.mapr.com/appblueprint/overview
  • 26. © 2017 MapR TechnologiesMapR Confidential 26 Q&A ENGAGE WITH US @mapr mdumoulin@mapr.com

Editor's Notes

  • #5: 1. Will this tire fail in the next 1,000 miles: Yes or no? Which brings in more customers: a $5 coupon or a 25% discount? 2. If you have a car with pressure gauges, you might want to know: Is this pressure gauge reading normal? If you're monitoring the internet you’d want to know: Is this message from the internet typical? 3. What will the temperature be next Tuesday? What will my fourth quarter sales be? 4. Which viewers like the same types of movies? Which printer models fail the same way? 5. Reinforcement learning was inspired by how the brains of rats and humans respond to punishment and rewards. These algorithms learn from outcomes, and decide on the next action. Typically, reinforcement learning is a good fit for automated systems that have to make lots of small decisions without human guidance. 6. What did similar people/customer buy/watch/listen to? What other movies will you like, if you like product A?
  • #9: Data Ingestion is a non-trivial task for enterprise The best systems combine data from multiple sources Adding more data is a highly specialized task Data Governance for ML Dataset versions Test data versions Model versions Test results versions Model Deployment is a non-trivial integration task with another external enterprise system May need to be scalable, HA and fault-tolerant What about after deployment? A/B Testing Understanding performance Dealing with data drift
  • #10: Data Ingestion is a non-trivial task for enterprise The best systems combine data from multiple sources Adding more data is a highly specialized task Data Governance for ML Dataset versions Test data versions Model versions Test results versions Model Deployment is a non-trivial integration task with another external enterprise system May need to be scalable, HA and fault-tolerant What about after deployment? A/B Testing Understanding performance Dealing with data drift
  • #11: Data Ingestion is a non-trivial task for enterprise The best systems combine data from multiple sources Adding more data is a highly specialized task Data Governance for ML Dataset versions Test data versions Model versions Test results versions Model Deployment is a non-trivial integration task with another external enterprise system May need to be scalable, HA and fault-tolerant What about after deployment? A/B Testing Understanding performance Dealing with data drift
  • #12: Data Ingestion is a non-trivial task for enterprise The best systems combine data from multiple sources Adding more data is a highly specialized task Data Governance for ML Dataset versions Test data versions Model versions Test results versions Model Deployment is a non-trivial integration task with another external enterprise system May need to be scalable, HA and fault-tolerant What about after deployment? A/B Testing Understanding performance Dealing with data drift
  • #13: Get training data (example: images) Get labels for the training data (examples: what the image is about, the image labels) Transform the data into numbers (machine learning algorithms can’t deal with raw data, only vectors of numbers) Heavily iterative work to find the best set of features Try many different algorithms, and tune their parameters for best performance Heavily iterative work to find the best algorithm and parameter values The best algorithm, trained on your data, with its parameters tuned for best performance is your predictive model Get new data Transform it to match the same format as your training feature vectors The model will output a predicted label for the new data This is a lot of work, but glosses over a HUGE amount of work required to get business value in an enterprise setting
  • #14: Here is a small sample of the issues faced in putting ML to work in an enterprise
  • #15: This is for real. Chris Fregly is the Pipeline.io guy and he’s building enterprise ML systems with this set of tools. This is the set of tools required to be able to provide a true 100% open source end to end story for enterprise ML. How does MapR simplify this picture? What tools still remain useful if we’d run on MapR?
  • #17: Can support OSS tools like R, scikit, Theano and TensorFlow first to avoid more expensive licences. NoSQL, Kafka, Spark, etc...
  • #24: Indeed, ML Tools are only really good at modeling. They typically provide limited support for feature engineering. In addition Tools typically only support testing models on a single dataset, with no support for comparing production models to experimental models, comparing models across different versions of a dataset, etc. Such capabilities need to be custom made and are typically done with low quality ad-hoc code by data scientists. Most of the work resides in the ETL to get the data in the first place, the data cleaning and feature engineering not supported by the tools, as well as as the work required to deploy a model to production. MapR really shines on that 90% of the work, and supports all ML tools just the same (indeed, better) as any competing platform.
  • #25: legacy tools: R, Python, Bash, SPSS, Hive/Pig State of the art: Apache projects like Drill, Impala, Spark, Zepplin, Mesos, Flink,