SlideShare a Scribd company logo
AI and Spark - IBM Community AI Day
About
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data
& AI Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
Center for Open Source Data
and AI Technologies
CODAIT
codait.org
DBG / Oct 4, 2018 / © 2018 IBM Corporation
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open Source
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
The Machine Learning
Workflow
Applying Machine Learning: Perception
In reality the workflow spans teams …
… and tools
Spark provides a unified platform
Machine Learning
Pipelines
What is a “model”?
Pipelines in Spark ML
Example – Text Classifier Pipeline
Example – Text Classifier PipelineModel
Spark ML Components
14
Source: http://spark.apache.org/docs/latest
Deep Learning
DBG / June 6, 2018 / © 2018 IBM Corporation
Deep Learning Overview
• Original theory from 1940s; computer models
originated around 1960s; fell out of favor in
1980s/90s
• Recent resurgence due to
• Bigger (and better) data; standard datasets (e.g.
ImageNet)
• Better hardware (GPUs)
• Improvements to algorithms, architectures and
optimization
• Leading to new state-of-the-art results in
computer vision (images and video);
speech/text; language translation and more
Source: Wikipedia
Modern Neural Networks
• Deep (multi-layer) networks
• Computer vision
• Convolution neural networks (CNNs)
• Image classification, object detection, segmentation
• Sequences and time-series
• Recurrent neural networks (RNNs)
• Machine translation, text generation
• Embeddings
• Text, categorical features
• Deep learning frameworks
• Flexibility, computation graphs, auto-differentiation,
GPUs
Source: Stanford CS231n
Deep Learning Frameworks
* Logos trademarks of their respective projects
Computation Graphs
Source: Google AI Blog
*MnasNet Network
*Inception V3
DL Frameworks on Spark
Major Frameworks
21
• Deeplearning4J
• BigDL
• Deep Learning Pipelines
• TensorFlowOnSpark
• Microsoft Machine Learning on Spark
(MMLSpark)
Deeplearning4J
22
• Distributed GPU support for all major deep
learning architectures
• CPU / Distributed CPU / Single GPU options exist
• Supports Convolutional Nets, LSTMs / RNNs,
Feedforward Nets, Word2Vec, custom layers
• Supported by startup Skymind.io
• Backed by its own linear algebra library –
ND4J
• APIs in Scala, Java, Python
• Newer Scala API, Keras-like
• Keras import / export for Python API
• Production serving is through proprietary
layer
• DataVec for ETL
BigDL
23
• Distributed CPU with Intel MKL
• No GPU support
• Most DL models – CNN, RNN
• Backed by Intel
• Natively integrated with Spark
• Scala, Python API
• Support for Spark ML pipelines
• Uses private internal Spark components for
distributed training
• Load Keras, Caffe, Torch models
• New Keras-style API
Deep Learning Pipelines
24
• Created by Databricks
• Focus on scoring models (TensorFlow / Keras) and
basic transfer learning
• No support for training the DL model
• Focus on image data & use cases
• Natively integrated with Spark
• Scala, Python API
• Support for Spark ML pipelines
• Support for scoring models as a SQL UDF
• Largely dormant currently
TensorFlowOnSpark
25
• Created by Yahoo
• Scale out TF on Spark clusters
• Use Spark executors to launch TF processes
• Supports distributed training through TF parameter
servers
• RDMA / Infiniband improvement to TF to speed up
distributed training
• Good support for TensorBoard
• Good integration with Spark
• But only Python API
• Some support for Spark ML pipelines
• Relatively inactive recently
MMLSpark
26
• Created by Microsoft
• Supports training using CNTK including distributed
• Image, text data
• Good integration with Spark
• Scala, Python, R API
• Support for Spark ML pipelines
• Varied deployment options
• Relatively active, seems quite well supported
Other Frameworks
27
• H20 AI / DeepWater
• Apache MXNet Spark integration
• TensorFrames
• CaffeOnSpark
• scalable-deep-learning on Github
• MLlib – MLPClassifier only
• Sparknet (abandoned)
Integration Challenges
28
• Moving data from Spark to DL framework (and
back)
• Serialization overhead – especially Python
• Managing DL computation graphs from Spark
executors means fault tolerance is difficult to
achieve
• GPU awareness
• Optimize and standardize data exchange -
SPARK-24579
• Apache Arrow
• Barrier Execution Mode - SPARK-24374
• Accelerator-aware scheduling - SPARK-
24615
29
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
developer.ibm.com
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://ibm.biz/BdYhXz
https://datascience.ibm.com/
MAX
Brought to you by community.ibm.com/icpfordata Catch the replay at ibmaicommunity.bemyapp.com

More Related Content

PPTX
Open, Secure & Transparent AI Pipelines
PPTX
End-to-End Deep Learning Deployment with ONNX
PDF
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
PDF
Deploying End-to-End Deep Learning Pipelines with ONNX
PPTX
IBM Developer Model Asset eXchange
PDF
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
PDF
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
PDF
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Open, Secure & Transparent AI Pipelines
End-to-End Deep Learning Deployment with ONNX
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Deploying End-to-End Deep Learning Pipelines with ONNX
IBM Developer Model Asset eXchange
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...

What's hot (20)

PDF
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
PDF
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PDF
Serverless machine learning operations
PPTX
ER/Studio 2016: Build a Business-Driven Data Architecture
PDF
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
PPTX
Managing and Versioning Machine Learning Models in Python
PDF
RAPID - Building a highly usable API Design language with XText
PDF
RepreZen DSL: Pushing the limits of language usability with XText
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
Omni-Path Status, Upstreaming and Ongoing Work
PDF
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
PDF
DIY Analytics with Apache Spark
PDF
Spark NLP: State of the Art Natural Language Processing at Scale
PPTX
Advanced python
DOC
Satish A (1)
PPT
Rhapsody Software
PPTX
Ai use cases
PDF
Machine Learning Teams - Full Stack Deep Learning
PDF
Using Automation to Improve Software Services
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Serverless machine learning operations
ER/Studio 2016: Build a Business-Driven Data Architecture
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Managing and Versioning Machine Learning Models in Python
RAPID - Building a highly usable API Design language with XText
RepreZen DSL: Pushing the limits of language usability with XText
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Omni-Path Status, Upstreaming and Ongoing Work
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
DIY Analytics with Apache Spark
Spark NLP: State of the Art Natural Language Processing at Scale
Advanced python
Satish A (1)
Rhapsody Software
Ai use cases
Machine Learning Teams - Full Stack Deep Learning
Using Automation to Improve Software Services
Ad

Similar to AI and Spark - IBM Community AI Day (20)

PDF
Integrating Deep Learning Libraries with Apache Spark
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PDF
Deep learning and Apache Spark
PDF
Index conf sparkai-feb20-n-pentreath
PPTX
.NET per la Data Science e oltre
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
PDF
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
PDF
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
PPTX
.NET for Azure Synapse (and viceversa)
PPTX
Amazon Deep Learning
PDF
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Integrating Deep Learning Libraries with Apache Spark
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Deep learning and Apache Spark
Index conf sparkai-feb20-n-pentreath
.NET per la Data Science e oltre
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Apache Spark's MLlib's Past Trajectory and new Directions
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Tuning and Monitoring Deep Learning on Apache Spark
.NET for Azure Synapse (and viceversa)
Amazon Deep Learning
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Ad

More from Nick Pentreath (7)

PPTX
Notebook-based AI Pipelines with Elyra and Kubeflow
PPTX
Scaling up deep learning by scaling down
PPTX
IBM Developer Model Asset eXchange - Deep Learning for Everyone
PPTX
Search and Recommendations: 3 Sides of the Same Coin
PPTX
Deep Learning for Recommender Systems
PPTX
Productionizing Spark ML Pipelines with the Portable Format for Analytics
PPTX
RNNs for Recommendations and Personalization
Notebook-based AI Pipelines with Elyra and Kubeflow
Scaling up deep learning by scaling down
IBM Developer Model Asset eXchange - Deep Learning for Everyone
Search and Recommendations: 3 Sides of the Same Coin
Deep Learning for Recommender Systems
Productionizing Spark ML Pipelines with the Portable Format for Analytics
RNNs for Recommendations and Personalization

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Introduction to the R Programming Language
PPTX
Introduction to machine learning and Linear Models
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Database Infoormation System (DBIS).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
SAP 2 completion done . PRESENTATION.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to the R Programming Language
Introduction to machine learning and Linear Models
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
annual-report-2024-2025 original latest.
Introduction-to-Cloud-ComputingFinal.pptx
Quality review (1)_presentation of this 21
Qualitative Qantitative and Mixed Methods.pptx
Reliability_Chapter_ presentation 1221.5784
Business Ppt On Nestle.pptx huunnnhhgfvu
Database Infoormation System (DBIS).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

AI and Spark - IBM Community AI Day

  • 2. About @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups
  • 3. Center for Open Source Data and AI Technologies CODAIT codait.org DBG / Oct 4, 2018 / © 2018 IBM Corporation CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source Gather Data Analyze Data Machine Learning Deep Learning Deploy Model Maintain Model Python Data Science Stack Fabric for Deep Learning (FfDL) Mleap + PFA Scikit-LearnPandas Apache Spark Apache Spark Jupyter Model Asset eXchange Keras + Tensorflow
  • 6. In reality the workflow spans teams …
  • 8. Spark provides a unified platform
  • 10. What is a “model”?
  • 12. Example – Text Classifier Pipeline
  • 13. Example – Text Classifier PipelineModel
  • 14. Spark ML Components 14 Source: http://spark.apache.org/docs/latest
  • 15. Deep Learning DBG / June 6, 2018 / © 2018 IBM Corporation
  • 16. Deep Learning Overview • Original theory from 1940s; computer models originated around 1960s; fell out of favor in 1980s/90s • Recent resurgence due to • Bigger (and better) data; standard datasets (e.g. ImageNet) • Better hardware (GPUs) • Improvements to algorithms, architectures and optimization • Leading to new state-of-the-art results in computer vision (images and video); speech/text; language translation and more Source: Wikipedia
  • 17. Modern Neural Networks • Deep (multi-layer) networks • Computer vision • Convolution neural networks (CNNs) • Image classification, object detection, segmentation • Sequences and time-series • Recurrent neural networks (RNNs) • Machine translation, text generation • Embeddings • Text, categorical features • Deep learning frameworks • Flexibility, computation graphs, auto-differentiation, GPUs Source: Stanford CS231n
  • 18. Deep Learning Frameworks * Logos trademarks of their respective projects
  • 19. Computation Graphs Source: Google AI Blog *MnasNet Network *Inception V3
  • 21. Major Frameworks 21 • Deeplearning4J • BigDL • Deep Learning Pipelines • TensorFlowOnSpark • Microsoft Machine Learning on Spark (MMLSpark)
  • 22. Deeplearning4J 22 • Distributed GPU support for all major deep learning architectures • CPU / Distributed CPU / Single GPU options exist • Supports Convolutional Nets, LSTMs / RNNs, Feedforward Nets, Word2Vec, custom layers • Supported by startup Skymind.io • Backed by its own linear algebra library – ND4J • APIs in Scala, Java, Python • Newer Scala API, Keras-like • Keras import / export for Python API • Production serving is through proprietary layer • DataVec for ETL
  • 23. BigDL 23 • Distributed CPU with Intel MKL • No GPU support • Most DL models – CNN, RNN • Backed by Intel • Natively integrated with Spark • Scala, Python API • Support for Spark ML pipelines • Uses private internal Spark components for distributed training • Load Keras, Caffe, Torch models • New Keras-style API
  • 24. Deep Learning Pipelines 24 • Created by Databricks • Focus on scoring models (TensorFlow / Keras) and basic transfer learning • No support for training the DL model • Focus on image data & use cases • Natively integrated with Spark • Scala, Python API • Support for Spark ML pipelines • Support for scoring models as a SQL UDF • Largely dormant currently
  • 25. TensorFlowOnSpark 25 • Created by Yahoo • Scale out TF on Spark clusters • Use Spark executors to launch TF processes • Supports distributed training through TF parameter servers • RDMA / Infiniband improvement to TF to speed up distributed training • Good support for TensorBoard • Good integration with Spark • But only Python API • Some support for Spark ML pipelines • Relatively inactive recently
  • 26. MMLSpark 26 • Created by Microsoft • Supports training using CNTK including distributed • Image, text data • Good integration with Spark • Scala, Python, R API • Support for Spark ML pipelines • Varied deployment options • Relatively active, seems quite well supported
  • 27. Other Frameworks 27 • H20 AI / DeepWater • Apache MXNet Spark integration • TensorFrames • CaffeOnSpark • scalable-deep-learning on Github • MLlib – MLPClassifier only • Sparknet (abandoned)
  • 28. Integration Challenges 28 • Moving data from Spark to DL framework (and back) • Serialization overhead – especially Python • Managing DL computation graphs from Spark executors means fault tolerance is difficult to achieve • GPU awareness • Optimize and standardize data exchange - SPARK-24579 • Apache Arrow • Barrier Execution Mode - SPARK-24374 • Accelerator-aware scheduling - SPARK- 24615
  • 29. 29 Thank you! codait.org twitter.com/MLnick github.com/MLnick developer.ibm.com FfDL Sign up for IBM Cloud and try Watson Studio! https://ibm.biz/BdYhXz https://datascience.ibm.com/ MAX
  • 30. Brought to you by community.ibm.com/icpfordata Catch the replay at ibmaicommunity.bemyapp.com