SlideShare a Scribd company logo
AI at Scale
Adi Polak | Microsoft
@adipolak
Start from the beginning
• Massive datasets
• Intelligent product
@adipolak
• Apache Spark
• Databricks for Production
• Build machine learning models with
Pipelines
• Cognitive Services and Apache Spark
• Data Engineer vs Data Scientist
• MMLSpark
Overview
@adipolak
What is
Apache Spark
What is Apache Spark?
@adipolak
Motivation
@adipolak
Apache Spark architecture
@adipolak
Apache Spark submit job
storage storage storage
@adipolak
Framework libraries- Software Stracture
@adipolak
Python, R & .NET for Apache Spark
@adipolak
Embedded Service
PySpark
Spark Scala Process
Pyspark
Protocol
Spark Worker
@adipolak
Structured
Streaming
MLlib GraphFrame TensorFrames
SQL SparkSession / DataFrame / Dataset APIs
Data Source
Connectors
Catalyst Optimization & Tungsten Execution
Spark Core (RDD APIs)
SQL
API
More API
@adipolak
Why Apache Spark is fast and efficient?
CATALYST OPTIMIZER IN MEMORY
COMPUTING
TUNGSTEN
@adipolak
CATALYST
@adipolak
Fundamentals of Catalyst Optimizer
SUB
Attribute(x) SUB
some_func(1) some_func(2)
Tree Rules
SUB
Attribute(x) some_func(-1)
@adipolak
Spark SQL Execution Plan
Logical optimization –> Optimization rules
• Constant folding
• Predicate pushdown
• Projection pruning
• …
Catalyst
Frontend Backend
@adipolak
Apache Spark
Connectors
Spark Connector for SQL Server
@adipolak
Spark connector for Cosmos DB
@adipolak
Many more connectors…
@adipolak
How to run
Apache
Spark on
Azure?
HDInsight
Azure Databricks
AKS
Azure Databricks
What is Azure Databricks
@adipolak
Workspace and Notebooks
Build for Production
• - Databricks limits and Azure limits
• - Workspace isolation
• - vnet
• - Security and key vault
• - Enable log analytics for monitoring
Databricks limits and Azure limits
• Key Databricks limits ( per workspace):
• - 150 concurrent jobs
• - 150 max notebooks
• - 1000 max hourly job submission
Key Azure limits :
- Storage accounts per region per
subscription: 250
- VMs per subscription per
region: 25,000
- Resource groups per subscription: 980
@adipolak
Workspace isolation
@adipolak
Cluster modes and their characteristics
@adipolak
Batch vs. Interactive workloads
@adipolak
Data Scientists and Analysts –
Hight concurrency mode
@adipolak
Data Engineers
@adipolak
Troubleshooting with cluster logs
• Azure Databricks provides three kinds of logging of cluster-related
activity:
• Cluster event logs, which capture cluster lifecycle events, like creation,
termination, configuration edits, and so on.
• Apache Spark driver and worker logs, which you can use for
debugging.
• Cluster init-script logs, valuable for debugging init scripts.
@adipolak
Cluster logs
Cluster
Metric
Report
@adipolak
Log Analytics Workspace
• Leverage Azure log analytics and connect it to the workspace
• with init scripts and visualize it using Kibana or other tools.
@adipolak
What
About
Machine
Learning?
Gather Data
ML Process / Basic Life Cycle
Feature Extract, Clean and Normalize
Select algorithm
Evaluate model
Data visualization
4
5
1
2
3
Repeat!
@adipolak
H O W ?
M L P i p e l i n e s !
@adipolak
What are pipelines?
Tools
Spark
Streaming
Spark ML/Other
Spark SQL MLflow
@adipolak
@adipolak
Demo
@adipolak
But in real life:
Accuracy < 0.5
AUC L
Results are far
from being good
Building machine learning models
from scratch is hard!
@adipolak
We need a Data Science
@adipolak
Our Data scientist needs tools
@adipolak
Stand on the
shoulder of giants
@adipolak
Azure Cognitive Services
@adipolak
Cognitive Services capabilities
Infuse your apps, websites, and bots with human-like intelligence
A variety of real-world applications
Vision Speech
Intent: PlayCall
Language Knowledge Search
Let’s Combine them
+
@adipolak
Apache Spark + Cognitive Services
@adipolak
@adipolak
Data Engineer vs Data Scientist
Advanced programming
Distributed systems
Data pipeline
Basic analytics
Data Engineer
Advance math/statistics
ML/AI
Advanced analytics
Basic programming
Data Science
Big Data constrains
Both should understand:
@adipolak
MMLSpark:
Microsoft Machine
Learning for
Apache Spark
@adipolak
aka.ms/mlflow_and_azure_ml
aka.ms/azure_databricks
aka.ms/twitter_sentiment_analysis
Want to learn more?
@adipolak
Thank you! Dank u! Merci!

More Related Content

PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
PDF
Bridging the Gap Between Datasets and DataFrames
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
PDF
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
PPTX
Zeppelin at Twitter
PDF
Anomaly Detection at Scale!
PDF
Data Warehousing with Spark Streaming at Zalando
PDF
Big Data Meets Learning Science: Keynote by Al Essa
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Bridging the Gap Between Datasets and DataFrames
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Zeppelin at Twitter
Anomaly Detection at Scale!
Data Warehousing with Spark Streaming at Zalando
Big Data Meets Learning Science: Keynote by Al Essa

What's hot (20)

PPTX
5 reasons why spark is in demand!
PPTX
Spark ML Pipeline serving
PDF
Distributed ML/DL with Ignite ML Module Using Apache Spark as Database
PDF
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
PDF
Internals of Speeding up PySpark with Arrow
PDF
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
PDF
Spark at Airbnb
PDF
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
PDF
Headaches and Breakthroughs in Building Continuous Applications
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
PPTX
Spline 2 - Vision and Architecture Overview
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PPTX
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PDF
Koalas: Unifying Spark and pandas APIs
PDF
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
PPTX
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
5 reasons why spark is in demand!
Spark ML Pipeline serving
Distributed ML/DL with Ignite ML Module Using Apache Spark as Database
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Internals of Speeding up PySpark with Arrow
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Spark at Airbnb
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Headaches and Breakthroughs in Building Continuous Applications
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Scaling ML-Based Threat Detection For Production Cyber Attacks
Spline 2 - Vision and Architecture Overview
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Tuning ML Models: Scaling, Workflows, and Architecture
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Koalas: Unifying Spark and pandas APIs
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
Ad

Similar to AI at Scale (20)

PDF
Big Data Adavnced Analytics on Microsoft Azure
PDF
Azure databricks c sharp corner toronto feb 2019 heather grandy
PPTX
Ai & Data Analytics 2018 - Azure Databricks for data scientist
PPTX
Introduction to Azure Databricks
PDF
Azure+Databricks+Course+Slide+Deck+V4.pdf
PPTX
Global AI Bootcamp Madrid - Azure Databricks
PPTX
Azure Databricks - An Introduction (by Kris Bock)
PPTX
TechEvent Databricks on Azure
PPTX
Deep Learning Technical Pitch Deck
PPTX
Azure Databricks & Spark @ Techorama 2018
PPTX
Azure Databricks is Easier Than You Think
PDF
Sergii Baidachnyi ITEM 2018
PPTX
898-Azure Databricks Technical Deck - sorinpe.pptx
PDF
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
PDF
Comparing Microsoft Big Data Platform Technologies
PDF
DBP-010_Using Azure Data Services for Modern Data Applications
PPTX
Azure Databricks - An Introduction 2019 Roadshow.pptx
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
PPTX
Databricks for Dummies
Big Data Adavnced Analytics on Microsoft Azure
Azure databricks c sharp corner toronto feb 2019 heather grandy
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Introduction to Azure Databricks
Azure+Databricks+Course+Slide+Deck+V4.pdf
Global AI Bootcamp Madrid - Azure Databricks
Azure Databricks - An Introduction (by Kris Bock)
TechEvent Databricks on Azure
Deep Learning Technical Pitch Deck
Azure Databricks & Spark @ Techorama 2018
Azure Databricks is Easier Than You Think
Sergii Baidachnyi ITEM 2018
898-Azure Databricks Technical Deck - sorinpe.pptx
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
Comparing Microsoft Big Data Platform Technologies
DBP-010_Using Azure Data Services for Modern Data Applications
Azure Databricks - An Introduction 2019 Roadshow.pptx
An Insider’s Guide to Maximizing Spark SQL Performance
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Databricks for Dummies
Ad

More from Adi Polak (7)

PPTX
Demystifying Apache Spark
PPTX
Burst workloads Cutting costs with Kubernetes and Virtual Kubelet
PDF
From desktop to the cloud, cutting costs with Virtual kubelet and ACI
PPTX
ETL – Everything you need to know
PPTX
Evolution of VS code Java ecosystem
PDF
Make it clean - scala clean code
PPTX
Spark UDFs are EviL, Catalyst to the rEsCue!
Demystifying Apache Spark
Burst workloads Cutting costs with Kubernetes and Virtual Kubelet
From desktop to the cloud, cutting costs with Virtual kubelet and ACI
ETL – Everything you need to know
Evolution of VS code Java ecosystem
Make it clean - scala clean code
Spark UDFs are EviL, Catalyst to the rEsCue!

Recently uploaded (20)

PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Sustainable Sites - Green Building Construction
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Drone Technology Electronics components_1
PPT
Project quality management in manufacturing
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
Digital Logic Computer Design lecture notes
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Geodesy 1.pptx...............................................
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Internet of Things (IOT) - A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
OOP with Java - Java Introduction (Basics)
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Sustainable Sites - Green Building Construction
CH1 Production IntroductoryConcepts.pptx
Foundation to blockchain - A guide to Blockchain Tech
Drone Technology Electronics components_1
Project quality management in manufacturing
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Digital Logic Computer Design lecture notes
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Geodesy 1.pptx...............................................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................

AI at Scale