SlideShare a Scribd company logo
Disrupting Big Data with
Apache Spark in the Cloud
Ali Ghodsi
Co-Founder and CEO
The Dawn of Advanced Analytics
2
WatsonSIRI/assistantsSelf-driving
cars
Not just sci-fi, important applications for businesses
Analytics Transforming Industries
3
Predictive analytics Anomaly Detection
Predict Product Revenue
Customer Assessment
Targeted Advertising
Fraud Detection
Risk Assessment
Equipment Failure
Data-Driven Real-time Analytics Applications
Today’s Data Reality
4
HADOOP
DATA LAKES
DATA HUBS
CLOUD
STORAGE
DATA
WAREHOUSE
S
Siloed, Fast-Growing Size, Cost
The Analytics Gap
5
IndustrialMediaPharma
HADOOP
DATA LAKES
DATA HUBS
CLOUD
STORAGE
DATA
WAREHOUSES
Siloed, Fast-Growing Size, Cost
Real-time Data-Driven Analytics Applications
Why is there a gap?
6
Real-time Data-Driven Analytics Applications
Manage Data
infrastructure
• Create, tune, monitor compute clusters.
• Securely access silos of disparate data sources.
• Enforce proper data governance.
•1
Empower teams to be
productive
• Securely share big data clusters among analysts.
• Interactively explore data and prototype ideas.
• Debug, troubleshoot, version-control big data applications.•
•
•
2
Establish Production-
Ready Applications
• Setup robust data pipelines for ETL/ELT.
• Productionize real-time applications with HA, FT.
• Build, serve, maintain advanced machine learning models.
•
3
Siloed, Fast-Growing Size, Cost
Databricks Cloud-Hosted Platform
7
• Separate compute & storage
• Integrate existing data stores
• Efficient cache on first access
Just-in-Time Data
Platform
1
Agile
• Workflow scheduler for ML,
streaming, SQL, ETL
• High availability, fault-
tolerant, performance-
optimized
Automated Apache
Spark Management
3
Production-Ready
• Interactive notebooks,
dashboards, reports
• Real-time exploration,
machine learning, graph use
cases
Integrated
Workspace
2
Democratize Big Data
HADOOP /
DATA LAKES
DATA
WAREHOUSE
S
YOUR
STORAGE
CLOUD
STORAGE
8
Databricks Just-in-Time Data Platform
INTEGRATED
WORKSPACEDASHBOARD
S
Reports
NOTEBOOKS
github, viz,
collaboration
BI TOOLS
JUST-IN-TIME
PROCESSING
POWERED
BY
APACHE
CLUSTERS: Auto-scaled, resilient, multi-tenant
DATA INTEGRATION: secure and fast data source
integrations
INTERFACES: REST APIs & BI tools
DATABRICKS SERVICES
+
YOUR CUSTOM SPARK
APPS
PRODUCTION JOBS
DATA LAKE
DATA HUB
The Challenge of Securing Analytics
9
End-to-end security a challenge for enterprises
Securing file
management
Secure table
management
Secure
cluster
management
Secure
job
workflows
Secure
dashboards,
report, notebook
management
Today there are piecemeal solutions, but no comprehensive
solution
Databricks Enterprise Security (DBES)
10
Holistic end-to-end security for Data Analytics
Tables Clusters Workflow
s
Notebooks,
Dashboards,
Reports
Files
• Role-based access control
• Auditing and governance
• Integrated identity-management
• Encryption on-disk and on-the-
wire
DBES provides
The First End-to-End Security Solution for Apache Spark
Enterprise use-cases
11
Preventing credit card fraud
Predict energy demand based on massive weather data
Predict player churn, predicting network outages
Natural language processing to extract author graph
Generating tailored programs based on big data
Thank you.
Try Apache Spark with Databricks
13
http://databricks.com/try
Try latest version of Apache Spark and preview of Spark 2.0

More Related Content

PDF
Shifting Data Science into High Gear
PPTX
Apache Spark in Scientific Applciations
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
PDF
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
PDF
Headaches and Breakthroughs in Building Continuous Applications
PDF
ASPgems - kappa architecture
PDF
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
PPTX
Realtime streaming architecture in INFINARIO
Shifting Data Science into High Gear
Apache Spark in Scientific Applciations
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
Headaches and Breakthroughs in Building Continuous Applications
ASPgems - kappa architecture
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Realtime streaming architecture in INFINARIO

What's hot (20)

PDF
Spark at Airbnb
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PDF
Cloud Experience: Data-driven Applications Made Simple and Fast
PPTX
Building Data Pipelines with Spark and StreamSets
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
PDF
Open Source DataViz with Apache Superset
PDF
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
PDF
Redash: Open Source SQL Analytics on Data Lakes
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PPTX
Super charged prototyping
PDF
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
PDF
Building Data Quality pipelines with Apache Spark and Delta Lake
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PPTX
The evolution of the big data platform @ Netflix (OSCON 2015)
PDF
Building Robust Production Data Pipelines with Databricks Delta
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
PDF
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
PPTX
SharePoint User Group - Leeds - 2015-09-02
PDF
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
PDF
Big Telco - Yousun Jeong
Spark at Airbnb
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Cloud Experience: Data-driven Applications Made Simple and Fast
Building Data Pipelines with Spark and StreamSets
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Open Source DataViz with Apache Superset
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Redash: Open Source SQL Analytics on Data Lakes
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Super charged prototyping
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Building Data Quality pipelines with Apache Spark and Delta Lake
SQL Analytics Powering Telemetry Analysis at Comcast
The evolution of the big data platform @ Netflix (OSCON 2015)
Building Robust Production Data Pipelines with Databricks Delta
Machine Learning Data Lineage with MLflow and Delta Lake
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
SharePoint User Group - Leeds - 2015-09-02
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Big Telco - Yousun Jeong
Ad

Viewers also liked (20)

PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PPTX
Democratizing AI with Apache Spark
PDF
From MapReduce to Apache Spark
PPTX
Apache Sparkを利用した「つぶやきビッグデータ」クローンとリコメンドシステムの構築
PDF
Big Data in Production: Lessons from Running in the Cloud
PDF
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
PDF
Appraiser: How Airbnb Generates Complex Models in Spark for Demand Prediction...
PDF
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
PPTX
Effective Spark on Multi-Tenant Clusters
PDF
Spark Uber Development Kit
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
PDF
NetflixにおけるPresto/Spark活用事例
PPTX
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
PDF
Google TensorFlow Tutorial
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Democratizing AI with Apache Spark
From MapReduce to Apache Spark
Apache Sparkを利用した「つぶやきビッグデータ」クローンとリコメンドシステムの構築
Big Data in Production: Lessons from Running in the Cloud
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
How Spark Enables the Internet of Things- Paula Ta-Shma
Appraiser: How Airbnb Generates Complex Models in Spark for Demand Prediction...
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
Effective Spark on Multi-Tenant Clusters
Spark Uber Development Kit
Simplifying Big Data Applications with Apache Spark 2.0
NetflixにおけるPresto/Spark活用事例
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Google TensorFlow Tutorial
Ad

Similar to Disrupting Big Data with Apache Spark in the Cloud (20)

PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PDF
Big data and you
 
PDF
Mastering Big Data: Tools, Techniques, and Applications
DOC
Big Data Technologies - Hadoop, Spark, and Beyond.doc
PPTX
Fundamentals of Big Data
PPTX
BI.pptx
PPTX
Big data unit 2
PPTX
Databricks on AWS.pptx
PPTX
Big data connection overview by aibdp.org
PPTX
Building a Big Data Pipeline
PPTX
Big Data Analytics with Hadoop
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
PDF
Mighty Guides- Data Disruption
PDF
Big Data Analytics
PDF
Big data processing with apache spark
PPTX
Big data
PPTX
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
PPTX
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
PPTX
Finding business value in Big Data
PPSX
De-Mystifying Big Data
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Big data and you
 
Mastering Big Data: Tools, Techniques, and Applications
Big Data Technologies - Hadoop, Spark, and Beyond.doc
Fundamentals of Big Data
BI.pptx
Big data unit 2
Databricks on AWS.pptx
Big data connection overview by aibdp.org
Building a Big Data Pipeline
Big Data Analytics with Hadoop
20160331 sa introduction to big data pipelining berlin meetup 0.3
Mighty Guides- Data Disruption
Big Data Analytics
Big data processing with apache spark
Big data
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Finding business value in Big Data
De-Mystifying Big Data

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Re-Architecting Spark For Performance Understandability
PDF
Low Latency Execution For Apache Spark
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Livy: A REST Web Service For Apache Spark
PDF
GPU Computing With Apache Spark And Python
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Spark on Mesos
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
RISELab:Enabling Intelligent Real-Time Decisions
Spatial Analysis On Histological Images Using Spark
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
A Graph-Based Method For Cross-Entity Threat Detection
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Time-Evolving Graph Processing On Commodity Clusters
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Low Latency Execution For Apache Spark
Efficient State Management With Spark 2.0 And Scale-Out Databases
Livy: A REST Web Service For Apache Spark
GPU Computing With Apache Spark And Python
Spark And Cassandra: 2 Fast, 2 Furious
Building Custom Machine Learning Algorithms With Apache SystemML
Spark on Mesos

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Database Infoormation System (DBIS).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PDF
annual-report-2024-2025 original latest.
PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Fluorescence-microscope_Botany_detailed content
Reliability_Chapter_ presentation 1221.5784
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Analytics and business intelligence.pdf
Introduction to Knowledge Engineering Part 1
annual-report-2024-2025 original latest.
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms

Disrupting Big Data with Apache Spark in the Cloud

  • 1. Disrupting Big Data with Apache Spark in the Cloud Ali Ghodsi Co-Founder and CEO
  • 2. The Dawn of Advanced Analytics 2 WatsonSIRI/assistantsSelf-driving cars Not just sci-fi, important applications for businesses
  • 3. Analytics Transforming Industries 3 Predictive analytics Anomaly Detection Predict Product Revenue Customer Assessment Targeted Advertising Fraud Detection Risk Assessment Equipment Failure Data-Driven Real-time Analytics Applications
  • 4. Today’s Data Reality 4 HADOOP DATA LAKES DATA HUBS CLOUD STORAGE DATA WAREHOUSE S Siloed, Fast-Growing Size, Cost
  • 5. The Analytics Gap 5 IndustrialMediaPharma HADOOP DATA LAKES DATA HUBS CLOUD STORAGE DATA WAREHOUSES Siloed, Fast-Growing Size, Cost Real-time Data-Driven Analytics Applications
  • 6. Why is there a gap? 6 Real-time Data-Driven Analytics Applications Manage Data infrastructure • Create, tune, monitor compute clusters. • Securely access silos of disparate data sources. • Enforce proper data governance. •1 Empower teams to be productive • Securely share big data clusters among analysts. • Interactively explore data and prototype ideas. • Debug, troubleshoot, version-control big data applications.• • • 2 Establish Production- Ready Applications • Setup robust data pipelines for ETL/ELT. • Productionize real-time applications with HA, FT. • Build, serve, maintain advanced machine learning models. • 3 Siloed, Fast-Growing Size, Cost
  • 7. Databricks Cloud-Hosted Platform 7 • Separate compute & storage • Integrate existing data stores • Efficient cache on first access Just-in-Time Data Platform 1 Agile • Workflow scheduler for ML, streaming, SQL, ETL • High availability, fault- tolerant, performance- optimized Automated Apache Spark Management 3 Production-Ready • Interactive notebooks, dashboards, reports • Real-time exploration, machine learning, graph use cases Integrated Workspace 2 Democratize Big Data
  • 8. HADOOP / DATA LAKES DATA WAREHOUSE S YOUR STORAGE CLOUD STORAGE 8 Databricks Just-in-Time Data Platform INTEGRATED WORKSPACEDASHBOARD S Reports NOTEBOOKS github, viz, collaboration BI TOOLS JUST-IN-TIME PROCESSING POWERED BY APACHE CLUSTERS: Auto-scaled, resilient, multi-tenant DATA INTEGRATION: secure and fast data source integrations INTERFACES: REST APIs & BI tools DATABRICKS SERVICES + YOUR CUSTOM SPARK APPS PRODUCTION JOBS DATA LAKE DATA HUB
  • 9. The Challenge of Securing Analytics 9 End-to-end security a challenge for enterprises Securing file management Secure table management Secure cluster management Secure job workflows Secure dashboards, report, notebook management Today there are piecemeal solutions, but no comprehensive solution
  • 10. Databricks Enterprise Security (DBES) 10 Holistic end-to-end security for Data Analytics Tables Clusters Workflow s Notebooks, Dashboards, Reports Files • Role-based access control • Auditing and governance • Integrated identity-management • Encryption on-disk and on-the- wire DBES provides The First End-to-End Security Solution for Apache Spark
  • 11. Enterprise use-cases 11 Preventing credit card fraud Predict energy demand based on massive weather data Predict player churn, predicting network outages Natural language processing to extract author graph Generating tailored programs based on big data
  • 13. Try Apache Spark with Databricks 13 http://databricks.com/try Try latest version of Apache Spark and preview of Spark 2.0

Editor's Notes

  • #2: I HAVE AN ANNOUNCEMENT TODAY BUT FIRST, WANT TO TALK ABOUT DATABRICKS
  • #3: THIS IS NOT JUST SCI-FI
  • #7: Establish production-quality infrastructure
  • #8: 1: Global publisher Elsevier – team in US and EU perform natural language processing on all their content to experiment with new product ideas. 2: Energy analytics company DNV GL – Databricks sped up analytics of IoT data from weather and grid sensor by 100x. 3: Financial service provider LendUp– Databricks enabled them to update their machine models daily instead of weekly.
  • #9: OPEN SOURCE