SlideShare a Scribd company logo
Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
Spark Summit East 2016
Not Your Father’s Database:
How to Use Apache Spark Properly 

in Your Big Data Architecture
Spark Summit East 2016
About Me
2005 Mobile Web & Voice Search
3
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL
But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
and more…
HDFS
11
Know when to use other data stores 

besides file systems
Today’s Goal
Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
12
Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
13
Good: General Purpose Processing
Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
14
Good: General Purpose Processing
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have 

to go through all your files to find your row.
16
Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases, create an index 

on your key column.
• Key-Value NOSQL stores retrieves the value 

of a key efficiently out of the box.
17
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
18
Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
19
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your data.
20
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in the files.
• Delete & Insert: Delete the old row and insert a new one.
• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
21
Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
22
Incremental
SQL Query
Database
Snapshot
+
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS
Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS
Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB
Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits:
• Built in distributed algorithms.
• In memory capabilities for iterative workloads.
• Data cleansing, featurization, training, testing, etc.
26
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
27
Thank you

More Related Content

PDF
Visualizing big data in the browser using spark
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PDF
New directions for Apache Spark in 2015
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Visualizing big data in the browser using spark
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
New directions for Apache Spark in 2015
Spark Application Carousel: Highlights of Several Applications Built with Spark

What's hot (20)

PDF
Distributed ML in Apache Spark
PDF
Enabling exploratory data science with Spark and R
PDF
New Developments in Spark
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
Spark streaming state of the union
PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Spark Summit 2015 keynote: Making Big Data Simple with Spark
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
New Directions for Spark in 2015 - Spark Summit East
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
A look ahead at spark 2.0
Distributed ML in Apache Spark
Enabling exploratory data science with Spark and R
New Developments in Spark
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Under the Hood - Meetup @ Data Science London
Announcing Databricks Cloud (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Optimizing Delta/Parquet Data Lakes for Apache Spark
Spark streaming state of the union
Spark Summit EU 2015: Reynold Xin Keynote
Introduction to Spark (Intern Event Presentation)
Spark Summit 2015 keynote: Making Big Data Simple with Spark
What to Expect for Big Data and Apache Spark in 2017
What's new in pandas and the SciPy stack for financial users
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
New Directions for Spark in 2015 - Spark Summit East
H2O World - H2O Rains with Databricks Cloud
A look ahead at spark 2.0
Ad

Viewers also liked (18)

PDF
Operational Tips for Deploying Spark
PDF
Apache Spark Use case for Education Industry
PDF
Cancer Outlier Pro file Analysis using Apache Spark
PPTX
How Totango uses Apache Spark
PPTX
Getting Apache Spark Customers to Production
PPTX
Kodu Game Lab e Project Spark
PDF
SampleClean: Bringing Data Cleaning into the BDAS Stack
PDF
Fighting Fraud with Apache Spark
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
PDF
Moa: Real Time Analytics for Data Streams
PDF
Breakthrough OLAP performance with Cassandra and Spark
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
PDF
Consumer offset management in Kafka
PDF
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
PDF
Lambda Architectures in Practice
PDF
Spark For Faster Batch Processing
PDF
Not Your Father's Database by Vida Ha
PDF
Akka Streams and HTTP
Operational Tips for Deploying Spark
Apache Spark Use case for Education Industry
Cancer Outlier Pro file Analysis using Apache Spark
How Totango uses Apache Spark
Getting Apache Spark Customers to Production
Kodu Game Lab e Project Spark
SampleClean: Bringing Data Cleaning into the BDAS Stack
Fighting Fraud with Apache Spark
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Moa: Real Time Analytics for Data Streams
Breakthrough OLAP performance with Cassandra and Spark
Building a Turbo-fast Data Warehousing Platform with Databricks
Consumer offset management in Kafka
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
Lambda Architectures in Practice
Spark For Faster Batch Processing
Not Your Father's Database by Vida Ha
Akka Streams and HTTP
Ad

Similar to Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture (20)

PDF
Not Your Father's Database by Databricks
PDF
Started with-apache-spark
PDF
Intro to Spark and Spark SQL
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PPTX
Spark - Migration Story
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PDF
Spark For The Business Analyst
PDF
RDBMS vs Hadoop vs Spark
PDF
How to get started in Big Data for master's students
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Spark what's new what's coming
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
PDF
Apache Spark 101 - Demi Ben-Ari
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Bds session 13 14
PDF
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Not Your Father's Database by Databricks
Started with-apache-spark
Intro to Spark and Spark SQL
An Insider’s Guide to Maximizing Spark SQL Performance
Spark - Migration Story
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Spark For The Business Analyst
RDBMS vs Hadoop vs Spark
How to get started in Big Data for master's students
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Spark what's new what's coming
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Bds session 13 14
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Sustainable Sites - Green Building Construction
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Digital Logic Computer Design lecture notes
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
573137875-Attendance-Management-System-original
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Internet of Things (IOT) - A guide to understanding
OOP with Java - Java Introduction (Basics)
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Sustainable Sites - Green Building Construction
Model Code of Practice - Construction Work - 21102022 .pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Lecture Notes Electrical Wiring System Components
Digital Logic Computer Design lecture notes
Embodied AI: Ushering in the Next Era of Intelligent Systems
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Foundation to blockchain - A guide to Blockchain Tech
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
573137875-Attendance-Management-System-original
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
R24 SURVEYING LAB MANUAL for civil enggi

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

  • 1. Not Your Father’s Database: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  • 2. Not Your Father’s Database: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  • 3. About Me 2005 Mobile Web & Voice Search 3
  • 4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics
  • 5. About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering
  • 6. This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL
  • 7. But the performance is very different… Is this your Spark infrastructure? 7 SQL HDFS
  • 8. Just in Time Data Warehouse w/ Spark HDFS
  • 9. Just in Time Data Warehouse w/ Spark HDFS
  • 10. Just in Time Data Warehouse w/ Spark and more… HDFS
  • 11. 11 Know when to use other data stores 
 besides file systems Today’s Goal
  • 12. Good: General Purpose Processing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12
  • 13. Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing
  • 14. Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing
  • 15. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 15
  • 16. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have 
 to go through all your files to find your row. 16
  • 17. Bad: Random Access Solution: If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index 
 on your key column. • Key-Value NOSQL stores retrieves the value 
 of a key efficiently out of the box. 17
  • 18. Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18
  • 19. Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19
  • 20. Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20
  • 21. Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21
  • 22. Use Case: Up-to-date, live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +
  • 23. Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 23 HDFS
  • 24. Bad: External Reporting w/ load Too many concurrent requests will overload Spark. 24 HDFS
  • 25. Solution: Write out to a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB
  • 26. Good: Machine Learning & Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26
  • 27. Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27