Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

Not Your Father’s Database:
How to Use Apache Spark Properly  
in Your Big Data Architecture
Spark Summit East 2016

About Me
2005 Mobile Web & Voice Search
3

About Me
4
2012 Reporting & Analytics

About Me
5
2012 Reporting & Analytics
2014 Solutions Engineering

This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL

But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS

Just in Time Data Warehouse w/ Spark
HDFS

Just in Time Data Warehouse w/ Spark
and more…
HDFS

11
Know when to use other data stores  
besides file systems
Today’s Goal

Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
12

Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
13

Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
14

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have  
to go through all your files to find your row.
16

Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases, create an index  
on your key column.
• Key-Value NOSQL stores retrieves the value  
of a key efficiently out of the box.
17

Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
18

Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
19

Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your data.
20

Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in the files.
• Delete & Insert: Delete the old row and insert a new one.
• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
21

Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
22
Incremental
SQL Query
Database
Snapshot
+

Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS

Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS

Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB

Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits:
• Built in distributed algorithms.
• In memory capabilities for iterative workloads.
• Data cleansing, featurization, training, testing, etc.
26

Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
27

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture (20)

More from Databricks (20)

Recently uploaded (20)

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture