SlideShare a Scribd company logo
Mladen Kovacevic, Senior Solutions Architect
Cloudera Inc.
Storage Engine
Considerations for your
Apache Spark Applications
#EUdev10
Outline
• Motivation – store your data – where exactly?
• Storage Capabilities:
– HDFS
– HBase
– Kudu
– Solr
• Asking the right questions
• Decide on right storage solution
2#EUdev10
Motivation
• Spark, SparkStreaming, SparkSQL – great for
processing – need a place to store content
• Integration with variety of storage systems
• Ingest and consumption requirements – use
case!
3#EUdev10
Design patterns
4#EUdev10
Choosing the right storage
for the use case
2006
2007
2016
2008
HDFS
• Distributed file system – cheap, scalable, storage
• Immutable – “record” changes are painful
• Columnar file formats - ideal for analytics
• SQL overlays (SparkSQL, Hive Metastore, more) to
define schema
Highlights
Very high throughput, painful random IO, batch oriented,
coding overhead (ie. dealing with small files problems), any
file
5#EUdev10
HDFS design pattern
df.write.parquet(“/data/person_table”)
6#EUdev10
• Small files accumulate
• External processes, or additional
application logic to manage these files
• Partition management
• Manage metadata carefully (depends
on ecosystem)
• Considerations- changing dimensions
(fast/slow)
• Late arriving data
HBase
• NoSQL engine, manages files on HDFS
• Key-value, distributed storage engine
• No data types – just binary fields
• Thousands to millions of columns
• Store entity data (profiles of people, devices, accounts)
Highlights
Very fast random IO, low throughput, NRT oriented,
challenging BI, no strict data types
7#EUdev10
HBase design pattern
8#EUdev10
• HBase Connection anywhere in
Spark/SparkStreaming app
• SparkSQL/DataFrames, Bulk Load
• Primary storage for ingestion or
complementary preserving state
• NoSQL store vs. structured
• Near-real-time
CDH hbase-spark: https://github.com/cloudera/hbase/tree/cdh5-1.2.0_5.13.0/hbase-spark
CDH HBase and Spark docs: http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.13.0/book.html#spark
Upstream hbase-spark (watch for updates in HBase 2.x release): https://github.com/apache/hbase/tree/master/hbase-spark/
Upstream HBase and Spark book: http://hbase.apache.org/book.html#spark
Analytic Gap
9#EUdev10
Kudu
• Storage system for tables of structured data
• Bring-your-own-SQL (SparkSQL, Impala), NoSQL-like
API, integration with Spark, MapReduce, more..
• Columnar, key partitioning by range and/or hash
• Limited number of columns (strongly typed)
Highlights
Fast random IO, fast throughput, NRT oriented, terrific for
BI, structured data
10#EUdev10
Kudu design pattern
df.write.options(kuduOptions).mode(“append”).kudu
OR
kuduContext.insertRows()
11#EUdev10
• DataFrame perfect match for Kudu
(structured)
• Data available immediately to SQL
engines (Impala, SparkSQL)
• Ideal case is append with moderate
updates
Kudu Integration with Spark: http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
Up and running with Apache Spark on Apache Kudu: https://blog.cloudera.com/blog/2017/02/up-and-running-with-apache-spark-on-apache-kudu/
Analytic Gap Filled
12#EUdev10
Solr
• Distributed index enabling search capabilities (Lucene)
• Typed, REST API based, search index query processing
• Search interface, faceting, integration with HBase storing
content (typically) in HDFS
Highlights
High random IO, low throughput, multi-faceted use cases,
NRT oriented, terrific for BI with the right tools (non-SQL),
loose schema, data types
13#EUdev10
Solr design pattern
14#EUdev10
• Prepare Solr document, add to
SolrCloud directly OR
• Write to HBase, leverage Lily
HBase Indexer service to update
Solr
• Store complete record in HBase,
while indexed fields for search in
Solr
• NRT availability (short soft
commits)
Questions we ask (1)
• How many voters have cast their ballots by city
thus far in the election, by the second?
– streaming data into ‘voter’ table, aggregate query,
immediate data availability : Kudu
• How many people watched last nights game
compared to the night before?
– daily batch, aggregate query : HDFS parquet
15#EUdev10
Questions we ask (2)
• What version is my device running and how
many dropped packets do I have?
– streaming entity profile data, metrics may change per
release, many updates, specific device, NRT: HBase
• Which tweets talk to the housing market, in the
21-30 age group?
– streaming, keyword search, facet filtering : Solr
16#EUdev10
Use case questionnaire
• Consumption interface: SQL (JDBC/ODBC) vs. API
• Near-real-time requirement for consumers
• Ingestion rate (can we keep up?)
• Entity vs. Events (time-based)
• Append-only vs moderate updates vs many updates
• Distinct values in dataset
17#EUdev10
Storage considerations (1)
18#EUdev10
Criteria
SQL interface
API interface
Near-real-time ingestion
Append-only + available for query
Appends with moderate updates
Mostly updates
Storage considerations (2)
19#EUdev10
Criteria
Entity based data
Event based data (time-series)
High distinct values
Many and unknown attributes
Binary data (Images, PDFs, etc)
Analytics
Wrap-up
• Review entire use-case end-to-end early
• Understand storage capabilities
• Ask the right questions (upstream/consumers)
• Consider security, architecture and development
costs
• Decide on the right storage solution
20#EUdev10

More Related Content

PDF
IT6601 MOBILE COMPUTING
PPTX
5g-wireless-technology-ppt.pptx
PPTX
Voice-over-Internet Protocol (VoIP) ppt
PPTX
Sensor Network
PDF
Genesys SIP Server Architecture
PPT
Cellular network,1st generation,2nd generation
PPT
H.323 protocol
PPTX
Overview of Modem
IT6601 MOBILE COMPUTING
5g-wireless-technology-ppt.pptx
Voice-over-Internet Protocol (VoIP) ppt
Sensor Network
Genesys SIP Server Architecture
Cellular network,1st generation,2nd generation
H.323 protocol
Overview of Modem

What's hot (20)

PDF
Homer - Workshop at Kamailio World 2017
PPTX
Voip
PPTX
PPTX
Introduction to router
PPT
VOIP BASIC
PDF
WIRELES NETWORK
PPTX
TDMA Time Division Multiple Access
PPTX
Concept Of VOIP in deatils
PDF
Routing protocols in ad hoc network
PPT
Ccna Presentation
 
PPTX
4g technology
DOCX
Protocols in Bluetooth
PPTX
Ad hoc wireless networks-Overview
PPTX
GSM & UMTS Security
PPTX
Sdn ppt
PPTX
INTERNET PROTOCOL TELEVISION
PPT
Ieee 802.11overview
PPT
Cellular network presentation
PPTX
Modbus protocol
Homer - Workshop at Kamailio World 2017
Voip
Introduction to router
VOIP BASIC
WIRELES NETWORK
TDMA Time Division Multiple Access
Concept Of VOIP in deatils
Routing protocols in ad hoc network
Ccna Presentation
 
4g technology
Protocols in Bluetooth
Ad hoc wireless networks-Overview
GSM & UMTS Security
Sdn ppt
INTERNET PROTOCOL TELEVISION
Ieee 802.11overview
Cellular network presentation
Modbus protocol
Ad

Viewers also liked (13)

PDF
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
PDF
Spark Pipelines in the Cloud with Alluxio with Gene Pang
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
PDF
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
PDF
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Ad

Similar to Storage Engine Considerations for Your Apache Spark Applications with Mladen Kovacevic (20)

PPTX
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PDF
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
PDF
Hoodie - DataEngConf 2017
PPTX
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rßdige...
PPTX
Hadoop and Big data in Big data and cloud.pptx
 
PPTX
Apache drill
PPTX
Hadoop ppt1
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Gunther hagleitner:apache hive & stinger
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
PDF
Michael stack -the state of apache h base
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PDF
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
PPTX
Introduction to Kudu - StampedeCon 2016
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Hoodie - DataEngConf 2017
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rßdige...
Hadoop and Big data in Big data and cloud.pptx
 
Apache drill
Hadoop ppt1
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
SQL Engines for Hadoop - The case for Impala
Gunther hagleitner:apache hive & stinger
Eric Baldeschwieler Keynote from Storage Developers Conference
Michael stack -the state of apache h base
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Introduction to Kudu - StampedeCon 2016

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
annual-report-2024-2025 original latest.
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Lecture1 pattern recognition............
PDF
Business Analytics and business intelligence.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Business Ppt On Nestle.pptx huunnnhhgfvu
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
climate analysis of Dhaka ,Banglades.pptx
Clinical guidelines as a resource for EBP(1).pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
annual-report-2024-2025 original latest.
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
Lecture1 pattern recognition............
Business Analytics and business intelligence.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ISS -ESG Data flows What is ESG and HowHow
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Miokarditis (Inflamasi pada Otot Jantung)

Storage Engine Considerations for Your Apache Spark Applications with Mladen Kovacevic

  • 1. Mladen Kovacevic, Senior Solutions Architect Cloudera Inc. Storage Engine Considerations for your Apache Spark Applications #EUdev10
  • 2. Outline • Motivation – store your data – where exactly? • Storage Capabilities: – HDFS – HBase – Kudu – Solr • Asking the right questions • Decide on right storage solution 2#EUdev10
  • 3. Motivation • Spark, SparkStreaming, SparkSQL – great for processing – need a place to store content • Integration with variety of storage systems • Ingest and consumption requirements – use case! 3#EUdev10
  • 4. Design patterns 4#EUdev10 Choosing the right storage for the use case 2006 2007 2016 2008
  • 5. HDFS • Distributed file system – cheap, scalable, storage • Immutable – “record” changes are painful • Columnar file formats - ideal for analytics • SQL overlays (SparkSQL, Hive Metastore, more) to define schema Highlights Very high throughput, painful random IO, batch oriented, coding overhead (ie. dealing with small files problems), any file 5#EUdev10
  • 6. HDFS design pattern df.write.parquet(“/data/person_table”) 6#EUdev10 • Small files accumulate • External processes, or additional application logic to manage these files • Partition management • Manage metadata carefully (depends on ecosystem) • Considerations- changing dimensions (fast/slow) • Late arriving data
  • 7. HBase • NoSQL engine, manages files on HDFS • Key-value, distributed storage engine • No data types – just binary fields • Thousands to millions of columns • Store entity data (profiles of people, devices, accounts) Highlights Very fast random IO, low throughput, NRT oriented, challenging BI, no strict data types 7#EUdev10
  • 8. HBase design pattern 8#EUdev10 • HBase Connection anywhere in Spark/SparkStreaming app • SparkSQL/DataFrames, Bulk Load • Primary storage for ingestion or complementary preserving state • NoSQL store vs. structured • Near-real-time CDH hbase-spark: https://github.com/cloudera/hbase/tree/cdh5-1.2.0_5.13.0/hbase-spark CDH HBase and Spark docs: http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.13.0/book.html#spark Upstream hbase-spark (watch for updates in HBase 2.x release): https://github.com/apache/hbase/tree/master/hbase-spark/ Upstream HBase and Spark book: http://hbase.apache.org/book.html#spark
  • 10. Kudu • Storage system for tables of structured data • Bring-your-own-SQL (SparkSQL, Impala), NoSQL-like API, integration with Spark, MapReduce, more.. • Columnar, key partitioning by range and/or hash • Limited number of columns (strongly typed) Highlights Fast random IO, fast throughput, NRT oriented, terrific for BI, structured data 10#EUdev10
  • 11. Kudu design pattern df.write.options(kuduOptions).mode(“append”).kudu OR kuduContext.insertRows() 11#EUdev10 • DataFrame perfect match for Kudu (structured) • Data available immediately to SQL engines (Impala, SparkSQL) • Ideal case is append with moderate updates Kudu Integration with Spark: http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark Up and running with Apache Spark on Apache Kudu: https://blog.cloudera.com/blog/2017/02/up-and-running-with-apache-spark-on-apache-kudu/
  • 13. Solr • Distributed index enabling search capabilities (Lucene) • Typed, REST API based, search index query processing • Search interface, faceting, integration with HBase storing content (typically) in HDFS Highlights High random IO, low throughput, multi-faceted use cases, NRT oriented, terrific for BI with the right tools (non-SQL), loose schema, data types 13#EUdev10
  • 14. Solr design pattern 14#EUdev10 • Prepare Solr document, add to SolrCloud directly OR • Write to HBase, leverage Lily HBase Indexer service to update Solr • Store complete record in HBase, while indexed fields for search in Solr • NRT availability (short soft commits)
  • 15. Questions we ask (1) • How many voters have cast their ballots by city thus far in the election, by the second? – streaming data into ‘voter’ table, aggregate query, immediate data availability : Kudu • How many people watched last nights game compared to the night before? – daily batch, aggregate query : HDFS parquet 15#EUdev10
  • 16. Questions we ask (2) • What version is my device running and how many dropped packets do I have? – streaming entity profile data, metrics may change per release, many updates, specific device, NRT: HBase • Which tweets talk to the housing market, in the 21-30 age group? – streaming, keyword search, facet filtering : Solr 16#EUdev10
  • 17. Use case questionnaire • Consumption interface: SQL (JDBC/ODBC) vs. API • Near-real-time requirement for consumers • Ingestion rate (can we keep up?) • Entity vs. Events (time-based) • Append-only vs moderate updates vs many updates • Distinct values in dataset 17#EUdev10
  • 18. Storage considerations (1) 18#EUdev10 Criteria SQL interface API interface Near-real-time ingestion Append-only + available for query Appends with moderate updates Mostly updates
  • 19. Storage considerations (2) 19#EUdev10 Criteria Entity based data Event based data (time-series) High distinct values Many and unknown attributes Binary data (Images, PDFs, etc) Analytics
  • 20. Wrap-up • Review entire use-case end-to-end early • Understand storage capabilities • Ask the right questions (upstream/consumers) • Consider security, architecture and development costs • Decide on the right storage solution 20#EUdev10