SlideShare a Scribd company logo
SPARK SUMMIT
EUROPE2016
SPARK AND COUCHBASE
AUGMENTINGTHE OPERATIONAL DATABASE WITH
SPARK
Michael Nitschinger
Couchbase
Spark Summit EU talk by Michael Nitschinger
WHY SPARK AND COUCHBASE
Overview & Use-Cases
Use Cases
Operations Analytics
CB
§ Recommendations
§ Predictive analytics
§ Fraud detection
§ Catalog
§ Personalization
§ Mobile applications
Use Case: Operationalize Analytics / ML
Hadoop
ML Model
Data
Warehouse
Training Data
CB
Model
Online Data
Serving
Predictions
Adapted from: Databricks – Not Your Father’s Database https://www.brighttalk.com/webcast/12891/196891
Use Case: Data Integration
RDBMSS3HDFS ES
NoSQL
Standalone Deployment
Side-By-Side Deployment
ACCESS PATTERNS
From Spark to Couchbase and Back Again
Key-Value
Fetch/Store by
Document ID
Key-Value
Fetch/Store by
Document ID
N1QL Query
Fetch by
Criteria “SQL”
Key-Value
Fetch/Store by
Document ID
N1QL Query
Fetch by
Criteria “SQL”
Map-Reduce
Views
Materialized Indexes
(Aggregation)
Key-Value
Fetch/Store by
Document ID
N1QL Query
Fetch by
Criteria “SQL”
Map-Reduce
Views
Materialized Indexes
(Aggregation)
Streaming
Mutation Streams
For Processing
Key-Value
Fetch/Store by
Document ID
N1QL Query
Fetch by
Criteria “SQL”
Map-Reduce
Views
Materialized Indexes
(Aggregation)
Streaming
Mutation Streams
For Processing
Full Text
Search on Freeform
Text
Key-Value
Fetch/Store by
Document ID
N1QL Query
Fetch by
Criteria “SQL”
Map-Reduce
Views
Materialized Indexes
(Aggregation)
Streaming
Mutation Streams
For Processing
Couchbase Data Partitioning
Data Locality
• RDD Location Hints based on the Cluster Map
• Not available for N1QL or Views
– Round robin - can’t give location hints
– Back end is scatter gather with 1 node responding
N1QL Query
• N1QL is a SQL service with JSON extensions
• Uses Couchbase’s Global Secondary Indexes
• Can run on any nodes within the cluster
• Nodes with differing services can be added and
removed as needed on the fly
Data Service
Projector & Router
Couchbase Query Architecture
Query ServiceIndex Service
Supervisor
Index maintenance &
Scan coordinator
Index#2Index#1
Query Processor
cbq-engine
Bucket#1 Bucket#2
DCP Stream
Index#4Index#3
...
B
u
c
k
e
t
#
2
B
u
c
k
e
t
#
1
ForestDB
Storage Engine
Spark SQL Sources
TableScan
Scan all of the data and return it
PrunedScan
Scan an index that matches only relevant data
to the query at hand.
PrunedFilteredScan
Scan an index that matches only relevant data
to the query at hand.
Predicate Conversion
Schema Inference
Schema Inference
N1QLRelation:28 - Inferring schema from bucket travel-
sample with query 'SELECT META(`travel-sample`).id as
`META_ID`, `travel-sample`.* FROM `travel-sample` WHERE
`type` = 'airline' LIMIT 1000'
N1QLRelation:28 - Executing generated query: 'SELECT
`name`,`callsign` FROM `travel-sample` WHERE `type` =
'airline''
Schema Inference
DCP and Spark Streaming
Replica
Indexing
…
Structured Streaming Source
27Adapted from https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
DCP Stream Unbounded Table
28
(Un)Structured Streaming?
29
Structured Streaming Source
30
Structured Streaming Sink
Couchbase Spark Connector 1.2.1
• Spark 1.6.x support, including Datasets
• DCP Flow Control
• Enhanced Java APIs
31
Couchbase Spark Connector 2.0.0
• Spark 2.0.x Support
• Enhanced DCP Client
• Experimental Structured Streaming
32
Resources
• Spark Packages
https://spark-packages.org/package/couchbase/couchbase-spark-
connector
• Docs http://docs.couchbase.com
• Source https://github.com/couchbase/couchbase-spark-connector
• Bugs https://issues.couchbase.com/browse/SPARKC
33
SPARK SUMMIT
EUROPE2016
THANK YOU.
Michael Nitschinger
@daschl
michael.nitschinger@couchbase.com

More Related Content

PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by John Musser
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PDF
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by John Musser
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Spark Summit EU talk by Heiko Korndorf

What's hot (20)

PDF
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
PDF
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Spark Summit EU talk by Christos Erotocritou
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Spark Uber Development Kit
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PDF
Spark Summit EU talk by Jim Dowling
PDF
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Efficient State Management With Spark 2.0 And Scale-Out Databases
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Bas Geerdink
Performance Troubleshooting Using Apache Spark Metrics
Spark Uber Development Kit
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Spark Summit EU talk by Jim Dowling
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Spark Summit EU talk by Bas Geerdink
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Ad

Viewers also liked (20)

PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Spark Summit EU talk by Josef Habdank
PDF
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
PDF
Spark Summit EU talk by Jaroslav Bachorik and Adrian Popescu
PDF
Spark Summit EU talk by Javier Aguedes
PDF
Spark Summit EU talk by Jorg Schad
PDF
Spark Summit EU talk by William Benton
PDF
Spark Summit EU talk by Johnathan Mercer
PDF
Spark Summit EU talk by Brij Bhushan Ravat
PDF
Spark Summit EU talk by Tug Grall
PDF
Spark Summit EU talk by Mike Percy
PDF
Spark Summit EU talk by Stephan Kessler
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
PDF
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
PPTX
Brackish Groundwater Characterization, John Meyer and Sanjeev Kalaswad, Texas...
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Jaroslav Bachorik and Adrian Popescu
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by William Benton
Spark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Brackish Groundwater Characterization, John Meyer and Sanjeev Kalaswad, Texas...
Ad

Similar to Spark Summit EU talk by Michael Nitschinger (20)

PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
PPTX
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
PPTX
Couchbase and Apache Spark
PPTX
Couchbase Data Platform | Big Data Demystified
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
PDF
Real Time Streaming with Flink & Couchbase
PPTX
Enterprise Architect's view of Couchbase 4.0 with N1QL
PDF
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
PPTX
Deep dive into N1QL: SQL for JSON: Internals and power features.
PPTX
NoSQL for SQL Server Developers using Couchbase
PDF
Couchbase Day
PDF
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
PDF
NoSQL, the Vital Open Source Ingredient for Modern Success
PDF
NoSQL - Vital Open Source Ingredient for Modern Success
ODP
Couchbase - Introduction
PDF
Manuel Hurtado. Couchbase paradigma4oct
PDF
Couchbase Chennai Meetup: Developing with Couchbase- made easy
PDF
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
PDF
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Couchbase and Apache Spark
Couchbase Data Platform | Big Data Demystified
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Real Time Streaming with Flink & Couchbase
Enterprise Architect's view of Couchbase 4.0 with N1QL
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Deep dive into N1QL: SQL for JSON: Internals and power features.
NoSQL for SQL Server Developers using Couchbase
Couchbase Day
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
NoSQL, the Vital Open Source Ingredient for Modern Success
NoSQL - Vital Open Source Ingredient for Modern Success
Couchbase - Introduction
Manuel Hurtado. Couchbase paradigma4oct
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Global journeys: estimating international migration
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Global journeys: estimating international migration
STUDY DESIGN details- Lt Col Maksud (21).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
IB Computer Science - Internal Assessment.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
climate analysis of Dhaka ,Banglades.pptx
Introduction to Business Data Analytics.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Miokarditis (Inflamasi pada Otot Jantung)
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Database Infoormation System (DBIS).pptx

Spark Summit EU talk by Michael Nitschinger