SlideShare a Scribd company logo
Phoenix Hadoop Users Group
Couchbase Data Pipeline
2
Agenda
 CouchbaseTechnology
 What is Couchbase
 Couchbase and Hadoop Ecosystem
 Architecture (Node/SDK/Cluster)
 Couchbase at PayPal
 Couchbase Deployment
 Use Case Overview
 Kafka Connector Demo
What is Couchbase?
4
High availability
cache
Key-value
store
Document
database
Embedded
database
Sync
management
Couchbase Server Couchbase Lite Couchbase
Sync Gateway
Data management for a broad range of use cases
5
CouchbaseTenants
Flexible data model
Consistent performance at scale
High availability
Easy, affordable scalability
24x365
Couchbase and Hadoop Ecosystem
7
CouchbaseView
NoSQL Hadoop NoSQL Hadoop
Overlap Compliment
NoSQL or Hadoop? NoSQL and Hadoop.
8
CouchbaseView
Couchbase Spark Hadoop (Hive)
Use cases • Operational
• Web / Mobile
• Analytics
• Machine
Learning
• Analytics
• Machine
Learning
Processing mode • Online
• Ad Hoc (New!)
• Streaming
• Ad Hoc
• Batch
• Batch
• Ad Hoc
Low latency = < 1ms ops Seconds Minutes
Users are typically Millions of customers 100’s of analysts 100’s of analysts
Big data = 10s ofTerabytes Petabytes(?) Petabytes
ANALYTICALOPERATIONAL
9
Lambda Architecture
1
4
5
DATA
SERVE
QUER
Y
New Data
Stream
Analysis
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Batch
Recompute
Real-Time
Increment
Batch Layer
Serving Layer
Speed Layer
2 BATCH
3 SPEED
10
Couchbase and Hadoop
New Data
Stream
MergedView
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Batch
Recompute
BatchViews
Real-TimeViews
Real-Time
Increment
Merge
Batch Layer
Serving Layer
Speed Layer
Couchbase Hadoop
Connector (Sqoop)
11
Couchbase Hadoop
Connector (Sqoop)
Couchbase and Hadoop
New Data
Stream
MergedView
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Batch
Recompute
BatchViews
Real-TimeViews
Real-Time
Increment
Merge
Batch Layer
Serving Layer
Speed Layer
Stream / Data
Ingestion
Store
Incremental
Data / Stream
processing
Serving merged
results /
responses
12
Couchbase Connectors
13
Couchbase Connectors
xDBC
App
CB Node
xDBC
ETL
xDBC
BI
xDBC
Visualization
CB Node CB Node
Visualization
Integrations, partnerships
COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD
Architecture
Couchbase Node
16
Couchbase Server Node
Single-node type means easier
administration and scaling
 Single installation
 Two major components/processes:
Data manager cluster manager
 Data manager:
 C/C++
 Layer consolidation of caching and
persistence
 Cluster manager:
 Erlang/OTP
 Administration UI’s
 Out-of-band for data requests
17
Couchbase Read Operation
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
GET
DOC 1
DOC 1
Single-node type means
easier administration and
scaling
 Reads out of cache are extremely
fast
 No other process/system to
communicate with
 Data connection is aTCP-binary
protocol
18
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
Couchbase Write Operation
DOC 1
DOC 1DOC 1
Single-node type means
easier administration and
scaling
 Writes are async by default
 Application gets
acknowledgement when
successfully in RAM and can trade-
off waiting for replication or
persistence per-write
 Replication to 1, 2 or 3 other nodes
 Replication is RAM-based so
extremely fast
 Off-node replication is primary
level of HA
 Disk written to as fast as possible –
no waiting
19
Couchbase Cache Ejection
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
DOC 2DOC 3DOC 4DOC 5
DOC 1
DOC 2 DOC 3 DOC 4 DOC 5
Single-node type means
easier administration and
scaling
 Layer consolidation means read
through and write through cache
 Couchbase automatically removes
data that has already been
persisted from RAM
20
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
Couchbase Cache Miss
DOC 2 DOC 3 DOC 4 DOC 5
DOC 2 DOC 3 DOC 4 DOC 5
GET
DOC 1
DOC 1
DOC 1
Single-node type means
easier administration and
scaling
 Layer consolidation means 1
single interface for App to talk to
and get its data back as fast as
possible
 Separation of cache and disk
allows for fastest access out of
RAM while pulling data from disk
in parallel
Architecture
Couchbase SDK
22
 Documents are integral to the SDKs.
 All SDK’s support JSON format
 In addition: Serialized objects, Unquoted Strings, Binary
pass-through
 A Document contains:
Couchbase SDK
22
Property Description
ID The bucket-unique identifier
Content The value that is stored
Expiry An expiration time
CAS Check-and-Set identifier
23
Couchbase SDK
What does it mean to be a Couchbase SDK?
Cluster
Bucket
CRUD
View
Query
N1QL
Query
Function
Manage connections to the bucket within the
cluster for different services.
Provide a core layer where IO can be managed
and optimized.
Provide a way to manage buckets.
API
insertDesignDocument()
flush()
listDesignDocuments()
Function
Hold on to cluster information such as
topology.
API
Reference Cluster Management
openBucket()
info()
disconnect()
Function
Give the application developer a concurrent
API for basic (k-v) or document management
API
get()
insert()
upsert()
remove()
Function
Allow for querying, execution of other
directives such as defining indexes and
checking on index state.
API
abucket.NewN1QLQuery(
“SELECT * FROM default LIMIT 5” )
.Consistency(gocouchbase.RequestPlus);
Function
Allow for view querying, building of queries
and reasonable error handling from the cluster.
API
abucket.NewViewQuery().Limit().Stale()
24
Couchbase SDK
 Official SDKs
 Java
 .NET
 Node.js
 Python
For each of these we have
 Full Document support
 Interoperability
 Common yet idiomatic Programming Model
Others: Erlang, Perl,TCL, Clojure, Scala
 PHP
 C / C++
 Go
 Ruby
JDBC and ODBC
Architecture
Couchbase Cluster: Node and SDK Interaction
26
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
Basic Operation
SHARD
5
SHARD
2
SHARD
9
SHARD SHARD SHARD
SHARD
4
SHARD
7
SHARD
8
SHARD SHARD SHARD
SHARD
1
SHARD
3
SHARD
6
SHARD SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
Application has single logical connection
to cluster (client object)
 Data is automatically sharded resulting in even
document data distribution across cluster
 Each vbucket replicated 1, 2 or 3 times (“peer-to-
peer” replication)
 Docs are automatically hashed by the client to a
shard’
 Cluster map provides location of which server a
shard is on
 Every read/write/update/delete goes to same node
for a given key
 Strongly consistent data access (“read your own
writes”)
 A single Couchbase node can achieve 100k’s ops/sec
so no need to scale reads
27
Auto sharding – Bucket and vBuckets
vB
Data buckets
vB
1 ….. 1024
Virtual buckets
 A bucket is a logical, unique key space
 Multiple buckets can exist within a single cluster of nodes
 Each bucket has active and replica data sets (1, 2 or 3 extra copies)
 Each data set has 1024Virtual Buckets (vBuckets)
 Each vBucket contains 1/1024th portion of the data set
 vBuckets do not have a fixed physical server location
 Mapping between the vBuckets and physical servers is called the
cluster map
 Document IDs (keys) always get hashed to the same vbucket
 Couchbase SDK’s lookup the vbucket -> server mapping
28
Cluster Map
29
Cluster Map
30
Cluster Map – 2 nodes added
31
Rebalance
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
SHARD
7
SHARD
SHARD
6
SHARD
SHARD
8
SHARD
9
SHARD
READ/WRITE/UPDATE
Application has single
logical connection to
cluster (client object)
 Multiple nodes added or
removed at once
 One-click operation
 Incremental movement of
active and replica vbuckets
and data
 Client library updated via
cluster map
 Fully online operation, no
downtime or loss of
performance
32
Fail Over Node
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD
SHARDSHARD
6
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD
SHARD
SHARD
7
SHARD
SHARD
6
SHARDSHARD
8
SHARD
9
SHARD
SHARD
3
SHARD
1
SHARD
3
SHARD
Application has single
logical connection to
cluster (client object)
 When node goes down,
some requests will fail
 Failover is either automatic
or manual`
 Client library is
automatically updated via
cluster map
 Replicas not recreated to
preserve stability
 Best practice to replace
node and rebalance
Couchbase at PayPal
Kafka Integration
34
Couchbase at PayPal
34
Footprint Overview
 Seven use cases (more going live at later date)
 Each cluster is 10 to 20 nodes per cluster
 Three data center locations per use case
Global Cookie Service
 Three clusters (two handle traffic, one for DR)
 Bi-Directional Replication
 Billions of Documents
 TB of Data (Maximum of 10 over time)
Challenge
 Data Analytics
35
Couchbase at PayPal
35
Couchbase Solution
 Couchbase Server deployed to capture and
serve global cookies
 Integrates with Hadoop to pass data for
additional offline analytics via Kafka
Results
 Consistent low latency
 SLA 10ms application
 SLA 1ms Couchbase
 High availability enabled by distributed cache
and data center replication
 Kafka integration for analytics within Hadoop
cluster
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Aug/Sep Oct Nov Dec
Month Month MonthMonth
36
Data volume/ Scalability
• Online system ; >1B documents
• 4-10k size ; 5-10TB total storage
• Linearly Scalable
Availability
• Multi data center – DR
• Availability requirement of
99.99%
RequirementsforDatabase
Data Structure
• Flexible & Schema less; document
based
Performance
• 50% read/50% write;
• Low latency < 10 msec (5)
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 37
Couchbase TAP
• Snapshot Entire Database
• Export Future mutations
• TAP observe data changes in memcached server
• Kafka - A high-throughput distributed messaging
system.
Couchbase Kafka Adapter
Based on Couchbase Tap & Kafka
Producer
Kafka Producer
Fast
Scalable
Durable
Distributed
https://github.com/paypal/couchbasekafka
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Stream data out of database
https://github.com/paypal/couchbasekafka
38
Camus ,
MR Jobs
TAP Stream Couchbase Kafka Adapter
{TAPClient + Kafka
Producer}
[1] [2] [3]
[4][5][6]
[7]
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Cookie
App
Cookie
App
Cookie
App
XDCR
Active
Write
Read
39
Bi-directional Uni-directional
Active Passive
Deployment Model
Demo
Connector … http://blog.couchbase.com/introducing-the-couchbase-kafka-connector
Bits … https://github.com/couchbase/couchbase-kafka-connector
ThankYou

More Related Content

PDF
Analyzing Time Series Data with Apache Spark and Cassandra
PPT
Compact, Compress, De-Duplicate (DAOS)
PDF
Securing your Pulsar Cluster with Vault_Chris Kellogg
PPTX
Java version 11 - les 9 nouveautes
PPT
Oracle GoldenGate
PPTX
Apache Spark Fundamentals
PDF
APEX Security 101
PDF
What CloudStackers Need To Know About LINSTOR/DRBD
Analyzing Time Series Data with Apache Spark and Cassandra
Compact, Compress, De-Duplicate (DAOS)
Securing your Pulsar Cluster with Vault_Chris Kellogg
Java version 11 - les 9 nouveautes
Oracle GoldenGate
Apache Spark Fundamentals
APEX Security 101
What CloudStackers Need To Know About LINSTOR/DRBD

What's hot (20)

PDF
Parallelizing with Apache Spark in Unexpected Ways
PDF
Cloud Infrastructure with Crossplane
PDF
Sample lld document v1.0
PPTX
Terraform on Azure
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Django interview Questions| Edureka
PPTX
Part3 Explain the Explain Plan
PDF
Data Modeling with NGSI, NGSI-LD
PPTX
Comparing three data ingestion approaches where Apache Kafka integrates with ...
PPTX
Unified Batch & Stream Processing with Apache Samza
PDF
Fully Utilizing Spark for Data Validation
PDF
Kafka Summit SF 2017 - Best Practices for Running Kafka on Docker Containers
PDF
Construisez votre première application MongoDB
PDF
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
PPTX
Database Migration Assistant for Unicode (DMU)
PPTX
Cloning Oracle EBS R12: A Step by Step Procedure
PPTX
Terraform on Azure
PDF
Fluentd vs. Logstash for OpenStack Log Management
PDF
VictoriaLogs: Open Source Log Management System - Preview
PDF
What is Docker Architecture | Edureka
Parallelizing with Apache Spark in Unexpected Ways
Cloud Infrastructure with Crossplane
Sample lld document v1.0
Terraform on Azure
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Django interview Questions| Edureka
Part3 Explain the Explain Plan
Data Modeling with NGSI, NGSI-LD
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Unified Batch & Stream Processing with Apache Samza
Fully Utilizing Spark for Data Validation
Kafka Summit SF 2017 - Best Practices for Running Kafka on Docker Containers
Construisez votre première application MongoDB
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Database Migration Assistant for Unicode (DMU)
Cloning Oracle EBS R12: A Step by Step Procedure
Terraform on Azure
Fluentd vs. Logstash for OpenStack Log Management
VictoriaLogs: Open Source Log Management System - Preview
What is Docker Architecture | Edureka
Ad

Similar to Couchbase Data Pipeline (20)

PPTX
CFCamp 2016 - Couchbase Overview
PDF
Manuel Hurtado. Couchbase paradigma4oct
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PDF
Reactive dashboard’s using apache spark
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Fabian Hueske – Cascading on Flink
PPTX
From oracle to hadoop with Sqoop and other tools
PPTX
Spark Workshop
PDF
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
PDF
SQL on Hadoop
PDF
Accelerating apache spark with rdma
PPTX
PDF
Replicate from Oracle to data warehouses and analytics
PDF
Couchbase - Yet Another Introduction
PPTX
Dragonflow Austin Summit Talk
PDF
Couchbase Day
PDF
How companies use NoSQL & Couchbase - NoSQL Now 2014
PPTX
Optimizing Performance in Rust for Low-Latency Database Drivers
CFCamp 2016 - Couchbase Overview
Manuel Hurtado. Couchbase paradigma4oct
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Reactive dashboard’s using apache spark
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming & Kafka-The Future of Stream Processing
Fabian Hueske – Cascading on Flink
From oracle to hadoop with Sqoop and other tools
Spark Workshop
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
SQL on Hadoop
Accelerating apache spark with rdma
Replicate from Oracle to data warehouses and analytics
Couchbase - Yet Another Introduction
Dragonflow Austin Summit Talk
Couchbase Day
How companies use NoSQL & Couchbase - NoSQL Now 2014
Optimizing Performance in Rust for Low-Latency Database Drivers
Ad

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Introduction to Knowledge Engineering Part 1
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
1_Introduction to advance data techniques.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Introduction to the R Programming Language
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Lecture1 pattern recognition............
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
annual-report-2024-2025 original latest.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Knowledge Engineering Part 1
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to the R Programming Language
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
SAP 2 completion done . PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Lecture1 pattern recognition............
climate analysis of Dhaka ,Banglades.pptx
Supervised vs unsupervised machine learning algorithms
annual-report-2024-2025 original latest.

Couchbase Data Pipeline

  • 1. Phoenix Hadoop Users Group Couchbase Data Pipeline
  • 2. 2 Agenda  CouchbaseTechnology  What is Couchbase  Couchbase and Hadoop Ecosystem  Architecture (Node/SDK/Cluster)  Couchbase at PayPal  Couchbase Deployment  Use Case Overview  Kafka Connector Demo
  • 4. 4 High availability cache Key-value store Document database Embedded database Sync management Couchbase Server Couchbase Lite Couchbase Sync Gateway Data management for a broad range of use cases
  • 5. 5 CouchbaseTenants Flexible data model Consistent performance at scale High availability Easy, affordable scalability 24x365
  • 7. 7 CouchbaseView NoSQL Hadoop NoSQL Hadoop Overlap Compliment NoSQL or Hadoop? NoSQL and Hadoop.
  • 8. 8 CouchbaseView Couchbase Spark Hadoop (Hive) Use cases • Operational • Web / Mobile • Analytics • Machine Learning • Analytics • Machine Learning Processing mode • Online • Ad Hoc (New!) • Streaming • Ad Hoc • Batch • Batch • Ad Hoc Low latency = < 1ms ops Seconds Minutes Users are typically Millions of customers 100’s of analysts 100’s of analysts Big data = 10s ofTerabytes Petabytes(?) Petabytes ANALYTICALOPERATIONAL
  • 9. 9 Lambda Architecture 1 4 5 DATA SERVE QUER Y New Data Stream Analysis All Data Precompute Views (Map Reduce) Process Stream Incremental Views Batch Recompute Real-Time Increment Batch Layer Serving Layer Speed Layer 2 BATCH 3 SPEED
  • 10. 10 Couchbase and Hadoop New Data Stream MergedView All Data Precompute Views (Map Reduce) Process Stream Incremental Views Partial Aggregate Partial Aggregate Partial Aggregate Real-Time Data Batch Recompute BatchViews Real-TimeViews Real-Time Increment Merge Batch Layer Serving Layer Speed Layer Couchbase Hadoop Connector (Sqoop)
  • 11. 11 Couchbase Hadoop Connector (Sqoop) Couchbase and Hadoop New Data Stream MergedView All Data Precompute Views (Map Reduce) Process Stream Incremental Views Partial Aggregate Partial Aggregate Partial Aggregate Real-Time Data Batch Recompute BatchViews Real-TimeViews Real-Time Increment Merge Batch Layer Serving Layer Speed Layer Stream / Data Ingestion Store Incremental Data / Stream processing Serving merged results / responses
  • 13. 13 Couchbase Connectors xDBC App CB Node xDBC ETL xDBC BI xDBC Visualization CB Node CB Node Visualization Integrations, partnerships
  • 16. 16 Couchbase Server Node Single-node type means easier administration and scaling  Single installation  Two major components/processes: Data manager cluster manager  Data manager:  C/C++  Layer consolidation of caching and persistence  Cluster manager:  Erlang/OTP  Administration UI’s  Out-of-band for data requests
  • 17. 17 Couchbase Read Operation APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE DOC 1 GET DOC 1 DOC 1 Single-node type means easier administration and scaling  Reads out of cache are extremely fast  No other process/system to communicate with  Data connection is aTCP-binary protocol
  • 18. 18 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE Couchbase Write Operation DOC 1 DOC 1DOC 1 Single-node type means easier administration and scaling  Writes are async by default  Application gets acknowledgement when successfully in RAM and can trade- off waiting for replication or persistence per-write  Replication to 1, 2 or 3 other nodes  Replication is RAM-based so extremely fast  Off-node replication is primary level of HA  Disk written to as fast as possible – no waiting
  • 19. 19 Couchbase Cache Ejection APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE DOC 1 DOC 2DOC 3DOC 4DOC 5 DOC 1 DOC 2 DOC 3 DOC 4 DOC 5 Single-node type means easier administration and scaling  Layer consolidation means read through and write through cache  Couchbase automatically removes data that has already been persisted from RAM
  • 20. 20 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE DOC 1 Couchbase Cache Miss DOC 2 DOC 3 DOC 4 DOC 5 DOC 2 DOC 3 DOC 4 DOC 5 GET DOC 1 DOC 1 DOC 1 Single-node type means easier administration and scaling  Layer consolidation means 1 single interface for App to talk to and get its data back as fast as possible  Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel
  • 22. 22  Documents are integral to the SDKs.  All SDK’s support JSON format  In addition: Serialized objects, Unquoted Strings, Binary pass-through  A Document contains: Couchbase SDK 22 Property Description ID The bucket-unique identifier Content The value that is stored Expiry An expiration time CAS Check-and-Set identifier
  • 23. 23 Couchbase SDK What does it mean to be a Couchbase SDK? Cluster Bucket CRUD View Query N1QL Query Function Manage connections to the bucket within the cluster for different services. Provide a core layer where IO can be managed and optimized. Provide a way to manage buckets. API insertDesignDocument() flush() listDesignDocuments() Function Hold on to cluster information such as topology. API Reference Cluster Management openBucket() info() disconnect() Function Give the application developer a concurrent API for basic (k-v) or document management API get() insert() upsert() remove() Function Allow for querying, execution of other directives such as defining indexes and checking on index state. API abucket.NewN1QLQuery( “SELECT * FROM default LIMIT 5” ) .Consistency(gocouchbase.RequestPlus); Function Allow for view querying, building of queries and reasonable error handling from the cluster. API abucket.NewViewQuery().Limit().Stale()
  • 24. 24 Couchbase SDK  Official SDKs  Java  .NET  Node.js  Python For each of these we have  Full Document support  Interoperability  Common yet idiomatic Programming Model Others: Erlang, Perl,TCL, Clojure, Scala  PHP  C / C++  Go  Ruby JDBC and ODBC
  • 26. 26 ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 Basic Operation SHARD 5 SHARD 2 SHARD 9 SHARD SHARD SHARD SHARD 4 SHARD 7 SHARD 8 SHARD SHARD SHARD SHARD 1 SHARD 3 SHARD 6 SHARD SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD Application has single logical connection to cluster (client object)  Data is automatically sharded resulting in even document data distribution across cluster  Each vbucket replicated 1, 2 or 3 times (“peer-to- peer” replication)  Docs are automatically hashed by the client to a shard’  Cluster map provides location of which server a shard is on  Every read/write/update/delete goes to same node for a given key  Strongly consistent data access (“read your own writes”)  A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads
  • 27. 27 Auto sharding – Bucket and vBuckets vB Data buckets vB 1 ….. 1024 Virtual buckets  A bucket is a logical, unique key space  Multiple buckets can exist within a single cluster of nodes  Each bucket has active and replica data sets (1, 2 or 3 extra copies)  Each data set has 1024Virtual Buckets (vBuckets)  Each vBucket contains 1/1024th portion of the data set  vBuckets do not have a fixed physical server location  Mapping between the vBuckets and physical servers is called the cluster map  Document IDs (keys) always get hashed to the same vbucket  Couchbase SDK’s lookup the vbucket -> server mapping
  • 30. 30 Cluster Map – 2 nodes added
  • 31. 31 Rebalance ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARD SHARD 8 SHARD 9 SHARD READ/WRITE/UPDATE Application has single logical connection to cluster (client object)  Multiple nodes added or removed at once  One-click operation  Incremental movement of active and replica vbuckets and data  Client library updated via cluster map  Fully online operation, no downtime or loss of performance
  • 32. 32 Fail Over Node ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARDSHARD 6 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARDSHARD 8 SHARD 9 SHARD SHARD 3 SHARD 1 SHARD 3 SHARD Application has single logical connection to cluster (client object)  When node goes down, some requests will fail  Failover is either automatic or manual`  Client library is automatically updated via cluster map  Replicas not recreated to preserve stability  Best practice to replace node and rebalance
  • 34. 34 Couchbase at PayPal 34 Footprint Overview  Seven use cases (more going live at later date)  Each cluster is 10 to 20 nodes per cluster  Three data center locations per use case Global Cookie Service  Three clusters (two handle traffic, one for DR)  Bi-Directional Replication  Billions of Documents  TB of Data (Maximum of 10 over time) Challenge  Data Analytics
  • 35. 35 Couchbase at PayPal 35 Couchbase Solution  Couchbase Server deployed to capture and serve global cookies  Integrates with Hadoop to pass data for additional offline analytics via Kafka Results  Consistent low latency  SLA 10ms application  SLA 1ms Couchbase  High availability enabled by distributed cache and data center replication  Kafka integration for analytics within Hadoop cluster
  • 36. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Aug/Sep Oct Nov Dec Month Month MonthMonth 36 Data volume/ Scalability • Online system ; >1B documents • 4-10k size ; 5-10TB total storage • Linearly Scalable Availability • Multi data center – DR • Availability requirement of 99.99% RequirementsforDatabase Data Structure • Flexible & Schema less; document based Performance • 50% read/50% write; • Low latency < 10 msec (5)
  • 37. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 37 Couchbase TAP • Snapshot Entire Database • Export Future mutations • TAP observe data changes in memcached server • Kafka - A high-throughput distributed messaging system. Couchbase Kafka Adapter Based on Couchbase Tap & Kafka Producer Kafka Producer Fast Scalable Durable Distributed https://github.com/paypal/couchbasekafka
  • 38. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Stream data out of database https://github.com/paypal/couchbasekafka 38 Camus , MR Jobs TAP Stream Couchbase Kafka Adapter {TAPClient + Kafka Producer} [1] [2] [3] [4][5][6] [7]
  • 39. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. Cookie App Cookie App Cookie App XDCR Active Write Read 39 Bi-directional Uni-directional Active Passive Deployment Model
  • 40. Demo Connector … http://blog.couchbase.com/introducing-the-couchbase-kafka-connector Bits … https://github.com/couchbase/couchbase-kafka-connector

Editor's Notes

  • #5: Couchbase is an open source NoSQL Data Management platform. It specializes in operational data management to run the full range of online scenarios – web, mobile, and Internet of Things It is open source, and available in community and enterprise editions. It is built upon an integrated object-managed cache, which is both memory centric and distributed. That’s the core. All three of these together pretty well cover the web, mobile, and IOT use cases Couchbase was designed for. KEY POINT: COUCHBASE PROVIDES A SET OF MULTI-PURPOSE, CORE CAPABILITIES THAT SUPPORT A BROAD RANGE OF APPLICATIONS AND USE CASES, ALL IN A SINGLE DATA MANAGEMENT PLATFORM. Couchbase provides a set of technology capabilities to support a broad range of applications and use cases: High Availability Cache: Couchbase provides an integrated managed object cache, so you can start out using Couchbase as a high availability cache on top of your existing relational database. For example, you can use Couchbase as a session store in front of your relational database, if your relational DB is struggling to keep up with the load required for online interactive applications. Key-Value Store: Many customers start with Couchbase as a cache and then broaden their usage to other capabilities, like using Couchbase as a Key-Value Store for things like Profile Management. Document Database: From there, you can grow into using Couchbase as a Document Database, where you can do more with capabilities like indexing and Cross Data Center Replication. Embedded Database: Couchbase also provides an embedded database called Couchbase Lite. It’s a purpose-built database for the device, so you can build applications that are always available and always work, whether offline or online. Sync Management: Finally, as part of our solution for mobile applications, we provide Couchbase Sync Gateway, which automatically synchronizes data on the device with Couchbase Server in the cloud so your developer doesn’t have to write code to manage the complex sync process. Starting with cache and then expanding to other capabilities is often a good way to learn the technology and get comfortable with Couchbase for a wider set of use cases. Couchbase Open Source Projects (Apache 2.0 Public License) Community Editions Enterprise Editions NoSQL Database (Couchbase Server) Document Database + Key / Value Store + Distributed Cache Cross Data Center Replication Mobile Database (Couchbase Lite) iOS, Android, Java, and .NET Mobile Synchronization (Couchbase Sync Gateway)
  • #6: Couchbase Server is a NoSQL document database for interactive web applications. It has a flexible data model, is easily scalable, provides consistent high performance and is “always-on,” meaning it can serve application data 24 hours, 7 days a week. KEY POINTS: COUCHBASE DELIVERS ALL THE CAPABILITIES NEEDED TO MEET TODAY’S REQUIREMENTS FOR PERFORMANCE, SCALABILITY, AVAILABILITY, AND DATA MODEL FLEXIBILITY. THESE TRANSLATE INTO MAJOR BENEFITS FOR YOUR BUSINESS. Couchbase was purpose-built to solve today’s requirements for enterprise-class, mission-critical, web and mobile applications. Specifically, Couchbase delivers the following capabilities: Fast performance at scale -- submillisecond latency to enable highly responsive applications, for millions or even hundreds of millions of users. Easy, affordable scalability – Couchbase is a distributed database that scales out on commodity hardware with push button simplicity. We make it very easy to add or remove capacity on demand with no system downtime. On premises, in the cloud, wherever you want. High availability – Couchbase automatically replicates your data across your servers, clusters, and data centers, so it’s always available, 24x7. And Couchbase doesn’t require any downtime to maintain. Flexible data model – Couchbase gives you complete flexibility to handle any kind of data, and to change your data model on the fly to accommodate new data attributes or new data types. It’s the kind of flexibility that developers love, because it gets rid of the rigid schemas that slow them down. So developers can build applications faster and easier. All this adds up to powerful benefits for your enterprise: Faster development & time to market Better business agility Improved customer experience Increased loyalty and revenue Lower IT costs and increased efficiency
  • #8: Not an either or decision… I’m going to argue why
  • #9: Latency - Everyone says real time, but what do mean? For an operational system, this means: Extremely fast (in-memory) reads Extremely fast (log append) writes For Couchbase, complete millions of ops / second (these are gets / sets) at latencies of under 1ms, compare LinkedIn figures from Jerry Franz’s session Tuned to LinkedIn’s specific workload: 75% writes (sets + incr) / 25% reads – 13 byte values, 25 byte keys on average 2.5 billion items (+ 1 replica) 600 Gbytes of RAM /  3 Tbytes of disk in use on average Average store latency ~ 0.4 milliseconds 99th percentile store latency ~ 2.5 milliseconds Average get latency ~ 0.8 milliseconds 99th percentile get latency ~ 8 milliseconds
  • #10: Users and consumers of information increasingly demand an always on, low latency access to their data. As well as providing a framework for businesses to understand what’s happening in real time while addressing Polyglot Persistence in managing data. The conceptual framework Lambda Architecture evolved out of Twitter and coined by Nathan Marz for a generic data processing architecture. In a way the architecture is an extended event sourced system but aims to accommodate streaming data at large scale. 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views.
  • #11: - Improve the animation for data handling – refresh existing sales deck and deep dives.
  • #12: Using Couchbase as the high performance, low latency, scalable data store to support personalized interactions Couchbase, as real-time operational database may be generating real-time data to feed into both the batch and real-time layers In some use cases, Couchbase is used to perform real-time processing and analytics – M/R views Some customers are using Couchbase as the data store to for stream processing Email from Michael on streaming data from CB to Spark via Kafka: well, Kafka is one way but it also uses DCP under the covers. I actually need to make some changes to the DCP implementation in the java client and my plan is to have DCP support in dp2 (a month later or so). So once we are GA, there will be a 100% way to stream data directly into a DStream (spark streaming). And of course you can easily implement simple polling of let’s say a view, and grab the full docs that match for example a time interval. 
  • #13: Couchbase Server is engineered with a unique, memory-centric architecture for increased scalability and performance. As data is written to an in-memory cache for low latency read/write access, it’s persisted and replicated for durability and availability. Memory-centric architecture removes unnecessary disk IO from read and write paths. For example, updating the index in-memory based on writes to the cache when it’s queried. Database Change Protocol (DCP) streams writes from the in-memory cache to in-memory queues within the indexing and replication components to increase performance. The same protocol we use to replicate changes between remote clusters is now being used for replicated mutations outside of Couchbase Database Change Protocol Since Couchbase Server 3.x internal de-facto standard to handle changes within a Bucket Clients: Intra-Cluster Replication, Indexing, XDCR Mutation Event that’s raised in case of a creation, update or delete Each mutation that occurs in a vBucket has a sequence number Important: Not yet officially exposed, but used to implement Connectors those are provided by Couchbase
  • #15: The data generated by users is published to Apache Kafka. Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop. Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
  • #17: Each Couchbase node is exactly the same. All nodes are broken down into two components: A data manager (on the left) and a cluster manager (on the right). It’s important to realize that these are separate processes within the system specifically designed so that a node can continue serving its data even in the face of cluster problems like network disruption. The data manager is written in C and C++ and is responsible both for the object caching layer, persistence layer and querying engine. It is based off of memcached and so provides a number of benefits; -The very low lock contention of memcached allows for extremely high throughput and low latencies both to a small set of documents (or just one) as well as across millions of documents -Being compatible with the memcached protocol means we are not only a drop-in replacement, but inherit support for automatic item expiration (TTL), atomic incrementer. -We’ve increased the maximum object size to 20mb, but still recommend keeping them much smaller -Support for both binary objects as well as natively supporting JSON documents -All of the metadata for the documents and their keys is kept in RAM at all times. While this does add a bit of overhead per item, it also allows for extremely fast “miss” speeds which are critical to the operation of some applications….we don’t have to scan a disk to know when we don’t have some data. The cluster manager is based on Erlang/OTP which was developed by Ericsson to deal with managing hundreds or even thousands of distributed telco switches. This component is responsible for configuration, administration, process monitoring, statistics gathering and the UI and REST interface. Note that there is no data manipulation done through this interface.
  • #20: Now, as you fill up memory (click), some data that has already been written to disk will be ejected from RAM to make room for new data. (click) Couchbase supports holding much more data than you have RAM available. It’s important to size the RAM capacity appropriately for your working set: the portion of data your application is working with at any given point in time and needs very low latency, high throughput access to. In some applications this is the entire data set, in others it is much smaller. As RAM fills up, we use a “not recently used” algorithm to determine the best data to be ejected from cache.
  • #21: Should a read now come in for one of those documents that has been ejected (click), it is copied back from disk into RAM and sent back to the application. The document then remains in RAM as long as there is space and it is being accessed.
  • #29: The application makes a call for a key called NYC MQ1 We run the key through the crc 32 function and the result of that hash function is that it points to vbucket3 Which in turn points to couchbase server number 1
  • #30: We now run a different key through through the has and we now come up with differnet vbucket, vbucket 4 and that points to server 3
  • #31: We now run a different key through through the has and we now come up with differnet vbucket, vbucket 4 and that points to server 3
  • #37: Build-in cache with persistence. Performance & Scalability Always on – constant uptime even during maintenance cycles Solution for Cookie had to be one that is transparent to existing Paypal applications. A solution that is highly available , low latency service. Looking at the Data Volume -> more than a billion documents, each document approx. 10K size, 10TB of data Always available with multi data center replication. Availability requirements of 4-9’s.
  • #40: Cookieservice connects to local Couchbase cluster. Upon write failure it writes to seconday cluster. We read our own writes for immediate consistency. Uni-directional XDCR to backup cluster for our DR needs.