Couchbase Data Pipeline

Phoenix Hadoop Users Group
Couchbase Data Pipeline

2
Agenda
 CouchbaseTechnology
 What is Couchbase
 Couchbase and Hadoop Ecosystem
 Architecture (Node/SDK/Cluster)
 Couchbase at PayPal
 Couchbase Deployment
 Use Case Overview
 Kafka Connector Demo

4
High availability
cache
Key-value
store
Document
database
Embedded
database
Sync
management
Couchbase Server Couchbase Lite Couchbase
Sync Gateway
Data management for a broad range of use cases

5
CouchbaseTenants
Flexible data model
Consistent performance at scale
High availability
Easy, affordable scalability
24x365

Couchbase and Hadoop Ecosystem

7
CouchbaseView
NoSQL Hadoop NoSQL Hadoop
Overlap Compliment
NoSQL or Hadoop? NoSQL and Hadoop.

8
CouchbaseView
Couchbase Spark Hadoop (Hive)
Use cases • Operational
• Web / Mobile
• Analytics
• Machine
Learning
• Analytics
• Machine
Learning
Processing mode • Online
• Ad Hoc (New!)
• Streaming
• Ad Hoc
• Batch
• Batch
• Ad Hoc
Low latency = < 1ms ops Seconds Minutes
Users are typically Millions of customers 100’s of analysts 100’s of analysts
Big data = 10s ofTerabytes Petabytes(?) Petabytes
ANALYTICALOPERATIONAL

9
Lambda Architecture
1
4
5
DATA
SERVE
QUER
Y
New Data
Stream
Analysis
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Batch
Recompute
Real-Time
Increment
Batch Layer
Serving Layer
Speed Layer
2 BATCH
3 SPEED

10
Couchbase and Hadoop
New Data
Stream
MergedView
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Batch
Recompute
BatchViews
Real-TimeViews
Real-Time
Increment
Merge
Batch Layer
Serving Layer
Speed Layer
Couchbase Hadoop
Connector (Sqoop)

11
Couchbase Hadoop
Connector (Sqoop)
Couchbase and Hadoop
New Data
Stream
MergedView
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Batch
Recompute
BatchViews
Real-TimeViews
Real-Time
Increment
Merge
Batch Layer
Serving Layer
Speed Layer
Stream / Data
Ingestion
Store
Incremental
Data / Stream
processing
Serving merged
results /
responses

13
Couchbase Connectors
xDBC
App
CB Node
xDBC
ETL
xDBC
BI
xDBC
Visualization
CB Node CB Node
Visualization
Integrations, partnerships

COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD

16
Couchbase Server Node
Single-node type means easier
administration and scaling
 Single installation
 Two major components/processes:
Data manager cluster manager
 Data manager:
 C/C++
 Layer consolidation of caching and
persistence
 Cluster manager:
 Erlang/OTP
 Administration UI’s
 Out-of-band for data requests

17
Couchbase Read Operation
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
GET
DOC 1
DOC 1
Single-node type means
easier administration and
scaling
 Reads out of cache are extremely
fast
 No other process/system to
communicate with
 Data connection is aTCP-binary
protocol

18
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
Couchbase Write Operation
DOC 1
DOC 1DOC 1
scaling
 Writes are async by default
 Application gets
acknowledgement when
successfully in RAM and can trade-
off waiting for replication or
persistence per-write
 Replication to 1, 2 or 3 other nodes
 Replication is RAM-based so
extremely fast
 Off-node replication is primary
level of HA
 Disk written to as fast as possible –
no waiting

19
Couchbase Cache Ejection
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
DOC 2DOC 3DOC 4DOC 5
DOC 1
DOC 2 DOC 3 DOC 4 DOC 5
scaling
 Layer consolidation means read
through and write through cache
 Couchbase automatically removes
data that has already been
persisted from RAM

20
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
Couchbase Cache Miss
GET
DOC 1
DOC 1
DOC 1
scaling
 Layer consolidation means 1
single interface for App to talk to
and get its data back as fast as
possible
 Separation of cache and disk
allows for fastest access out of
RAM while pulling data from disk
in parallel

22
 Documents are integral to the SDKs.
 All SDK’s support JSON format
 In addition: Serialized objects, Unquoted Strings, Binary
pass-through
 A Document contains:
Couchbase SDK
22
Property Description
ID The bucket-unique identifier
Content The value that is stored
Expiry An expiration time
CAS Check-and-Set identifier

23
Couchbase SDK
What does it mean to be a Couchbase SDK?
Cluster
Bucket
CRUD
View
Query
N1QL
Query
Function
Manage connections to the bucket within the
cluster for different services.
Provide a core layer where IO can be managed
and optimized.
Provide a way to manage buckets.
API
insertDesignDocument()
flush()
listDesignDocuments()
Function
Hold on to cluster information such as
topology.
API
Reference Cluster Management
openBucket()
info()
disconnect()
Function
Give the application developer a concurrent
API for basic (k-v) or document management
API
get()
insert()
upsert()
remove()
Function
Allow for querying, execution of other
directives such as defining indexes and
checking on index state.
API
abucket.NewN1QLQuery(
“SELECT * FROM default LIMIT 5” )
.Consistency(gocouchbase.RequestPlus);
Function
Allow for view querying, building of queries
and reasonable error handling from the cluster.
API
abucket.NewViewQuery().Limit().Stale()

24
Couchbase SDK
 Official SDKs
 Java
 .NET
 Node.js
 Python
For each of these we have
 Full Document support
 Interoperability
 Common yet idiomatic Programming Model
Others: Erlang, Perl,TCL, Clojure, Scala
 PHP
 C / C++
 Go
 Ruby
JDBC and ODBC

Architecture
Couchbase Cluster: Node and SDK Interaction

26
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
Basic Operation
SHARD
5
SHARD
2
SHARD
9
SHARD SHARD SHARD
SHARD
4
SHARD
7
SHARD
8
SHARD SHARD SHARD
SHARD
1
SHARD
3
SHARD
6
SHARD SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
Application has single logical connection
to cluster (client object)
 Data is automatically sharded resulting in even
document data distribution across cluster
 Each vbucket replicated 1, 2 or 3 times (“peer-to-
peer” replication)
 Docs are automatically hashed by the client to a
shard’
 Cluster map provides location of which server a
shard is on
 Every read/write/update/delete goes to same node
for a given key
 Strongly consistent data access (“read your own
writes”)
 A single Couchbase node can achieve 100k’s ops/sec
so no need to scale reads

27
Auto sharding – Bucket and vBuckets
vB
Data buckets
vB
1 ….. 1024
Virtual buckets
 A bucket is a logical, unique key space
 Multiple buckets can exist within a single cluster of nodes
 Each bucket has active and replica data sets (1, 2 or 3 extra copies)
 Each data set has 1024Virtual Buckets (vBuckets)
 Each vBucket contains 1/1024th portion of the data set
 vBuckets do not have a fixed physical server location
 Mapping between the vBuckets and physical servers is called the
cluster map
 Document IDs (keys) always get hashed to the same vbucket
 Couchbase SDK’s lookup the vbucket -> server mapping

30
Cluster Map – 2 nodes added

31
Rebalance
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
SHARD
7
SHARD
SHARD
6
SHARD
SHARD
8
SHARD
9
SHARD
READ/WRITE/UPDATE
Application has single
logical connection to
cluster (client object)
 Multiple nodes added or
removed at once
 One-click operation
 Incremental movement of
active and replica vbuckets
and data
 Client library updated via
cluster map
 Fully online operation, no
downtime or loss of
performance

32
Fail Over Node
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD
SHARDSHARD
6
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD
SHARD
SHARD
7
SHARD
SHARD
6
SHARDSHARD
8
SHARD
9
SHARD
SHARD
3
SHARD
1
SHARD
3
SHARD
Application has single
logical connection to
cluster (client object)
 When node goes down,
some requests will fail
 Failover is either automatic
or manual`
 Client library is
automatically updated via
cluster map
 Replicas not recreated to
preserve stability
 Best practice to replace
node and rebalance

Couchbase at PayPal
Kafka Integration

34
Couchbase at PayPal
34
Footprint Overview
 Seven use cases (more going live at later date)
 Each cluster is 10 to 20 nodes per cluster
 Three data center locations per use case
Global Cookie Service
 Three clusters (two handle traffic, one for DR)
 Bi-Directional Replication
 Billions of Documents
 TB of Data (Maximum of 10 over time)
Challenge
 Data Analytics

35
Couchbase at PayPal
35
Couchbase Solution
 Couchbase Server deployed to capture and
serve global cookies
 Integrates with Hadoop to pass data for
additional offline analytics via Kafka
Results
 Consistent low latency
 SLA 10ms application
 SLA 1ms Couchbase
 High availability enabled by distributed cache
and data center replication
 Kafka integration for analytics within Hadoop
cluster

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Aug/Sep Oct Nov Dec
Month Month MonthMonth
36
Data volume/ Scalability
• Online system ; >1B documents
• 4-10k size ; 5-10TB total storage
• Linearly Scalable
Availability
• Multi data center – DR
• Availability requirement of
99.99%
RequirementsforDatabase
Data Structure
• Flexible & Schema less; document
based
Performance
• 50% read/50% write;
• Low latency < 10 msec (5)

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 37
Couchbase TAP
• Snapshot Entire Database
• Export Future mutations
• TAP observe data changes in memcached server
• Kafka - A high-throughput distributed messaging
system.
Couchbase Kafka Adapter
Based on Couchbase Tap & Kafka
Producer
Kafka Producer
Fast
Scalable
Durable
Distributed
https://github.com/paypal/couchbasekafka

Stream data out of database
https://github.com/paypal/couchbasekafka
38
Camus ,
MR Jobs
TAP Stream Couchbase Kafka Adapter
{TAPClient + Kafka
Producer}
[1] [2] [3]
[4][5][6]
[7]

Cookie
App
Cookie
App
Cookie
App
XDCR
Active
Write
Read
39
Bi-directional Uni-directional
Active Passive
Deployment Model

Demo
Connector … http://blog.couchbase.com/introducing-the-couchbase-kafka-connector
Bits … https://github.com/couchbase/couchbase-kafka-connector

Couchbase Data Pipeline

More Related Content

What's hot (20)

Similar to Couchbase Data Pipeline (20)

Recently uploaded (20)

Couchbase Data Pipeline

Editor's Notes