SFHUG Kudu Talk

1© Cloudera, Inc. All rights reserved.
Todd Lipcon on behalf of the Kudu team
Kudu: Resolving Transactional
and Analytic Trade-offs in
Hadoop
1

The conference for and by Data Scientists, from startup to enterprise
wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce,
Uber, Pinterest, and more
When: Thursday, October 22, 2015
Where: Broadway Studios, San Francisco

Kudu
Storage for Fast Analytics on Fast Data
• New updating column store for
Hadoop
• Apache-licensed open source
• Beta now available
Columnar Store
Kudu

Motivation and Goals
Why build Kudu?
4

Motivating Questions
• Are there user problems that can we can’t address because of gaps in Hadoop
ecosystem storage technologies?
• Are we positioned to take advantage of advancements in the hardware
landscape?

Current Storage Landscape in Hadoop
HDFS excels at:
• Efficiently scanning large amounts
of data
• Accumulating data with high
throughput
HBase excels at:
• Efficiently finding and writing
individual rows
• Making data mutable
Gaps exist when these properties
are needed simultaneously

• High throughput for big scans (columnar
storage and replication)
Goal: Within 2x of Parquet
• Low-latency for short accesses (primary key
indexes and quorum replication)
Goal: 1ms read/write on SSD
• Database-like semantics (initially single-row
ACID)
• Relational data model
• SQL query
• “NoSQL” style scan/insert/update (Java client)
Kudu Design Goals

Changing Hardware landscape
• Spinning disk -> solid state storage
• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and
1.5GB/sec write throughput, at a price of less than $3/GB and dropping
• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant:
• 64->128->256GB over last few years
• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t
designed with CPU efficiency in mind.
• Takeaway 2: Column stores are feasible for random access

Kudu Usage
• Table has a SQL-like schema
• Finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs
• Insert(), Update(), Delete(), Scan()
• Integrations with MapReduce, Spark, and Impala
• more to come!
9

Use cases and architectures

Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combination of
sequential and random reads and writes
● Time Series
○ Examples: Stream market data; fraud detection & prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine Data Analytics
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online Reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups

Real-Time Analytics in Hadoop Today
Fraud Detection in the Real World = Storage Complexity
Considerations:
● How do I handle failure
during this process?
● How often do I reorganize
data streaming in into a
format appropriate for
reporting?
● When reporting, how do I see
data that has not yet been
reorganized?
● How do I ensure that
important jobs aren’t
interrupted by maintenance?
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
Have we
accumulated
enough data?
Reorganize
HBase file
into Parquet
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Reporting
Request
Impala on HDFS

Real-Time Analytics in Hadoop with Kudu
Improvements:
● One system to operate
● No cron jobs or background
processes
● Handle late arrivals or data
corrections with ease
● New data available
immediately for analytics or
operations
Historical and Real-time
Data
Incoming Data
(Messaging
System)
Reporting
Request
Storage in Kudu

How it works
14

Tables and Tablets
• Table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5), with Raft consensus
• Allow read from any replica, plus leader-driven writes with low MTTR
• Tablet servers host tablets
• Store data on local disks (no HDFS)
15

Metadata
• Replicated master*
• Acts as a tablet directory (“META” table)
• Acts as a catalog (table schemas, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• 80-node load test, GetTableLocations RPC perf:
• 99th percentile: 68us, 99.99th percentile: 657us
• <2% peak CPU usage
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
16

Raft consensus
18
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!

Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
19

Fault tolerance (2)
• Permanent failure:
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica
• Leader copies the data over to the new one, which joins as a new FOLLOWER
20

Tablet design
• Inserts buffered in an in-memory store (like HBase’s memstore)
• Flushed to disk
• Columnar layout, similar to Apache Parquet
• Updates use MVCC (updates tagged with timestamp, not in-place)
• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans
• Near-optimal read path for “current time” scans
• No per row branches, fast vectorized decoding and predicate evaluation
• Performance worsens based on number of recent updates
21

LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)
• During tonight’s break-out sessions, I can go into excruciating detail
22

Kudu trade-offs
• Random updates will be slower
• HBase model allows random updates without incurring a disk seek
• Kudu requires a key lookup before update, bloom lookup before insert, may
incur seeks
• Single-row reads may be slower
• Columnar design is optimized for scans
• Especially slow at reading a row that has had many recent updates (e.g YCSB
“zipfian”)
23

Benchmarks
24

TPC-H (Analytics benchmark)
• 75TS + 1 master cluster
• 12 (spinning) disk each, enough RAM to fit dataset
• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
25

- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)

What about Apache Phoenix?
• 10 node cluster (9 worker, 1 master)
• HBase 1.0, Phoenix 4.3
• TPC-H LINEITEM table only (6B rows)
27
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH Q1 COUNT(*)
COUNT(*)
WHERE…
single-row
lookup
Time(sec)
Phoenix
Kudu
Parquet

What about NoSQL-style random access? (YCSB)
• YCSB 0.5.0-snapshot
• 10 node cluster
(9 worker, 1 master)
• HBase 1.0
• 100M rows, 10M ops
28

But don’t trust me (a vendor)…
29

About Xiaomi
Mobile Internet Company Founded in 2010
Smartphones Software
E-commerce
MIUI
Cloud Services
App Store/Game
Payment/Finance
…
Smart Home
Smart Devices

Big Data Analytics Pipeline
Before Kudu
• Long pipeline
high latency(1 hour ~ 1 day), data conversion pains
• No ordering
Log arrival(storage) order not exactly logical order
e.g. read 2-3 days of log for data in 1 day

Big Data Analysis Pipeline
Simplified With Kudu
• ETL Pipeline(0~10s latency)
Apps that need to prevent backpressure or require ETL
• Direct Pipeline(no latency)
Apps that don’t require ETL and no backpressure issues
OLAP scan
Side table lookup
Result store

Use Case 1
Mobile service monitoring and tracing tool
Requirements
 High write throughput
>5 Billion records/day and growing
 Query latest data and quick response
Identify and resolve issues quickly
 Can search for individual records
Easy for troubleshooting
Gather important RPC tracing events from mobile
app and backend service.
Service monitoring & troubleshooting tool.

Use Case 1: Benchmark
Environment
 71 Node cluster
 Hardware
CPU: E5-2620 2.1GHz * 24 core Memory: 64GB
Network: 1Gb Disk: 12 HDD
 Software
Hadoop2.6/Impala 2.1/Kudu
Data
 1 day of server side tracing data
~2.6 Billion rows
~270 bytes/row
17 columns, 5 key columns

Use Case 1: Benchmark Results
1.4 2.0 2.3
3.1
1.3 0.91.3
2.8
4.0
5.7
7.5
16.7
Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total Time(s) Throughput(Total) Throughput(per node)
Kudu 961.1 2.8M record/s 39.5k record/s
Parquet 114.6 23.5M record/s 331k records/s
Bulk load using impala (INSERT INTO):
Query latency:
* HDFS parquet file replication = 3 , kudu table replication = 3
* Each query run 5 times then take average

Use Case 1: Result Analysis
 Lazy materialization
Ideal for search style query
Q6 returns only a few records (of a single user) with all columns
 Scan range pruning using primary index
Predicates on primary key
Q5 only scans 1 hour of data
 Future work
Primary index: speed-up order by and distinct
Hash Partitioning: speed-up count(distinct), no need for global
shuffle/merge

Use Case 2
OLAP PaaS for ecosystem cloud
 Provide big data service for smart hardware startups (Xiaomi’s
ecosystem members)
 OLAP database with some OLTP features
 Manage/Ingest/query your data and serving results in one place
Backend/Mobile App/Smart Device/IoT …

What Kudu is not
38

Kudu is…
• NOT a SQL database
• “BYO SQL”
• NOT a filesystem
• data must have tabular structure
• NOT a replacement for HBase or HDFS
• Cloudera continues to invest in those systems
• Many use cases where they’re still more appropriate
• NOT an in-memory database
• Very fast for memory-sized workloads, but can operate on larger data too!
39

Getting started
40

Getting started as a user
• http://getkudu.io
• kudu-user@googlegroups.com
• Quickstart VM
• Easiest way to get started
• Impala and Kudu in an easy-to-install VM
• CSD and Parcels
• For installation on a Cloudera Manager-managed cluster
41

Getting started as a developer
• http://github.com/cloudera/kudu
• All commits go here first
• Public gerrit: http://gerrit.cloudera.org
• All code reviews happening here
• Public JIRA: http://issues.cloudera.org
• Includes bugs going back to 2013. Come see our dirty laundry!
• kudu-dev@googlegroups.com
• Apache 2.0 license open source
• Contributions are welcome and encouraged!
42

http://getkudu.io/
@getkudu

SFHUG Kudu Talk

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to SFHUG Kudu Talk (20)

More from Felicia Haggarty (6)

Recently uploaded (20)

SFHUG Kudu Talk