SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Kudu Webinar Series
Technical Deep Dive
David Alves| Apache Kudu PMC | Cloudera
2© Cloudera, Inc. All rights reserved.
Kudu Webinar Series
Part 1: Lambda Architectures – Simplified by Apache Kudu
A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can
dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical Databases
An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and
Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time Data
Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will
discuss how to make it a reality.
Part 4: Technical Deep-Dive into Apache Kudu
An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu
PMC Member.
3© Cloudera, Inc. All rights reserved.
Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
applications
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for big data with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu
4© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Filling the Analytic Gap
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData
5© Cloudera, Inc. All rights reserved.
Better Together
Kudu Benefits from Integration with the Apache Ecosystem
Spark – Stream Processing for Kudu
• Open standard for real-time stream processing
• Effective for automating decision processes and machine
learning
• Use Cases include: Time Series Data & Machine Data
Analytics
Impala – High-Performance BI & SQL for Kudu
• Open standard for interactive SQL queries
• Powers analytic database workloads with flexibility, scale, and
open architecture
• Use Cases include: Time Series Data & Online Reporting
6© Cloudera, Inc. All rights reserved.
Apache Kudu Community
7© Cloudera, Inc. All rights reserved.
Why Kudu?
Use Cases and Motivation
8© Cloudera, Inc. All rights reserved.
Why Kudu?
A simultaneous combination of sequential and random reads and writes
Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you get
results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify problems
or threats? Can your system do
machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting
9© Cloudera, Inc. All rights reserved.
Next generation hardware
Cheaper and faster every year.
Persistent memory (3D XPoint™)
Kudu can take advantage of SSD
and NVM using Intel’s NVM Library.
RAM is cheaper and bigger every
day.
Kudu runs smoothly with huge
RAM. Written in C++ to avoid GC
issues.
Modern CPUs are adding cores and
SIMD width, not GHz.
Kudu takes advantage of SIMD
instructions and concurrent data
structures.
Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
10© Cloudera, Inc. All rights reserved.
Apache Kudu: Scalable and fast tabular storage
Scalable
• Tested up to 275 nodes (~3PB cluster)
• Designed to scale to 1000s of nodes and tens of PBs
Fast
• Millions of read/write operations per second across cluster
• Multiple GB/second read throughput per node
Tabular
• Represents data in structured tables like a relational database
• Individual record-level access to 100+ billion row tables
11© Cloudera, Inc. All rights reserved.
Deep Dive:
Replication And Fault Tolerance
12© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master
• Acts as a tablet directory
• Acts as a catalog (which tables exist, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
13© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for
‘tlipcon’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might
care about: T1, T2, T3, …
UPDATE tlipcon
SET col=foo
Meta Cache
T1: …
T2: …
T3: …
14© Cloudera, Inc. All rights reserved.
Raft consensus
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!
15© Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
16© Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure (Tablet Copy):
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica(in PRE_VOTER state)
• Leader copies the data over to the new one, which joins as a new FOLLOWER
• New replica is assigned VOTER state
• Cluster change_config lets you add/remove tablet servers to a tablet’s
configuration(only supported from CLI tool)
17© Cloudera, Inc. All rights reserved.
Deep Dive:
Columnar Storage
18© Cloudera, Inc. All rights reserved.
Columnar Storage
• Improved Scan performance
• Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading
unnecessary data from other columns
• Efficient encodings can dramatically improve compression ratios, which reduces effective IO load
• Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining)
• At the cost of random access performance
• single row access requires a number of seeks proportional to the number of columns
• BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
19© Cloudera, Inc. All rights reserved.
Row Storage
{23059873, newsycbot, 1442865158, Visual exp…}
{22309487, RideImpala, 1442828307, Introducing …}
…
Tweet_id, user_name, created_at, text
Scans have to read all the data, no encodings
20© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
21© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB
…AND door open to big IO gains with compression and encoding
22© Cloudera, Inc. All rights reserved.
Available Encodings:
• Dictionary (Strings, Binary)
• Bitshuffle (Numeric)
• RLE (Numeric, Bool)
Available Compression:
• Snappy
• LZ4
• ZLib
Columnar Storage – Other Encodings/Compression
Kudu 1.3 ships with “good” defaults for most cases
23© Cloudera, Inc. All rights reserved.
Deep Dive:
Write and Read Paths
24© Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)
25© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
flush
26© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah2”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“blah2”
Row=r2 col=c2 val=“2”
flush
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
27© Cloudera, Inc. All rights reserved.
LSM Update path
MemStore
UPDATE
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads. Random-
write workload is transformed to
fully sequential!
28© Cloudera, Inc. All rights reserved.
LSM Read path
MemStore
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string row
keys
R1: c1=blah c2=2
R2: c1=newval c2=5
….
CPU intensive!
Must always read
rowkeys
Any given row may exist across
multiple HFiles: must always
merge!
The more HFiles to merge, the
slower it reads
29© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
30© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush
31© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
Each DiskRowSet has its own
DeltaMemStore to
accumulate updates
base data
base data
32© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col 1=$1M
base data
33© Cloudera, Inc. All rights reserved.
Kudu storage – Read path
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in
DiskRowSet 1
Any row is only in exactly one
DiskRowSet– no need to merge cross-
DRS!
Updates are merged based on ordinal
offset within DRS: array indexing, no
string compares
base data
base data
34© Cloudera, Inc. All rights reserved.
Kudu storage – Delta flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
0: pay=fooREDO DeltaFile
Flush
A REDO delta indicates how to
transform between the ‘base data’
(columnar) and a later version
base data
base data
35© Cloudera, Inc. All rights reserved.
Kudu storage – Major delta compaction
name pay role
DiskRowSet(pre-compaction)
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta
application work on reads
name pay role
DiskRowSet(post-compaction)
Delta MS
Unmerged REDO
deltasUNDO deltas
If a column has few updates, doesn’t need to be
re-written: those deltas maintained in new
DeltaFile
Merge updates for columns with high update
percentage
base data
36© Cloudera, Inc. All rights reserved.
Kudu storage – RowSet Compactions
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, joe] [jon, julie, linda,
mary]
[omar, zach, zeke, zoe]
Reorganize rows to avoid rowsets
with overlapping key ranges
Writes for “chris” have to perform
bloom lookups on all 3 RS
37© Cloudera, Inc. All rights reserved.
Kudu Storage - Compactions
• Main Idea: Always be compacting!
• Compactions run continuously to prevent IO storms
• ”Budgeted” RS compactions: What is the best way to spend X MBs IO?
• Physical/Logical decoupling: different replicas run compactions at different times
38© Cloudera, Inc. All rights reserved.
Deep Dive
Partitioning
39© Cloudera, Inc. All rights reserved.
Partitioning
• Kudu has flexible policies for distributing data among partitions
• Hash partitioning is built in, and can be combined with range partitioning
• Keys are ordered within a partition
• Key order matters
• Partitioning key order dictates distribution
• Primary key order affects how much data is read for scans
40© Cloudera, Inc. All rights reserved.
Primary Key selection
• Example - Time series data:
• ”time” - timestamp
• ”series” - {region, server, metric}
(us-east.appserver01.loadavg, 2016-05-
09T15:14:00Z)
(us-east.appserver01.loadavg, 2016-05-
09T15:15:00Z)
(us-west.dbserver03.rss, 2016-05-
09T15:14:30Z)
(us-west.dbserver03.rss, 2016-05-
09T15:14:30Z)
(2016-05-09T15:14:00Z, us-
east.appserver01.loadavg)
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(2016-05-09T15:15:00Z, us-
east.appserver01.loadavg)
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(series, time) (time, series)
SELECT * WHERE series = ‘us-east.appserver01.loadavg’;
41© Cloudera, Inc. All rights reserved.
Partitioning — By Time Range (inserts)
All Inserts go to Latest PartitionBig scans (across large time intervals)
can be parallelized across many partitions
42© Cloudera, Inc. All rights reserved.
Partitioning — By Series Range
Inserts are spread among all partitionsScans are over a single partition
43© Cloudera, Inc. All rights reserved.
Partitioning — By Series Range
Partitions can become unbalanced,
resulting in hot spotting
44© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash (inserts)
Inserts are spread among all partitionsScans are over a single partition
45© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash
Partitions grow overtime, eventually
becoming too big for a single server
46© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash + Time Range
Inserts are spread among all partitions
in the latest time range
Big scans (across large time intervals)
can be parallelized across partitions
47© Cloudera, Inc. All rights reserved.
Next Steps
Get Started with
Kudu & Cloudera
Start Contributing
to Kudu
• www.cloudera.com/downloads
• https://blog.cloudera.com/?s=kudu
http://kudu.apache.org/
48© Cloudera, Inc. All rights reserved.
Thank you
David Alves – Apache Kudu PMC
@dribeiroalves

More Related Content

PDF
How Impala Works
PDF
Getting Started with Confluent Schema Registry
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
PDF
Apache Hudi: The Path Forward
PPTX
HBase in Practice
PDF
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
PDF
When NOT to use Apache Kafka?
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
How Impala Works
Getting Started with Confluent Schema Registry
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Apache Hudi: The Path Forward
HBase in Practice
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
When NOT to use Apache Kafka?
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...

What's hot (20)

PPTX
Kudu Deep-Dive
PPTX
Introduction to Apache Kudu
PPT
Cloudera Impala Internals
PPTX
Ozone- Object store for Apache Hadoop
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PDF
Delta from a Data Engineer's Perspective
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Performance Optimizations in Apache Impala
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Apache kudu
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
PDF
Sparklens: Understanding the Scalability Limits of Spark Applications with R...
PDF
Building an open data platform with apache iceberg
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PPTX
Optimizing Apache Spark SQL Joins
PDF
Google Bigtable Paper Presentation
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Kudu Deep-Dive
Introduction to Apache Kudu
Cloudera Impala Internals
Ozone- Object store for Apache Hadoop
Apache Tez: Accelerating Hadoop Query Processing
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Simplifying Real-Time Architectures for IoT with Apache Kudu
Delta from a Data Engineer's Perspective
Hive + Tez: A Performance Deep Dive
Performance Optimizations in Apache Impala
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Apache kudu
Apache Tez - A New Chapter in Hadoop Data Processing
Moving Beyond Lambda Architectures with Apache Kudu
Sparklens: Understanding the Scalability Limits of Spark Applications with R...
Building an open data platform with apache iceberg
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Optimizing Apache Spark SQL Joins
Google Bigtable Paper Presentation
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Ad

Viewers also liked (20)

PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PPTX
Kudu Forrester Webinar
PPTX
Analyzing Hadoop Data Using Sparklyr

PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
PPTX
Top 5 IoT Use Cases
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
PDF
A Closer Look at Apache Kudu
PDF
Kudu Cloudera Meetup Paris
PPTX
The Impala Cookbook
PPTX
Kite SDK: Working with Datasets
PDF
Oozie @ Riot Games
PDF
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
PPTX
Oozie meetup - HA
PDF
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
PPTX
Building data pipelines with kite
PPTX
The Transformation of your Data in modern IT (Presented by DellEMC)
PPTX
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
PDF
October 2013 HUG: Oozie 4.x
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Kudu Forrester Webinar
Analyzing Hadoop Data Using Sparklyr

Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Top 5 IoT Use Cases
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
A Closer Look at Apache Kudu
Kudu Cloudera Meetup Paris
The Impala Cookbook
Kite SDK: Working with Datasets
Oozie @ Riot Games
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Oozie meetup - HA
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Building data pipelines with kite
The Transformation of your Data in modern IT (Presented by DellEMC)
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
October 2013 HUG: Oozie 4.x
Ad

Similar to Apache Kudu: Technical Deep Dive

 (20)

PPTX
SFHUG Kudu Talk
PPTX
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
PDF
Kudu austin oct 2015.pptx
PDF
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
PDF
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
PDF
Kudu - Fast Analytics on Fast Data
PPTX
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
PDF
Introducing Kudu, Big Data Warehousing Meetup
PDF
Kudu: Fast Analytics on Fast Data
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
PPTX
Introducing Kudu
PDF
Apache Kudu - Updatable Analytical Storage #rakutentech
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
PPTX
Introduction to Kudu - StampedeCon 2016
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PDF
Spark Summit EU talk by Mike Percy
SFHUG Kudu Talk
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Kudu austin oct 2015.pptx
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Kudu - Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Introducing Kudu, Big Data Warehousing Meetup
Kudu: Fast Analytics on Fast Data
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Kudu
Apache Kudu - Updatable Analytical Storage #rakutentech
Intro to Apache Kudu (short) - Big Data Application Meetup
Introduction to Kudu - StampedeCon 2016
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Spark Summit EU talk by Mike Percy

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Transform Your Business with a Software ERP System
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Complete React Javascript Course Syllabus.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Introduction to Artificial Intelligence
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Transform Your Business with a Software ERP System
Materi-Enum-and-Record-Data-Type (1).pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Complete React Javascript Course Syllabus.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
Design an Analysis of Algorithms II-SECS-1021-03
How Creative Agencies Leverage Project Management Software.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Which alternative to Crystal Reports is best for small or large businesses.pdf
L1 - Introduction to python Backend.pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
VVF-Customer-Presentation2025-Ver1.9.pptx
Introduction to Artificial Intelligence
Odoo POS Development Services by CandidRoot Solutions
ManageIQ - Sprint 268 Review - Slide Deck
Wondershare Filmora 15 Crack With Activation Key [2025
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design

Apache Kudu: Technical Deep Dive



  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu Webinar Series Technical Deep Dive David Alves| Apache Kudu PMC | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Kudu Webinar Series Part 1: Lambda Architectures – Simplified by Apache Kudu A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can dramatically simplify real-time analytics. Part 2: Extending the Capabilities of Operational and Analytical Databases An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and Analytical databases can handle. Part 3: Data-in-Motion: Unlock the Value of Real-Time Data Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will discuss how to make it a reality. Part 4: Technical Deep-Dive into Apache Kudu An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu PMC Member.
  • 3. 3© Cloudera, Inc. All rights reserved. Updateable Analytic Storage Simple real-time analytics and updates with Apache Kudu Kudu: Storage for fast analytics on fast data • Simplified architecture for building real-time analytic applications • Designed for next-generation hardware for faster analytic performance across frameworks • Native Hadoop storage engine Flexibility for the right tools for the right use case in one platform • Only analytic database for big data with Kudu + Impala • Simple real-time applications with Kudu + Spark Use cases • Time series data • Machine data analytics • Online reporting STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr OTHER Kite NoSQL HBase OTHER Object Store FILESYSTEM HDFS RELATIONAL Kudu
  • 4. 4© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Filling the Analytic Gap Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData
  • 5. 5© Cloudera, Inc. All rights reserved. Better Together Kudu Benefits from Integration with the Apache Ecosystem Spark – Stream Processing for Kudu • Open standard for real-time stream processing • Effective for automating decision processes and machine learning • Use Cases include: Time Series Data & Machine Data Analytics Impala – High-Performance BI & SQL for Kudu • Open standard for interactive SQL queries • Powers analytic database workloads with flexibility, scale, and open architecture • Use Cases include: Time Series Data & Online Reporting
  • 6. 6© Cloudera, Inc. All rights reserved. Apache Kudu Community
  • 7. 7© Cloudera, Inc. All rights reserved. Why Kudu? Use Cases and Motivation
  • 8. 8© Cloudera, Inc. All rights reserved. Why Kudu? A simultaneous combination of sequential and random reads and writes Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes? Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning? How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data? Time Series Data Machine Data Analytics Online Reporting
  • 9. 9© Cloudera, Inc. All rights reserved. Next generation hardware Cheaper and faster every year. Persistent memory (3D XPoint™) Kudu can take advantage of SSD and NVM using Intel’s NVM Library. RAM is cheaper and bigger every day. Kudu runs smoothly with huge RAM. Written in C++ to avoid GC issues. Modern CPUs are adding cores and SIMD width, not GHz. Kudu takes advantage of SIMD instructions and concurrent data structures. Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
  • 10. 10© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast tabular storage Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Tabular • Represents data in structured tables like a relational database • Individual record-level access to 100+ billion row tables
  • 11. 11© Cloudera, Inc. All rights reserved. Deep Dive: Replication And Fault Tolerance
  • 12. 12© Cloudera, Inc. All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them
  • 13. 13© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘tlipcon’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE tlipcon SET col=foo Meta Cache T1: … T2: … T3: …
  • 14. 14© Cloudera, Inc. All rights reserved. Raft consensus TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  • 15. 15© Cloudera, Inc. All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures
  • 16. 16© Cloudera, Inc. All rights reserved. Fault tolerance (2) • Permanent failure (Tablet Copy): • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica(in PRE_VOTER state) • Leader copies the data over to the new one, which joins as a new FOLLOWER • New replica is assigned VOTER state • Cluster change_config lets you add/remove tablet servers to a tablet’s configuration(only supported from CLI tool)
  • 17. 17© Cloudera, Inc. All rights reserved. Deep Dive: Columnar Storage
  • 18. 18© Cloudera, Inc. All rights reserved. Columnar Storage • Improved Scan performance • Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading unnecessary data from other columns • Efficient encodings can dramatically improve compression ratios, which reduces effective IO load • Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining) • At the cost of random access performance • single row access requires a number of seeks proportional to the number of columns • BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
  • 19. 19© Cloudera, Inc. All rights reserved. Row Storage {23059873, newsycbot, 1442865158, Visual exp…} {22309487, RideImpala, 1442828307, Introducing …} … Tweet_id, user_name, created_at, text Scans have to read all the data, no encodings
  • 20. 20© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
  • 21. 21© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB 200GB …AND door open to big IO gains with compression and encoding
  • 22. 22© Cloudera, Inc. All rights reserved. Available Encodings: • Dictionary (Strings, Binary) • Bitshuffle (Numeric) • RLE (Numeric, Bool) Available Compression: • Snappy • LZ4 • ZLib Columnar Storage – Other Encodings/Compression Kudu 1.3 ships with “good” defaults for most cases
  • 23. 23© Cloudera, Inc. All rights reserved. Deep Dive: Write and Read Paths
  • 24. 24© Cloudera, Inc. All rights reserved. LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) • Reads perform an on-the-fly merge of all on-disk HFiles • Kudu • Shares some traits (memstores, compactions) • More complex. • Slower writes in exchange for faster reads (especially scans)
  • 25. 25© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  • 26. 26© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah2” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  • 27. 27© Cloudera, Inc. All rights reserved. LSM Update path MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random- write workload is transformed to fully sequential!
  • 28. 28© Cloudera, Inc. All rights reserved. LSM Read path MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  • 29. 29© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush
  • 30. 30© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush
  • 31. 31© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS Each DiskRowSet has its own DeltaMemStore to accumulate updates base data base data
  • 32. 32© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  • 33. 33© Cloudera, Inc. All rights reserved. Kudu storage – Read path MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M Read rows in DiskRowSet 2 Then, read rows in DiskRowSet 1 Any row is only in exactly one DiskRowSet– no need to merge cross- DRS! Updates are merged based on ordinal offset within DRS: array indexing, no string compares base data base data
  • 34. 34© Cloudera, Inc. All rights reserved. Kudu storage – Delta flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  • 35. 35© Cloudera, Inc. All rights reserved. Kudu storage – Major delta compaction name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Many deltas accumulate: lots of delta application work on reads name pay role DiskRowSet(post-compaction) Delta MS Unmerged REDO deltasUNDO deltas If a column has few updates, doesn’t need to be re-written: those deltas maintained in new DeltaFile Merge updates for columns with high update percentage base data
  • 36. 36© Cloudera, Inc. All rights reserved. Kudu storage – RowSet Compactions DRS 1 (32MB) [PK=alice], [PK=joe], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, joe] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Reorganize rows to avoid rowsets with overlapping key ranges Writes for “chris” have to perform bloom lookups on all 3 RS
  • 37. 37© Cloudera, Inc. All rights reserved. Kudu Storage - Compactions • Main Idea: Always be compacting! • Compactions run continuously to prevent IO storms • ”Budgeted” RS compactions: What is the best way to spend X MBs IO? • Physical/Logical decoupling: different replicas run compactions at different times
  • 38. 38© Cloudera, Inc. All rights reserved. Deep Dive Partitioning
  • 39. 39© Cloudera, Inc. All rights reserved. Partitioning • Kudu has flexible policies for distributing data among partitions • Hash partitioning is built in, and can be combined with range partitioning • Keys are ordered within a partition • Key order matters • Partitioning key order dictates distribution • Primary key order affects how much data is read for scans
  • 40. 40© Cloudera, Inc. All rights reserved. Primary Key selection • Example - Time series data: • ”time” - timestamp • ”series” - {region, server, metric} (us-east.appserver01.loadavg, 2016-05- 09T15:14:00Z) (us-east.appserver01.loadavg, 2016-05- 09T15:15:00Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (2016-05-09T15:14:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (2016-05-09T15:15:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (series, time) (time, series) SELECT * WHERE series = ‘us-east.appserver01.loadavg’;
  • 41. 41© Cloudera, Inc. All rights reserved. Partitioning — By Time Range (inserts) All Inserts go to Latest PartitionBig scans (across large time intervals) can be parallelized across many partitions
  • 42. 42© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Inserts are spread among all partitionsScans are over a single partition
  • 43. 43© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Partitions can become unbalanced, resulting in hot spotting
  • 44. 44© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash (inserts) Inserts are spread among all partitionsScans are over a single partition
  • 45. 45© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash Partitions grow overtime, eventually becoming too big for a single server
  • 46. 46© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash + Time Range Inserts are spread among all partitions in the latest time range Big scans (across large time intervals) can be parallelized across partitions
  • 47. 47© Cloudera, Inc. All rights reserved. Next Steps Get Started with Kudu & Cloudera Start Contributing to Kudu • www.cloudera.com/downloads • https://blog.cloudera.com/?s=kudu http://kudu.apache.org/
  • 48. 48© Cloudera, Inc. All rights reserved. Thank you David Alves – Apache Kudu PMC @dribeiroalves