SlideShare a Scribd company logo
2
Most read
5
Most read
14
Most read
Pulsar Virtual Summit North America 2021
Apache BookKeeper State Store:
A Durable Key-Value Store
Pulsar Virtual Summit North America 2021
Prashant Kumar
Principal Software Engineer @ Splunk
● Principal Software Developer at Splunk.
● Ex Yahoo, Verizon Media.
● Prior Experience, Key member and
contributor to Sherpa, the geographically
replicated multi tenant key value store @
Yahoo.
Pulsar Virtual Summit North America 2021
Agenda
I. Introduction to Apache Bookkeeper Statestore.
II. Why yet another KV store?
III. How does it fit into Pulsar ecosystem?
IV. Intended use case.
V. Brief architecture.
VI. Current state and production worthiness.
VII. Product roadmap and future work.
Pulsar Virtual Summit North America 2021
Apache Bookkeeper Statestore
● It’s a Key-Value store.
● It’s durable.
● It’s locally replicated.
● It’s eventually consistent.
● It’s fault tolerant.
● It’s cloud native and k8s based deployment.
Pulsar Virtual Summit North America 2021
Argh!. Yet another KV store?
Pulsar Virtual Summit North America 2021
Integral part of Apache Pulsar ecosystem
● Uses same Zookeeper deployment that Pulsar uses.
● Uses same Bookkeeper deployment that Pulsar uses.
● Uses same infrastructure for metrics, dashboards etc as
Bookkeeper
● Part of bookkeeper/stream code base.
● Existing client side integration in Apache Pulsar function
service.
Pulsar Virtual Summit North America 2021
Primary use cases
● Store and access function state and checkpoints
● A secondary metadata store for Apache Pulsar, away from
Zookeeper
● Other various KV store use cases
Pulsar Virtual Summit North America 2021
Data model
Pulsar Virtual Summit North America 2021
High level serving architecture (Bird view)
Pulsar Virtual Summit North America 2021
High level Datastore architecture
Pulsar Virtual Summit North America 2021
Benchmarking
● Benchmarking with YCSB
● Setup
○ YCSB Thread count = 40
○ # k8s pods = 3
○ cpuRequest = 8
○ cpuLimit = 16
○ memoryRequest = 24Gi
○ memoryLimit = 24Gi
● Read output
○ Throughput = 22557.7 Ops/S
○ Average latency = 1.699 ms
○ 99%tile latency = 5.323 ms
● Write output
○ Throughput = 15256.16 Ops/S
○ Average latency = 8.820 ms
○ 99%tile = 27.071 ms
Pulsar Virtual Summit North America 2021
Production readiness
● It has already been in production for last few
months
● It’s a k8s based deployment
● Sustained production traffic
○ Read throughput 240 Ops/S
○ Write throughput 90 Ops/S
Pulsar Virtual Summit North America 2021
Product roadmap
● Contribute internal changes back to open source
● Storage hardening
○ Pick up the data litter actively and reactively
○ Clear obsolete transaction log
● Operability
○ Improved and granular monitoring and alerting.
● Availability
○ Implementation of replica.
○ Serving read traffic from a replica.
○ Elevation of replica to be primary when primary fails
● Scaling and load balancing
○ Splitting a shard
○ Moving shards across cluster
Pulsar Virtual Summit North America 2021
References
● Statestore Repo:
https://github.com/apache/bookkeeper/tree/master/stream
● Distributedlog Repo:
https://github.com/apache/bookkeeper/tree/master/stream/distri
butedlog
● Pulsar Function - Statestore integration:
https://github.com/apache/pulsar/blob/master/pulsar-
functions/worker/src/main/java/org/apache/pulsar/functions/wor
ker/PulsarWorkerService.java#L420

More Related Content

PPTX
Apache Flink and what it is used for
PDF
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
PDF
Introduction to Apache Flink
PPTX
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
PPTX
Stability Patterns for Microservices
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PPTX
Storage Requirements and Options for Running Spark on Kubernetes
PPTX
M|18 Deep Dive: InnoDB Transactions and Replication
Apache Flink and what it is used for
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Introduction to Apache Flink
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Stability Patterns for Microservices
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Storage Requirements and Options for Running Spark on Kubernetes
M|18 Deep Dive: InnoDB Transactions and Replication

What's hot (20)

PDF
Building Microservices with gRPC and NATS
PDF
An Introduction to Apache Kafka
PPTX
RocksDB compaction
PDF
Polyglot persistence @ netflix (CDE Meetup)
PDF
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
PPTX
From distributed caches to in-memory data grids
PPTX
No data loss pipeline with apache kafka
PPTX
Kafka 101
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
DevOps Supercharged with Docker on Exadata
PPTX
Apache Tez – Present and Future
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
PDF
Scalar DB: A library that makes non-ACID databases ACID-compliant
PDF
Akka-intro-training-public.pdf
PDF
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
PDF
Apache Flink internals
PDF
Introduction to Apache Kafka
PPTX
RedisConf17- Using Redis at scale @ Twitter
Building Microservices with gRPC and NATS
An Introduction to Apache Kafka
RocksDB compaction
Polyglot persistence @ netflix (CDE Meetup)
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
From distributed caches to in-memory data grids
No data loss pipeline with apache kafka
Kafka 101
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
DevOps Supercharged with Docker on Exadata
Apache Tez – Present and Future
Introduction to Apache Flink - Fast and reliable big data processing
Tuning Apache Kafka Connectors for Flink.pptx
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Scalar DB: A library that makes non-ACID databases ACID-compliant
Akka-intro-training-public.pdf
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
Apache Flink internals
Introduction to Apache Kafka
RedisConf17- Using Redis at scale @ Twitter
Ad

Similar to Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021 (20)

PDF
Manage Pulsar Cluster Lifecycles with Kubernetes Operators - Pulsar Summit NA...
PDF
Benchmarking for postgresql workloads in kubernetes
PPTX
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
PDF
Infinitic: Building a Workflow Engine on Top of Pulsar - Pulsar Summit NA 2021
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
PDF
In-memory No SQL- GIDS2014
PDF
Critical Attributes for a High-Performance, Low-Latency Database
PPTX
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
PPTX
Oracle big data appliance and solutions
PDF
The architecture of SkySQL
PDF
Architectural caching patterns for kubernetes
PDF
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
PDF
Function Mesh: Complex Streaming Jobs Made Simple - Pulsar Summit NA 2021
PDF
Learn from HomeAway Hadoop Development and Operations Best Practices
PDF
[DevConf.US 2019]Quarkus Brings Serverless to Java Developers
PDF
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
PDF
Ohio Devfest - Visual Analysis with GCP
PDF
SVC / Storwize analysis cost effective storage planning (use case)
PPTX
Retour d'expérience d'un environnement base de données multitenant
Manage Pulsar Cluster Lifecycles with Kubernetes Operators - Pulsar Summit NA...
Benchmarking for postgresql workloads in kubernetes
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Infinitic: Building a Workflow Engine on Top of Pulsar - Pulsar Summit NA 2021
Ceph Community Talk on High-Performance Solid Sate Ceph
In-memory No SQL- GIDS2014
Critical Attributes for a High-Performance, Low-Latency Database
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
Oracle big data appliance and solutions
The architecture of SkySQL
Architectural caching patterns for kubernetes
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Function Mesh: Complex Streaming Jobs Made Simple - Pulsar Summit NA 2021
Learn from HomeAway Hadoop Development and Operations Best Practices
[DevConf.US 2019]Quarkus Brings Serverless to Java Developers
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Ohio Devfest - Visual Analysis with GCP
SVC / Storwize analysis cost effective storage planning (use case)
Retour d'expérience d'un environnement base de données multitenant
Ad

More from StreamNative (20)

PDF
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
PDF
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
PDF
Distributed Database Design Decisions to Support High Performance Event Strea...
PDF
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
PDF
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
PDF
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
PDF
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
PDF
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
PDF
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
PDF
Understanding Broker Load Balancing - Pulsar Summit SF 2022
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
PDF
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
PDF
Event-Driven Applications Done Right - Pulsar Summit SF 2022
PDF
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
PDF
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
PDF
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
PDF
Welcome and Opening Remarks - Pulsar Summit SF 2022
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
PDF
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Distributed Database Design Decisions to Support High Performance Event Strea...
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Spectroscopy.pptx food analysis technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Getting Started with Data Integration: FME Form 101
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectroscopy.pptx food analysis technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Group 1 Presentation -Planning and Decision Making .pptx
Per capita expenditure prediction using model stacking based on satellite ima...
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Machine Learning_overview_presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
SOPHOS-XG Firewall Administrator PPT.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing

Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021

  • 1. Pulsar Virtual Summit North America 2021 Apache BookKeeper State Store: A Durable Key-Value Store
  • 2. Pulsar Virtual Summit North America 2021 Prashant Kumar Principal Software Engineer @ Splunk ● Principal Software Developer at Splunk. ● Ex Yahoo, Verizon Media. ● Prior Experience, Key member and contributor to Sherpa, the geographically replicated multi tenant key value store @ Yahoo.
  • 3. Pulsar Virtual Summit North America 2021 Agenda I. Introduction to Apache Bookkeeper Statestore. II. Why yet another KV store? III. How does it fit into Pulsar ecosystem? IV. Intended use case. V. Brief architecture. VI. Current state and production worthiness. VII. Product roadmap and future work.
  • 4. Pulsar Virtual Summit North America 2021 Apache Bookkeeper Statestore ● It’s a Key-Value store. ● It’s durable. ● It’s locally replicated. ● It’s eventually consistent. ● It’s fault tolerant. ● It’s cloud native and k8s based deployment.
  • 5. Pulsar Virtual Summit North America 2021 Argh!. Yet another KV store?
  • 6. Pulsar Virtual Summit North America 2021 Integral part of Apache Pulsar ecosystem ● Uses same Zookeeper deployment that Pulsar uses. ● Uses same Bookkeeper deployment that Pulsar uses. ● Uses same infrastructure for metrics, dashboards etc as Bookkeeper ● Part of bookkeeper/stream code base. ● Existing client side integration in Apache Pulsar function service.
  • 7. Pulsar Virtual Summit North America 2021 Primary use cases ● Store and access function state and checkpoints ● A secondary metadata store for Apache Pulsar, away from Zookeeper ● Other various KV store use cases
  • 8. Pulsar Virtual Summit North America 2021 Data model
  • 9. Pulsar Virtual Summit North America 2021 High level serving architecture (Bird view)
  • 10. Pulsar Virtual Summit North America 2021 High level Datastore architecture
  • 11. Pulsar Virtual Summit North America 2021 Benchmarking ● Benchmarking with YCSB ● Setup ○ YCSB Thread count = 40 ○ # k8s pods = 3 ○ cpuRequest = 8 ○ cpuLimit = 16 ○ memoryRequest = 24Gi ○ memoryLimit = 24Gi ● Read output ○ Throughput = 22557.7 Ops/S ○ Average latency = 1.699 ms ○ 99%tile latency = 5.323 ms ● Write output ○ Throughput = 15256.16 Ops/S ○ Average latency = 8.820 ms ○ 99%tile = 27.071 ms
  • 12. Pulsar Virtual Summit North America 2021 Production readiness ● It has already been in production for last few months ● It’s a k8s based deployment ● Sustained production traffic ○ Read throughput 240 Ops/S ○ Write throughput 90 Ops/S
  • 13. Pulsar Virtual Summit North America 2021 Product roadmap ● Contribute internal changes back to open source ● Storage hardening ○ Pick up the data litter actively and reactively ○ Clear obsolete transaction log ● Operability ○ Improved and granular monitoring and alerting. ● Availability ○ Implementation of replica. ○ Serving read traffic from a replica. ○ Elevation of replica to be primary when primary fails ● Scaling and load balancing ○ Splitting a shard ○ Moving shards across cluster
  • 14. Pulsar Virtual Summit North America 2021 References ● Statestore Repo: https://github.com/apache/bookkeeper/tree/master/stream ● Distributedlog Repo: https://github.com/apache/bookkeeper/tree/master/stream/distri butedlog ● Pulsar Function - Statestore integration: https://github.com/apache/pulsar/blob/master/pulsar- functions/worker/src/main/java/org/apache/pulsar/functions/wor ker/PulsarWorkerService.java#L420

Editor's Notes