SlideShare a Scribd company logo
Apache Pulsar as a
Dual Streaming /
Batch Processor
Joe Olson
Senior Manager, Big Data Analytics
Apache Road Show Chicago - May 2019
Agenda
United and the Airline Industry
How Publish – Subscribe Compute
Model Presents Opportunity
Apache Pulsar & Apache Bookkeeper
Use Case: FAA’s Real Time SWIM
Feed
2
About United Airlines…..
 1,348 aircraft (779 mainline, 569 regional) with 250+ on order (supply chain)
 158M passengers in 2018
(public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)
 4900 daily departures (scheduling, operations, weather, route planning)
 355 airports served, in 48 countries (baggage claim, check-ins)
 88,000 employees worldwide (scheduling, pay)
 Constantly in motion! Future (and past) always changing.
 A data scientist / data engineer dream.
Source: https://hub.united.com/corporate-fact-sheet/
3
Business Goals
 Improve Customer Experience
- How can we reduce friction when booking a reservation? Maneuvering through an airport?
- How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)
 Improve Employee Experience
- How can we keep employees better informed of the current situation so they can relay it to the customers?
- What are we learning from our surveys about what the customer bases says is / isn’t working?
 Revenue Generation
- What personalized offers can we make to our customers?
- Are our offers competitive with the rest of the industry?
 Improve Operational Reliability
- How can we better prepare for weather or other operational interruptions?
- How can we manage the fleet better and insure spare parts are where they need to be?
4
Industry Ideas – Customer Experience
5
Apache Pulsar – Key Points
 “Apache Pulsar is an open-source distributed pub-sub messaging system originally
created at Yahoo and now part of the Apache Software Foundation”
- Designed for low publish latency (< 5ms) at scale with strong durability guarantees
- Persistent message storage based on Apache BookKeeper.
- Tiered storage provides opportunity for batch and stream processing in the same platform.
- Built from the ground up as a multi-tenant system: isolation, quotas, etc
- Geo-replication designed in – across data centers or geographic regions.
- Pulsar has run in production at Yahoo scale for over 3 years, with millions of messages per
second across millions of topics. Can scale to hundreds of nodes.
- Easily deploy lightweight compute logic without a separate stream processing engine.
- REST Admin API for provisioning, administration, tools and monitoring. Deploy on bare metal
or Kubernetes.
6
Apache Pulsar – Multi Tenancy
 Pulsar was designed from the
ground up to be a multi-tenant
system. In Pulsar, tenants are
the highest administrative unit
within a Pulsar instance.
 Capacity allocated to a tenant.
 A namespace is the
administrative unit
nomenclature within a tenant.
The configuration policies set
on a namespace apply to all
the topics created in that
namespace
7
Apache Pulsar – Subscription Models
 In exclusive mode, only a single consumer is
allowed to attach to the subscription
 In shared or round robin mode, multiple
consumers can attach to the same subscription.
Messages are delivered in a round robin
distribution across consumers, and any given
message is delivered to only one consumer.
Ordering not guaranteed.
 In failover mode, multiple consumers can attach
to the same subscription. The first consumer will
initially be the only one receiving messages.
This consumer is called the master consumer.
 When the master consumer disconnects, all
(non-acked and subsequent) messages will be
delivered to the next consumer in line
8
Apache Pulsar – Reference Architecture
 One or more brokers handles and load balances
incoming messages from producers, dispatches
messages to consumers
- Topic lookup + data transfer
- Messages dispatched out of a managed
ledger cache, or if under load from persistent
storage (Bookkeeper)
- Coordination with the local and global meta
stores (Zookeeper)
 A BookKeeper cluster consisting of one or more
bookies handles persistent storage of messages
 Local Zookeeper handles coordination tasks
within a cluster, and a global cluster handles
coordination instance wide (Georeplication)
9
Apache BookKeeper - Key Points
 Apache BookKeeper is a scalable, fault tolerant, low latency log storage service
delivering durability and consistency guarantees and can provide access to both historic
and real time data
- Atomic unit is an entry
- A ledger is a bound set of entries, a stream is an unbound set of ledgers.
- Individual servers storing ledgers are called bookies.
- Entries are written to ledgers sequentially, and at most, once (append-only)
- Each bookie handles fragments of ledgers as part of an ensemble. (striping)
A stream of ledgers…
entry
10
Apache BookKeeper – Reference Architecture
 Two APIs:
- Ledger API – allows direct interaction with
ledgers, allowing you most flexibility in
working with bookies.
- Log stream API – allows you to interact with
streams without dealing with lower level
ledgers.
 Bookies advertise themselves to the Zookeeper
metadata cluster.
11
Apache BookKeeper – Storage Requirements
 Clients should be able to write and read streams of entries with very low latency (under 5
milliseconds), even when providing strong durability
 Data storage should be durable, consistent, and fault tolerant
 The system should enable clients to stream or tail ledgers to propagate data as they’re written
 The system should be able to store and provide access to both historic and real-time data
12
Apache BookKeeper – Durability
 Example:bookies 1-5 are the ensemble for the ledger.
 Entries are striped across the bookies.
 Write quorum in this case is 3 (all entries written to 3
bookies)
 Write is considered successful when the ack quorum
(in this case 2) successfully acknowledge the write
(fsync).
 Wide variety of writing to bookies in the case of
system degradation.
 Maximize bandwidth by scaling out bookies
 Improve latency by tuning the ack quorum.
 Replication supports durability
13
Apache BookKeeper – Consistency & Availability
 Consistency for log reads:
- An entry successfully written is immediately
readable.
- An entry read once is always readable.
- All entries written previously are also readable.
- The order of records is identical across all readers.
- Consistency accomplished via LastAddConfirmed
(LAC) – a spin on a two phase commit.
 Availability:
- Write can be performed as long as there are
enough bookies to satisfy the ack quorum.
- Read can be performed by any bookie in the
cluster.
14
Apache BookKeeper – I/O Isolation
 Three separate I/O paths implemented:
- Write (low latency)
- Tailing read (low latency)
- Catch up read (high latency)
Write
Read
Read
Read
15
Apache BookKeeper – Data Distribution
 Storage capacity for a single log stream
constrained by the capacity of the cluster,
never a single host.
 No stream rebalancing when capacity is added.
New bookies will be discovered, and available
for writing.
 Replica repair when failure detected is efficient
because it can be concurrently from multiple
hosts.
 All due to segmenting the streams.
16
Apache Pulsar – Tiered Storage
Broker
Bookies
Infinite Stream
 Infinite stream – most recent data stored on the
broker, rest stored in bookies, as capacity of
cluster allows
- Write
- Tailing Read
- Catchup Read
17
Apache Pulsar – Tiered Storage
 Infinite stream
- Offloader: move segments off the Pulsar
cluster and onto commodity storage.
- Can be triggered on time, size, or demand.
 Access
- Broker knows how to read data back, or
bypass bookies and read segments directly.
18
Apache Pulsar – Bringing It All Together
Producer
Subscriber
Segment
Reader
Unbounded stream
Bounded stream
19
Apache Pulsar – Bringing It All Together
Producer
Subscriber
Segment
Reader
Unbounded stream
Batch Processing Stream
Processing
20
Use Case – Improve Operational Reliability
 SWIM (System Wide Information Management)
- Real time FAA message feed describing the current and future state of the nation’s managed
airspace - traffic, weather, airport operations, etc.
- Publishers (such as airlines) push their operational information to an endpoint.
- Allows subscribers (such as airlines) on common published message interface.
 Airline needs:
- Connect the information in this feed up with their existing operational systems.
• Maintain current state on assets.
- Real time and historical analytics on this feed – traditional and predictive (ML / AI).
21
SWIM Overview
Phase of operation
FAA Topic
22
Sample SWIM Enroute TBFM Messages
{"carrier": "UAL”,
"flight number": 376,
"origin": "EWR",
"destination": "LAX",
"flight date": "2019-Mar-19”}
"Flight Plan": [{
"event_source": "TMA.ZOB.FAA.GOV",
"event_time": "2019-03-29T16:23:22.659Z",
"event_id": "422",
"tma_id": "C00926",
"Aircraft Id": "UAL376",
"Origin Airport": "EWR",
"Destination Airport": "LAX",
"Flight Plan": "ACTIVE",
"Aircraft Status": "TRACKED",
"Aircraft Type": "B752/L",
"Engine Type": "JET",
"Beacon Code": "2334",
"Flight Plan Speed": "483.0",
"Assigned Requested Altitude": "28000",
"Track Datasource": "ZNY",
"Coordination Fix": "KEWR",
"Coordination Time": "2019-03-29T16:14:00Z",
"Estimated Departure Clearance Status": "FAA”,
"Flight Plan Field 10A": "KEWR..COATE.Q436.RAAKK.Q438.RUBYY..MKG..BAE.J36.DUTYS..
KG78K..JORDY..OBH..GLL..DBL..CHESZ.Q88.HAKMN.ANJLL4.KLAX/2148",
"TMA Converted Route": "KEWR/0000 COATE/0000 LAAYK/0000 YYOST/0000 DGRAF/0000
KG78K/0000 JORDY/0000 OBH/0000 GLL/0000 DBL/0000 KLAX/0000}]
• Sample TBFM Messages. This specific flight generated 800 such messages
"Station Time of Arrival": [{
"event_source": "TMA.ZLA.FAA.GOV",
"event_time": "2019-03-29T20:38:28.148Z",
"event_id": "4664550",
"tma_id": "L03502”,
"Meter Fix Name": "CRCUS”,
"ETA Outer Meter Arc": "2019-03-29T21:42:45Z",
”ETA Meter Fix": "2019-03-29T21:46:35Z",
”ETA at Display Point": "2019-03-29T21:42:55Z",
"ETA at Scheduling Fix": "2019-03-29T21:42:55Z",
"ETA at Runway": "2019-03-29T21:57:23Z"}],
23
Architecture - Current State: Point to Point
Scheduling
Flight
Plans
Weather
Airport
Operations
FAA Systems:
Airspace
Operations
Scheduling
Flight
Plans
Weather
Airport
Operations
Airline Systems:
Airspace
Operations
24
Architecture - Target State: Pub / Sub
Scheduling
Flight
Plans
Weather
Airport
Operations
FAA Systems:
Airspace
Operations
Scheduling
Flight
Plans
Weather
Airport
Operations
Airline Systems:
Airspace
Operations
Producer
Subscriber
Topics
Producer
Subscriber
25
Architecture - Target State Considerations
Scheduling
Flight
Plans
Weather
Airport
Operations
Airline Systems:
Airspace
Operations
Producer
Subscriber
File
Connector
JDBC
Connector
API
Connector
 Connectivity to the operational
systems is mostly through file,
JDBC, and API interfaces.
 Most of these are not designed for
streaming interfaces (yet).
 How to connect up a topic with a
systems that are not designed to
work with streams?
26
Architecture - Target State Considerations
Scheduling
Flight
Plans
Weather
Airport
Operations
Airline Systems:
Airspace
Operations
Producer
Subscriber
File
Connector
JDBC
Connector
API
Connector
 What if there were both batch and
streaming interfaces?
 Use the batch interface until more
sophisticated streaming interfaces
come online.
 An API written around the segment
reader can help to close the last
mile.
 Treat as batch when needed, treat
as stream when needed.
Segment
Reader API
27
Apache Communities
 Twitter: @apache_pulsar
 Wechat: ApachePulsar
 Mailing Lists
- dev@pulsar.apache.org
- user@pulsar.apache.org
 Slack
- https://apache-pulsar.slack.com
 Localization
- http://crowdin.com/project/apache-pulsar
 Github
- https://github.com/apache/pulsar
 Twitter: @asfbookkeeper
 Mailing Lists
- dev@bookkeeper.apache.org
- user@bookkeeper.apache.org
- issues@bookkeeper.apache.org
 Slack
- http://apachebookkeeper.slack.com/
 Github
- https://github.com/apache/bookkeeper
Apache Pulsar Apache BookKeeper
Thank You!
We’re hiring!
- Data Engineers
- Data Scientists

More Related Content

PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PPTX
Flume vs. kafka
PDF
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
PDF
How Orange Financial combat financial frauds over 50M transactions a day usin...
PPTX
Flume and Hadoop performance insights
PDF
Apache kafka
PPT
HBaseCon 2013: Apache HBase Replication
PPTX
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Flume vs. kafka
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
How Orange Financial combat financial frauds over 50M transactions a day usin...
Flume and Hadoop performance insights
Apache kafka
HBaseCon 2013: Apache HBase Replication
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...

What's hot (20)

PPTX
Cache simulator
PDF
Apache flume by Swapnil Dubey
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
PPTX
ApacheCon-HBase-2016
PPTX
HBase Read High Availability Using Timeline Consistent Region Replicas
PDF
Kafka syed academy_v1_introduction
PPTX
Deploying Apache Flume to enable low-latency analytics
PPTX
HBase: Where Online Meets Low Latency
PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
PDF
HBaseCon 2015- HBase @ Flipboard
PDF
SAP OS/DB Migration using Azure Storage Account
PDF
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
PDF
Apache Flume - DataDayTexas
PPTX
Apache phoenix
PDF
Inside Flume
PPTX
Apache HBase: State of the Union
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
PDF
OpenText Archive Server on Azure
PPTX
Kafka Fundamentals
PPTX
Digital Library Collection Management using HBase
Cache simulator
Apache flume by Swapnil Dubey
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
ApacheCon-HBase-2016
HBase Read High Availability Using Timeline Consistent Region Replicas
Kafka syed academy_v1_introduction
Deploying Apache Flume to enable low-latency analytics
HBase: Where Online Meets Low Latency
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2015- HBase @ Flipboard
SAP OS/DB Migration using Azure Storage Account
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Apache Flume - DataDayTexas
Apache phoenix
Inside Flume
Apache HBase: State of the Union
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
OpenText Archive Server on Azure
Kafka Fundamentals
Digital Library Collection Management using HBase
Ad

Similar to Apache Pulsar as a Dual Stream / Batch Processor (20)

PDF
Hands-on Workshop: Apache Pulsar
PDF
Pulsar - flexible pub-sub for internet scale
PDF
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
PDF
Apache Pulsar Overview
PDF
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
PDF
Music city data Hail Hydrate! from stream to lake
PDF
Linked In Stream Processing Meetup - Apache Pulsar
PDF
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
PDF
Evaluating Streaming Data Solutions
PDF
bigdata 2022_ FLiP Into Pulsar Apps
PDF
Timothy Spann: Apache Pulsar for ML
PDF
apidays New York 2022 - Leveraging Event Streaming to Super-Charge your Busin...
PDF
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
PDF
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
PDF
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
PDF
Creating Data Fabric for #IOT with Apache Pulsar
PDF
Cloud lunch and learn real-time streaming in azure
PDF
Open keynote_carolyn&matteo&sijie
PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Hands-on Workshop: Apache Pulsar
Pulsar - flexible pub-sub for internet scale
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Apache Pulsar Overview
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Music city data Hail Hydrate! from stream to lake
Linked In Stream Processing Meetup - Apache Pulsar
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
Evaluating Streaming Data Solutions
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann: Apache Pulsar for ML
apidays New York 2022 - Leveraging Event Streaming to Super-Charge your Busin...
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
Creating Data Fabric for #IOT with Apache Pulsar
Cloud lunch and learn real-time streaming in azure
Open keynote_carolyn&matteo&sijie
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Ad

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
annual-report-2024-2025 original latest.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to machine learning and Linear Models
PPTX
climate analysis of Dhaka ,Banglades.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
annual-report-2024-2025 original latest.
Miokarditis (Inflamasi pada Otot Jantung)
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction-to-Cloud-ComputingFinal.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Data Science and Data Analysis
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
Business Analytics and business intelligence.pdf
Introduction to Knowledge Engineering Part 1
STERILIZATION AND DISINFECTION-1.ppthhhbx
SAP 2 completion done . PRESENTATION.pptx
Fluorescence-microscope_Botany_detailed content
Introduction to machine learning and Linear Models
climate analysis of Dhaka ,Banglades.pptx

Apache Pulsar as a Dual Stream / Batch Processor

  • 1. Apache Pulsar as a Dual Streaming / Batch Processor Joe Olson Senior Manager, Big Data Analytics Apache Road Show Chicago - May 2019
  • 2. Agenda United and the Airline Industry How Publish – Subscribe Compute Model Presents Opportunity Apache Pulsar & Apache Bookkeeper Use Case: FAA’s Real Time SWIM Feed
  • 3. 2 About United Airlines…..  1,348 aircraft (779 mainline, 569 regional) with 250+ on order (supply chain)  158M passengers in 2018 (public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)  4900 daily departures (scheduling, operations, weather, route planning)  355 airports served, in 48 countries (baggage claim, check-ins)  88,000 employees worldwide (scheduling, pay)  Constantly in motion! Future (and past) always changing.  A data scientist / data engineer dream. Source: https://hub.united.com/corporate-fact-sheet/
  • 4. 3 Business Goals  Improve Customer Experience - How can we reduce friction when booking a reservation? Maneuvering through an airport? - How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)  Improve Employee Experience - How can we keep employees better informed of the current situation so they can relay it to the customers? - What are we learning from our surveys about what the customer bases says is / isn’t working?  Revenue Generation - What personalized offers can we make to our customers? - Are our offers competitive with the rest of the industry?  Improve Operational Reliability - How can we better prepare for weather or other operational interruptions? - How can we manage the fleet better and insure spare parts are where they need to be?
  • 5. 4 Industry Ideas – Customer Experience
  • 6. 5 Apache Pulsar – Key Points  “Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo and now part of the Apache Software Foundation” - Designed for low publish latency (< 5ms) at scale with strong durability guarantees - Persistent message storage based on Apache BookKeeper. - Tiered storage provides opportunity for batch and stream processing in the same platform. - Built from the ground up as a multi-tenant system: isolation, quotas, etc - Geo-replication designed in – across data centers or geographic regions. - Pulsar has run in production at Yahoo scale for over 3 years, with millions of messages per second across millions of topics. Can scale to hundreds of nodes. - Easily deploy lightweight compute logic without a separate stream processing engine. - REST Admin API for provisioning, administration, tools and monitoring. Deploy on bare metal or Kubernetes.
  • 7. 6 Apache Pulsar – Multi Tenancy  Pulsar was designed from the ground up to be a multi-tenant system. In Pulsar, tenants are the highest administrative unit within a Pulsar instance.  Capacity allocated to a tenant.  A namespace is the administrative unit nomenclature within a tenant. The configuration policies set on a namespace apply to all the topics created in that namespace
  • 8. 7 Apache Pulsar – Subscription Models  In exclusive mode, only a single consumer is allowed to attach to the subscription  In shared or round robin mode, multiple consumers can attach to the same subscription. Messages are delivered in a round robin distribution across consumers, and any given message is delivered to only one consumer. Ordering not guaranteed.  In failover mode, multiple consumers can attach to the same subscription. The first consumer will initially be the only one receiving messages. This consumer is called the master consumer.  When the master consumer disconnects, all (non-acked and subsequent) messages will be delivered to the next consumer in line
  • 9. 8 Apache Pulsar – Reference Architecture  One or more brokers handles and load balances incoming messages from producers, dispatches messages to consumers - Topic lookup + data transfer - Messages dispatched out of a managed ledger cache, or if under load from persistent storage (Bookkeeper) - Coordination with the local and global meta stores (Zookeeper)  A BookKeeper cluster consisting of one or more bookies handles persistent storage of messages  Local Zookeeper handles coordination tasks within a cluster, and a global cluster handles coordination instance wide (Georeplication)
  • 10. 9 Apache BookKeeper - Key Points  Apache BookKeeper is a scalable, fault tolerant, low latency log storage service delivering durability and consistency guarantees and can provide access to both historic and real time data - Atomic unit is an entry - A ledger is a bound set of entries, a stream is an unbound set of ledgers. - Individual servers storing ledgers are called bookies. - Entries are written to ledgers sequentially, and at most, once (append-only) - Each bookie handles fragments of ledgers as part of an ensemble. (striping) A stream of ledgers… entry
  • 11. 10 Apache BookKeeper – Reference Architecture  Two APIs: - Ledger API – allows direct interaction with ledgers, allowing you most flexibility in working with bookies. - Log stream API – allows you to interact with streams without dealing with lower level ledgers.  Bookies advertise themselves to the Zookeeper metadata cluster.
  • 12. 11 Apache BookKeeper – Storage Requirements  Clients should be able to write and read streams of entries with very low latency (under 5 milliseconds), even when providing strong durability  Data storage should be durable, consistent, and fault tolerant  The system should enable clients to stream or tail ledgers to propagate data as they’re written  The system should be able to store and provide access to both historic and real-time data
  • 13. 12 Apache BookKeeper – Durability  Example:bookies 1-5 are the ensemble for the ledger.  Entries are striped across the bookies.  Write quorum in this case is 3 (all entries written to 3 bookies)  Write is considered successful when the ack quorum (in this case 2) successfully acknowledge the write (fsync).  Wide variety of writing to bookies in the case of system degradation.  Maximize bandwidth by scaling out bookies  Improve latency by tuning the ack quorum.  Replication supports durability
  • 14. 13 Apache BookKeeper – Consistency & Availability  Consistency for log reads: - An entry successfully written is immediately readable. - An entry read once is always readable. - All entries written previously are also readable. - The order of records is identical across all readers. - Consistency accomplished via LastAddConfirmed (LAC) – a spin on a two phase commit.  Availability: - Write can be performed as long as there are enough bookies to satisfy the ack quorum. - Read can be performed by any bookie in the cluster.
  • 15. 14 Apache BookKeeper – I/O Isolation  Three separate I/O paths implemented: - Write (low latency) - Tailing read (low latency) - Catch up read (high latency) Write Read Read Read
  • 16. 15 Apache BookKeeper – Data Distribution  Storage capacity for a single log stream constrained by the capacity of the cluster, never a single host.  No stream rebalancing when capacity is added. New bookies will be discovered, and available for writing.  Replica repair when failure detected is efficient because it can be concurrently from multiple hosts.  All due to segmenting the streams.
  • 17. 16 Apache Pulsar – Tiered Storage Broker Bookies Infinite Stream  Infinite stream – most recent data stored on the broker, rest stored in bookies, as capacity of cluster allows - Write - Tailing Read - Catchup Read
  • 18. 17 Apache Pulsar – Tiered Storage  Infinite stream - Offloader: move segments off the Pulsar cluster and onto commodity storage. - Can be triggered on time, size, or demand.  Access - Broker knows how to read data back, or bypass bookies and read segments directly.
  • 19. 18 Apache Pulsar – Bringing It All Together Producer Subscriber Segment Reader Unbounded stream Bounded stream
  • 20. 19 Apache Pulsar – Bringing It All Together Producer Subscriber Segment Reader Unbounded stream Batch Processing Stream Processing
  • 21. 20 Use Case – Improve Operational Reliability  SWIM (System Wide Information Management) - Real time FAA message feed describing the current and future state of the nation’s managed airspace - traffic, weather, airport operations, etc. - Publishers (such as airlines) push their operational information to an endpoint. - Allows subscribers (such as airlines) on common published message interface.  Airline needs: - Connect the information in this feed up with their existing operational systems. • Maintain current state on assets. - Real time and historical analytics on this feed – traditional and predictive (ML / AI).
  • 22. 21 SWIM Overview Phase of operation FAA Topic
  • 23. 22 Sample SWIM Enroute TBFM Messages {"carrier": "UAL”, "flight number": 376, "origin": "EWR", "destination": "LAX", "flight date": "2019-Mar-19”} "Flight Plan": [{ "event_source": "TMA.ZOB.FAA.GOV", "event_time": "2019-03-29T16:23:22.659Z", "event_id": "422", "tma_id": "C00926", "Aircraft Id": "UAL376", "Origin Airport": "EWR", "Destination Airport": "LAX", "Flight Plan": "ACTIVE", "Aircraft Status": "TRACKED", "Aircraft Type": "B752/L", "Engine Type": "JET", "Beacon Code": "2334", "Flight Plan Speed": "483.0", "Assigned Requested Altitude": "28000", "Track Datasource": "ZNY", "Coordination Fix": "KEWR", "Coordination Time": "2019-03-29T16:14:00Z", "Estimated Departure Clearance Status": "FAA”, "Flight Plan Field 10A": "KEWR..COATE.Q436.RAAKK.Q438.RUBYY..MKG..BAE.J36.DUTYS.. KG78K..JORDY..OBH..GLL..DBL..CHESZ.Q88.HAKMN.ANJLL4.KLAX/2148", "TMA Converted Route": "KEWR/0000 COATE/0000 LAAYK/0000 YYOST/0000 DGRAF/0000 KG78K/0000 JORDY/0000 OBH/0000 GLL/0000 DBL/0000 KLAX/0000}] • Sample TBFM Messages. This specific flight generated 800 such messages "Station Time of Arrival": [{ "event_source": "TMA.ZLA.FAA.GOV", "event_time": "2019-03-29T20:38:28.148Z", "event_id": "4664550", "tma_id": "L03502”, "Meter Fix Name": "CRCUS”, "ETA Outer Meter Arc": "2019-03-29T21:42:45Z", ”ETA Meter Fix": "2019-03-29T21:46:35Z", ”ETA at Display Point": "2019-03-29T21:42:55Z", "ETA at Scheduling Fix": "2019-03-29T21:42:55Z", "ETA at Runway": "2019-03-29T21:57:23Z"}],
  • 24. 23 Architecture - Current State: Point to Point Scheduling Flight Plans Weather Airport Operations FAA Systems: Airspace Operations Scheduling Flight Plans Weather Airport Operations Airline Systems: Airspace Operations
  • 25. 24 Architecture - Target State: Pub / Sub Scheduling Flight Plans Weather Airport Operations FAA Systems: Airspace Operations Scheduling Flight Plans Weather Airport Operations Airline Systems: Airspace Operations Producer Subscriber Topics Producer Subscriber
  • 26. 25 Architecture - Target State Considerations Scheduling Flight Plans Weather Airport Operations Airline Systems: Airspace Operations Producer Subscriber File Connector JDBC Connector API Connector  Connectivity to the operational systems is mostly through file, JDBC, and API interfaces.  Most of these are not designed for streaming interfaces (yet).  How to connect up a topic with a systems that are not designed to work with streams?
  • 27. 26 Architecture - Target State Considerations Scheduling Flight Plans Weather Airport Operations Airline Systems: Airspace Operations Producer Subscriber File Connector JDBC Connector API Connector  What if there were both batch and streaming interfaces?  Use the batch interface until more sophisticated streaming interfaces come online.  An API written around the segment reader can help to close the last mile.  Treat as batch when needed, treat as stream when needed. Segment Reader API
  • 28. 27 Apache Communities  Twitter: @apache_pulsar  Wechat: ApachePulsar  Mailing Lists - dev@pulsar.apache.org - user@pulsar.apache.org  Slack - https://apache-pulsar.slack.com  Localization - http://crowdin.com/project/apache-pulsar  Github - https://github.com/apache/pulsar  Twitter: @asfbookkeeper  Mailing Lists - dev@bookkeeper.apache.org - user@bookkeeper.apache.org - issues@bookkeeper.apache.org  Slack - http://apachebookkeeper.slack.com/  Github - https://github.com/apache/bookkeeper Apache Pulsar Apache BookKeeper
  • 29. Thank You! We’re hiring! - Data Engineers - Data Scientists