SlideShare a Scribd company logo
1
© OCTO 2015© OCTO 2015
Event Driven Architecture
bluckbluck
2
© OCTO 2015
The first problem was how to transport data between systems
The second part of this problem was the need to do richer
analytical data processing with very low latency.
3
© OCTO 2015
The pipeline for log data was
scalable
but lossy and
could only deliver data with high latency.
The pipeline between Oracle instances was
fast,
exact, and
real-time,
but not available to any other systems.
4
© OCTO 2015
The pipeline of Oracle data for Hadoop was
periodic CSV dumps—
high throughput,
but batch.
The pipeline of data to our search system was
low latency,
but unscalable
and tied directly to the database.
The messaging systems were
low latency
but unreliable
and unscalable.
5
© OCTO 2015
6
© OCTO 2015
We added data centers geographically distributed around the world we had to
build out geographical replication for each of these data flows
he data was always unreliable.
Our reports were untrustworthy,
derived indexes and stores were questionable, and
everyone spent a lot of time battling data quality issues of all kinds
At the same time we weren't just shipping data from place to place; we also
wanted to do things with it
Hadoop had given us a platform for batch processing, data archival, and ad hoc
processing, and this had been enormously successful, but we lacked an
analogous platform for low-latency processing.
7
© OCTO 2015
Stream Data Plateform
8
© OCTO 2015
Stream Data Plateform
9
© OCTO 2015
Your database stores the current state of your data. But the current state is
always caused by some actions that took place in the past. The actions are the
events.
Much of what people refer to when they talk about "big data" is really the act of
capturing these events that previously weren't recorded anywhere and putting
them to use for analysis, optimization, and decision making
Event streams are an obvious fit for log data or things like "orders", "sales",
"clicks" or "trades" that are obviously event-like.
The Rise of Events and Event Streams
10
© OCTO 2015
data in databases can also be thought of as an event stream. The process of creating a
backup or standby copy of a database :
to dump out the contents
to take a "diff" of what has changed
Change capture : If we take our diffs more and more frequently what we will be left with is a
continuous sequence of single row changes.
By publishing the database changes into the stream data platform you add this to the other
set of event streams. You can use these streams to synchronize other systems like
Hadoop cluster,
a replica database, or
a search index, or
you can feed these changes into applications
or stream processors to directly compute new things off the changes.
Databases Are Event Streams
11
© OCTO 2015
A stream data platform has two primary uses:
Data Integration: The stream data platform captures streams of events or data changes and
feeds these to other data systems such as relational databases, key-value stores, Hadoop, or
the data warehouse.
Stream processing: It enables continuous, real-time processing and transformation of these
streams and makes the results available system-wide.
The stream data platform is a central hub for data streams.
t also acts as a buffer between these systems—the publisher of data doesn't need to be
concerned with the various systems that will eventually consume and load the data. This
means consumers of data can come and go and are fully decoupled from the source.
What Is a Stream Data Platform For?
12
© OCTO 2015
Hadoop wants to be able to maintain a full copy of all the data in your organization and act
as a "data lake" or "enterprise data hub".
Directly integrating each data source with HDFS is a hugely time consuming proposition
the end result only makes that data available to Hadoop.
This type of data capture isn't suitable for real-time processing or syncing other real-time
applications.
This same pipeline can run in reverse: Hadoop and the data warehouse environment can
publish out results that need to flow into appropriate systems for serving in customer-facing
applications.
What Is a Stream Data Platform For? Zoom
Hadoop
13
© OCTO 2015
The stream processing use case plays off the data integration use case.
The results of the stream processing are just a new, derived stream.
Stream processing acts as both a way to develop applications that need low-latency
transformations but it is also directly part of the data integration usage as well:
integrating systems often requires some munging of data streams in between.
What Is a Stream Data Platform For? Zoom
ETL
14
© OCTO 2015
A stream data platform is similar to an enterprise messaging system—it receives
messages and distributes them to interested subscribers. There are three
important differences:
Messaging systems are typically run in one-off deployments for different
applications. The purpose of the stream data platform is very much as a central data
hub.
Messaging systems do a poor job of supporting integration with batch systems, such
as a data warehouse or a Hadoop cluster, as they have limited data storage
capacity.
Messaging systems do not provide semantics that are easily compatible with rich
stream processing.
How Does a Stream Data Platform Relate To Existing Things
15
© OCTO 2015
In other words a data stream data platform is a messaging system whose role
has been rethought at a company-wide scale.
How Does a Stream Data Platform Relate To Existing Things
16
© OCTO 2015
A stream data platform is a true platform that any other system can choose to
tap into and many applications can build around.
by making data available in a uniform format in a single place with a common
stream abstraction, many of the routine data clean-up tasks can be avoided
entirely.
Data Integration Tools
17
© OCTO 2015
The advantage of a stream data platform is that transformation is fundamentally
decoupled from the stream itself.
This code can live in applications or stream processing tasks, allowing teams to
iterate at their own pace without a central bottleneck for application
development.
Enterprise Service Buses
18
© OCTO 2015
Databases have long had similar log mechanisms such as Golden Gate.
However these mechanisms are limited to database changes only and are not a
general purpose event capture platform.
Change Capture Systems
19
© OCTO 2015
A stream data platform doesn't replace your data warehouse; in fact, quite the
opposite: it feeds it data.
Data Warehouses and Hadoop
20
© OCTO 2015
They attempt to add richer processing semantics to subscribers and can make
implementing data transformation easier.
Stream Processing Systems
21
© OCTO 2015
everything from user activity to database changes to administrative actions like
restarting a process are captured in real-time streams that are subscribed to and
processed in real-time.
What Does This Look Like In Practice?
22
© OCTO 2015
part of the promise of this approach to data management is having a central repository with
the full set of data streams your organization generates. This works best when data is all in
the same place.
simplifying system architecture.
fewer integration points for data consumers,
fewer things to operate,
lower incremental cost for adding new applications,
makes it easier to reason about data flow.
But, there are several reasons to end up with multiple clusters
To keep activity local to a datacenter
For security reasons
For SLA control.
Rcommendations : Limit The Number of Clusters
23
© OCTO 2015
Apache Kafka does not enforce any particular format
If each individual or application chooses a representation of their own preference—say
some use JSON, others XML, and others CSV—the result is that any system or process
which uses multiple data streams has to munge and understand each of these.
Local optimization—choosing your favorite format for data you produce—leads to huge
global sub-optimization since now each system needs to write N adaptors, one for each
format it wants to ingest.
imagine how useless the Unix toolchain would be if each tool invented its own format: you
would have to translate between formats every time you wanted to pipe one command to
another.
Rcommendations : Pick A Single Data Format
24
© OCTO 2015
Connecting all systems directly would look
something like this
Whereas having this central stream data platform
looks something like this
Rcommendations : Pick A Single Data Format
25
© OCTO 2015
We think Avro is the best choice for a number of reasons:
1. It has a direct mapping to and from JSON
2. It has a very compact format. The bulk of JSON, repeating every field name with every single
record, is what makes JSON inefficient for high-volume usage.
3. It is very fast.
4. It has great bindings for a wide variety of programming languages so you can generate Java
objects that make working with event data easier, but it does not require code generation so
tools can be written generically for any data stream.
5. It has a rich, extensible schema language defined in pure JSON
6. It has the best notion of compatibility for evolving your data over time.
Rcommendations : Use Avro as Your Data Format
26
© OCTO 2015
Isn't the modern world of big data all about unstructured data, dumped in whatever form is
convenient, and parsed later when it is queried?
One of the primary advantages of this type of architecture where data is modeled as
streams is that applications are decoupled. Applications produce a stream of events
capturing what occurred without knowledge of which things subscribe to these streams.
The Need For Schemas
27
© OCTO 2015
Whenever you see a common activity across multiple systems try to use a common schema
for this activity.
An example of this that is common to all businesses is application errors.
Share Event Schemas
28
© OCTO 2015
MODELING SPECIFIC DATA TYPES IN KAFKA

More Related Content

PDF
Which Change Data Capture Strategy is Right for You?
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
PPTX
Streaming Data and Stream Processing with Apache Kafka
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
PDF
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
PPTX
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
PPTX
Scaling HDFS at Xiaomi
PPTX
Preventative Maintenance of Robots in Automotive Industry
Which Change Data Capture Strategy is Right for You?
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
Streaming Data and Stream Processing with Apache Kafka
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Scaling HDFS at Xiaomi
Preventative Maintenance of Robots in Automotive Industry

What's hot (20)

PDF
Building Event-Driven Services with Apache Kafka
PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PPTX
Building Continuously Curated Ingestion Pipelines
PDF
A Practical Guide to Selecting a Stream Processing Technology
PDF
Evolving from Messaging to Event Streaming
PDF
The Data Dichotomy- Rethinking the Way We Treat Data and Services
PDF
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
PDF
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...
PDF
Data Lake and the rise of the microservices
PPTX
Monitoring and Troubleshooting a Real Time Pipeline
PPT
The Evolution of Big Data Pipelines at Intuit
PDF
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
PDF
Reliable and Scalable Data Ingestion at Airbnb
PDF
Leveraging Mainframe Data for Modern Analytics
PDF
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
PPTX
Flink Streaming
PDF
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
PDF
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
PPTX
In Flux Limiting for a multi-tenant logging service
PPTX
Accelerating Data Warehouse Modernization
Building Event-Driven Services with Apache Kafka
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Building Continuously Curated Ingestion Pipelines
A Practical Guide to Selecting a Stream Processing Technology
Evolving from Messaging to Event Streaming
The Data Dichotomy- Rethinking the Way We Treat Data and Services
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Designing and Implementing Information Systems with Event Modeling, Bobby Cal...
Data Lake and the rise of the microservices
Monitoring and Troubleshooting a Real Time Pipeline
The Evolution of Big Data Pipelines at Intuit
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Reliable and Scalable Data Ingestion at Airbnb
Leveraging Mainframe Data for Modern Analytics
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Flink Streaming
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
In Flux Limiting for a multi-tenant logging service
Accelerating Data Warehouse Modernization
Ad

Viewers also liked (16)

PPTX
Event Driven Architectures - Phoenix Java Users Group 2013
PDF
Fraud Management Industry Update Webinar
PPT
Detecting Opportunities and Threats with Complex Event Processing: Case St...
PDF
Spring Batch - Lessons Learned out of a real life banking system.
PDF
Kafka Utrecht Meetup
PPT
Event Driven Architecture (EDA), November 2, 2006
PPTX
The Impact of Messaging Standards on Event-Driven Architecture and IoT
PPTX
Event Driven Architecture - MeshU - Ilya Grigorik
PDF
Complex Event Processing in Practice at jDays 2012
PPTX
Kafka at scale facebook israel
PDF
Developing real-time data pipelines with Spring and Kafka
PDF
Spoilt for Choice: How to Choose the Right Enterprise Service Bus (ESB)?
PDF
Data Microservices with Spring Cloud Stream, Task, and Data Flow #jsug #spri...
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
PDF
Introduction to Kafka Streams
Event Driven Architectures - Phoenix Java Users Group 2013
Fraud Management Industry Update Webinar
Detecting Opportunities and Threats with Complex Event Processing: Case St...
Spring Batch - Lessons Learned out of a real life banking system.
Kafka Utrecht Meetup
Event Driven Architecture (EDA), November 2, 2006
The Impact of Messaging Standards on Event-Driven Architecture and IoT
Event Driven Architecture - MeshU - Ilya Grigorik
Complex Event Processing in Practice at jDays 2012
Kafka at scale facebook israel
Developing real-time data pipelines with Spring and Kafka
Spoilt for Choice: How to Choose the Right Enterprise Service Bus (ESB)?
Data Microservices with Spring Cloud Stream, Task, and Data Flow #jsug #spri...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introduction to Kafka Streams
Ad

Similar to Event Driven Architecture (20)

PDF
Data platform architecture
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
Building real time data-driven products
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Building an Event-Driven Data Mesh (Early Release) Adam Bellemare
PDF
Building an Event-Driven Data Mesh (Early Release) Adam Bellemare
PDF
Designing data intensive applications - Oleg Mürk
PDF
Designing Data-Intensive Applications
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
Rethinking Streaming Analytics For Scale
PPTX
Data Architectures for Robust Decision Making
PPTX
Data streaming fundamentals
PDF
ETL Is Dead, Long-live Streams
PDF
Building end to end streaming application on Spark
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Trivento summercamp fast data 9/9/2016
PDF
xGem Data Stream Processing
Data platform architecture
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Building real time data-driven products
Trivento summercamp masterclass 9/9/2016
Building an Event-Driven Data Mesh (Early Release) Adam Bellemare
Building an Event-Driven Data Mesh (Early Release) Adam Bellemare
Designing data intensive applications - Oleg Mürk
Designing Data-Intensive Applications
Big Data Architecture Workshop - Vahid Amiri
Rethinking Streaming Analytics For Scale
Data Architectures for Robust Decision Making
Data streaming fundamentals
ETL Is Dead, Long-live Streams
Building end to end streaming application on Spark
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Trivento summercamp fast data 9/9/2016
xGem Data Stream Processing

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Global journeys: estimating international migration
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Quality review (1)_presentation of this 21
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Lecture1 pattern recognition............
PDF
Introduction to Business Data Analytics.
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Launch Your Data Science Career in Kochi – 2025
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Supervised vs unsupervised machine learning algorithms
Global journeys: estimating international migration
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
Quality review (1)_presentation of this 21
Business Acumen Training GuidePresentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............
Introduction to Business Data Analytics.
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Event Driven Architecture

  • 1. 1 © OCTO 2015© OCTO 2015 Event Driven Architecture bluckbluck
  • 2. 2 © OCTO 2015 The first problem was how to transport data between systems The second part of this problem was the need to do richer analytical data processing with very low latency.
  • 3. 3 © OCTO 2015 The pipeline for log data was scalable but lossy and could only deliver data with high latency. The pipeline between Oracle instances was fast, exact, and real-time, but not available to any other systems.
  • 4. 4 © OCTO 2015 The pipeline of Oracle data for Hadoop was periodic CSV dumps— high throughput, but batch. The pipeline of data to our search system was low latency, but unscalable and tied directly to the database. The messaging systems were low latency but unreliable and unscalable.
  • 6. 6 © OCTO 2015 We added data centers geographically distributed around the world we had to build out geographical replication for each of these data flows he data was always unreliable. Our reports were untrustworthy, derived indexes and stores were questionable, and everyone spent a lot of time battling data quality issues of all kinds At the same time we weren't just shipping data from place to place; we also wanted to do things with it Hadoop had given us a platform for batch processing, data archival, and ad hoc processing, and this had been enormously successful, but we lacked an analogous platform for low-latency processing.
  • 7. 7 © OCTO 2015 Stream Data Plateform
  • 8. 8 © OCTO 2015 Stream Data Plateform
  • 9. 9 © OCTO 2015 Your database stores the current state of your data. But the current state is always caused by some actions that took place in the past. The actions are the events. Much of what people refer to when they talk about "big data" is really the act of capturing these events that previously weren't recorded anywhere and putting them to use for analysis, optimization, and decision making Event streams are an obvious fit for log data or things like "orders", "sales", "clicks" or "trades" that are obviously event-like. The Rise of Events and Event Streams
  • 10. 10 © OCTO 2015 data in databases can also be thought of as an event stream. The process of creating a backup or standby copy of a database : to dump out the contents to take a "diff" of what has changed Change capture : If we take our diffs more and more frequently what we will be left with is a continuous sequence of single row changes. By publishing the database changes into the stream data platform you add this to the other set of event streams. You can use these streams to synchronize other systems like Hadoop cluster, a replica database, or a search index, or you can feed these changes into applications or stream processors to directly compute new things off the changes. Databases Are Event Streams
  • 11. 11 © OCTO 2015 A stream data platform has two primary uses: Data Integration: The stream data platform captures streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse. Stream processing: It enables continuous, real-time processing and transformation of these streams and makes the results available system-wide. The stream data platform is a central hub for data streams. t also acts as a buffer between these systems—the publisher of data doesn't need to be concerned with the various systems that will eventually consume and load the data. This means consumers of data can come and go and are fully decoupled from the source. What Is a Stream Data Platform For?
  • 12. 12 © OCTO 2015 Hadoop wants to be able to maintain a full copy of all the data in your organization and act as a "data lake" or "enterprise data hub". Directly integrating each data source with HDFS is a hugely time consuming proposition the end result only makes that data available to Hadoop. This type of data capture isn't suitable for real-time processing or syncing other real-time applications. This same pipeline can run in reverse: Hadoop and the data warehouse environment can publish out results that need to flow into appropriate systems for serving in customer-facing applications. What Is a Stream Data Platform For? Zoom Hadoop
  • 13. 13 © OCTO 2015 The stream processing use case plays off the data integration use case. The results of the stream processing are just a new, derived stream. Stream processing acts as both a way to develop applications that need low-latency transformations but it is also directly part of the data integration usage as well: integrating systems often requires some munging of data streams in between. What Is a Stream Data Platform For? Zoom ETL
  • 14. 14 © OCTO 2015 A stream data platform is similar to an enterprise messaging system—it receives messages and distributes them to interested subscribers. There are three important differences: Messaging systems are typically run in one-off deployments for different applications. The purpose of the stream data platform is very much as a central data hub. Messaging systems do a poor job of supporting integration with batch systems, such as a data warehouse or a Hadoop cluster, as they have limited data storage capacity. Messaging systems do not provide semantics that are easily compatible with rich stream processing. How Does a Stream Data Platform Relate To Existing Things
  • 15. 15 © OCTO 2015 In other words a data stream data platform is a messaging system whose role has been rethought at a company-wide scale. How Does a Stream Data Platform Relate To Existing Things
  • 16. 16 © OCTO 2015 A stream data platform is a true platform that any other system can choose to tap into and many applications can build around. by making data available in a uniform format in a single place with a common stream abstraction, many of the routine data clean-up tasks can be avoided entirely. Data Integration Tools
  • 17. 17 © OCTO 2015 The advantage of a stream data platform is that transformation is fundamentally decoupled from the stream itself. This code can live in applications or stream processing tasks, allowing teams to iterate at their own pace without a central bottleneck for application development. Enterprise Service Buses
  • 18. 18 © OCTO 2015 Databases have long had similar log mechanisms such as Golden Gate. However these mechanisms are limited to database changes only and are not a general purpose event capture platform. Change Capture Systems
  • 19. 19 © OCTO 2015 A stream data platform doesn't replace your data warehouse; in fact, quite the opposite: it feeds it data. Data Warehouses and Hadoop
  • 20. 20 © OCTO 2015 They attempt to add richer processing semantics to subscribers and can make implementing data transformation easier. Stream Processing Systems
  • 21. 21 © OCTO 2015 everything from user activity to database changes to administrative actions like restarting a process are captured in real-time streams that are subscribed to and processed in real-time. What Does This Look Like In Practice?
  • 22. 22 © OCTO 2015 part of the promise of this approach to data management is having a central repository with the full set of data streams your organization generates. This works best when data is all in the same place. simplifying system architecture. fewer integration points for data consumers, fewer things to operate, lower incremental cost for adding new applications, makes it easier to reason about data flow. But, there are several reasons to end up with multiple clusters To keep activity local to a datacenter For security reasons For SLA control. Rcommendations : Limit The Number of Clusters
  • 23. 23 © OCTO 2015 Apache Kafka does not enforce any particular format If each individual or application chooses a representation of their own preference—say some use JSON, others XML, and others CSV—the result is that any system or process which uses multiple data streams has to munge and understand each of these. Local optimization—choosing your favorite format for data you produce—leads to huge global sub-optimization since now each system needs to write N adaptors, one for each format it wants to ingest. imagine how useless the Unix toolchain would be if each tool invented its own format: you would have to translate between formats every time you wanted to pipe one command to another. Rcommendations : Pick A Single Data Format
  • 24. 24 © OCTO 2015 Connecting all systems directly would look something like this Whereas having this central stream data platform looks something like this Rcommendations : Pick A Single Data Format
  • 25. 25 © OCTO 2015 We think Avro is the best choice for a number of reasons: 1. It has a direct mapping to and from JSON 2. It has a very compact format. The bulk of JSON, repeating every field name with every single record, is what makes JSON inefficient for high-volume usage. 3. It is very fast. 4. It has great bindings for a wide variety of programming languages so you can generate Java objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream. 5. It has a rich, extensible schema language defined in pure JSON 6. It has the best notion of compatibility for evolving your data over time. Rcommendations : Use Avro as Your Data Format
  • 26. 26 © OCTO 2015 Isn't the modern world of big data all about unstructured data, dumped in whatever form is convenient, and parsed later when it is queried? One of the primary advantages of this type of architecture where data is modeled as streams is that applications are decoupled. Applications produce a stream of events capturing what occurred without knowledge of which things subscribe to these streams. The Need For Schemas
  • 27. 27 © OCTO 2015 Whenever you see a common activity across multiple systems try to use a common schema for this activity. An example of this that is common to all businesses is application errors. Share Event Schemas
  • 28. 28 © OCTO 2015 MODELING SPECIFIC DATA TYPES IN KAFKA