SlideShare a Scribd company logo
Every ad.
Every sales channel.
Every screen.
One platform.
Building Distributed Data
Streaming System
AshishTadose
Lead Software Engineer
Big Data Analytics - PubMatic
Agenda
• What is stream processing
• Streaming architecture
• Scalable Data Ingestion
• RealTime Streaming Processing system
2
What is Streaming Process ?
3
In simple words, Streaming is…
4
Batch & Streaming processing
Data
Generator
Ingestion
Distributed
File system
Processing Data Store
Batch processing
Data
Generator
Ingestion
Message
Queue
Processing Data Store
Stream Data processing
Batch & Streaming processing
6
Data
Generator
Ingestion
Message
Queue
Processing Data Store
Stream Data processing
Distributed
File system
Processing Data Store
Batch processing
Batch & Streaming processing
7
Data
Generator
Ingestion
Message
Queue
Processing Data Store
Stream Data processing
Distributed
File system
Processing Data Store
Batch processing
Lambda Architecture:Velocity &Volume
8
Streaming
Ingestion
Technologies
9
Ingestion Ecosystem
• Sources
• Machine data
• External stream & syslogs
• Data Collection
• Flume
• Kafka
• Kinesis
• Confluent
10
Flume
• Easier to setup
• Rich set of in-build tools
• No inherent support for data replication
• Nodes works in isolation
• Memory channel vs File Channel 11
Kinesis
12
Kafka
13
 http://kafka.apache.org/
 Originated at LinkedIn, open sourced in early 2011
 Implemented in Scala, some Java
 9 core committers, plus ~ 20 contributors
Why is Kafka so fast?
• Fast writes:
• While Kafka persists all data to disk, essentially all writes go to the
page cache of OS, i.e. RAM.
• Fast reads:
• Very efficient to transfer data from page cache to a network socket
• Linux: sendfile() system call
• Combination of the two = fast Kafka!
• Example (Operations):On a Kafka cluster where the consumers are mostly
caught up you will see no read activity on the disks as they will be serving
data entirely from cache.
14
1
http://kafka.apache.org/documentation.html#persistence
Flafka – Flume meets Kafka
15
Confluent - Centralized Ingestion with Kafka Pipeline
16
Stream
Processing
17
RealTime Stream Processing
• Processing system
• Apache Storm
• Apache Samza
• Apache Spark (Streaming)
• Project Apex - DataTorrent
• Storage
• Hive HDFS
• Hbase
• MySql
• Custom
• Access
• Depend of data storage
• Scalable query interface - Kafka 18
Streaming Design Patterns
• Micro batching
• Unpredictable incoming data
• Creating multiple streams
• Out of sequence events
• Stream joins
• Top N metrics
• External Lookup
19
ThankYou
20

More Related Content

PPTX
Building Distributed Data Streaming System
PPTX
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
PPTX
Introducing Venice - Strata NYC 2017
PPTX
Fast Online Access to Massive Offline Data - SECR 2016
PPTX
Presto@Netflix Presto Meetup 03-19-15
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PPTX
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
PPTX
Hadoop and friends
Building Distributed Data Streaming System
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Introducing Venice - Strata NYC 2017
Fast Online Access to Massive Offline Data - SECR 2016
Presto@Netflix Presto Meetup 03-19-15
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Hadoop and friends

What's hot (20)

PDF
Presto Strata Hadoop SJ 2016 short talk
PPTX
Bootstrap SaaS startup using Open Source Tools
PPTX
Membase Meetup 2010
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
PPTX
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
Presto@Uber
PDF
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
PDF
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
PDF
HBaseConAsia2018 Keynote1: Apache HBase Project Status
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
PDF
Presto
PDF
Presto at Hadoop Summit 2016
PPTX
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Seattle kafka meetup nov 2015 published siphon
PDF
Change Data Capture with Data Collector @OVH
PDF
Couchbase@live person meetup july 22nd
Presto Strata Hadoop SJ 2016 short talk
Bootstrap SaaS startup using Open Source Tools
Membase Meetup 2010
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
Building tiered data stores using aesop to bridge sql and no sql systems
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Presto@Uber
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Keynote1: Apache HBase Project Status
Meetup#2: Building responsive Symbology & Suggest WebService
Presto
Presto at Hadoop Summit 2016
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Seattle kafka meetup nov 2015 published siphon
Change Data Capture with Data Collector @OVH
Couchbase@live person meetup july 22nd
Ad

Viewers also liked (12)

PDF
Art of Disorderly Programming
PDF
Pres Final Taube ConnWeek 2012
PPTX
PTP 2.0 Innovative Funding Solutions
PDF
Youthpass - Ana - signed
PDF
Cultural Theme Month
PPTX
2016 02-17 transportation outreach planner presentation
PDF
Autoimmuunisairaudet, suolisto ja ravinto - 19092015
PDF
Outlook for the World Paper Grade Pulp Market
DOCX
Proyecto FinEs Biologia 2017
PDF
Hadoop Application Architectures tutorial - Strata London
PDF
NetSuite and Sage ERP X3 Solution Spotlight
PPTX
Introduction to Cryptography
Art of Disorderly Programming
Pres Final Taube ConnWeek 2012
PTP 2.0 Innovative Funding Solutions
Youthpass - Ana - signed
Cultural Theme Month
2016 02-17 transportation outreach planner presentation
Autoimmuunisairaudet, suolisto ja ravinto - 19092015
Outlook for the World Paper Grade Pulp Market
Proyecto FinEs Biologia 2017
Hadoop Application Architectures tutorial - Strata London
NetSuite and Sage ERP X3 Solution Spotlight
Introduction to Cryptography
Ad

Similar to Data streaming-systems (20)

PPTX
Data Stream Processing with Apache Flink
PPTX
Apache Kafka
PDF
Confluent and Elastic
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Stream processing on mobile networks
PDF
Building real time data-driven products
PDF
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
PDF
OSSNA Building Modern Data Streaming Apps
PPTX
Apache Flink: Past, Present and Future
PPTX
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
PDF
Leveraging Mainframe Data for Modern Analytics
PDF
xGem Data Stream Processing
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
10 Big Data Technologies you Didn't Know About
PPTX
Unified Batch & Stream Processing with Apache Samza
PPTX
Data Analysis on AWS
PDF
Architectural Evolution Starting from Hadoop
PPTX
Deploying Apache Flume to enable low-latency analytics
Data Stream Processing with Apache Flink
Apache Kafka
Confluent and Elastic
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Stream processing on mobile networks
Building real time data-driven products
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
OSSNA Building Modern Data Streaming Apps
Apache Flink: Past, Present and Future
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Leveraging Mainframe Data for Modern Analytics
xGem Data Stream Processing
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
10 Big Data Technologies you Didn't Know About
Unified Batch & Stream Processing with Apache Samza
Data Analysis on AWS
Architectural Evolution Starting from Hadoop
Deploying Apache Flume to enable low-latency analytics

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Introduction to Business Data Analytics.
PPT
Quality review (1)_presentation of this 21
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Mega Projects Data Mega Projects Data
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Moving the Public Sector (Government) to a Digital Adoption
Data_Analytics_and_PowerBI_Presentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Knowledge Engineering Part 1
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Business Data Analytics.
Quality review (1)_presentation of this 21
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Mega Projects Data Mega Projects Data
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
Moving the Public Sector (Government) to a Digital Adoption

Data streaming-systems

  • 1. Every ad. Every sales channel. Every screen. One platform. Building Distributed Data Streaming System AshishTadose Lead Software Engineer Big Data Analytics - PubMatic
  • 2. Agenda • What is stream processing • Streaming architecture • Scalable Data Ingestion • RealTime Streaming Processing system 2
  • 3. What is Streaming Process ? 3
  • 4. In simple words, Streaming is… 4
  • 5. Batch & Streaming processing Data Generator Ingestion Distributed File system Processing Data Store Batch processing Data Generator Ingestion Message Queue Processing Data Store Stream Data processing
  • 6. Batch & Streaming processing 6 Data Generator Ingestion Message Queue Processing Data Store Stream Data processing Distributed File system Processing Data Store Batch processing
  • 7. Batch & Streaming processing 7 Data Generator Ingestion Message Queue Processing Data Store Stream Data processing Distributed File system Processing Data Store Batch processing
  • 10. Ingestion Ecosystem • Sources • Machine data • External stream & syslogs • Data Collection • Flume • Kafka • Kinesis • Confluent 10
  • 11. Flume • Easier to setup • Rich set of in-build tools • No inherent support for data replication • Nodes works in isolation • Memory channel vs File Channel 11
  • 13. Kafka 13  http://kafka.apache.org/  Originated at LinkedIn, open sourced in early 2011  Implemented in Scala, some Java  9 core committers, plus ~ 20 contributors
  • 14. Why is Kafka so fast? • Fast writes: • While Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM. • Fast reads: • Very efficient to transfer data from page cache to a network socket • Linux: sendfile() system call • Combination of the two = fast Kafka! • Example (Operations):On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache. 14 1 http://kafka.apache.org/documentation.html#persistence
  • 15. Flafka – Flume meets Kafka 15
  • 16. Confluent - Centralized Ingestion with Kafka Pipeline 16
  • 18. RealTime Stream Processing • Processing system • Apache Storm • Apache Samza • Apache Spark (Streaming) • Project Apex - DataTorrent • Storage • Hive HDFS • Hbase • MySql • Custom • Access • Depend of data storage • Scalable query interface - Kafka 18
  • 19. Streaming Design Patterns • Micro batching • Unpredictable incoming data • Creating multiple streams • Out of sequence events • Stream joins • Top N metrics • External Lookup 19