SlideShare a Scribd company logo
Introduction to Real-time
data processing
Yogi Devendra
(yogidevendra@apache.org)
Agenda
● What is big data?
● Data at rest Vs Data in motion
● Batch processing Vs Real - time data
processing (streaming)
● Examples
● When to use: Batch? Real-time?
● Current trends
2
Image ref [4]
3
Big data
Definition : big data
Big data is high-volume, high-velocity and/or
high-variety information assets that demand
cost-effective, innovative forms of information
processing that enable enhanced insight,
decision making, and process automation. [1]
4
Exploding sizes of datasets
5
● Google
○ >100PB data everyday [3]
● Large Hydron collidor :
○ 150M sensors x 40M sample per sec x 600 M
collisions per sec
○ >500 exabytes per day [2]
○ 0.0001% of data is actually analysed
6
Questions
Image ref [16]
Data at rest Vs Data in motion
● At rest :
○ Dataset is fixed
○ a.k.a bounded [15]
● In motion :
○ continuously incoming data
○ a.k.a unbounded
7
Data at rest Vs Data in motion (continued)
● Generally Big data has velocity
○ continuous data
● Difference lies in when are you analyzing
your data? [5]
○ after the event occurs ⇒ at rest
○ as the event occurs ⇒ in motion
8
Examples
● Data at rest
○ Finding stats about group in a closed room
○ Analyzing sales data for last month to make
strategic decisions
● Data in motion
○ Finding stats about group in a marathon
○ e-commerce order processing
9
10
Questions
Image ref [16]
Batch processing
● Problem statement :
○ Process this entire data
○ give answer for X at the end.
11
Batch processing : Use-cases
12
● Sales summary for the previous
month[5]
● Model training for Spam emails
Batch processing : Characteristics
13
● Access to entire data
● Split decided at the launch time.
● Capable of doing complex analysis (e.g.
Model training) [6]
● Optimize for Throughput (data processed
per sec)
● Example frameworks : Map Reduce,
Apache Spark [6]
14
Questions
Image ref [16]
Real time data processing
● a.k.a. Stream processing
● Problem statement :
○ Process incoming stream of data
○ to give answer for X at this
moment.
15
Stream processing : Use-cases
● e-commerce order processing
● Credit card fraud detection
● Label given email as : spam vs non-
spam
16
Image ref [7]
17
Stream processing : Characteristics
● Results for X are based on the
current data
● Computes function on one record or
smaller window. [6]
● Optimizations for latency (avg. time
taken for a record)
18
Stream processing : Characteristics
● Need to complete computes in near real-
time
● Computes something relatively simple e.g.
Using pre-defined model to label a record.
● Example frameworks: Apache Apex,
Apache storm
19
20
Questions
Image ref [16]
21
Batch Vs Streaming
pani puri ⇒ Streaming
image ref [9]
wada ⇒ batch
image ref [8]
22
23
Questions
Image ref [16]
Micro-batch
● Create batch of
small size
● Process each
micro-batch
separately
● Example
frameworks: Spark
streaming
pani puri ⇒ micro-batch
image ref [10]
24
● Depends on use-case
○ Some are suitable for batch
○ Some are suitable for streaming
○ Some can be solved by any one
○ Some might need combination of two.
25
When to use : Batch Vs Streaming?
When to use : Batch Vs Real time?(continued)
● Answers for current snapshot ⇒ Real-time
○ Answers at the end ⇒ Open
● Complex calculations, multiple iterations
over entire data ⇒ Batch
○ Simple computations ⇒ Open
● Low latency requirements (< 1s) ⇒ Real-
time
26
When to use : Batch Vs Real time?(continued)
● Each record can be processed
independently ⇒ Open
○ Independent processing not possible ⇒
Batch
● Depends on use-case
○ Some use-cases can be solved by any one
○ Some other might need combination of two.
27
28
Questions
Image ref [16]
Can one replace the other?
● Batch processing is designed for ‘data at
rest’. ‘data in motion’ becomes stale; if
processed in batch mode.
● Real-time processing is designed for ‘data
in motion’. But, can be used for ‘data at
rest’ as well (in many cases).
29
30
Questions
Image ref [16]
Quiz : is this Batch or Real-time?
● Queue for roller coaster
ride image ref [11]
● Queue at the petrol
pump image ref [12]
31
Quiz : is this Batch or Real-time?
● Selecting relevant ad
to show for requested
page
● Courier dispatch from
city A to B
image ref [13]
image ref [14]
32
33
Questions
Image ref [16]
Current trends
● Difficulty in splitting problems as Map
Reduce : Alternative paradigms for
expressing user intent .
● More and more use-cases demanding
faster insight to data (near real-time)
● ‘Data in motion’ is common.
● ‘Real-time data processing’ getting
traction.
34
35
Questions
Image ref [16]
36
References
1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/
2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data
3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/
4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/
5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/
6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht
7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud-
detection
8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/
9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/
10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/
11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the-
roller-coaster.html
12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and-
diesel-fuel-retailing-ril
13. Publishers | Propellerads https://propellerads.com/publishers/
14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067
15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html
16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146
17. Thank You http://www.planwallpaper.com/thank-you
37

More Related Content

PDF
PDF
Realtime Data Analysis Patterns
PDF
The Rise of Streaming SQL
PDF
ISNCC 2017
PDF
A head start on cloud native event driven applications - bigdatadays
PDF
Stream Processing with Ballerina
PDF
Workshop 20140522 BigQuery Implementation
PDF
TPC-H analytics' scenarios and performances on Hadoop data clouds
Realtime Data Analysis Patterns
The Rise of Streaming SQL
ISNCC 2017
A head start on cloud native event driven applications - bigdatadays
Stream Processing with Ballerina
Workshop 20140522 BigQuery Implementation
TPC-H analytics' scenarios and performances on Hadoop data clouds

What's hot (20)

DOCX
empirical analysis modeling of power dissipation control in internet data ce...
PDF
Google Dremel. Concept and Implementations.
PDF
Druid meetup @walkme
PDF
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
PDF
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
PDF
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
PDF
Eventually, time will kill your data pipeline
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
PPTX
2013 DATA @ NFLX (Tableau User Group)
PPTX
OLAP Basics and Fundamentals by Bharat Kalia
PDF
Workflow Hacks #1 - dots. Tokyo
PDF
spark stream - kafka - the right way
PDF
The State of Postgres | Strata San Jose 2018 | Umur Cubukcu
PDF
You might be paying too much for BigQuery
PDF
Sensing the world with data of things
PPTX
Malstone KDD 2010
ODP
Cassandra at Finn.io — May 30th 2013
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PPT
Case Study Real Time Olap Cubes
PPTX
30 days of google cloud event
empirical analysis modeling of power dissipation control in internet data ce...
Google Dremel. Concept and Implementations.
Druid meetup @walkme
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Eventually, time will kill your data pipeline
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
2013 DATA @ NFLX (Tableau User Group)
OLAP Basics and Fundamentals by Bharat Kalia
Workflow Hacks #1 - dots. Tokyo
spark stream - kafka - the right way
The State of Postgres | Strata San Jose 2018 | Umur Cubukcu
You might be paying too much for BigQuery
Sensing the world with data of things
Malstone KDD 2010
Cassandra at Finn.io — May 30th 2013
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Case Study Real Time Olap Cubes
30 days of google cloud event
Ad

Similar to Introduction to Real-Time Data Processing (20)

PPTX
Introduction to Real-Time Data Processing
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
PDF
Lambda architecture @ Indix
PDF
Data engineering in 10 years.pdf
PPTX
Our journey with druid - from initial research to full production scale
PDF
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark.pdf
PDF
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
PDF
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
PDF
The Lyft data platform: Now and in the future
PDF
Lyft data Platform - 2019 slides
PDF
Big Query - Women Techmarkers (Ukraine - March 2014)
PDF
Big data real time architectures
PDF
Engineering data quality
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
E commerce data migration in moving systems across data centres
PDF
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
PDF
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
PDF
Adding Velocity to BigBench
PDF
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
PDF
Workflow Engines + Luigi
Introduction to Real-Time Data Processing
Counting Unique Users in Real-Time: Here's a Challenge for You!
Lambda architecture @ Indix
Data engineering in 10 years.pdf
Our journey with druid - from initial research to full production scale
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark.pdf
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
The Lyft data platform: Now and in the future
Lyft data Platform - 2019 slides
Big Query - Women Techmarkers (Ukraine - March 2014)
Big data real time architectures
Engineering data quality
Trivento summercamp masterclass 9/9/2016
E commerce data migration in moving systems across data centres
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
Workflow Engines + Luigi
Ad

More from Apache Apex (20)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Deep Dive into Apache Apex App Development
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Apache Apex
PPTX
Introduction to Yarn
PPTX
Introduction to Map Reduce
PPTX
HDFS Internals
PPTX
Intro to Big Data Hadoop
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Developing streaming applications with apache apex (strata + hadoop world)
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Intro to Apache Apex @ Women in Big Data
Deep Dive into Apache Apex App Development
Hadoop Interacting with HDFS
Introduction to Apache Apex
Introduction to Yarn
Introduction to Map Reduce
HDFS Internals
Intro to Big Data Hadoop
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Big Data Berlin v8.0 Stream Processing with Apache Apex

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
medical staffing services at VALiNTRY
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
System and Network Administration Chapter 2
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Digital Strategies for Manufacturing Companies
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How Creative Agencies Leverage Project Management Software.pdf
Transform Your Business with a Software ERP System
Odoo Companies in India – Driving Business Transformation.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Softaken Excel to vCard Converter Software.pdf
medical staffing services at VALiNTRY
Which alternative to Crystal Reports is best for small or large businesses.pdf
System and Network Administration Chapter 2
Understanding Forklifts - TECH EHS Solution
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Odoo POS Development Services by CandidRoot Solutions
Navsoft: AI-Powered Business Solutions & Custom Software Development
Digital Strategies for Manufacturing Companies
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free

Introduction to Real-Time Data Processing

  • 1. Introduction to Real-time data processing Yogi Devendra (yogidevendra@apache.org)
  • 2. Agenda ● What is big data? ● Data at rest Vs Data in motion ● Batch processing Vs Real - time data processing (streaming) ● Examples ● When to use: Batch? Real-time? ● Current trends 2
  • 4. Definition : big data Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. [1] 4
  • 5. Exploding sizes of datasets 5 ● Google ○ >100PB data everyday [3] ● Large Hydron collidor : ○ 150M sensors x 40M sample per sec x 600 M collisions per sec ○ >500 exabytes per day [2] ○ 0.0001% of data is actually analysed
  • 7. Data at rest Vs Data in motion ● At rest : ○ Dataset is fixed ○ a.k.a bounded [15] ● In motion : ○ continuously incoming data ○ a.k.a unbounded 7
  • 8. Data at rest Vs Data in motion (continued) ● Generally Big data has velocity ○ continuous data ● Difference lies in when are you analyzing your data? [5] ○ after the event occurs ⇒ at rest ○ as the event occurs ⇒ in motion 8
  • 9. Examples ● Data at rest ○ Finding stats about group in a closed room ○ Analyzing sales data for last month to make strategic decisions ● Data in motion ○ Finding stats about group in a marathon ○ e-commerce order processing 9
  • 11. Batch processing ● Problem statement : ○ Process this entire data ○ give answer for X at the end. 11
  • 12. Batch processing : Use-cases 12 ● Sales summary for the previous month[5] ● Model training for Spam emails
  • 13. Batch processing : Characteristics 13 ● Access to entire data ● Split decided at the launch time. ● Capable of doing complex analysis (e.g. Model training) [6] ● Optimize for Throughput (data processed per sec) ● Example frameworks : Map Reduce, Apache Spark [6]
  • 15. Real time data processing ● a.k.a. Stream processing ● Problem statement : ○ Process incoming stream of data ○ to give answer for X at this moment. 15
  • 16. Stream processing : Use-cases ● e-commerce order processing ● Credit card fraud detection ● Label given email as : spam vs non- spam 16
  • 18. Stream processing : Characteristics ● Results for X are based on the current data ● Computes function on one record or smaller window. [6] ● Optimizations for latency (avg. time taken for a record) 18
  • 19. Stream processing : Characteristics ● Need to complete computes in near real- time ● Computes something relatively simple e.g. Using pre-defined model to label a record. ● Example frameworks: Apache Apex, Apache storm 19
  • 21. 21
  • 22. Batch Vs Streaming pani puri ⇒ Streaming image ref [9] wada ⇒ batch image ref [8] 22
  • 24. Micro-batch ● Create batch of small size ● Process each micro-batch separately ● Example frameworks: Spark streaming pani puri ⇒ micro-batch image ref [10] 24
  • 25. ● Depends on use-case ○ Some are suitable for batch ○ Some are suitable for streaming ○ Some can be solved by any one ○ Some might need combination of two. 25 When to use : Batch Vs Streaming?
  • 26. When to use : Batch Vs Real time?(continued) ● Answers for current snapshot ⇒ Real-time ○ Answers at the end ⇒ Open ● Complex calculations, multiple iterations over entire data ⇒ Batch ○ Simple computations ⇒ Open ● Low latency requirements (< 1s) ⇒ Real- time 26
  • 27. When to use : Batch Vs Real time?(continued) ● Each record can be processed independently ⇒ Open ○ Independent processing not possible ⇒ Batch ● Depends on use-case ○ Some use-cases can be solved by any one ○ Some other might need combination of two. 27
  • 29. Can one replace the other? ● Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if processed in batch mode. ● Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at rest’ as well (in many cases). 29
  • 31. Quiz : is this Batch or Real-time? ● Queue for roller coaster ride image ref [11] ● Queue at the petrol pump image ref [12] 31
  • 32. Quiz : is this Batch or Real-time? ● Selecting relevant ad to show for requested page ● Courier dispatch from city A to B image ref [13] image ref [14] 32
  • 34. Current trends ● Difficulty in splitting problems as Map Reduce : Alternative paradigms for expressing user intent . ● More and more use-cases demanding faster insight to data (near real-time) ● ‘Data in motion’ is common. ● ‘Real-time data processing’ getting traction. 34
  • 36. 36
  • 37. References 1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/ 2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data 3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/ 4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/ 5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/ 6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch- processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht 7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud- detection 8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/ 9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/ 10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/ 11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the- roller-coaster.html 12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and- diesel-fuel-retailing-ril 13. Publishers | Propellerads https://propellerads.com/publishers/ 14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067 15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html 16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146 17. Thank You http://www.planwallpaper.com/thank-you 37