SlideShare a Scribd company logo
Realtime

Distributed Analysis

of Datastreams

Philipp Nolte – University of Passau – January 2014

1
Learn
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analytics.
Why twitter created Storm.
How Storm works.
2
Limits

Imagine a traditional web analytics software:
Every page view increments

the url’s database row.

3
First Aid

Queue your writes and write in batches.
Shard your data: Partition horizontally.

4
Chronic Issues
Fault-tolerance is hard.
Applications become more and more complex.
You have to do all the work.

5
New Tools
Large scale computation systems such as Hadoop.
Scalable databases such as Casandra and Riak.
Easy to use frameworks such as Storm and Dempsy.

6
Lambda Architecture
Theoretical, abstract architecture for working with big data.

Speed Layer
Serving Layer
Batch Layer
7
Goal

Compute arbitrary functions on arbitrary data.
query = function ( all data )

8
Properties
Robust and fault-tolerant.
Low latency reads and updates.
Scalable.
Minimal maintenance.

9
Batch Layer

Speed Layer
Serving Layer

Stores the immutable master dataset.
Precomputes arbitrary batch views.
Home of batch processing and map

reduce systems such as Hadoop.

10

Batch Layer
Serving Layer

Speed Layer
Serving Layer
Batch Layer

Read-only random-access to batch views.
Updated by batch layer.
Indexes batch views.
Home of real-time query systems

such as Cloudera Impala for Hadoop.
11
Speed Layer

Speed Layer
Serving Layer
Batch Layer

Compensates for high-latency batch views.
Fast, incremental algorithms.
More complex because of random-writes.
Home of Apache HBase or Storm.

12
Lambda Architecture
Speed Layer
Realtime Views
Batch Views

Data

Serving Layer
Batch Layer
13

Query
Available Data
Batch View
Time

Realtime View
Discard Realtime View

as soon as it is represented
in the batch view.

Batch View

Realtime View

14
Twitter’s Early Days
Worker

Queue

Queue

Worker

Worker

Queue

Worker

Map

Queue

Worker

Queue

Worker

Tweets

Worker

Queue

Worker

URLs
Hadoop

Cassandra
15
Storm
Guaranteed message processing without

message brokers.
Horizontal scalability.
Fault-tolerance.
High level of abstraction.
Just works.
16
Storm Topologies
Stream

Spout

⚡️Bolt

⚡️Bolt

Spout

⚡️Bolt

⚡️Bolt

17
Parallel Tasks
Task

Spout
T

T

⚡️Bolt
T

Spout
T

Stream

T

⚡️Bolt

T

⚡️Bolt
T

18

T

T

⚡️Bolt
T

T

T
Demo

Storm in action

19
Know
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analytics.
Why twitter created Storm.
How Storm works.
20
The End.

Questions?

21

More Related Content

PDF
Data streaming at VRT
PPTX
Optimizing Spark
PDF
Improving ad hoc and production workflows at Stitch Fix
PPTX
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_Data
PPTX
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
PPT
Cloud computing and Hadoop introduction
PDF
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
PPTX
Extending your Hadoop Implementation to the Cloud
Data streaming at VRT
Optimizing Spark
Improving ad hoc and production workflows at Stitch Fix
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_Data
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Cloud computing and Hadoop introduction
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Extending your Hadoop Implementation to the Cloud

What's hot (20)

PDF
Cassandra Essentials Day Cambridge
PPT
Final deck
PPTX
Big data vahidamiri-datastack.ir
PPTX
Managed Cluster Services
PDF
Spark Summit EU talk by Ahsan Javed Awan
PPTX
Big Data - Part IV
PPTX
Big data architecture on cloud computing infrastructure
PPT
Realtime search
PPTX
Big Data - Part III
PPTX
Ajug april 2011
PDF
How to teach your data scientist to leverage an analytics cluster with Presto...
PPTX
Hadoop and Big Data: Revealed
PDF
Developing high frequency indicators using real time tick data on apache supe...
PDF
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
PPTX
Big Data - Part II
PDF
Amazon Web Services: Lessons for Architecting Data in the Cloud
PPTX
Hadoop Tutorial For Beginners
PPTX
Atlanta hadoop users group july 2013
PPTX
Big Data - Part I
Cassandra Essentials Day Cambridge
Final deck
Big data vahidamiri-datastack.ir
Managed Cluster Services
Spark Summit EU talk by Ahsan Javed Awan
Big Data - Part IV
Big data architecture on cloud computing infrastructure
Realtime search
Big Data - Part III
Ajug april 2011
How to teach your data scientist to leverage an analytics cluster with Presto...
Hadoop and Big Data: Revealed
Developing high frequency indicators using real time tick data on apache supe...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
Big Data - Part II
Amazon Web Services: Lessons for Architecting Data in the Cloud
Hadoop Tutorial For Beginners
Atlanta hadoop users group july 2013
Big Data - Part I
Ad

Viewers also liked (8)

PPT
Presentació tdr Ignasi Güell
PDF
Fortschritte im Bereich Collaborative Filtering
PDF
Robustheit in Empfehlungssystemen
PDF
Trust-based recommender systems
PDF
Ansätze für gemeinschaftliches Filtering
PDF
Trust und Interest Similarity und deren Anwendung für Empfehlungssysteme
PDF
Effiziente Verarbeitung von großen Datenmengen
PPTX
Profile Injection Attack Detection in Recommender System
Presentació tdr Ignasi Güell
Fortschritte im Bereich Collaborative Filtering
Robustheit in Empfehlungssystemen
Trust-based recommender systems
Ansätze für gemeinschaftliches Filtering
Trust und Interest Similarity und deren Anwendung für Empfehlungssysteme
Effiziente Verarbeitung von großen Datenmengen
Profile Injection Attack Detection in Recommender System
Ad

Similar to Realtime
 Distributed Analysis
 of Datastreams (20)

PPTX
Real time analytics
PPTX
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
PDF
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
PPSX
Hadoop-Quick introduction
PDF
How can Hadoop & SAP be integrated
ODP
Front Range PHP NoSQL Databases
PPTX
Essential Data Engineering for Data Scientist
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPT
Hadoop and Voldemort @ LinkedIn
PDF
Big Events, Mob Scale - Darach Ennis (Push Technology)
PDF
Big Data, Mob Scale.
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PPTX
Introduction to Redis and its features.pptx
PDF
Real time data processing frameworks
PPTX
Big data and hadoop product page
PPTX
Big data analytics: Technology's bleeding edge
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Need for Time series Database
Real time analytics
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Hadoop-Quick introduction
How can Hadoop & SAP be integrated
Front Range PHP NoSQL Databases
Essential Data Engineering for Data Scientist
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Data, Mob Scale.
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Introduction to Redis and its features.pptx
Real time data processing frameworks
Big data and hadoop product page
Big data analytics: Technology's bleeding edge
Simplifying Real-Time Architectures for IoT with Apache Kudu
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Need for Time series Database

More from Florian Stegmaier (10)

PDF
Musikempfehlungssysteme
PDF
Linked Open Data als Basis für Empfehlungssysteme
PDF
Entscheidungshilfe: Recommender System
PDF
Funktionsweise und Ansätze von inhaltsbasiertem Filtern
PDF
Context Basierte Personalisierungsansätze
PDF
Evaluierung von Empfehlungssystemen
PDF
Effiziente Verarbeitung von grossen Datenmengen
PDF
Introduction to the FP7 CODE project @ BDBC
PDF
Generische Datenintegration zur semantischen Diagnoseunterstützung im Projekt...
PDF
AIR: Architecture for Interoperable Retrieval on Distributed and Heterogeneou...
Musikempfehlungssysteme
Linked Open Data als Basis für Empfehlungssysteme
Entscheidungshilfe: Recommender System
Funktionsweise und Ansätze von inhaltsbasiertem Filtern
Context Basierte Personalisierungsansätze
Evaluierung von Empfehlungssystemen
Effiziente Verarbeitung von grossen Datenmengen
Introduction to the FP7 CODE project @ BDBC
Generische Datenintegration zur semantischen Diagnoseunterstützung im Projekt...
AIR: Architecture for Interoperable Retrieval on Distributed and Heterogeneou...

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation theory and applications.pdf
PPTX
A Presentation on Artificial Intelligence
DOCX
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Encapsulation theory and applications.pdf
A Presentation on Artificial Intelligence
The AUB Centre for AI in Media Proposal.docx

Realtime
 Distributed Analysis
 of Datastreams