SlideShare a Scribd company logo
A real-time architecture using
Hadoop and Storm.
Speaker

Nathan Bijnens
@nathan_gs

A real-time architecture using Hadoop & Storm. #JaxLondon

2
Our Vision

Volume
Big Data

test

A real-time architecture using Hadoop & Storm. #JaxLondon

3
Big Data

Velocity
test

A real-time architecture using Hadoop & Storm. #JaxLondon

4
Our Vision

Volum
e

Variety
test

A real-time architecture using Hadoop & Storm. #JaxLondon

5
Computing Trends
Past

Current

Computation (CPUs)
Expensive

Computation Cheap
(Many Core Computers)

Disk Storage Expensive

Disk Storage Cheap
(Cheap Commodity Disks)

DRAM Expensive

DRAM / SSD
Getting Cheap

Coordination Easy
(Latches Don’t Often Hit)

Coordination Hard
(Latches Stall a Lot, etc)

Source: Immutability Changes Everything - Pat Helland, RICON2012

A real-time architecture using Hadoop & Storm. #JaxLondon

6
Credits
Nathan Marz
Ex-Backtype & Twitter
Startup in
Stealthmode
Storm
Cascalog
ElephantDB
manning.com/marz

A real-time architecture using Hadoop & Storm. #JaxLondon

7
A Data System

A real-time architecture using Hadoop & Storm. #JaxLondon

8
Data is more than Information

Not all information is equal.
Some information is derived from other pieces of
information.

A real-time architecture using Hadoop & Storm. #JaxLondon

9
Data is more than Information

Eventually you will reach the most
‘raw’ form of information.
This is the information you hold true, simple because it
exists.
Let’s call this ‘data’, very similar to ‘event’.

A real-time architecture using Hadoop & Storm. #JaxLondon

10
Events - Before

Events used to manipulate
the master data.

A real-time architecture using Hadoop & Storm. #JaxLondon

11
Events - After

Today, events are the master
data.

A real-time architecture using Hadoop & Storm. #JaxLondon

12
Data System

Let’s store everything.

A real-time architecture using Hadoop & Storm. #JaxLondon

13
Events

Data is Immutable

A real-time architecture using Hadoop & Storm. #JaxLondon

14
Events

Data is Time Based

A real-time architecture using Hadoop & Storm. #JaxLondon

15
Capturing change traditionally

Person

Location

Person

Location

Nathan

Antwerp

Nathan

Ghent

Geert

Dendermonde

Geert

Dendermonde

John

Ghent

John

Ghent

A real-time architecture using Hadoop & Storm. #JaxLondon

16
Capturing change

Person

Location

Timestamp

Person

Location

Time

Nathan

Antwerp

2005-01-01

Nathan

Antwerp

2005-01-01

Geert

Dendermonde

2011-10-08

Geert

Dendermond
e

2011-10-08

John

Ghent

2010-05-02

John

Ghent

2010-05-02

Nathan

Ghent

2013-02-03

A real-time architecture using Hadoop & Storm. #JaxLondon

17
Query

The data you query is often
transformed, aggregated, ...
Rarely used in it’s original form.

A real-time architecture using Hadoop & Storm. #JaxLondon

18
Query

Query = function ( all data
)
A real-time architecture using Hadoop & Storm. #JaxLondon

19
Number of people living in each city.

Person

Location

Time

Location

Count

Nathan

Antwerp

2005-01-01

Ghent

2

Geert

Dendermond
e

2011-10-08

Dendermonde

1

John

Ghent

2010-05-02

Nathan

Ghent

2013-02-03

A real-time architecture using Hadoop & Storm. #JaxLondon

20
Query

All Data

Query

A real-time architecture using Hadoop & Storm. #JaxLondon

22
Query: Precompute

All Data

Precomputed
View

Query

A real-time architecture using Hadoop & Storm. #JaxLondon

23
Layered Architecture

Batch Layer

Speed Layer

Serving Layer

A real-time architecture using Hadoop & Storm. #JaxLondon

24
Layered Architecture

Query

Cassandr
a

Incoming Data
Hadoop

Elephan
tDB

A real-time architecture using Hadoop & Storm. #JaxLondon

25
Batch Layer

A real-time architecture using Hadoop & Storm. #JaxLondon

26
Batch Layer

Incoming Data
Hadoop

Elephan
tDB

A real-time architecture using Hadoop & Storm. #JaxLondon

27
Batch Layer

Unrestrained computation.

A real-time architecture using Hadoop & Storm. #JaxLondon

28
Batch Layer

No need to De-Normalize.

A real-time architecture using Hadoop & Storm. #JaxLondon

29
Batch Layer

Horizontal scalable.

A real-time architecture using Hadoop & Storm. #JaxLondon

30
Batch Layer

High Latency.
Let’s pretend temporarily that update latency
doesn’t matter.

A real-time architecture using Hadoop & Storm. #JaxLondon

31
Batch Layer

Functional computation, based on
immutable inputs, is idempotent.

A real-time architecture using Hadoop & Storm. #JaxLondon

32
Batch Layer

Stores master copy of data
set...
append only.

A real-time architecture using Hadoop & Storm. #JaxLondon

33
Batch Layer

A real-time architecture using Hadoop & Storm. #JaxLondon

34
Batch: View generation

View
#1

Master Dataset

MapReduc
e

View
#2

View
#3

A real-time architecture using Hadoop & Storm. #JaxLondon

35
MapReduce

MAP

1. Take a large data set and divide it into subsets
…

2. Perform the same function on all subsets

REDUCE

DoWork()

DoWork()

DoWork()

…

3. Combine the output from all subsets
…

Output

A real-time architecture using Hadoop & Storm. #JaxLondon

36
Serialization & Schema

Catch errors as quickly as they happen.
Validation on write vs on read.

A real-time architecture using Hadoop & Storm. #JaxLondon

37
Serialization & Schema

CSV is actually a serialization language that is
just poorly defined.

A real-time architecture using Hadoop & Storm. #JaxLondon

38
Serialization & Schema
Use a format with a schema.
-

Thrift
Avro
Protobuffers

Added bonus: it’s faster & uses less space.

A real-time architecture using Hadoop & Storm. #JaxLondon

39
Batch View Database

Read only database.
No random writes required.

A real-time architecture using Hadoop & Storm. #JaxLondon

40
Batch View Database

Every iteration produces the
Views from scratch.

A real-time architecture using Hadoop & Storm. #JaxLondon

41
Batch View Database
ElephantDB
Splout
Voldemort
…

A real-time architecture using Hadoop & Storm. #JaxLondon

42
Batch Layer
We are not done yet…

Just a few hours of data.

Data absorbed into Batch Views

Not yet
absorbed.

A real-time architecture using Hadoop & Storm. #JaxLondon

No
w

Time

44
Speed Layer

A real-time architecture using Hadoop & Storm. #JaxLondon

45
Overview
Cassandr
a

Incoming Data
Hadoop

Elephan
tDB

A real-time architecture using Hadoop & Storm. #JaxLondon

46
Speed Layer

Stream processing.

A real-time architecture using Hadoop & Storm. #JaxLondon

47
Speed Layer

Continuous computation.

A real-time architecture using Hadoop & Storm. #JaxLondon

48
Speed Layer

Transactional.

A real-time architecture using Hadoop & Storm. #JaxLondon

49
Speed Layer

Storing a limited window of
data.
Compensating for the last few hours of data.

A real-time architecture using Hadoop & Storm. #JaxLondon

50
Speed Layer

All the complexity is isolated in the
Speed layer.
If anything goes wrong, it’s auto-corrected.

A real-time architecture using Hadoop & Storm. #JaxLondon

51
CAP
You have a choice between:
Availability
-

Queries are eventual consistent.

Consistency
-

Queries are consistent.

A real-time architecture using Hadoop & Storm. #JaxLondon

52
Eventual accuracy

Some algorithms are hard to
implement in real time. For those
cases we could estimate the results.

A real-time architecture using Hadoop & Storm. #JaxLondon

53
Speed Layer

Real
Time
View 1

Incoming Data
Real
Time
View 2

A real-time architecture using Hadoop & Storm. #JaxLondon

54
Storm
Message passing.
Distributed processing.
Horizontally scalable.
Incremental algorithms.
Fast.
Data in motion.

A real-time architecture using Hadoop & Storm. #JaxLondon

55
Storm

Nimbus

Execute
r

Execute
r

Worker
Node

Supervis
or

Execute
r

Execute
r

Execute
r

Worker
Node

Supervis
or

Execute
r

Execute
r

Execute
r

Execute
r

Supervis
or

Zookeep
er

Worker
Node

A real-time architecture using Hadoop & Storm. #JaxLondon

56
Storm
Tuple

Stream

A real-time architecture using Hadoop & Storm. #JaxLondon

57
Storm
Spout

Bolt

A real-time architecture using Hadoop & Storm. #JaxLondon

58
Storm
Grouping

A real-time architecture using Hadoop & Storm. #JaxLondon

59
Data Ingestion
Kafka
Flume
Scribe
*MQ
Kestrel

A real-time architecture using Hadoop & Storm. #JaxLondon

60
Speed Layer Views
The views are stored in Read & Write database.
-

Cassandra
Hbase
Redis
MySQL
ElasticSearch
…

Much more complex than a read only view.

A real-time architecture using Hadoop & Storm. #JaxLondon

61
Serving Layer

A real-time architecture using Hadoop & Storm. #JaxLondon

62
Overview

Query

Cassandr
a

Incoming Data
Hadoop

Elephan
tDB

A real-time architecture using Hadoop & Storm. #JaxLondon

63
Serving Layer

Random reads

A real-time architecture using Hadoop & Storm. #JaxLondon

64
Serving Layer

This layer queries the Batch & Real
Time views and merges it.

A real-time architecture using Hadoop & Storm. #JaxLondon

65
Serving Layer

Batch
Views

Merge
Real
Time
Views

A real-time architecture using Hadoop & Storm. #JaxLondon

66
Serving Layer

How to query an Average?

A real-time architecture using Hadoop & Storm. #JaxLondon

67
Overview

A real-time architecture using Hadoop & Storm. #JaxLondon

68
Overview

Query

Cassandr
a

Incoming Data
Hadoop

Elephan
tDB

A real-time architecture using Hadoop & Storm. #JaxLondon

69
Lambda Architecture

A real-time architecture using Hadoop & Storm. #JaxLondon

70
Lambda Architecture

Can discard any view, batch and real
time, and just recreate everything from
the master data.

A real-time architecture using Hadoop & Storm. #JaxLondon

71
Lambda Architecture

Mistakes are corrected via recomputation.
Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.

A real-time architecture using Hadoop & Storm. #JaxLondon

72
Lambda Architecture

Data storage is highly optimized.

A real-time architecture using Hadoop & Storm. #JaxLondon

73
Lambda Architecture

Immutability changes everything.

A real-time architecture using Hadoop & Storm. #JaxLondon

74
Questions?

Questions?
@nathan_gs & #BigDataCon13

A real-time architecture using Hadoop & Storm. #JaxLondon

75
DataCrunchers
We enable companies in envisioning, defining and
implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.

A real-time architecture using Hadoop & Storm. #JaxLondon

76
Thank you

Thank you
@nathan_gs

A real-time architecture using Hadoop & Storm. #JaxLondon

77

More Related Content

PPTX
Large scale, interactive ad-hoc queries over different datastores with Apache...
PDF
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
PDF
Lens: Data exploration with Dask and Jupyter widgets
PDF
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
ETL with SPARK - First Spark London meetup
PDF
Adding Complex Data to Spark Stack by Tug Grall
PPTX
Dask: Scaling Python
Large scale, interactive ad-hoc queries over different datastores with Apache...
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Lens: Data exploration with Dask and Jupyter widgets
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Spark Summit East 2015 Advanced Devops Student Slides
ETL with SPARK - First Spark London meetup
Adding Complex Data to Spark Stack by Tug Grall
Dask: Scaling Python

What's hot (19)

PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PDF
High Performance Machine Learning in R with H2O
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PDF
Sparkling Water 5 28-14
PDF
Fast and Scalable Python
PDF
Data Science with Spark
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Fast Data Analytics with Spark and Python
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
PPTX
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PDF
H2O Design and Infrastructure with Matt Dowle
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PPTX
Building data pipelines
PDF
Introduction to Spark Training
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
High Performance Machine Learning in R with H2O
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Sparkling Water 5 28-14
Fast and Scalable Python
Data Science with Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Fast Data Analytics with Spark and Python
Project Tungsten: Bringing Spark Closer to Bare Metal
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Frustration-Reduced PySpark: Data engineering with DataFrames
H2O Design and Infrastructure with Matt Dowle
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Building data pipelines
Introduction to Spark Training
Ad

Viewers also liked (20)

PDF
Big Events, Mob Scale - Darach Ennis (Push Technology)
PPT
How Java got its Mojo Back - James Governor (Redmonk)
PDF
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
PPTX
Legal and ethical considerations redone
PPTX
Interactive media applications
PPTX
45 second video proposal
PPTX
Bringing your app to the web with Dart - Chris Buckett (Entity Group)
PPTX
Practical Performance: Understand the Performance of Your Application - Chris...
PDF
Streams and Things - Darach Ennis (Ubiquiti Networks)
PDF
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
PPTX
Interactive media applications
PDF
What makes Groovy Groovy - Guillaume Laforge (Pivotal)
PDF
Introducing Vert.x 2.0 - Taking polyglot application development to the next ...
PDF
How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...
PDF
Scaling Scala to the database - Stefan Zeiger (Typesafe)
PDF
Databases and agile development - Dwight Merriman (MongoDB)
PDF
The state of the art biorepository at ILRI
PPTX
Why other ppl_dont_get_it
PDF
Real-world polyglot programming on the JVM - Ben Summers (ONEIS)
PDF
What You Need to Know About Lambdas - Jamie Allen (Typesafe)
Big Events, Mob Scale - Darach Ennis (Push Technology)
How Java got its Mojo Back - James Governor (Redmonk)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Legal and ethical considerations redone
Interactive media applications
45 second video proposal
Bringing your app to the web with Dart - Chris Buckett (Entity Group)
Practical Performance: Understand the Performance of Your Application - Chris...
Streams and Things - Darach Ennis (Ubiquiti Networks)
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
Interactive media applications
What makes Groovy Groovy - Guillaume Laforge (Pivotal)
Introducing Vert.x 2.0 - Taking polyglot application development to the next ...
How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Databases and agile development - Dwight Merriman (MongoDB)
The state of the art biorepository at ILRI
Why other ppl_dont_get_it
Real-world polyglot programming on the JVM - Ben Summers (ONEIS)
What You Need to Know About Lambdas - Jamie Allen (Typesafe)
Ad

Similar to A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers (20)

PDF
A real-time architecture using Hadoop and Storm @ JAX London
PDF
A real time architecture using Hadoop and Storm @ FOSDEM 2013
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
PPTX
Cassandra synergy
PDF
Realtime
 Distributed Analysis
 of Datastreams
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
PPTX
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
PPT
Big data & hadoop framework
PDF
How can Hadoop & SAP be integrated
PPTX
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
PDF
Big Data , Big Problem?
PDF
Data platform architecture
PDF
Storm at spider.io - London Storm Meetup 2013-06-18
PDF
Real time data processing frameworks
PPTX
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
High Performance Processing of Streaming Data
PPTX
Yahoo compares Storm and Spark
PDF
2014 sept 26_thug_lambda_part1
A real-time architecture using Hadoop and Storm @ JAX London
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Cassandra synergy
Realtime
 Distributed Analysis
 of Datastreams
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Big data & hadoop framework
How can Hadoop & SAP be integrated
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Big Data , Big Problem?
Data platform architecture
Storm at spider.io - London Storm Meetup 2013-06-18
Real time data processing frameworks
Big Data Architecture Workshop - Vahid Amiri
High Performance Processing of Streaming Data
Yahoo compares Storm and Spark
2014 sept 26_thug_lambda_part1

More from jaxLondonConference (17)

PDF
Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...
PDF
Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...
PDF
JVM Support for Multitenant Applications - Steve Poole (IBM)
PDF
Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)
PDF
Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...
PDF
Java Testing With Spock - Ken Sipe (Trexin Consulting)
PDF
The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...
PDF
Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...
PPT
Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...
PDF
The Curious Clojurist - Neal Ford (Thoughtworks)
PPTX
TDD at scale - Mash Badar (UBS)
PDF
Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)
PDF
Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)
PPTX
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
PPTX
Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...
PDF
Little words of wisdom for the developer - Guillaume Laforge (Pivotal)
PDF
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...
Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...
JVM Support for Multitenant Applications - Steve Poole (IBM)
Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)
Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...
Java Testing With Spock - Ken Sipe (Trexin Consulting)
The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...
Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...
Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...
The Curious Clojurist - Neal Ford (Thoughtworks)
TDD at scale - Mash Badar (UBS)
Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)
Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...
Little words of wisdom for the developer - Guillaume Laforge (Pivotal)
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Electronic commerce courselecture one. Pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Review of recent advances in non-invasive hemoglobin estimation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Electronic commerce courselecture one. Pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

Editor's Notes

  • #4: How much data do you have? 44 times as much data in the next decade, 15 Zb in 2015Data silos (erp, crm, …)CustomersTrimble (3Tb in hun database systeem)Truvo (wijzigen van een index duurt 24u)Traditionele systemen kunnen dit volume niet aan.How many data do you have?Turn 12 terabytes of Tweets created each day into improved product sentiment analysisConvert 350 billion annual meter readings to better predict power consumption
  • #5: Real timeTime sensitivedecisiontakingFrauddetectionEnergy allocationMarketing campaignsMarket transactionsSolution:Real-time solutions in combination with batch (hadoop)Nosql systems
  • #6: StructuredUnstructured80% is unstructured data, A key drawback of using traditional relational database systems is that they're not good at handling variable data. A flexible data modelWord, email, foto, text, video, APIs, …?What are your needs regarding variety?The end result: bringingstructureintounstructured dataMonitor 100’s of live video feeds from surveillance cameras to target points of interestExploit the 80% data growth in images, video and documents to improve customer satisfaction
  • #7: We can afford to keep Immutable Copies of lots of data.We NEED immutability to Coordinate with fewer challenges.Semaphores & Locks are the things to avoid: Instruction opportunities lost waiting for a semaphore increase with more cores…
  • #10: The # of followers on Twitter = all follows & unfollows combined.Account balance
  • #11: Data = eventIn an ever changing world we found a ‘safe heaven’ for dataEverything we do generates events:Pay with Credit CardCommit to GitClick on a webpageTweet
  • #14: It is easier to store all data in a cost effective way.Compare to DWH world.
  • #15: Immutability greatly restricts the range of errors that can cause data loss or data corruption.Ex. Only CR, no more CRUD.Information might of course change.Fault ToleranceData lossHuman error, Hardware failureData CorruptionParallel met functioneelprogrammeren.
  • #16: Allows state regeneration. Eg. What was my bank balance on 1 may 2005?
  • #20: Queries as pure functions that take all data as input is the most general formulation.Different functions may look at different portions and aggregate information in different ways.
  • #24: Too slow; might be petabyte scaleImpala/Drill: why not
  • #29: The batch layer can calculate anything (given enough time).
  • #30: The batch layer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
  • #31: Not vertically
  • #33: It’s OK to croak and restart
  • #34: Is something really immutable when it’s name can change.
  • #35: Doesn’t have to be Hadoop. The importance here is a Distributed FS combined with a processing framework.Spark,
  • #37: Source: PolybasePass2012.pptxhttp://whyjava.wordpress.com/2011/08/04/how-i-explained-mapreduce-to-my-wife/
  • #38: http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=nsValue of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruptionOtherwise you’ll detect corruption issues at read-time
  • #39: http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
  • #43: Maarkanopgelostworden, door bvb ES je views op voorhandtegenereren.
  • #50: In some circumstances.
  • #52: All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
  • #53: Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)http://codahale.com/you-cant-sacrifice-partition-tolerance/Hbasavs Cassandra
  • #54: Eg. Unique countsML
  • #57: Nimbus:Manages the clusterWorker Node:Supervisor:Manages workers; restarts them if neededExecuterPhysical JVM process.Execute tasks (those are spread evenly across the workers)TasksEach in his own Thread. Is the actual Bolt or Spout.Processes the stream.
  • #58: Tuple:Named list of valuesDynamicly typedStreamSequence of Tuples
  • #59: SpoutSource of StreamsSometimes replayableBoltStream transformationsAt least 1 input stream0 - * output streams
  • #65: The serving layer needs to be able to answer any query in a short amount of time.
  • #68: AVG = sum + count; preaggregate, but not everything is possible.
  • #71: Lambda first named by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.
  • #72: High tolerance for human & system errors.
  • #73: http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
  • #74: Data storage layer optimized independently from query resolution layer
  • #75: If you remember one thing about this presentation is: Immutability.