SlideShare a Scribd company logo
big DATA
mob SCALE
JAX London 2013 - Darach Ennis - @darachennis
Big Events, Mob Scale - Darach Ennis (Push Technology)
small FAST
DATA guy
JAX London 2013 - Darach Ennis - @darachennis
Big Data!
!
!

“The techniques and technologies for such dataintensive science are so different that it is
worth distinguishing data-intensive science from
computational science as a new, fourth paradigm”
!

- Jim Gray!
!
!

The Fourth Paradigm: Data-Intensive Scientific Discovery. - Microsoft 2009
DATA intensive!
science SCALE
Compute Sympathy
Compute Sympathy
Compute Sympathy
A Wall Street Second
A Swiss Second
Small Data? <= 128bytes
HTTP GET/POST - A typical RESTful performance
Req/Sec

Bw/Sec (MB)
12,616

Avg Latency (ms)
14,642

15,499

Max Latency (ms)
15,787

15,445

1000

Stdev (ms)

15,330

15,173

14,998

8,705
3,907

4,279

100

1000

10

100

1
10

1

0.1
1

2

4

8

16

32

64

Concurrent Connections

128

256

512

1024
Small Data? <= 1K
Req/Sec

Bw/Sec (MB)
Avg - A typical RESTfulLatency (ms)
Max performance Stdev (ms)
HTTP GET/POST Latency (ms)

10000

1000

1,288

1,951

2,722 2,849 2,790 2,858 2,916 2,830 2,788 2,842

690

100

100

10

1

1

0.1
1

2

4

8

16

32

64

128

Concurrent Connections

256

512

1024
Big Events - 1Billion Sources
Ballpark number of boxes if each box can handle 2500 events/second
1000000

1/dy

1/hr

1/mn

1/sc

400,000
40,000
Value Axis

16,667
4,000
1,667

1000

167
17
1

1

112

35

1

1/dy 1/hr 1/mn 1/sc
1 million

12
1

2

1

1/dy 1/hr 1/mn 1/sc
10 million

1/dy 1/hr 1/mn 1/sc
100 million

Category Axis

5
1/dy 1/hr 1/mn 1/sc
1 billion
Data!
Sympathy?
5 V's
5 V’s via [V-PEC-T]
•

Business Factors
•
•

•

‘Veracity’ - The What
‘Value’ - The Why

Technical Domain (Policies, Events, Content)
•

Volume, Velocity, Variety
Source: Ashwani Roy, Charles Cai - QCON London 2013 - http://bit.ly/1f2Pdf9
Source: Ashwani Roy, Charles Cai - QCON London 2013 - http://bit.ly/1f2Pdf9
Source: Ashwani Roy, Charles Cai - QCON London 2013 - http://bit.ly/1f2Pdf9
Incremental!
!

The needs of the individual event or query
outweigh the needs of the aggregate events
or queries in flight in the system
!
!
!
Batch!
!

The needs of the system outweigh the needs
of individual events and queries running in
flight or active within the system
!
!
!
“Computing arbitrary functions on an arbitrary
dataset in real time is a daunting problem..”

- Nathan März
Lambda Architecture
“Twitter Scale”
5000 msgs/second inbound
<1K “Small data”
“Firehouse" outbound - but
thats just a broadcast
problem (easy)
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed
Views
Views
Views

Apps
Lambda: A
All new data is sent to both the batch
layer and the speed layer. In the
batch layer, new data is appended to
the master dataset. In the speed
layer, the new data is consumed to
do incremental updates of the
realtime views.
Lambda: B
The master dataset is an immutable,
append-only set of data. The master
dataset only contains the rawest
information that is not derived from
any other information you have.
Lambda: Master data set
•

From A: “rawest … not derived"
•

In many environments it may be preferable to
normalise data for later ease of retrieval (eg:
Dremel, strongly typed nested records) to support
scalable ad hoc query.


•

Derivation allows other forms of efficient retrieval eg:
using SAX - Symbolic Aggregate Approximation,
PAA - Piecewise Aggregate Approximation etc..
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

?

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed
Views
Views
Views

Apps
SAX & PAA

Piecewise Aggregate
Approximation

Symbolic Aggregate
Approximation

1sc -> 1mn -> 1hr -> 1dy -> 1wk -> 1mh -> 1yr
Lambda: C
The batch layer precomputes query
functions from scratch. The results of the
batch layer are called batch views. The
batch layer runs in a while(true) loop and
continuously recomputes the batch views
from scratch. The strength of the batch
layer is its ability to compute arbitrary
functions on arbitrary data. This gives it
the power to support any application.
Lambda: D
The serving layer indexes the batch views
produced by the batch layer and makes it
possible to get particular values out of a
batch view very quickly. The serving layer
is a scalable database that swaps in new
batch views as they’re made available.
Because of the latency of the batch layer,
the results available from the serving layer
are always out of date by a few hours.
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

K/V

Rel

Serving
Web

Data

MQ

"New Data"

?

Apps

Views
Views
Views

Speed
Views
Views
Views

Apps
Think ‘Statistical
Compression'
Lambda: E
The speed layer compensates for the high latency of updates
to the serving layer. It uses fast incremental algorithms and
read/write databases to produce realtime views that are
always up to date. The speed layer only deals with recent
data, because any data older than that has been absorbed
into the batch layer and accounted for in the serving layer.
The speed layer is significantly more complex than the
batch and serving layers, but that complexity is
compensated by the fact that the realtime views can be
continuously discarded as data makes its way through
the batch and serving layers. So, the potential negative
impact of that complexity is greatly limited.
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed

?
Views
Views
Views

Apps
Use a DSP + CEP/ESP or
‘Scalable CEP'
•

Storm/S4 + Esper/…
•

Embed a CEP/ESP within a Distributed
Stream processing Engine

•

Use Drill for large scale ad hoc query
[leverage nested records]

•

Already have middleware? Have well
defined queries? Roll your own minimal
EEP (or use mine!)
Lambda: F
Queries are resolved by getting results from both
the batch and realtime views and merging them
together.
Millwheel: http://bit.ly/1gWqNIC

a
St
Queries

Window
Window
Counter
Counter

Model

Web
Query

ts

Model
Model

St
a

ts

Out of
Out of
Trend?
Trend?

Alerts

Monitor

Google’s “Zeitgeist
pipeline"
Lambda: Batch View
•

Precomputed Queries are central to Complex
Event Processing / Event Stream Processing
architectures.

•

Unfortunately, though, most DBMS’s still offer
only synchronous blocking RPC access to
underlying data when asynchronous guaranteed
delivery would be preferable for view
construction leveraging CEP/ESP techniques.
Lambda: Merging …
•

Possibly one of the most difficult aspects of near
real-time and historical data integration is
combining flows sensibly.

•

For example, is the order of interleaving across
merge sources applied in a known
deterministically recomputable order? If not, how
can results be recomputed subsequently? Will
data converge? 




[cf: http://cs.brown.edu/research/aurora/hwang.icde05.ha.pdf]
Lambda: A start …
Batch

Time
Series

Docs

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed
Views
Views
Views

Apps
mob DATA
Not a Jedi
… yet …
JAX London 2013 - Darach Ennis - @darachennis
Thanks.
Questions?
!

@darachennis

More Related Content

PDF
Streams and Things
PDF
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
PPTX
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
PPTX
Intro to Spark development
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
Streams and Things
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
Intro to Spark development
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
A Day in the Life of a Druid Implementor and Druid's Roadmap

What's hot (20)

PDF
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Lens: Data exploration with Dask and Jupyter widgets
PPTX
Distributed Deep Learning on Hadoop Clusters
PDF
Adding Complex Data to Spark Stack by Tug Grall
PDF
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Spark what's new what's coming
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Top 5 mistakes when writing Streaming applications
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Data Infrastructure for a World of Music
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Spark streaming State of the Union - Strata San Jose 2015
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Lens: Data exploration with Dask and Jupyter widgets
Distributed Deep Learning on Hadoop Clusters
Adding Complex Data to Spark Stack by Tug Grall
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Spark what's new what's coming
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Spark Summit EU 2015: Reynold Xin Keynote
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Top 5 mistakes when writing Streaming applications
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Implementing the Lambda Architecture efficiently with Apache Spark
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Data Infrastructure for a World of Music
Ad

Viewers also liked (20)

PDF
How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...
PDF
Little words of wisdom for the developer - Guillaume Laforge (Pivotal)
PDF
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
PPTX
Legal and ethical considerations redone
PDF
Real-world polyglot programming on the JVM - Ben Summers (ONEIS)
PPTX
Why other ppl_dont_get_it
PDF
Streams and Things - Darach Ennis (Ubiquiti Networks)
PDF
Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)
PDF
Design is a Process, not an Artefact - Trisha Gee (MongoDB)
PDF
Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...
PDF
Big data from the LHC commissioning: practical lessons from big science - Sim...
PPTX
Interactive media applications
PDF
Introducing Vert.x 2.0 - Taking polyglot application development to the next ...
PDF
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
PDF
Scaling Scala to the database - Stefan Zeiger (Typesafe)
PDF
What You Need to Know About Lambdas - Jamie Allen (Typesafe)
PDF
What makes Groovy Groovy - Guillaume Laforge (Pivotal)
PPTX
Bringing your app to the web with Dart - Chris Buckett (Entity Group)
PPT
How Java got its Mojo Back - James Governor (Redmonk)
PPTX
Interactive media applications
How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...
Little words of wisdom for the developer - Guillaume Laforge (Pivotal)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Legal and ethical considerations redone
Real-world polyglot programming on the JVM - Ben Summers (ONEIS)
Why other ppl_dont_get_it
Streams and Things - Darach Ennis (Ubiquiti Networks)
Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)
Design is a Process, not an Artefact - Trisha Gee (MongoDB)
Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...
Big data from the LHC commissioning: practical lessons from big science - Sim...
Interactive media applications
Introducing Vert.x 2.0 - Taking polyglot application development to the next ...
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
Scaling Scala to the database - Stefan Zeiger (Typesafe)
What You Need to Know About Lambdas - Jamie Allen (Typesafe)
What makes Groovy Groovy - Guillaume Laforge (Pivotal)
Bringing your app to the web with Dart - Chris Buckett (Entity Group)
How Java got its Mojo Back - James Governor (Redmonk)
Interactive media applications
Ad

Similar to Big Events, Mob Scale - Darach Ennis (Push Technology) (20)

PDF
Deconstructing Lambda
PDF
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
PDF
Lambda architecture
PPTX
Applying big data thinking to normal size data
PDF
Big Data and Fast Data combined – is it possible?
PDF
Microsoft Big Data @ SQLUG 2013
PDF
Big Data Evolution
ODP
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
PPTX
Big Data Ecosystem
PDF
Dev Ops Training
PDF
LUISS - Deep Learning and data analyses - 09/01/19
PDF
Data Warehousing 101(and a video)
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PPTX
ParStream - Big Data for Business Users
PPTX
Big Data presentation at GITPRO 2013
PPT
Gentle into to DataGrid technology and customer use cases
PPTX
WebAction-Sami Abkay
PPTX
Thing you didn't know you could do in Spark
PDF
Seminaire bigdata23102014
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Deconstructing Lambda
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Lambda architecture
Applying big data thinking to normal size data
Big Data and Fast Data combined – is it possible?
Microsoft Big Data @ SQLUG 2013
Big Data Evolution
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Big Data Ecosystem
Dev Ops Training
LUISS - Deep Learning and data analyses - 09/01/19
Data Warehousing 101(and a video)
The Future of Fast Databases: Lessons from a Decade of QuestDB
ParStream - Big Data for Business Users
Big Data presentation at GITPRO 2013
Gentle into to DataGrid technology and customer use cases
WebAction-Sami Abkay
Thing you didn't know you could do in Spark
Seminaire bigdata23102014
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...

More from jaxLondonConference (17)

PDF
Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...
PDF
Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...
PDF
JVM Support for Multitenant Applications - Steve Poole (IBM)
PDF
Databases and agile development - Dwight Merriman (MongoDB)
PDF
Java Testing With Spock - Ken Sipe (Trexin Consulting)
PDF
The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...
PDF
Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...
PPT
Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...
PDF
The Curious Clojurist - Neal Ford (Thoughtworks)
PPTX
TDD at scale - Mash Badar (UBS)
PDF
Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)
PDF
Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)
PPTX
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
PPTX
Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...
PPTX
Large scale, interactive ad-hoc queries over different datastores with Apache...
PDF
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
PPTX
Practical Performance: Understand the Performance of Your Application - Chris...
Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...
Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...
JVM Support for Multitenant Applications - Steve Poole (IBM)
Databases and agile development - Dwight Merriman (MongoDB)
Java Testing With Spock - Ken Sipe (Trexin Consulting)
The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...
Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...
Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...
The Curious Clojurist - Neal Ford (Thoughtworks)
TDD at scale - Mash Badar (UBS)
Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)
Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...
Large scale, interactive ad-hoc queries over different datastores with Apache...
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
Practical Performance: Understand the Performance of Your Application - Chris...

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Modernizing your data center with Dell and AMD
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced IT Governance
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Modernizing your data center with Dell and AMD
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced IT Governance
CIFDAQ's Market Insight: SEC Turns Pro Crypto
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx

Big Events, Mob Scale - Darach Ennis (Push Technology)