SlideShare a Scribd company logo
Building Scalable
Aggregation Systems
Boulder/Denver Big Data Meetup
April 15, 2015
Jared Winick
jaredwinick @ koverse.com
@jaredwinick
Outline
• The Value of Aggregations
• Abstractions
• Systems
• Demo
• References/Additional Information
Building Scalable Aggregation Systems
Building Scalable Aggregation Systems
Aggregation provides a means of
turning billions of pieces of raw data
into condensed, human-consumable
information.
Aggregation of Aggregations
Time Series
Set Size/Cardinality
Top-K
Quantiles
Density/Heatmap
16.3k Unique
Users
G1
G2
T
i
m
e
S
e
r
i
e
s
T
o
p
-
K
S
u
m
C
a
r
d
i
n
a
l
i
t
y
Q
u
a
n
t
i
l
e
s
Abstractions
1
2
3
4
10
+
+
+
=
Concept from (P1)
1
2
3
4
3
+ +
=
7
=
10
=
+
We can parallelize integer addition
Associative + Commutative
Operations
• Associative: 1 + (2 + 3) = (1 + 2) + 3
• Commutative: 1 + 2 = 2 + 1
• Allows us to parallelize our reduce (for
instance locally in combiners)
• Applies to many operations, not just
integer addition.
• Spoiler: Key to incremental aggregations
{a,
b}
{b, c}
{a, c}
{a}
{a, b,
c}
+ +
=
{a, c}
=
{a, b,
c}
=
+
We can also parallelize the “addition” of other types, like Sets, as
Set Union is associative
Monoid Interface
• Abstract Algebra provides a formal foundation for
what we can casually observe.
• Don’t be thrown off by the name, just think of it as
another trait/interface.
• Monoids provide a critical abstraction to treat
aggregations of different types in the same way
Monoid Examples
Monoid Examples
Many Monoid Implementations
Already Exist
• https://github.com/twitter/algebird/
• Long, String, Set, Seq, Map, etc…
• HyperLogLog – Cardinality Estimates
• QTree – Quantile Estimates
• SpaceSaver/HeavyHitters – Approx Top-K
• Also easy to add your own with libraries
like stream-lib [C3]
Serialization
• One additional trait we need our
“aggregatable” types to have is that we
can serialize/deserialize them.
1
2
3
4
3
+ +
=
7
=
1
0
=
+
1) zero()
2) plus()
3) plus()
4) serialize()
6) deserialize()
5) zero()
7) plus()
9) plus()
3
78) deserialize()
These abstractions enable a
small library of reusable code to
aggregate data in many parts of
your system.
Systems
Requirements and Tradeoffs
Query
Latency
milliseconds seconds minutes
• Results are pre-computed
• requires compute and
storage resources
• Supported queries must be
known in advance
• Results are computed at query
time
• No resources used except
for executed queries
• Ad-hoc queries
Requirements and Tradeoffs
Number of
Users
large many few
• Resources required per query
must be small
• Requires scalable query
handling/storage
• Queries can be
expensive
Requirements and Tradeoffs
Freshness of
Results
seconds minutes hours
• May require streaming
platform in addition to batch
• Smaller, more frequent
updates is more work
• Single batch platform
• Less frequent
computation
Requirements and Tradeoffs
Amount of
Data
billions millions thousands
• Requires parallelized
computation and storage
• Single server is
sufficient
SQL on Hadoop
• Impala, Hive, SparkSQL
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
Batch Jobs
• Spark, Hadoop MapReduce
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
Dependent on
where you put the
job’s output
Online Incremental Systems
• Twitter’s Summingbird [PA1, C4], Google’s Mesa [PA2],
Koverse’s Aggregation Framework
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
S
M
K
Online Incremental Systems:
Common Components
• Aggregations are computed/reduced
incrementally via associative operations
• Results are mostly pre-computed for so
queries are inexpensive
• Aggregations, keyed by dimensions, are
stored in low latency, scalable key-value
store
Summingbird Program
Summingbird
Data
HDFS
Queues Storm
Topology
Hadoop
Job
Online
KV store
Batch
KV store
Client
Library
Client
Reduce
Reduce
Reduce
Reduce
Mesa
Data (batches)
Colossus
Query
Server
61
62
91
92
…
Singletons
61-70
Cumulatives
61-80
61-90
0-60
Base
Compaction
Worker
Reduce
Reduce
Client
Koverse
Data
Apache Accumulo
Koverse
Server
Hadoop Job
Reduce
Reduce
ClientRecords Aggregates
Min/Maj
Compation
Iterator
Reduce
Scan
Iterator
Reduce
Demo
References
Presentations
P1. Algebra for Analytics - https://speakerdeck.com/johnynek/algebra-for-analytics
Code
C1. Algebird - https://github.com/twitter/algebird
C2. Simmer - https://github.com/avibryant/simmer
C3. stream-lib https://github.com/addthis/stream-lib
C4. Summingbird - https://github.com/twitter/summingbird
Papers
PA1. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations
http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf
PA2. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42851.pdf
PA3. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms http://arxiv.org/abs/1304.7544
Video
V1. Intro To Summingbird - https://engineering.twitter.com/university/videos/introduction-to-summingbird
Graphics
G1. Histogram Graphic - http://www.statmethods.net/graphs/density.html
G2. Heatmap Graphic - https://www.mapbox.com/blog/twitter-map-every-tweet/
G3. The Matrix Background - http://wall.alphacoders.com/by_sub_category.php?id=198802

More Related Content

PPTX
Running Presto and Spark on the Netflix Big Data Platform
PPTX
Big Data Pipeline and Analytics Platform
PPTX
The evolution of the big data platform @ Netflix (OSCON 2015)
PPTX
VariantSpark on AWS
PPTX
REDSHIFT - Amazon
PPTX
Next Generation Big Data Platform at Netflix 2014
PDF
Build Real-Time Applications with Databricks Streaming
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Running Presto and Spark on the Netflix Big Data Platform
Big Data Pipeline and Analytics Platform
The evolution of the big data platform @ Netflix (OSCON 2015)
VariantSpark on AWS
REDSHIFT - Amazon
Next Generation Big Data Platform at Netflix 2014
Build Real-Time Applications with Databricks Streaming
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...

What's hot (15)

PDF
Demystifying Data Engineering
PDF
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
PDF
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
PDF
British Gas Connected Homes: Data Engineering
PPTX
Data Science at Scale by Sarah Guido
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
PDF
Scaling graphite for application metrics
PDF
Bring Satellite and Drone Imagery into your Data Science Workflows
PDF
Cloud Connect 2012, Big Data @ Netflix
PPTX
Presto Talk @ Hadoop Summit'15
PDF
Big data on AWS
PPTX
Data Analysis on AWS
PPTX
An Architect's guide to real time big data systems
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
PPTX
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
Demystifying Data Engineering
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
British Gas Connected Homes: Data Engineering
Data Science at Scale by Sarah Guido
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Scaling graphite for application metrics
Bring Satellite and Drone Imagery into your Data Science Workflows
Cloud Connect 2012, Big Data @ Netflix
Presto Talk @ Hadoop Summit'15
Big data on AWS
Data Analysis on AWS
An Architect's guide to real time big data systems
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
Ad

Viewers also liked (10)

PDF
Data Aggregation System
PDF
BIG DATA, a new way to achieve success in Enterprise Architecture.
PPTX
Apache Cassandra Ignite Presentation
PPTX
One Large Data Lake, Hold the Hype
PPTX
Introduction to Apache Accumulo
PPTX
Multi dimension aggregations using spark and dataframes
PPTX
An Introduction to Accumulo
PPTX
Single Page Applications with AngularJS 2.0
PPTX
Aggregates
PPTX
Introduction to Apache ZooKeeper
Data Aggregation System
BIG DATA, a new way to achieve success in Enterprise Architecture.
Apache Cassandra Ignite Presentation
One Large Data Lake, Hold the Hype
Introduction to Apache Accumulo
Multi dimension aggregations using spark and dataframes
An Introduction to Accumulo
Single Page Applications with AngularJS 2.0
Aggregates
Introduction to Apache ZooKeeper
Ad

Similar to Building Scalable Aggregation Systems (20)

PPTX
Correlate Log Data with Business Metrics Like a Jedi
PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
PPTX
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
PDF
Hadoop Master Class : A concise overview
PDF
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
PPTX
Optimize Your Reporting In Less Than 10 Minutes
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PPTX
Reactive Development: Commands, Actors and Events. Oh My!!
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PDF
Lean Enterprise, Microservices and Big Data
PDF
Accelerating Data Science with Better Data Engineering on Databricks
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Dynamic DDL: Adding structure to streaming IoT data on the fly
PPTX
Interactive query using hadoop
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PPTX
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Correlate Log Data with Business Metrics Like a Jedi
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Hadoop Master Class : A concise overview
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Optimize Your Reporting In Less Than 10 Minutes
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
From Pipelines to Refineries: scaling big data applications with Tim Hunter
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Reactive Development: Commands, Actors and Events. Oh My!!
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Lean Enterprise, Microservices and Big Data
Accelerating Data Science with Better Data Engineering on Databricks
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Dynamic DDL: Adding structure to streaming IoT data on the fly
Interactive query using hadoop
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Mega Projects Data Mega Projects Data
PPTX
Computer network topology notes for revision
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to machine learning and Linear Models
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Lecture1 pattern recognition............
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Business Analytics and business intelligence.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
.pdf is not working space design for the following data for the following dat...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Mega Projects Data Mega Projects Data
Computer network topology notes for revision
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Knowledge Engineering Part 1
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Foundation of Data Science unit number two notes
Introduction to machine learning and Linear Models
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Lecture1 pattern recognition............
Supervised vs unsupervised machine learning algorithms
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Business Analytics and business intelligence.pdf

Building Scalable Aggregation Systems

  • 1. Building Scalable Aggregation Systems Boulder/Denver Big Data Meetup April 15, 2015 Jared Winick jaredwinick @ koverse.com @jaredwinick
  • 2. Outline • The Value of Aggregations • Abstractions • Systems • Demo • References/Additional Information
  • 5. Aggregation provides a means of turning billions of pieces of raw data into condensed, human-consumable information.
  • 6. Aggregation of Aggregations Time Series Set Size/Cardinality Top-K Quantiles Density/Heatmap 16.3k Unique Users G1 G2
  • 10. 1 2 3 4 3 + + = 7 = 10 = + We can parallelize integer addition
  • 11. Associative + Commutative Operations • Associative: 1 + (2 + 3) = (1 + 2) + 3 • Commutative: 1 + 2 = 2 + 1 • Allows us to parallelize our reduce (for instance locally in combiners) • Applies to many operations, not just integer addition. • Spoiler: Key to incremental aggregations
  • 12. {a, b} {b, c} {a, c} {a} {a, b, c} + + = {a, c} = {a, b, c} = + We can also parallelize the “addition” of other types, like Sets, as Set Union is associative
  • 13. Monoid Interface • Abstract Algebra provides a formal foundation for what we can casually observe. • Don’t be thrown off by the name, just think of it as another trait/interface. • Monoids provide a critical abstraction to treat aggregations of different types in the same way
  • 16. Many Monoid Implementations Already Exist • https://github.com/twitter/algebird/ • Long, String, Set, Seq, Map, etc… • HyperLogLog – Cardinality Estimates • QTree – Quantile Estimates • SpaceSaver/HeavyHitters – Approx Top-K • Also easy to add your own with libraries like stream-lib [C3]
  • 17. Serialization • One additional trait we need our “aggregatable” types to have is that we can serialize/deserialize them. 1 2 3 4 3 + + = 7 = 1 0 = + 1) zero() 2) plus() 3) plus() 4) serialize() 6) deserialize() 5) zero() 7) plus() 9) plus() 3 78) deserialize()
  • 18. These abstractions enable a small library of reusable code to aggregate data in many parts of your system.
  • 20. Requirements and Tradeoffs Query Latency milliseconds seconds minutes • Results are pre-computed • requires compute and storage resources • Supported queries must be known in advance • Results are computed at query time • No resources used except for executed queries • Ad-hoc queries
  • 21. Requirements and Tradeoffs Number of Users large many few • Resources required per query must be small • Requires scalable query handling/storage • Queries can be expensive
  • 22. Requirements and Tradeoffs Freshness of Results seconds minutes hours • May require streaming platform in addition to batch • Smaller, more frequent updates is more work • Single batch platform • Less frequent computation
  • 23. Requirements and Tradeoffs Amount of Data billions millions thousands • Requires parallelized computation and storage • Single server is sufficient
  • 24. SQL on Hadoop • Impala, Hive, SparkSQL milliseconds seconds minutes large many few seconds minutes hours billions millions thousands Query Latency # of Users Freshness Data Size
  • 25. Batch Jobs • Spark, Hadoop MapReduce milliseconds seconds minutes large many few seconds minutes hours billions millions thousands Query Latency # of Users Freshness Data Size Dependent on where you put the job’s output
  • 26. Online Incremental Systems • Twitter’s Summingbird [PA1, C4], Google’s Mesa [PA2], Koverse’s Aggregation Framework milliseconds seconds minutes large many few seconds minutes hours billions millions thousands Query Latency # of Users Freshness Data Size S M K
  • 27. Online Incremental Systems: Common Components • Aggregations are computed/reduced incrementally via associative operations • Results are mostly pre-computed for so queries are inexpensive • Aggregations, keyed by dimensions, are stored in low latency, scalable key-value store
  • 28. Summingbird Program Summingbird Data HDFS Queues Storm Topology Hadoop Job Online KV store Batch KV store Client Library Client Reduce Reduce Reduce Reduce
  • 30. Koverse Data Apache Accumulo Koverse Server Hadoop Job Reduce Reduce ClientRecords Aggregates Min/Maj Compation Iterator Reduce Scan Iterator Reduce
  • 31. Demo
  • 32. References Presentations P1. Algebra for Analytics - https://speakerdeck.com/johnynek/algebra-for-analytics Code C1. Algebird - https://github.com/twitter/algebird C2. Simmer - https://github.com/avibryant/simmer C3. stream-lib https://github.com/addthis/stream-lib C4. Summingbird - https://github.com/twitter/summingbird Papers PA1. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf PA2. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42851.pdf PA3. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms http://arxiv.org/abs/1304.7544 Video V1. Intro To Summingbird - https://engineering.twitter.com/university/videos/introduction-to-summingbird Graphics G1. Histogram Graphic - http://www.statmethods.net/graphs/density.html G2. Heatmap Graphic - https://www.mapbox.com/blog/twitter-map-every-tweet/ G3. The Matrix Background - http://wall.alphacoders.com/by_sub_category.php?id=198802