Aggregators:
modeling data queries functionally
Oscar Boykin, Twitter
@posco
Or:
Aggregators:
composable aggregation for scalding,
spark, summingbird, and plain scala
@Twitter
How to compute size of a list in Map/Reduce?
3
2 3 5 7 11 13 17
@Twitter
How to compute size of a list in Map/Reduce?
4
2 3 5 7 11 13 17
1 1 1 1 1 1 1
map(x => 1)
@Twitter
How to compute size of a list in Map/Reduce?
5
2 3 5 7 11 13 17
1 1 1 1 1 1 1
222
374
reduce {(x, y) => x+y}
Associative functions:
f(a,f(b,c)) == f(f(a,b),c)
also called “semigroups”
we want
map+semigroup in one
abstraction!
@Twitter
Getting the average
8
2 3 5 7 11 13 17
@Twitter
Getting the average
9
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
map(x => (1,x))
@Twitter
Getting the average
10
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
2,242, 5
3,417,584,17
2,12
reduce(Semigroup.plus)
@Twitter
Getting the average
11
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
7,58 8.285
map(case (c, s) => s/c.toDouble)
We really want
map+semigroup+map
in one abstraction!
trait Aggregator[In, Middle, Out] {
def prepare(i: In): Middle
def semigroup: Semigroup[Middle]
def present(m: Middle): Out
}
https://github.com/twitter/algebird
How do we use this?
@Twitter 15
@Twitter 16
@Twitter 17
@Twitter 18
Not such a new idea. Scalding had a
mapReduceMap function in the first
release:
But why should we be excited?
map (prepare)
reduce (semigroup)
map (present)
“Does not compose”
is the new
“is a piece of crap”
paraphrasing Dan Rosen @mergeconflict
Aggregators Compose
!=💩Aggregator
map (prepare)
reduce (semigroup)
map (present)
map (prepare)
reduce (semigroup)
map (present)
composePrepare
map (prepare)
reduce (semigroup)
map (present)
composePrepare
Function + Aggregator = Aggregator
map (prepare)
reduce (semigroup)
map (present)
map (prepare)
reduce (semigroup)
map (present)
andThenPresent
map (prepare)
reduce (semigroup)
map (present)
andThenPresent
Aggregator + Function = Aggregator
map (prepare)
reduce (semigroup)
map (present)
map (prepare)
reduce (semigroup)
map (present)
Aggregator 1 Aggregator 2
map (prepare)
reduce (semigroup)
map (present)
Joined Aggregator
Aggregator * Aggregator = Aggregator
Aggregators are Applicative Functors
Functor: has a map method
map(t: A[T])(fn: T => U): A[U]
Applicative: has a join method:
def join(t: A[T], u: A[U]): A[(T, U)]
Monad: has a flatMap method:
def flatMap(t: A[T])(fn: T => A[U]): A[U]
Aggregators are Applicative Functors
Functor: has a map method
map(t: A[T])(fn: T => U): A[U]
Applicative: has a join method:
def join(t: A[T], u: A[U]): A[(T, U)]
Monad: has a flatMap method:
def flatMap(t: A[T])(fn: T => A[U]): A[U]
Let’s go to the REPL
http://bit.ly/AggregatingWithAlice
https://gist.github.com/johnynek/
814fc1e77aad1d295bb7
Aggregators “just work” with scala collections
Aggregators are built in to Scalding
Aggregators are easy to use with Spark
@Twitter
Algebird with spark:
https://github.com/twitter/algebird/pull/397
37
@Twitter
Algebird with spark:
https://github.com/twitter/algebird/pull/397
38
Key Points
1) Aggregators encapsulate very general query
logic independent of how it is executed (in
memory, scalding, spark, you name it)
2) Aggregators compose so you can define parts
you use, and easily glue them together
3) Algebird has many advanced, well tested
Aggregators: TopK, HyperLogLog,
CountMinSketch, Mean, Stddev, …
Oscar Boykin @posco / oscar@twitter.com
Algebird has these aggregators and more:
https://github.com/twitter/algebird

More Related Content

PDF
Mi primer map reduce
PDF
Mi primer map reduce
PPT
Mapreduce: Theory and implementation
PPT
Map reduce (from Google)
PDF
Hardcore functional programming
PDF
A family tree of graph types
PDF
R vectorization
PDF
Three Functional Programming Technologies for Big Data
Mi primer map reduce
Mi primer map reduce
Mapreduce: Theory and implementation
Map reduce (from Google)
Hardcore functional programming
A family tree of graph types
R vectorization
Three Functional Programming Technologies for Big Data

What's hot (7)

PPTX
Pig: Data Analysis Tool in Cloud
PDF
Fast lookup in sorted array jakob voigts
PPTX
R and Visualization: A match made in Heaven
PPTX
Data visualization using R
PDF
AP Calculus Slides February 29, 2008
PDF
1 ESO - Unit 2 - Exercises 2.2 - Powers of 10
PDF
Data Visualization With R
Pig: Data Analysis Tool in Cloud
Fast lookup in sorted array jakob voigts
R and Visualization: A match made in Heaven
Data visualization using R
AP Calculus Slides February 29, 2008
1 ESO - Unit 2 - Exercises 2.2 - Powers of 10
Data Visualization With R
Ad

Similar to Aggregators: Data Day Texas, 2015 (20)

PPT
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
PPT
Lec2 Mapred
PDF
How to use Map() Filter() and Reduce() functions in Python | Edureka
PPTX
Clojure to Slang
PDF
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
PPTX
CAT Functions and Graphs basics
PDF
Let’s Talk About Ruby
PPTX
MapReduce
PDF
Multinomial Logistic Regression with Apache Spark
PDF
Alpine Spark Implementation - Technical
PDF
Algebra 1
PDF
Visual Api Training
PDF
Real World Haskell: Lecture 6
PDF
Monads from Definition
PPTX
Introduction to Map Reduce
PDF
Scala Collections : Java 8 on Steroids
PPTX
DA_02_algorithms.pptx
PPTX
Introduction to pig
PDF
JSDC 2014 - functional java script, why or why not
PDF
Map Reduce
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Lec2 Mapred
How to use Map() Filter() and Reduce() functions in Python | Edureka
Clojure to Slang
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
CAT Functions and Graphs basics
Let’s Talk About Ruby
MapReduce
Multinomial Logistic Regression with Apache Spark
Alpine Spark Implementation - Technical
Algebra 1
Visual Api Training
Real World Haskell: Lecture 6
Monads from Definition
Introduction to Map Reduce
Scala Collections : Java 8 on Steroids
DA_02_algorithms.pptx
Introduction to pig
JSDC 2014 - functional java script, why or why not
Map Reduce
Ad

Recently uploaded (20)

PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PDF
AI Guide for Business Growth - Arna Softech
PPTX
Computer Software - Technology and Livelihood Education
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
assetexplorer- product-overview - presentation
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
Trending Python Topics for Data Visualization in 2025
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Wondershare Recoverit Full Crack New Version (Latest 2025)
CCleaner 6.39.11548 Crack 2025 License Key
AI Guide for Business Growth - Arna Softech
Computer Software - Technology and Livelihood Education
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
Designing Intelligence for the Shop Floor.pdf
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
assetexplorer- product-overview - presentation
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
How to Use SharePoint as an ISO-Compliant Document Management System
Tech Workshop Escape Room Tech Workshop
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Oracle Fusion HCM Cloud Demo for Beginners
Why Generative AI is the Future of Content, Code & Creativity?
MCP Security Tutorial - Beginner to Advanced
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Trending Python Topics for Data Visualization in 2025
"Secure File Sharing Solutions on AWS".pptx
DNT Brochure 2025 – ISV Solutions @ D365
Top 10 Software Development Trends to Watch in 2025 🚀.pdf

Aggregators: Data Day Texas, 2015