Aggregation Framework

MapReduce, The
Aggregation Framework &
The Hadoop Connector
Bryan Reinero
Engineer, 10gen

• We're storing our data in MongoDB

Data Warehousing

• We need reporting, grouping,
common aggregations, etc.

Data Warehousing

• We need reporting, grouping,
common aggregations, etc.
• What are we using for this?

Data Warehousing

• SQL for reporting and analytics
• Infrastructure complications
• Additional maintenance
• Data duplication
• ETL processes
• Real time?

Data Warehousing

MapReduce
Worker thread
calls mapper

Data Set

MapReduce
Worker thread Workers call Reduce()
calls mapper

Output

Data Set

Our Example Data
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}

MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )

function map() {
var key = this.language;
emit ( key, { totalPages : this.pages, numBooks : 1
})
}

MapReduce
db.books.mapReduce(

function reduce(key, values) {
map() {
var result = { numBooks : 0, totalPages : 0};
var key = this.language;
emit ( key, { totalPages : this.pages, numBooks : 1
} ) values.forEach(function (value) {
} result.numBooks += value.numBooks;
result.totalPages += value.totalPages;
});
return result;
}

MapReduce
db.books.mapReduce(

function reduce(key, values) {
finalize( key, value ) {
var result = { numBooks : 0, totalPages : 0};
if ( value.numBooks != 0 )
return value.totalPages value.numBooks;
values.forEach(function /(value) {
} result.numBooks += value.numBooks;
result.totalPages += value.totalPages;
});
return result;
}

MapReduce
db.books.mapReduce(
"results" : [
function finalize( key, value ) {
{
"_id" : "English",
if ( value.numBooks != 0 )
"value" : 653
return value.totalPages / value.numBooks;
},
} {
"_id" : "Russian",
"value" : 1440
}
]

MapReduce

Three Node Replica Set

Primary Secondary Secondary

MapReduce Jobs

MapReduce

Three Node Replica Set tags :{
tags :{ tags :{ workload
workload workload :”analysis}
:”prod”} :”prod”} priority : 0,
Primary Secondary Secondary

MapReduce Jobs
CRUD operations MapReduce Jobs

MapReduce in MongoDB
• Implemented with JavaScript
– Single-threaded
– Difficult to debug

• Concurrency
– Appearance of parallelism
– Write locks

• Versatile, powerful

MapReduce

• Intended for complex data
analysis

MapReduce

• Intended for complex data
analysis
• Overkill for simple aggregations

MapReduce

• Declared in JSON, executes in C++

Aggregation Framework

• Flexible, functional, and simple


• Flexible, functional, and simple
• Plays nice with sharding


Pipeline
Piping command line operations

ps ax | grep mongod | head 1

Pipeline
Piping aggregation operations

$match | $group | $sort

Stream of documents Result document

Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort
• $limit
• $skip

$match
• Filter documents
• Uses existing query syntax
• No geospatial operations or $where

{ $match : { language :
"Russian" } }
{
pages: 218,
language: "English"
}

{
title: “War and Peace",
pages: 1440,
language: ”Russian"
}

{
title: “Atlas Shrugged",
pages: 1088,
language: ”English"
}

{ $match : { pages : { $gt : 1000
language :
}}
"Russian" } }
{
pages: 218,
language: "English"
}

{
title: “War and Peace",
pages: 1440,
language: ”Russian"
}

{
title: “Atlas Shrugged",
pages: 1088,
language: ”English"
}

$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields

Selecting and Excluding
Fields
$project: { _id: 0, title: 1, language: 1 }
{
_id: 375,
title: "Great Gatsby",
ISBN: "9781857150193",
available: true,
{
pages: 218,
subjects: [
language: "English"
"Long Island",
}
"New York",
"1920s"
],
language: "English"
}

Renaming and Computing
Fields
{ $project: {
avgChapterLength: {
$divide: ["$pages",
"$chapters"]
},
lang: "$language"
}}

Fields
{ $project: { New Field
avgChapterLength: {
$divide: ["$pages",
"$chapters"]
},
lang: "$language"
}}

Fields
{ $project: { New Field
avgChapterLength: {
Operation
$divide: ["$pages",
"$chapters"] Dividend
},
lang: "$language" Divisor
}}

Fields

{
_id: 375,
ISBN: "9781857150193",
available: true, {
pages: 218, _id: 375,
subjects: [ avgChapterLength: 24.2222,
"Long Island", lang: "English"
"New York", }
"1920s"
],
language: "English"
}

Creating Sub-Document
Fields
$project: { title: 1,
stats: { pages: "$pages”, language:
"$language”, }
}
{
_id: 375,
ISBN: "9781857150193",
available: true,
pages: 218,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}

Creating Sub-Document
Fields
$project: { title: 1,
stats: { pages: "$pages”, language:
"$language”, }
}
{
_id: 375,
ISBN: "9781857150193",
available: true, {
pages: 218, _id: 375,
subjects: [ title: "Great Gatsby",
"Long Island", stats: {
"New York", pages: 218,
"1920s" language: "English"
], }
language: "English" }
}

$group
• Group documents by an ID
– Field reference, object, constant

• Other output fields are computed
– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last

• Processes all data in memory

$group: {_id: "$language”,
Calculating an$avg:"$pages"
avgPages: { Average
}}
{
pages: 218,
language: "English" {
} _id: "Russian",
avgPages: 1440
{
}
title: "War and Peace",
pages: 1440,
{
language: "Russian"
_id: "English",
}
avgPages: 653
{ }
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}

$group: { _id: "$language",
Collecting Distinct Values
titles: { $addToSet: "$title" }}
{
pages: 218,
{
language: "English"
_id: "Russian",
}
titles: [ "War and Peace" ]
{ }
title: "War and Peace",
pages: 1440,
{
language: "Russian"
_id: "English",
}
titles: [
{ "Atlas Shrugged",
title: "Atlas Shrugged", "The Great Gatsby"
pages: 1088, ]
language: "English" }
}

$unwind
• Operate on an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error

• Pipe to $group to aggregate array values

$unwind
{ $unwind: "$subjects" }

{
{ ISBN: "9781857150193",
subjects: [
ISBN: "9781857150193",
"Long Island",
subjects: "Long York"
"New York", Island"
"1920s"
"New
} "1920s"
]
}

$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed

• Limit and skip follow cursor behavior

$sort, $limit, $skip

{ $sort: { title: 1 }}

{ title: ”The Great Gatsby" } { title: ”Animal Farm" }

{ title: ”Brave New World" } { title: ”Brave New World" }

{ title: ”Grapes of Wrath” } { title: ”Fathers and Sons” }
{ title: ”Animal Farm" } { title: ” Grapes of Wrath" }

{ title: ”Lord of the Flies" } { title: ”Invisible Man" }
{ title: ”Father and Sons" } { title: “Lord of the Flies" }
{ title : “Invisible Man” } { title : “The Great Gatsby” }

Sort All the Documents in the
Pipeline
{ $skip: { title: 1 }}
$sort: 2 }
{ title: ”Animal Farm" }

{ title: ”Brave New World" }

{ title: ”Fathers and Sons” }
{ title: ” Grapes of Wrath" }

{ title: ”Invisible Man" }
{ title: “Lord of the Flies" }
{ title : “The Great Gatsby” }

Sort All the Documents in the
Pipeline

{ $skip : 4 }
$limit: 2}
{ title: ”Animal Farm" }

{ title: ”Brave New World" }

{ title: ”Fathers and Sons” }
{ title: ” Grapes of Wrath" }

{ title: ”Invisible Man" }
{ title: “Lord of the Flies" }
{ title : “The Great Gatsby” }

Usage
• collection.aggregate() method
– Mongo shell
– Most drivers

• aggregate database command

Collection
Database Command
db.runCommand({
aggregate: "books",
db.books.aggregate([
pipeline: [
{ $project: { language: 1 }},
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
])
]
})

{
result: [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
ok: 1
}

Limitations
• Result limited by BSON document size
– Final command result
– Intermediate shard results

• Pipeline operator memory limits
• Some BSON types unsupported
– Binary, Code, deprecated types

$match $match
$project $project
Shard A
$group Shard B
$group Shard C

$match: { /* filter by shard key */ } mongos

client

Sharding

Shard A Shard B
$match $match
$project $project
$group $group Shard C

mongos
$group

client

Sharding

Nice, but…
• Limited parallelism
• No access to analytics libraries
• Separation of concerns
• Need to integrate with existing tool chains

Scaling MongoDB

Sharded cluster

MongoDB

Single Instance
Or
Replica Set
Client
Application

The Mechanism of Sharding
Complete Data Set

Define shard key on title

Animal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies

Chunk Chunk



Chunk Chunk Chunk Chunk



Chunk Chunk Chunk Chunk



Shard 1 Shard 2 Shard 3 Shard 4

Data Growth


Load Balancing


Processing Big Data

Map Map Map Map
Reduce Reduce Reduce Reduce

Processing Big Data
• Need to break data into smaller pieces
• Process data across multiple nodes

Hadoop Hadoop Hadoop Hadoop Hadoop


Input splits on Unshared
Systems

Total Dataset

Single Instance
Or
Replica Set

Hadoop Hadoop Hadoop
Single Hadoop
Map
Hadoop
Reduce

Hadoop Connector in Java
final Configuration conf = new Configuration();

MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );
MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );

final Job job = new Job( conf, "word count" );

job.setMapperClass( TokenizerMapper.class );
job.setReducerClass( IntSumReducer.class );
…

MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop


$ ./SBT package


$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar
../hadoop/1.0.1/libexec/lib/


$ ./SBT package


$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/


$ ./SBT package


$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

ROCK AND ROLL!
$ bin/hadoop com.xgen.WordCount

Thank You
Bryan Reinero
Engineer, 10gen

Aggregation Framework

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Aggregation Framework (20)

More from MongoDB (20)

Aggregation Framework