Aggregation Framework

#mongodbdays

Aggregation Framework
Emily Stolfo
Ruby Engineer/Evangelist, 10gen
@EmStolfo

Tuesday, January 29, 13

Agenda
• State of Aggregation
• Pipeline
• Usage and Limitations
• Optimization
• Sharding
• (Expressions)
• Looking Ahead


State of Aggregation


State of Aggregation
• We're storing our data in MongoDB
• We need to do ad-hoc reporting, grouping,
common aggregations, etc.
• What are we using for this?


Data Warehousing


Data Warehousing
• SQL for reporting and analytics
• Infrastructure complications
– Additional maintenance
– Data duplication
– ETL processes
– Real time?


MapReduce


MapReduce
• Extremely versatile, powerful
• Intended for complex data analysis
• Overkill for simple aggregation tasks, such as
– Averages
– Summation
– Grouping


MapReduce in MongoDB
• Implemented with JavaScript
– Single-threaded
– Difficult to debug

• Concurrency
– Appearance of parallelism
– Write locks


• Declared in JSON, executes in C++
• Flexible, functional, and simple
– Operation pipeline
– Computational expressions

• Works well with sharding


Enabling Developers
• Doing more within MongoDB, faster
• Refactoring MapReduce and groupings
– Replace pages of JavaScript
– Longer aggregation pipelines

• Quick aggregations from the shell


Pipeline


Pipeline
• Process a stream of documents
– Original input is a collection
– Final output is a result document

• Series of operators
– Filter or transform data
– Input/output chain

ps ax | grep mongod | head -n 1


Pipeline Operators

• $match • $sort
• $project • $limit
• $group • $skip
• $unwind


Example book data
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}


$match
• Filter documents
• Uses existing query syntax
• (No geospatial operations or $where)


Matching Field Values
{
{ $match: {
language: "Russian"
pages: 218,
}}
language: "English"
}

{
title: "War and Peace",
{
pages: 1440,
language: "Russian"
pages: 1440,
}
language: "Russian"
}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}


Matching with Query Operators
{ { $match: {
title: "The Great Gatsby", pages: { $gt: 1000 }
pages: 218, }}
language: "English"
}

{ {
title: "War and Peace", title: "War and Peace",
pages: 1440, pages: 1440,
language: "Russian" language: "Russian"
} }

{ {
title: "Atlas Shrugged", title: "Atlas Shrugged",
pages: 1088, pages: 1088,
language: "English" language: "English"
} }


$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields


Including and Excluding Fields
{ { $project: {
_id: 375, _id: 0,
title: "Great Gatsby", title: 1,
ISBN: "9781857150193", language: 1
available: true, }}
pages: 218,
subjects: [
"Long Island",
"New York",
"1920s" {
], title: " Great Gatsby",
language: "English" language: "English"
} }


Renaming and Computing Fields
{ { $project: {
_id: 375, avgChapterLength: {
title: "Great Gatsby", $divide: ["$pages",
ISBN: "9781857150193", "$chapters"]
available: true, },
pages: 218, lang: "$language"
chapters: 9, }}
subjects: [
"Long Island",
"New York",
"1920s" {
], _id: 375,
language: "English" avgChapterLength: 24.2222 ,
} lang: "English"
}


Creating Sub-Document Fields
{ $project: {
{
title: 1,
_id: 375,
stats: {
title: "Great Gatsby",
pages: "$pages",
ISBN: "9781857150193",
language: "$language",
available: true,
}
pages: 218,
}}
subjects: [
"Long Island",
"New York",
"1920s"
{
],
_id: 375,
language: "English"
title: " Great Gatsby",
}
stats: {
pages: 218,
language: "English"
}


$group
• Group documents by an ID
– Field reference, object, constant

• Other output fields are computed
– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last

• Processes all data in memory


Calculating an Average
{ { $group: {
title: "The Great Gatsby", _id: "$language",
pages: 218, avgPages: { $avg:
language: "English" "$pages" }
} }}

{
pages: 1440, {
language: "Russian" _id: "Russian",
} avgPages: 1440
}
{
title: "Atlas Shrugged", {
pages: 1088, _id: "English",
language: "English" avgPages: 653
} }


Summating Fields and Counting
{ { $group: {
pages: 218, numTitles: { $sum: 1 },
language: "English" sumPages: { $sum: "$pages" }
}}
}

{
title: "War and Peace", {
pages: 1440, _id: "Russian",
language: "Russian” numTitles: 1,
} sumPages: 1440
}
{
{
title: "Atlas Shrugged",
_id: "English",
pages: 1088, numTitles: 2,
language: "English" sumPages: 1306
} }


Collecting Distinct Values
{ { $group: {
pages: 218, titles: { $addToSet: "$title" }
language: "English" }}
}

{ {
title: "War and Peace", _id: "Russian",
titles: [ "War and Peace" ]
pages: 1440, }
language: "Russian"
}
{
_id: "English",
{ titles: [
title: "Atlas Shrugged", "Atlas Shrugged",
pages: 1088, "The Great Gatsby"
language: "English" ]
}
}


$unwind
• Applied to an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error

• Pipe to $group to aggregate array values


Yielding Multiple Documents from One
{ { $unwind: "$subjects" }
ISBN: "9781857150193",
{
subjects: [
"Long Island", ISBN: "9781857150193",
"New York", subjects: "Long Island"
"1920s" }
]
} {
ISBN: "9781857150193",
subjects: "New York"
}

{
ISBN: "9781857150193",
subjects: "1920s"
}


$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed

• Limit and skip follow cursor behavior


Sort All the Documents in the Pipeline

{ title: "The Great Gatsby" } { $sort: { title: 1 }}
{ title: "Brave New World" }
{ title: "Grapes of Wrath" } { title: "Animal Farm" }
{ title: "Animal Farm" } { title: "Brave New World" }
{ title: "Lord of the Flies" } { title: "Fahrenheit 451" }
{ title: "Fathers and Sons" } { title: "Fathers and Sons" }
{ title: "Invisible Man" } { title: "Grapes of Wrath" }
{ title: "Fahrenheit 451" } { title: "Invisible Man" }
{ title: "Lord of the Flies" }
{ title: "The Great Gatsby" }


Limit Documents Through the Pipeline

{ title: "The Great Gatsby" } { $limit: 5 }
{ title: "Grapes of Wrath" } { title: "The Great Gatsby" }
{ title: "Animal Farm" } { title: "Brave New World" }
{ title: "Lord of the Flies" } { title: "Grapes of Wrath" }
{ title: "Fathers and Sons" } { title: "Animal Farm" }
{ title: "Invisible Man" } { title: "Lord of the Flies" }
{ title: "Fahrenheit 451" }


Skip Over Documents in the Pipeline

{ title: "The Great Gatsby" } { $skip: 5 }
{ title: "Grapes of Wrath" }
{ title: "Animal Farm" } { title: "Fathers and Sons" }
{ title: "Lord of the Flies" } { title: "Invisible Man" }
{ title: "Fathers and Sons" } { title: "Fahrenheit 451" }
{ title: "Invisible Man" }
{ title: "Fahrenheit 451" }


Usage and Limitations


Usage
• collection.aggregate() method
– Mongo shell
– Most drivers

• aggregate database command


Collection
db.books.aggregate([
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
])

{
result: [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
ok: 1
}


Database Command
db.runCommand({
aggregate: "books",
pipeline: [
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
]
})

{
result: [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
ok: 1
}


Limitations
• Result limited by BSON document size
– Final command result
– Intermediate shard results

• Pipeline operator memory limits
• Some BSON types unsupported
– Binary, Code, deprecated types


Sharding


Sharding
• Split the pipeline at first $group or $sort
– Shards execute pipeline up to that point
– mongos merges results and continues

• Early $match may excuse shards
• CPU and memory implications for mongos


Sharding
[
{ $match: { /* filter by shard key */ }},
{ $project: { /* select fields */ }},
{ $group: { /* group by some field */ }},
{ $sort: { /* sort by some field */ }},
{ $project: { /* reshape result */ }}
]


Aggregation in a sharded cluster


Expressions


Expressions
• Return computed values
• Used with $project and $group
• Reference fields using $ (e.g. "$x")
• Expressions may be nested


Boolean Operators
• Input array of one or more values
– $and, $or
– Short-circuit logic

• Invert values with $not
• Evaluation of non-boolean types
– null, undefined, zero ▶ false
– Non-zero, strings, dates, objects ▶ true

{ $and: [true, false] } ▶ false
{ $or: ["foo", 0] } ▶ true
{ $not: null } ▶ true


Comparison Operators
• Compare numbers, strings, and dates
• Input array with two operands
– $cmp, $eq, $ne
– $gt, $gte, $lt, $lte

{ $cmp: [3, 4] } ▶ -1
{ $eq: ["foo", "bar"] } ▶ false
{ $ne: ["foo", "bar"] } ▶ true
{ $gt: [9, 7] } ▶ true


Arithmetic Operators
• Input array of one or more numbers
– $add, $multiply

• Input array of two operands
– $subtract, $divide, $mod

{ $add: [1, 2, 3] } ▶ 6
{ $multiply: [2, 2, 2] } ▶ 8
{ $subtract: [10, 7] } ▶ 3
{ $divide: [10, 2] } ▶ 5
{ $mod: [8, 3] } ▶ 2


String Operators
• $strcasecmp case-insensitive comparison
– $cmp is case-sensitive

• $toLower and $toUpper case change
• $substr for sub-string extraction
• Not encoding aware (assumes ASCII alphabet)

{ $strcasecmp:
["foo", "bar"] }
▶
1
{ $substr:
["foo", 1, 2] }
▶
"oo"
{ $toUpper:
"foo" }
▶
"FOO"
{ $toLower:
"BAR" }
▶
"bar"


Date Operators
• Extract values from date objects
– $dayOfYear, $dayOfMonth, $dayOfWeek
– $year, $month, $week
– $hour, $minute, $second

{ $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012
{ $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10
{ $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24
{ $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4
{ $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299
{ $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43


Conditional Operators
• $cond ternary operator
• $ifNull

{ $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different”

{ $ifNull: ["foo", "bar"] } ▶ "foo"
{ $ifNull: [null, "bar"] } ▶ "bar"


Looking Ahead


Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing time series data


Extending the Framework
• Adding new pipeline operators, expressions
• $out and $tee for output control
– https://jira.mongodb.org/browse/SERVER-3253


Future Enhancements
• Automatically move $match earlier if possible
• Pipeline explain facility
• Memory usage improvements
– Grouping input sorted by _id
– Sorting with limited output


#mongodbdays

Thank You
Emily Stolfo
Ruby Engineer/Evangelist, 10gen
@EmStolfo


Aggregation Framework

More Related Content

Similar to Aggregation Framework (7)

More from MongoDB (20)

Aggregation Framework