SlideShare a Scribd company logo
MapReduce, The
Aggregation Framework &
 The Hadoop Connector
Bryan Reinero
Engineer, 10gen
Data Warehousing
• We're storing our data in MongoDB




Data Warehousing
• We're storing our data in MongoDB
  • We need reporting, grouping,
    common aggregations, etc.




Data Warehousing
• We're storing our data in MongoDB
  • We need reporting, grouping,
    common aggregations, etc.
  • What are we using for this?




Data Warehousing
• SQL for reporting and analytics
  • Infrastructure complications
    •   Additional maintenance
    •   Data duplication
    •   ETL processes
    •   Real time?




Data Warehousing
MapReduce
MapReduce
Worker thread
calls mapper




 Data Set
MapReduce
Worker thread   Workers call Reduce()
calls mapper




                          Output




 Data Set
Our Example Data
{
    _id: 375,
    title: "The Great Gatsby",
    ISBN: "9781857150193",
    available: true,
    pages: 218,
    chapters: 9,
    subjects: [
      "Long Island",
      "New York",
      "1920s"
    ],
    language: "English"
}
MapReduce
   db.books.mapReduce(
     map, reduce, {finalize: finalize, out: { inline : 1} } )

 function map() {
    var key = this.language;
    emit ( key, { totalPages : this.pages, numBooks : 1
 })
  }
MapReduce
   db.books.mapReduce(
     map, reduce, {finalize: finalize, out: { inline : 1} } )

 function reduce(key, values) {
            map() {
     var result = { numBooks : 0, totalPages : 0};
      var key = this.language;
      emit ( key, { totalPages : this.pages, numBooks : 1
 } ) values.forEach(function (value) {
  } result.numBooks += value.numBooks;
     result.totalPages += value.totalPages;
     });
     return result;
 }
MapReduce
   db.books.mapReduce(
     map, reduce, {finalize: finalize, out: { inline : 1} } )

 function reduce(key, values) {
            finalize( key, value ) {
     var result = { numBooks : 0, totalPages : 0};
 if ( value.numBooks != 0 )
     return value.totalPages value.numBooks;
     values.forEach(function /(value) {
 } result.numBooks += value.numBooks;
     result.totalPages += value.totalPages;
     });
     return result;
 }
MapReduce
   db.books.mapReduce(
     map, reduce, {finalize: finalize, out: { inline : 1} } )
 "results" : [
 function finalize( key, value ) {
          {
             "_id" : "English",
 if ( value.numBooks != 0 )
             "value" : 653
     return value.totalPages / value.numBooks;
          },
 }        {
             "_id" : "Russian",
             "value" : 1440
          }
      ]
MapReduce

Three Node Replica Set


    Primary              Secondary   Secondary




                    MapReduce Jobs
MapReduce

Three Node Replica Set                      tags :{
 tags :{                 tags :{            workload
 workload                workload           :”analysis}
 :”prod”}                :”prod”}           priority : 0,
    Primary                 Secondary            Secondary




                           MapReduce Jobs
              CRUD operations                 MapReduce Jobs
MapReduce in MongoDB
• Implemented with JavaScript
  – Single-threaded
  – Difficult to debug

• Concurrency
  – Appearance of parallelism
  – Write locks
• Versatile, powerful




MapReduce
• Versatile, powerful
  • Intended for complex data
    analysis




MapReduce
• Versatile, powerful
  • Intended for complex data
    analysis
  • Overkill for simple aggregations



MapReduce
Aggregation Framework
• Declared in JSON, executes in C++




Aggregation Framework
• Declared in JSON, executes in C++
  • Flexible, functional, and simple




Aggregation Framework
• Declared in JSON, executes in C++
  • Flexible, functional, and simple
  • Plays nice with sharding




Aggregation Framework
Pipeline
Pipeline
    Piping command line operations


 ps ax | grep mongod | head 1
Pipeline
     Piping aggregation operations


  $match | $group | $sort

Stream of documents            Result document
Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort
• $limit
• $skip
$match
• Filter documents
• Uses existing query syntax
• No geospatial operations or $where
{ $match : { language :
"Russian" } }
{
    title: "The Great Gatsby",
    pages: 218,
    language: "English"
}

{
    title: “War and Peace",
    pages: 1440,
    language: ”Russian"
}

{
    title: “Atlas Shrugged",
    pages: 1088,
    language: ”English"
}
{ $match : { pages : { $gt : 1000
             language :
}}
"Russian" } }
 {
     title: "The Great Gatsby",
     pages: 218,
     language: "English"
 }

 {
     title: “War and Peace",
     pages: 1440,
     language: ”Russian"
 }

 {
     title: “Atlas Shrugged",
     pages: 1088,
     language: ”English"
 }
$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields
Selecting and Excluding
    Fields
      $project: { _id: 0, title: 1, language: 1 }
{
    _id: 375,
    title: "Great Gatsby",
    ISBN: "9781857150193",
    available: true,
                                {
    pages: 218,
                                    title: "Great Gatsby",
    subjects: [
                                    language: "English"
      "Long Island",
                                }
      "New York",
      "1920s"
    ],
    language: "English"
}
Renaming and Computing
Fields
{ $project: {
  avgChapterLength: {
    $divide: ["$pages",
          "$chapters"]
  },
  lang: "$language"
}}
Renaming and Computing
Fields
{ $project: {             New Field
  avgChapterLength: {
    $divide: ["$pages",
          "$chapters"]
  },
  lang: "$language"
}}
Renaming and Computing
Fields
{ $project: {             New Field
  avgChapterLength: {
                          Operation
    $divide: ["$pages",
          "$chapters"]    Dividend
  },
  lang: "$language"       Divisor
}}
Renaming and Computing
    Fields

{
    _id: 375,
    title: "Great Gatsby",
    ISBN: "9781857150193",
    available: true,         {
    pages: 218,                  _id: 375,
    subjects: [                  avgChapterLength: 24.2222,
      "Long Island",             lang: "English"
      "New York",            }
      "1920s"
    ],
    language: "English"
}
Creating Sub-Document
    Fields
$project: { title: 1,
   stats: { pages: "$pages”, language:
   "$language”, }
}
{
    _id: 375,
    title: "Great Gatsby",
    ISBN: "9781857150193",
    available: true,
    pages: 218,
    subjects: [
      "Long Island",
      "New York",
      "1920s"
    ],
    language: "English"
}
Creating Sub-Document
    Fields
$project: { title: 1,
   stats: { pages: "$pages”, language:
   "$language”, }
}
{
    _id: 375,
    title: "Great Gatsby",
    ISBN: "9781857150193",
    available: true,         {
    pages: 218,                  _id: 375,
    subjects: [                  title: "Great Gatsby",
      "Long Island",             stats: {
      "New York",                  pages: 218,
      "1920s"                      language: "English"
    ],                           }
    language: "English"      }
}
$group
• Group documents by an ID
   – Field reference, object, constant

• Other output fields are computed
   – $max, $min, $avg, $sum
   – $addToSet, $push
   – $first, $last

• Processes all data in memory
$group: {_id: "$language”,
 Calculating an$avg:"$pages"
   avgPages: { Average
}}
{
    title: "The Great Gatsby",
    pages: 218,
    language: "English"          {
}                                    _id: "Russian",
                                     avgPages: 1440
{
                                 }
    title: "War and Peace",
    pages: 1440,
                                 {
    language: "Russian"
                                     _id: "English",
}
                                     avgPages: 653
{                                }
    title: "Atlas Shrugged",
    pages: 1088,
    language: "English"
}
$group: { _id: "$language",
    Collecting Distinct Values
     titles: { $addToSet: "$title" }}
{
    title: "The Great Gatsby",
    pages: 218,
                                 {
    language: "English"
                                     _id: "Russian",
}
                                     titles: [ "War and Peace" ]
{                                }
    title: "War and Peace",
    pages: 1440,
                                 {
    language: "Russian"
                                     _id: "English",
}
                                     titles: [
{                                      "Atlas Shrugged",
    title: "Atlas Shrugged",           "The Great Gatsby"
    pages: 1088,                     ]
    language: "English"          }
}
$unwind
• Operate on an array field
• Yield new documents for each array element
   – Array replaced by element value
   – Missing/empty fields → no output
   – Non-array fields → error

• Pipe to $group to aggregate array values
$unwind
               { $unwind: "$subjects" }

{
  title: "The Great Gatsby",
{ ISBN: "9781857150193",
  title: "The Great Gatsby",
  subjects: [
  ISBN: "9781857150193",
     "Long Island",
  subjects: "Long York"
     "New York", Island"
             "1920s"
             "New
} "1920s"
  ]
}
$sort, $limit, $skip
• Sort documents by one or more fields
   – Same order syntax as cursors
   – Waits for earlier pipeline operator to return
   – In-memory unless early and indexed

• Limit and skip follow cursor behavior
$sort, $limit, $skip

                     { $sort: { title: 1 }}

 { title: ”The Great Gatsby" }          { title: ”Animal Farm" }

 { title: ”Brave New World" }        { title: ”Brave New World" }

 { title: ”Grapes of Wrath” }        { title: ”Fathers and Sons” }
   { title: ”Animal Farm" }          { title: ” Grapes of Wrath" }

  { title: ”Lord of the Flies" }        { title: ”Invisible Man" }
 { title: ”Father and Sons" }         { title: “Lord of the Flies" }
   { title : “Invisible Man” }       { title : “The Great Gatsby” }
Sort All the Documents in the
Pipeline
         { $skip: { title: 1 }}
           $sort: 2 }
                           { title: ”Animal Farm" }

                        { title: ”Brave New World" }

                        { title: ”Fathers and Sons” }
                        { title: ” Grapes of Wrath" }

                           { title: ”Invisible Man" }
                         { title: “Lord of the Flies" }
                        { title : “The Great Gatsby” }
Sort All the Documents in the
Pipeline

         { $skip : 4 }
           $limit: 2}
                            { title: ”Animal Farm" }

                         { title: ”Brave New World" }

                         { title: ”Fathers and Sons” }
                         { title: ” Grapes of Wrath" }

                            { title: ”Invisible Man" }
                          { title: “Lord of the Flies" }
                         { title : “The Great Gatsby” }
Usage and Limitations
Usage
• collection.aggregate() method
   – Mongo shell
   – Most drivers

• aggregate database command
Collection
Database Command
  db.runCommand({
    aggregate: "books",
  db.books.aggregate([
    pipeline: [
    { $project: { language: 1 }},
      { $project: { language: 1 }},
    { $group: { _id: "$language", numTitles: { $sum: 1 }}}
      { $group: { _id: "$language", numTitles: { $sum: 1 }}}
  ])
    ]
  })



  {
      result: [
        { _id: "Russian", numTitles: 1 },
        { _id: "English", numTitles: 2 }
      ],
      ok: 1
  }
Limitations
• Result limited by BSON document size
   – Final command result
   – Intermediate shard results

• Pipeline operator memory limits
• Some BSON types unsupported
   – Binary, Code, deprecated types
Sharding
$match                       $match
            $project                     $project
           Shard A
            $group                      Shard B
                                         $group     Shard C



$match: { /* filter by shard key */ }    mongos




                                          client

     Sharding
Shard A     Shard B
  $match      $match
  $project    $project
  $group      $group     Shard C



              mongos
              $group




               client

Sharding
Nice, but…
• Limited parallelism
• No access to analytics libraries
• Separation of concerns
• Need to integrate with existing tool chains
Hadoop Connector
Scaling MongoDB

 Sharded cluster


              MongoDB

            Single Instance
                  Or
              Replica Set
                                Client
                              Application
The Mechanism of Sharding
                              Complete Data Set

  Define shard key on title




Animal Farm Brave New World Fathers & Sons Invisible Man   Lord of the Flies
The Mechanism of Sharding
                Chunk                             Chunk

  Define shard key on title




Animal Farm Brave New World Fathers & Sons Invisible Man   Lord of the Flies
The Mechanism of Sharding
     Chunk                Chunk          Chunk              Chunk

  Define shard key on title




Animal Farm Brave New World Fathers & Sons Invisible Man   Lord of the Flies
Chunk                Chunk          Chunk              Chunk

  Define shard key on title




Animal Farm Brave New World Fathers & Sons Invisible Man   Lord of the Flies

    Shard 1             Shard 2         Shard 3               Shard 4
Data Growth




Shard 1   Shard 2   Shard 3   Shard 4
Load Balancing




Shard 1   Shard 2   Shard 3   Shard 4
Processing Big Data

 Map      Map      Map      Map
Reduce   Reduce   Reduce   Reduce
Processing Big Data
• Need to break data into smaller pieces
• Process data across multiple nodes




Hadoop      Hadoop        Hadoop      Hadoop       Hadoop


         Hadoop      Hadoop        Hadoop      Hadoop       Hadoop
Input splits on Unshared
Systems


                                 Total Dataset


                       Single Instance
                             Or
                         Replica Set

 Hadoop                       Hadoop        Hadoop
                                                  Single Hadoop
                                                         Map
              Hadoop
                                                   Reduce
          Hadoop        Hadoop           Hadoop       Hadoop      Hadoop
Hadoop Connector in Java
final Configuration conf = new Configuration();

MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" );
MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" );

final Job job = new Job( conf, "word count" );

job.setMapperClass( TokenizerMapper.class );
job.setReducerClass( IntSumReducer.class );
…
MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop
MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop

$ ./SBT package
MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar 
    ../hadoop/1.0.1/libexec/lib/
MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar 
    ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar 
    ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
MongoDB-Hadoop Quickstart
https://github.com/mongodb/mongo-hadoop

$ ./SBT package

$ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar 
    ../hadoop/1.0.1/libexec/lib/

$ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/

ROCK AND ROLL!
$ bin/hadoop com.xgen.WordCount
Thank You
Bryan Reinero
Engineer, 10gen

More Related Content

KEY
MongoDB Aggregation Framework
PPTX
Aggregation in MongoDB
PDF
MongoDB Aggregation Framework
PDF
Aggregation Framework MongoDB Days Munich
PPTX
The Aggregation Framework
ODP
Aggregation Framework in MongoDB Overview Part-1
PPTX
The Aggregation Framework
PPTX
MongoDB World 2016 : Advanced Aggregation
MongoDB Aggregation Framework
Aggregation in MongoDB
MongoDB Aggregation Framework
Aggregation Framework MongoDB Days Munich
The Aggregation Framework
Aggregation Framework in MongoDB Overview Part-1
The Aggregation Framework
MongoDB World 2016 : Advanced Aggregation

What's hot (20)

PPTX
Agg framework selectgroup feb2015 v2
PDF
Data Processing and Aggregation with MongoDB
PPTX
MongoDB Aggregation
PPTX
Webinar: Exploring the Aggregation Framework
PPTX
MongoDB - Aggregation Pipeline
PPTX
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
PDF
Mongodb Aggregation Pipeline
PDF
Doing More with MongoDB Aggregation
PPTX
ETL for Pros: Getting Data Into MongoDB
PPTX
Data Governance with JSON Schema
PPTX
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
PDF
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
PDF
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
PDF
MongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDB
PDF
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
PDF
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
PPTX
Beyond the Basics 2: Aggregation Framework
PPTX
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
PDF
Indexing
PDF
Webinar: Working with Graph Data in MongoDB
Agg framework selectgroup feb2015 v2
Data Processing and Aggregation with MongoDB
MongoDB Aggregation
Webinar: Exploring the Aggregation Framework
MongoDB - Aggregation Pipeline
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
Mongodb Aggregation Pipeline
Doing More with MongoDB Aggregation
ETL for Pros: Getting Data Into MongoDB
Data Governance with JSON Schema
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDB
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
Beyond the Basics 2: Aggregation Framework
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
Indexing
Webinar: Working with Graph Data in MongoDB
Ad

Viewers also liked (6)

PDF
Mongo db aggregation guide
PDF
MongoDB: Replication,Sharding,MapReduce
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
KEY
An Introduction to Map/Reduce with MongoDB
PPT
Introduction to MongoDB
PDF
MongoDB Aggregation Framework in action !
Mongo db aggregation guide
MongoDB: Replication,Sharding,MapReduce
Analytics with MongoDB Aggregation Framework and Hadoop Connector
An Introduction to Map/Reduce with MongoDB
Introduction to MongoDB
MongoDB Aggregation Framework in action !
Ad

Similar to Aggregation Framework (20)

PPTX
Powerful Analysis with the Aggregation Pipeline
PDF
Aggregation Framework
PDF
Webinar: Data Processing and Aggregation Options
PPTX
[MongoDB.local Bengaluru 2018] Tutorial: Pipeline Power - Doing More with Mon...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local Munich 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pip...
PPTX
MongoDB 3.2 - Analytics
PPTX
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
PDF
MongoDB .local Bengaluru 2019: Aggregation Pipeline Power++: How MongoDB 4.2 ...
PDF
Ruby sittin' on the Couch
PDF
CouchDB on Rails - FrozenRails 2010
PDF
CouchDB on Rails
PDF
CouchDB
PDF
CouchDB on Rails - RailsWayCon 2010
PDF
NoSQL - An introduction to CouchDB
PPTX
Introduction to MongoDB for C# developers
PPTX
Elasticsearch
PDF
CouchDB at JAOO Århus 2009
PDF
Couchbase presentation - by Patrick Heneise
PPTX
Query for json databases
Powerful Analysis with the Aggregation Pipeline
Aggregation Framework
Webinar: Data Processing and Aggregation Options
[MongoDB.local Bengaluru 2018] Tutorial: Pipeline Power - Doing More with Mon...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local Munich 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pip...
MongoDB 3.2 - Analytics
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
MongoDB .local Bengaluru 2019: Aggregation Pipeline Power++: How MongoDB 4.2 ...
Ruby sittin' on the Couch
CouchDB on Rails - FrozenRails 2010
CouchDB on Rails
CouchDB
CouchDB on Rails - RailsWayCon 2010
NoSQL - An introduction to CouchDB
Introduction to MongoDB for C# developers
Elasticsearch
CouchDB at JAOO Århus 2009
Couchbase presentation - by Patrick Heneise
Query for json databases

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
PDF
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB

Aggregation Framework

  • 1. MapReduce, The Aggregation Framework & The Hadoop Connector Bryan Reinero Engineer, 10gen
  • 3. • We're storing our data in MongoDB Data Warehousing
  • 4. • We're storing our data in MongoDB • We need reporting, grouping, common aggregations, etc. Data Warehousing
  • 5. • We're storing our data in MongoDB • We need reporting, grouping, common aggregations, etc. • What are we using for this? Data Warehousing
  • 6. • SQL for reporting and analytics • Infrastructure complications • Additional maintenance • Data duplication • ETL processes • Real time? Data Warehousing
  • 9. MapReduce Worker thread Workers call Reduce() calls mapper Output Data Set
  • 10. Our Example Data { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" }
  • 11. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function map() { var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 }) }
  • 12. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { map() { var result = { numBooks : 0, totalPages : 0}; var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 } ) values.forEach(function (value) { } result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  • 13. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { finalize( key, value ) { var result = { numBooks : 0, totalPages : 0}; if ( value.numBooks != 0 ) return value.totalPages value.numBooks; values.forEach(function /(value) { } result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  • 14. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) "results" : [ function finalize( key, value ) { { "_id" : "English", if ( value.numBooks != 0 ) "value" : 653 return value.totalPages / value.numBooks; }, } { "_id" : "Russian", "value" : 1440 } ]
  • 15. MapReduce Three Node Replica Set Primary Secondary Secondary MapReduce Jobs
  • 16. MapReduce Three Node Replica Set tags :{ tags :{ tags :{ workload workload workload :”analysis} :”prod”} :”prod”} priority : 0, Primary Secondary Secondary MapReduce Jobs CRUD operations MapReduce Jobs
  • 17. MapReduce in MongoDB • Implemented with JavaScript – Single-threaded – Difficult to debug • Concurrency – Appearance of parallelism – Write locks
  • 19. • Versatile, powerful • Intended for complex data analysis MapReduce
  • 20. • Versatile, powerful • Intended for complex data analysis • Overkill for simple aggregations MapReduce
  • 22. • Declared in JSON, executes in C++ Aggregation Framework
  • 23. • Declared in JSON, executes in C++ • Flexible, functional, and simple Aggregation Framework
  • 24. • Declared in JSON, executes in C++ • Flexible, functional, and simple • Plays nice with sharding Aggregation Framework
  • 26. Pipeline Piping command line operations ps ax | grep mongod | head 1
  • 27. Pipeline Piping aggregation operations $match | $group | $sort Stream of documents Result document
  • 28. Pipeline Operators • $match • $project • $group • $unwind • $sort • $limit • $skip
  • 29. $match • Filter documents • Uses existing query syntax • No geospatial operations or $where
  • 30. { $match : { language : "Russian" } } { title: "The Great Gatsby", pages: 218, language: "English" } { title: “War and Peace", pages: 1440, language: ”Russian" } { title: “Atlas Shrugged", pages: 1088, language: ”English" }
  • 31. { $match : { pages : { $gt : 1000 language : }} "Russian" } } { title: "The Great Gatsby", pages: 218, language: "English" } { title: “War and Peace", pages: 1440, language: ”Russian" } { title: “Atlas Shrugged", pages: 1088, language: ”English" }
  • 32. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields
  • 33. Selecting and Excluding Fields $project: { _id: 0, title: 1, language: 1 } { _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, title: "Great Gatsby", subjects: [ language: "English" "Long Island", } "New York", "1920s" ], language: "English" }
  • 34. Renaming and Computing Fields { $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language" }}
  • 35. Renaming and Computing Fields { $project: { New Field avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language" }}
  • 36. Renaming and Computing Fields { $project: { New Field avgChapterLength: { Operation $divide: ["$pages", "$chapters"] Dividend }, lang: "$language" Divisor }}
  • 37. Renaming and Computing Fields { _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, _id: 375, subjects: [ avgChapterLength: 24.2222, "Long Island", lang: "English" "New York", } "1920s" ], language: "English" }
  • 38. Creating Sub-Document Fields $project: { title: 1, stats: { pages: "$pages”, language: "$language”, } } { _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English" }
  • 39. Creating Sub-Document Fields $project: { title: 1, stats: { pages: "$pages”, language: "$language”, } } { _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, { pages: 218, _id: 375, subjects: [ title: "Great Gatsby", "Long Island", stats: { "New York", pages: 218, "1920s" language: "English" ], } language: "English" } }
  • 40. $group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memory
  • 41. $group: {_id: "$language”, Calculating an$avg:"$pages" avgPages: { Average }} { title: "The Great Gatsby", pages: 218, language: "English" { } _id: "Russian", avgPages: 1440 { } title: "War and Peace", pages: 1440, { language: "Russian" _id: "English", } avgPages: 653 { } title: "Atlas Shrugged", pages: 1088, language: "English" }
  • 42. $group: { _id: "$language", Collecting Distinct Values titles: { $addToSet: "$title" }} { title: "The Great Gatsby", pages: 218, { language: "English" _id: "Russian", } titles: [ "War and Peace" ] { } title: "War and Peace", pages: 1440, { language: "Russian" _id: "English", } titles: [ { "Atlas Shrugged", title: "Atlas Shrugged", "The Great Gatsby" pages: 1088, ] language: "English" } }
  • 43. $unwind • Operate on an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array values
  • 44. $unwind { $unwind: "$subjects" } { title: "The Great Gatsby", { ISBN: "9781857150193", title: "The Great Gatsby", subjects: [ ISBN: "9781857150193", "Long Island", subjects: "Long York" "New York", Island" "1920s" "New } "1920s" ] }
  • 45. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior
  • 46. $sort, $limit, $skip { $sort: { title: 1 }} { title: ”The Great Gatsby" } { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Brave New World" } { title: ”Grapes of Wrath” } { title: ”Fathers and Sons” } { title: ”Animal Farm" } { title: ” Grapes of Wrath" } { title: ”Lord of the Flies" } { title: ”Invisible Man" } { title: ”Father and Sons" } { title: “Lord of the Flies" } { title : “Invisible Man” } { title : “The Great Gatsby” }
  • 47. Sort All the Documents in the Pipeline { $skip: { title: 1 }} $sort: 2 } { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Fathers and Sons” } { title: ” Grapes of Wrath" } { title: ”Invisible Man" } { title: “Lord of the Flies" } { title : “The Great Gatsby” }
  • 48. Sort All the Documents in the Pipeline { $skip : 4 } $limit: 2} { title: ”Animal Farm" } { title: ”Brave New World" } { title: ”Fathers and Sons” } { title: ” Grapes of Wrath" } { title: ”Invisible Man" } { title: “Lord of the Flies" } { title : “The Great Gatsby” }
  • 50. Usage • collection.aggregate() method – Mongo shell – Most drivers • aggregate database command
  • 51. Collection Database Command db.runCommand({ aggregate: "books", db.books.aggregate([ pipeline: [ { $project: { language: 1 }}, { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }
  • 52. Limitations • Result limited by BSON document size – Final command result – Intermediate shard results • Pipeline operator memory limits • Some BSON types unsupported – Binary, Code, deprecated types
  • 54. $match $match $project $project Shard A $group Shard B $group Shard C $match: { /* filter by shard key */ } mongos client Sharding
  • 55. Shard A Shard B $match $match $project $project $group $group Shard C mongos $group client Sharding
  • 56. Nice, but… • Limited parallelism • No access to analytics libraries • Separation of concerns • Need to integrate with existing tool chains
  • 58. Scaling MongoDB Sharded cluster MongoDB Single Instance Or Replica Set Client Application
  • 59. The Mechanism of Sharding Complete Data Set Define shard key on title Animal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  • 60. The Mechanism of Sharding Chunk Chunk Define shard key on title Animal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  • 61. The Mechanism of Sharding Chunk Chunk Chunk Chunk Define shard key on title Animal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies
  • 62. Chunk Chunk Chunk Chunk Define shard key on title Animal Farm Brave New World Fathers & Sons Invisible Man Lord of the Flies Shard 1 Shard 2 Shard 3 Shard 4
  • 63. Data Growth Shard 1 Shard 2 Shard 3 Shard 4
  • 64. Load Balancing Shard 1 Shard 2 Shard 3 Shard 4
  • 65. Processing Big Data Map Map Map Map Reduce Reduce Reduce Reduce
  • 66. Processing Big Data • Need to break data into smaller pieces • Process data across multiple nodes Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop
  • 67. Input splits on Unshared Systems Total Dataset Single Instance Or Replica Set Hadoop Hadoop Hadoop Single Hadoop Map Hadoop Reduce Hadoop Hadoop Hadoop Hadoop Hadoop
  • 68. Hadoop Connector in Java final Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/test.in" ); MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/test.out" ); final Job job = new Job( conf, "word count" ); job.setMapperClass( TokenizerMapper.class ); job.setReducerClass( IntSumReducer.class ); …
  • 71. MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop $ ./SBT package $ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/
  • 72. MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop $ ./SBT package $ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/ $ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
  • 73. MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop $ ./SBT package $ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/ $ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/
  • 74. MongoDB-Hadoop Quickstart https://github.com/mongodb/mongo-hadoop $ ./SBT package $ cp mongo-hadoop-core_1.0.3-SNAPSHOT.jar ../hadoop/1.0.1/libexec/lib/ $ cp wordcount.jar ../hadoop/1.0.1/libexec/lib/ ROCK AND ROLL! $ bin/hadoop com.xgen.WordCount