SlideShare a Scribd company logo
Hadoop + MongoDB
{ Name: ‘Bryan Reinero’,
Title: ‘Developer Advocate’,
Twitter: ‘@blimpyacht’,
Email: ‘bryan@mongdb.com’ }
2
Hadoop
A framework for distributed processing of large data sets
• Terabyte and petabyte datasets
• Data warehousing
• Advanced analytics
• Not a database
• No indexes
• Batch processing
3
Data Management
4
Data Management
Hadoop
Fault tolerance
Batch processing
Coarse-grained operations
Unstructured Data
MongoDB
High availability
Mutable data
Fine-grained operations
Flexible Schemas
5
Data Management
Hadoop
Offline Processing
Analytics
Data Warehousing
MongoDB
Online Operations
Application
Operational
6
Typical Implementations
Application Server
7
MongoDB as an Operational Store
Application Server
8
Use Cases
• Behavioral analytics
• Segmentation
• Fraud detection
• Prediction
• Pricing analytics
• Sales analytics
What does it do?
10
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
11
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
12
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
13
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
14
MapReduce
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: bucketByLoc( loc ),
value: 6205
}
);
emit(
{ key: bucketByDate( timestamp ),
value: value
}
);
}
15
MapReduce
{key: ObjectId(…),
value: 6205 }
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: bucketByLoc( loc ),
value: 6205
}
);
emit(
{ key: bucketByDate( timestamp ),
value: value
}
);
}
16
MapReduce
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: bucketByLoc( loc ),
value: 6205
}
);
emit(
{ key: bucketByDate( timestamp ),
value: value
}
);
}
{key: zone_a,
value: 6205}
17
MapReduce
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: bucketByLoc( loc ),
value: 6205
}
);
emit(
{ key: bucketByDate( timestamp ),
value: value
}
);
}
{ key: m06_d01_h02,
value: 6205}
18
MapReduce
19
MapReduce
key: zonea, value: 6025
20
MapReduce
key: zonea, value: 4904
21
MapReduce
key: zonea, value: 6338
22
MapReduce
key: m06_d01_h02, value: 6205
23
MapReduce
key: m06_d01_h02, value: 4904
24
MapReduce
key: m06_d01_h02, value: 6338
25
MapReduce
key: m06_d01_h02, value: 6721
26
MapReduce
function reduce ( key, values ) {
var result = { count: 1, sum : 0 };
values.forEach( function( v ){
result.sum = v.value;
result.count++;
});
return result;
}
27
MapReduce
function reduce ( key, values ) {
var result = { count: 1, sum : 0 };
values.forEach( function( v ){
result.sum = v.value;
result.count++;
});
return result;
}
28
HDFS
YARN
MapReduce
Pig Hive
Spark
29
HDFS and YARN
• Hadoop Distributed File System (HDFS)
– Distributed file-system that stores data on commodity machines
in a Hadoop cluster
• Yet Another Resource Negotiator (YARN)
– Resource management platform responsible for managing and
scheduling compute resources in a Hadoop cluster
30
Hadoop Distributed File System (HDFS)
DATA
NODE
DATA
NODE
DATA
NODE
DATA
NODE
Client
Read / Writes
Replication
NAME
NODE
Metadata
Operations
31
Yet Another Resource Negotiator
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Using The Connector
33
What You’re Gonna Need
A reducer class
extends org.apache.hadoop.mapreduce.Reducer
A mapper class
extends org.apache.hadoop.mapreduce.Mapper
Hadoop Connector Jar
https://github.com/mongodb/mongo-hadoop
34
MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/output
– mongo.job.input.format = com.hadoop.BSONFileInputFormat
– mapred.input.dir = hdfs:///tmp/database.bson
– mongo.job.output.format = com.hadoop.BSONFileOutputFormat
– mapred.output.dir = hdfs:///tmp/output.bson
35
Yet Another Resource Negotiator
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Bin/hadoop jar MyJob.jar
MongoDB_Hadoop_Connector.jar
36
Cluster
MONGOS
SHARD A
SHARDB
SHARD C
SHARD D
MONGOS Client
37
38
extends MongoSplitter class
39
extends MongoSplitter class
List<InputSplit> calculateSplits()
40
• High-level platform for creating MapReduce
• Pig Latin abstracts Java into easier-to-use notation
• Executed as a series of MapReduce applications
• Supports user-defined functions (UDFs)
Pig
41
samples = LOAD 'mongodb://127.0.0.1:27017/sensor.logs'
USING
com.mongodb.hadoop.pig.MongoLoader(’deviceId:int,value:double');
grouped = GROUP samples by deviceId;
sample_stats = FOREACH grouped {
mean = AVG(samples.value);
GENERATE group as deviceId, mean as mean;
}
STORE sample_stats INTO 'mongodb://127.0.0.1:27017/sensor.stats'
USING com.mongodb.hadoop.pig.MongoStorage;
42
• Data warehouse infrastructure built on top of Hadoop
• Provides data summarization, query, and analysis
• HiveQL is a subset of SQL
• Support for user-defined functions (UDFs)
43
• Powerful built-in transformations and actions
– map, reduceByKey, union, distinct, sample, intersection, and more
– foreach, count, collect, take, and many more
An engine for processing Hadoop data. Can perform
MapReduce in addition to streaming, interactive queries,
and machine learning.
44
Data Flows
Hadoop
Connector
BSON Files
MapReduce & HDFS
Thanks!
{ name: ‘Bryan Reinero’,
title: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
code: ‘github.com/breinero’
email: ‘bryan@mongdb.com’ }

More Related Content

PPTX
Using MongoDB with Hadoop & Spark
PPTX
MongoDB and Hadoop: Driving Business Insights
PDF
Using MongoDB + Hadoop Together
PPTX
MongoDB and Hadoop: Driving Business Insights
PPTX
MongoDB et Hadoop
PDF
Webinar: Managing Real Time Risk Analytics with MongoDB
PDF
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
PDF
Finding URL pattern with MapReduce and Apache Hadoop
Using MongoDB with Hadoop & Spark
MongoDB and Hadoop: Driving Business Insights
Using MongoDB + Hadoop Together
MongoDB and Hadoop: Driving Business Insights
MongoDB et Hadoop
Webinar: Managing Real Time Risk Analytics with MongoDB
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Finding URL pattern with MapReduce and Apache Hadoop

What's hot (20)

PDF
Building Data Applications with Apache Druid
PPTX
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
PPTX
Hermes: Free the Data! Distributed Computing with MongoDB
PPTX
Programmatic Bidding Data Streams & Druid
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PPT
MongoDB Tick Data Presentation
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
PDF
Archmage, Pinterest’s Real-time Analytics Platform on Druid
PDF
August meetup - All about Apache Druid
PDF
The architecture of data analytics PaaS on AWS
PPTX
Using MongoDB As a Tick Database
PPTX
MongoDB & Hadoop - Understanding Your Big Data
PPTX
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
PDF
Benchmarking Apache Druid
PDF
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
PDF
Analytics over Terabytes of Data at Twitter
PDF
Apache Druid Vision and Roadmap
PDF
Hbase status quo apache-con europe - nov 2012
PPTX
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
Building Data Applications with Apache Druid
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
Hermes: Free the Data! Distributed Computing with MongoDB
Programmatic Bidding Data Streams & Druid
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
MongoDB Tick Data Presentation
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Archmage, Pinterest’s Real-time Analytics Platform on Druid
August meetup - All about Apache Druid
The architecture of data analytics PaaS on AWS
Using MongoDB As a Tick Database
MongoDB & Hadoop - Understanding Your Big Data
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Benchmarking Apache Druid
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Analytics over Terabytes of Data at Twitter
Apache Druid Vision and Roadmap
Hbase status quo apache-con europe - nov 2012
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
Ad

Viewers also liked (20)

PDF
Collecting and analyzing sensor data with hadoop or other no sql databases
PPTX
Mongo db and hadoop driving business insights - final
PPTX
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
PDF
Hadoop to spark-v2
PPTX
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
PDF
Hadoop Spark Introduction-20150130
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
PDF
Rethinking wicked problems
PDF
2010: Særmagasin: KINA - TRUSSEL ELLER MULIGHED?
PDF
Nets_annual report_2015_FINAL
PDF
Community development and networks
PDF
Danske kreds 0115WEB
PDF
Anesthesia Business Consultants: Communique winter12
PPT
Proposed FY2012 Commonwealth Budget :: ANALYSIS
PDF
Responsible investment & governance annual report 2011
PPTX
Marina_Loenning_MORGENDAGENS_BEDRIFTSKUNDER_SETT_MED_TELENORS_OYNE_IT-tinget_...
PPT
Hanne Leth Andersen: Uddannelse og arbejdsliv: en eller to verdener?
PPTX
Time Management Energistyrelsen
PDF
Anesthesia Business Consultants: Communique spring12
PDF
qwest communications 1Q 03_Earnings_Release
Collecting and analyzing sensor data with hadoop or other no sql databases
Mongo db and hadoop driving business insights - final
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
Hadoop to spark-v2
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Hadoop Spark Introduction-20150130
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Rethinking wicked problems
2010: Særmagasin: KINA - TRUSSEL ELLER MULIGHED?
Nets_annual report_2015_FINAL
Community development and networks
Danske kreds 0115WEB
Anesthesia Business Consultants: Communique winter12
Proposed FY2012 Commonwealth Budget :: ANALYSIS
Responsible investment & governance annual report 2011
Marina_Loenning_MORGENDAGENS_BEDRIFTSKUNDER_SETT_MED_TELENORS_OYNE_IT-tinget_...
Hanne Leth Andersen: Uddannelse og arbejdsliv: en eller to verdener?
Time Management Energistyrelsen
Anesthesia Business Consultants: Communique spring12
qwest communications 1Q 03_Earnings_Release
Ad

Similar to Webinar: MongoDB + Hadoop (20)

POTX
What's the Scoop on Hadoop? How It Works and How to WORK IT!
PDF
What is hadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Big data applications
PPTX
Big Data in the Microsoft Platform
PPTX
Microsoft's Hadoop Story
PPTX
Big Data Analytics with Hadoop
PPSX
Hadoop-Quick introduction
PDF
Unit IV.pdf
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Hadoop and big data training
PPTX
Hadoop in a Nutshell
PPTX
Introduction to Hadoop
PDF
The practice of big data - making big data approachable
PDF
Bi with apache hadoop(en)
PPT
Hadoop
PDF
1. Big Data - Introduction(what is bigdata).pdf
PPT
Big data and hadoop
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What is hadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big data applications
Big Data in the Microsoft Platform
Microsoft's Hadoop Story
Big Data Analytics with Hadoop
Hadoop-Quick introduction
Unit IV.pdf
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Hadoop and big data training
Hadoop in a Nutshell
Introduction to Hadoop
The practice of big data - making big data approachable
Bi with apache hadoop(en)
Hadoop
1. Big Data - Introduction(what is bigdata).pdf
Big data and hadoop

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Programs and apps: productivity, graphics, security and other tools
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine Learning_overview_presentation.pptx
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
gpt5_lecture_notes_comprehensive_20250812015547.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.

Webinar: MongoDB + Hadoop

Editor's Notes