SlideShare a Scribd company logo
What’s the Scoop with Hadoop?
How the connector works and how to work it
{ Name: ‘Bryan Reinero’,
Title: ‘Developer Advocate’,
Twitter: ‘@blimpyacht’,
Email: ‘bryan@mongdb.com’ }
2
3
Hadoop
A framework for distributed processing of large data sets
• Terabyte and petabyte datasets
• Data warehousing
• Advanced analytics
• Not a database
• No indexes
• Batch processing
4
Data Management
5
Data Management
Hadoop
Fault tolerance
Batch processing
Coarse-grained operations
Unstructured Data
MongoDB
High availability
Mutable data
Fine-grained operations
Flexible Schemas
6
Data Management
Hadoop
Offline Processing
Analytics
Data Warehousing
MongoDB
Online Operations
Application
Operational
7
Typical Implementations
Application Server
8
MongoDB as an Operational Store
Application Server
9
Use Cases
• Behavioral analytics
• Segmentation
• Fraud detection
• Prediction
• Pricing analytics
• Sales analytics
What does it do?
11
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
12
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
13
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
14
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
”deviceId" : 556172530004,
"value" : 6205,
"timestamp" : ISODate(”2015-06-02T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
} Average Sensor Value By
Device
Time Interval
Location Bucket
15
Processing Sensor Data
{
"_id" : ObjectId("556172a53004b760dde8a488"),
"d_id" : ObjectId("556172a53004b760dde8a443"),
"v" : 6205,
"timestamp" : ISODate("3129-12-13T02:03:17.906Z"),
"loc" : [
-174.95596353219008,
40.654427078258834
]
}
LIVE CODE DEMO!!!
PSUEDO
^
16
MapReduce
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: Chelsea,
value: 6205
}
);
emit(
{ key: m06_d01_h02,
value: value
}
);
}
17
MapReduce
{key: ObjectId(…),
value: 6205 }
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: Chelsea,
value: 6205
}
);
emit(
{ key: m06_d01_h02,
value: value
}
);
}
18
MapReduce
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: Chelsea,
value: 6205
}
);
emit(
{ key: m06_d01_h02,
value: value
}
);
}
{key: Chelsea,
value: 6205}
19
MapReduce
map() {
emit(
{ key: ObjectId(…),
value: 6205
}
);
emit(
{ key: Chelsea,
value: 6205
}
);
emit(
{ key: m06_d01_h02,
value: value
}
);
}
{ key: m06_d01_h02,
value: 6205}
20
MapReduce
21
MapReduce
key: Chelsea, value: 6025
22
MapReduce
key: Chelsea, value: 4904
23
MapReduce
key: Chelsea, value: 6338
24
MapReduce
key: m06_d01_h02, value: 6205
25
MapReduce
key: m06_d01_h02, value: 4904
26
MapReduce
key: m06_d01_h02, value: 6338
27
MapReduce
key: m06_d01_h02, value: 6721
28
MapReduce
function reduce ( key, values ) {
var result = { count: 1, sum : 0 };
values.forEach( function( v ){
result.sum = v.value;
result.count++;
});
return result;
}
29
MapReduce
function reduce ( key, values ) {
var result = { count: 1, sum : 0 };
values.forEach( function( v ){
result.sum = v.value;
result.count++;
});
return result;
}
30
HDFS
YARN
MapReduce
Pig Hive
Spark
31
HDFS and YARN
• Hadoop Distributed File System (HDFS)
– Distributed file-system that stores data on commodity machines
in a Hadoop cluster
• Yet Another Resource Negotiator (YARN)
– Resource management platform responsible for managing and
scheduling compute resources in a Hadoop cluster
39
Hadoop Distributed File System (HDFS)
DATA
NODE
DATA
NODE
DATA
NODE
DATA
NODE
Client
Read / Writes
Replication
NAME
NODE
Metadata
Operations
47
Yet Another Resource Negotiator
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Using The Connector
49
What You’re Gonna Need
A reducer class
extends org.apache.hadoop.mapreduce.Reducer
A mapper class
extends org.apache.hadoop.mapreduce.Mapper
50
MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/output
– mongo.job.input.format = com.hadoop.BSONFileInputFormat
– mapred.input.dir = hdfs:///tmp/database.bson
– mongo.job.output.format = com.hadoop.BSONFileOutputFormat
– mapred.output.dir = hdfs:///tmp/output.bson
51
Yet Another Resource Negotiator
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Bin/hadoop jar MyJob.jar
MongoDB_Hadoop_Connector.jar
52
53
extends MongoSplitter class
54
extends MongoSplitter class
List<InputSplit> calculateSplits()
55
Cluster
MONGOS
SHARD A
SHARDB
SHARD C
SHARD D
MONGOS Client
56
• High-level platform for creating MapReduce
• Pig Latin abstracts Java into easier-to-use notation
• Executed as a series of MapReduce applications
• Supports user-defined functions (UDFs)
Pig
57
samples = LOAD 'mongodb://127.0.0.1:27017/sensor.logs'
USING
com.mongodb.hadoop.pig.MongoLoader(’deviceId:int,value:double');
grouped = GROUP samples by deviceId;
sample_stats = FOREACH grouped {
mean = AVG(samples.value);
GENERATE group as deviceId, mean as mean;
}
STORE sample_stats INTO 'mongodb://127.0.0.1:27017/sensor.stats'
USING com.mongodb.hadoop.pig.MongoStorage;
58
• Data warehouse infrastructure built on top of Hadoop
• Provides data summarization, query, and analysis
• HiveQL is a subset of SQL
• Support for user-defined functions (UDFs)
59
• Powerful built-in transformations and actions
– map, reduceByKey, union, distinct, sample, intersection, and more
– foreach, count, collect, take, and many more
An engine for processing Hadoop data. Can perform
MapReduce in addition to streaming, interactive queries,
and machine learning.
60
Data Flows
Hadoop
Connector
BSON Files
MapReduce & HDFS
Thanks!
{ name: ‘Bryan Reinero’,
title: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
code: ‘github.com/breinero’
email: ‘bryan@mongdb.com’ }

More Related Content

PPTX
Hermes: Free the Data! Distributed Computing with MongoDB
PPTX
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
PPTX
Migration from SQL to MongoDB - A Case Study at TheKnot.com
PPTX
State of Florida Neo4j Graph Briefing - Cyber IAM
PDF
Engineering practices in big data storage and processing
PDF
Analyze and visualize non-relational data with DocumentDB + Power BI
PPTX
Azure DocumentDB for Healthcare Integration
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Hermes: Free the Data! Distributed Computing with MongoDB
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
Migration from SQL to MongoDB - A Case Study at TheKnot.com
State of Florida Neo4j Graph Briefing - Cyber IAM
Engineering practices in big data storage and processing
Analyze and visualize non-relational data with DocumentDB + Power BI
Azure DocumentDB for Healthcare Integration
Apache Spark and MongoDB - Turning Analytics into Real-Time Action

What's hot (20)

PDF
Neo4j 4.1 overview
PDF
August meetup - All about Apache Druid
PPTX
Prepare for Peak Holiday Season with MongoDB
PPTX
Querying Druid in SQL with Superset
PDF
Webinar: Faster Big Data Analytics with MongoDB
PDF
Mongo db 3.4 Overview
PPTX
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
PDF
Webinar: Managing Real Time Risk Analytics with MongoDB
PPTX
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
PDF
Druid Adoption Tips and Tricks
PPTX
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
PPTX
Programmatic Bidding Data Streams & Druid
PPTX
Joins and Other MongoDB 3.2 Aggregation Enhancements
PDF
Elastic{ON} 2017 Recap
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
PPTX
Building a Scalable and Modern Infrastructure at CARFAX
PPTX
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
PDF
Webinar: Schema Patterns and Your Storage Engine
PDF
Splunk: Druid on Kubernetes with Druid-operator
PPTX
What's new in MongoDB 2.6
Neo4j 4.1 overview
August meetup - All about Apache Druid
Prepare for Peak Holiday Season with MongoDB
Querying Druid in SQL with Superset
Webinar: Faster Big Data Analytics with MongoDB
Mongo db 3.4 Overview
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
Druid Adoption Tips and Tricks
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Programmatic Bidding Data Streams & Druid
Joins and Other MongoDB 3.2 Aggregation Enhancements
Elastic{ON} 2017 Recap
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Building a Scalable and Modern Infrastructure at CARFAX
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Webinar: Schema Patterns and Your Storage Engine
Splunk: Druid on Kubernetes with Druid-operator
What's new in MongoDB 2.6
Ad

Viewers also liked (19)

PDF
How MapReduce part of Hadoop works (i.e. system's view) ?
PPTX
Hadoop
PDF
Hadoop Cluster on Docker Containers
PPTX
What is big data
PPSX
Cloud Computing
PDF
Hadoop - How It Works
PDF
What is hadoop and how it works?
PPTX
Learn Big Data & Hadoop
PDF
An Introduction to the World of Hadoop
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
Big data ppt
PPTX
Hadoop with Python
PPTX
Big data and Hadoop
PPTX
Big Data & Hadoop Tutorial
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
Big Data Analytics with Hadoop
PPTX
What is Big Data?
PDF
Hadoop Overview & Architecture
 
PPTX
Big data ppt
How MapReduce part of Hadoop works (i.e. system's view) ?
Hadoop
Hadoop Cluster on Docker Containers
What is big data
Cloud Computing
Hadoop - How It Works
What is hadoop and how it works?
Learn Big Data & Hadoop
An Introduction to the World of Hadoop
EclipseCon Keynote: Apache Hadoop - An Introduction
Big data ppt
Hadoop with Python
Big data and Hadoop
Big Data & Hadoop Tutorial
Hadoop introduction , Why and What is Hadoop ?
Big Data Analytics with Hadoop
What is Big Data?
Hadoop Overview & Architecture
 
Big data ppt
Ad

Similar to What's the Scoop on Hadoop? How It Works and How to WORK IT! (20)

POTX
Webinar: MongoDB + Hadoop
PDF
What is hadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Microsoft's Hadoop Story
PDF
1. Big Data - Introduction(what is bigdata).pdf
PPTX
Big data applications
PPTX
Big Data in the Microsoft Platform
PDF
Unit IV.pdf
DOCX
Anil_BigData Resume
PPTX
Introduction to Hadoop
PPTX
Hadoop in a Nutshell
PDF
Finding URL pattern with MapReduce and Apache Hadoop
PDF
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
PDF
Hd insight essentials quick view
PDF
HdInsight essentials Hadoop on Microsoft Platform
PDF
Hd insight essentials quick view
PPSX
Hadoop-Quick introduction
PPTX
Hadoop in the Cloud – The What, Why and How from the Experts
PPTX
Hadoop_EcoSystem_Pradeep_MG
PPTX
Hadoop and big data training
Webinar: MongoDB + Hadoop
What is hadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Microsoft's Hadoop Story
1. Big Data - Introduction(what is bigdata).pdf
Big data applications
Big Data in the Microsoft Platform
Unit IV.pdf
Anil_BigData Resume
Introduction to Hadoop
Hadoop in a Nutshell
Finding URL pattern with MapReduce and Apache Hadoop
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
Hd insight essentials quick view
HdInsight essentials Hadoop on Microsoft Platform
Hd insight essentials quick view
Hadoop-Quick introduction
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop_EcoSystem_Pradeep_MG
Hadoop and big data training

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation

What's the Scoop on Hadoop? How It Works and How to WORK IT!

Editor's Notes