MapReduce
michel.bruley@teradata.com

Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

April 2012

www.decideo.fr/bruley
What is MapReduce?
Restricted parallel programming model meant for large
clusters
– User implements Map() and Reduce() functions
Parallel computing framework
– Libraries take care of EVERYTHING else
• Parallelization
• Fault Tolerance
• Data Distribution
• Load Balancing
Useful model for many practical tasks
www.decideo.fr/bruley
Map and Reduce
The idea of Map, and Reduce is 40+ year old
– Present in all Functional Programming Languages.
– See, e.g., APL, Lisp and ML
Alternate names for Map: Apply-All
Higher Order Functions
– take function definitions as arguments, or
– return a function as output
Map and Reduce are higher-order functions.

www.decideo.fr/bruley
Map and Reduce Functions
Functions borrowed from functional programming
languages (eg. Lisp)
Map()
– Process a key/value pair to generate intermediate
key/value pairs
Reduce()
– Merge all intermediate values associated with the same
key

www.decideo.fr/bruley
Example: Counting Words
Map()
– Input <filename, file text>
– Parses file and emits <word, count> pairs
• eg. <”hello”, 1>
Reduce()
– Sums all values for the same key and emits <word,
TotalCount>
• eg. <”hello”, (3 5 2 7)> => <”hello”, 17>

www.decideo.fr/bruley
Execution on Clusters
1.

Input files split (M splits)

2.

Assign Master & Workers

3.

Map tasks

4.

Writing intermediate data to disk (R regions)

5.

Intermediate data read & sort

6.

Reduce tasks

7.

Return

www.decideo.fr/bruley
Map/Reduce Cluster
Implementation
Input
files

M map Intermediate
tasks
files

R reduce
tasks

split 0
split 1
split 2
split 3
split 4
Several map or
reduce tasks can
run on a single
computer
www.decideo.fr/bruley

Output
files
Output 0
Output 1

Each intermediate
file is divided into R
partitions, by
partitioning function

Each reduce task
corresponds to
one partition
Map Reduce vs. Parallel
Databases
Map Reduce widely used for parallel processing
– Google, Yahoo, and 100’s of other companies
– Example uses: compute PageRank, build keyword indices,
do data analysis of web click logs, ….
Database people say:
– but parallel databases have been doing this for decades
Map Reduce people say:
– we operate at scales of 1000’s of machines
– We handle failures seamlessly
– We allow procedural code in map and reduce and allow
data of any type
www.decideo.fr/bruley
Typical MapReduce Cluster

www.decideo.fr/bruley
Map Reduce Implementations
Google
– Not available outside Google
Hadoop
– An open-source implementation in Java
– Uses HDFS for stable storage
– Download: http://lucene.apache.org/hadoop/
Teradata Aster
– Cluster-optimized SQL Database that also implements
MapReduce
• IITB alumnus among founders
And several others, such as Cassandra at Facebook, etc.
www.decideo.fr/bruley
MapReduce v. Hadoop
MapReduce

Hadoop

Org

Google

Yahoo/Apache

Impl

C++

Java

Distributed
GFS
File Sys

HDFS

Data Base Bigtable

HBase

Distributed
Chubby
lock mgr

ZooKeeper

www.decideo.fr/bruley
Solutions Stack for Teradata Aster

Data
Integration
/ ETL

Business
Intelligence
Tools

Query
Tools

Analytics
Specialists

Systems Management

Aster Data
Ecosystem

Security

Aster Data nCluster
Operating System
Servers

Cloud Infrastructure
Storage

www.decideo.fr/bruley

Aster Data
Platform
Infrastructure
Teradata Aster Platform Infrastructure
For physical infrastructure (non-cloud) deployments
Aster Data
Analytic
Platform

nCluster
nCluster

Aster Data nCluster packaged software

Operating
System

Certified Linux operating system

Server
Hardware

Certified commodity (x86) server
hardware with internal storage

www.decideo.fr/bruley
Teradata Aster Infrastructure
For cloud deployments
Aster Data
Analytic
Platform

nCluster
nCluster

Aster Data nCluster packaged software

Operating
System

Compute
Instance

Storage

www.decideo.fr/bruley

Linux operating system

CC
CC

xLarge
xLarge

EBS
EBS
Ephemeral
Ephemeral

Compute instance from cloud provider
(e.g. Amazon Web Services EC2)
Storage connected to cloud computing
capacity
Teradata Aster Architecture for
Analytics
Your Analytics & Advanced Reporting
Applications
App

App

App

App

• Support for in-database processing of custom
applications written in broad variety of languages
• Integration with third-party packaged software via
ODBC/JDBC or in-database integration

Aster Data nCluster
Analytic Functions and Frameworks

• Rich libraries of MapReduce analytics from Aster
Data and partners
• Visual development environment--develop in hours

Unified Interface

• Standard SQL interface
• MapReduce processing integrated with SQL via
SQL-MapReduce interface

SQL

SQL-MapReduce

Analytics Processing Engines
SQL

MapReduce

Massively Parallel Data Stores

www.decideo.fr/bruley

…

• Optimized SQL engine
• Fully-integrated in-database MapReduce
• Hybrid row/column DBMS
• Linear, incremental scalability
• Commodity hardware
Teradata Aster Ecosystem
Partner

Product

Product
release

Platform for Certification

MicroStrategy

Intelligence Server

9.2.1 32-bit

Windows 7, Enterprise Edition SP1, 32-bit, 64-bit

SAP

Business Objects

XI 3.1

Windows 2008, 32-bit

Informatica

Powercenter

9.0.1

Client: Windows 2003/2008 Server 32 bit.
Server: Windows 2003/2008 Server 32 bit and 64 bit

IBM

Cognos

10.1FP1

n/a

Tableau

Tableau Server

6

Windows (SS: TBU)

Microsoft

SSLS, SSAS,
SSFS, SSIS

SQL Server
2008

.NET Framework 2.0
Windows Server, 2008 64-bit
Windows 2003, 32-bit

*Oracle BIEE certification currently in process

www.decideo.fr/bruley

More Related Content

PPTX
High performance computing
PPTX
Load balancing in cloud computing.pptx
PPT
Advanced Operating System- Introduction
PDF
20CS2021 DISTRIBUTED COMPUTING
PPTX
Cloud Security
PPTX
Cloud Reference Model
PPTX
Load balancing in cloud
PDF
Introduction to Cloud Computing
High performance computing
Load balancing in cloud computing.pptx
Advanced Operating System- Introduction
20CS2021 DISTRIBUTED COMPUTING
Cloud Security
Cloud Reference Model
Load balancing in cloud
Introduction to Cloud Computing

What's hot (20)

PPTX
Introduction to Parallel and Distributed Computing
PPTX
Virtualization in cloud computing
PPTX
Distributed database
PDF
Apache Spark Introduction
PPTX
Elastic Compute Cloud (EC2) on AWS Presentation
PPTX
Key-Value NoSQL Database
PDF
Cloud Security And Privacy
PPTX
Cloud Computing Tools
PPTX
Cloud applications
PPT
Unit-3_BDA.ppt
PDF
The CAP Theorem
PPTX
PPT on Hadoop
PPT
Map Reduce
PPTX
PPTX
Federated Learning
PDF
MongoDB Sharding Fundamentals
PPTX
Overview of UML Diagrams
PPTX
cloud computing
PPTX
Introduction to Map Reduce
Introduction to Parallel and Distributed Computing
Virtualization in cloud computing
Distributed database
Apache Spark Introduction
Elastic Compute Cloud (EC2) on AWS Presentation
Key-Value NoSQL Database
Cloud Security And Privacy
Cloud Computing Tools
Cloud applications
Unit-3_BDA.ppt
The CAP Theorem
PPT on Hadoop
Map Reduce
Federated Learning
MongoDB Sharding Fundamentals
Overview of UML Diagrams
cloud computing
Introduction to Map Reduce
Ad

Similar to Map Reduce (20)

PPT
Meethadoop
PPTX
Hadoop bigdata overview
PPT
Hadoop mapreduce and yarn frame work- unit5
PDF
Report Hadoop Map Reduce
PPTX
Cloud Services for Big Data Analytics
PPTX
Cloud Services for Big Data Analytics
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PPT
Introduccion a Hadoop / Introduction to Hadoop
PPTX
Big data concepts
PDF
May 29, 2014 Toronto Hadoop User Group - Micro ETL
PPTX
Sawmill - Integrating R and Large Data Clouds
PPTX
Stratosphere with big_data_analytics
PPTX
Managing Big data Module 3 (1st part)
PPT
MapReduce in cgrid and cloud computinge.ppt
PPT
Hadoop MapReduce Fundamentals
PPT
Map reducecloudtech
PPTX
Apache Spark Introduction @ University College London
PPT
Hadoop ppt2
PPTX
Hadoop Big Data A big picture
Meethadoop
Hadoop bigdata overview
Hadoop mapreduce and yarn frame work- unit5
Report Hadoop Map Reduce
Cloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Lightening Fast Big Data Analytics using Apache Spark
Introduccion a Hadoop / Introduction to Hadoop
Big data concepts
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Sawmill - Integrating R and Large Data Clouds
Stratosphere with big_data_analytics
Managing Big data Module 3 (1st part)
MapReduce in cgrid and cloud computinge.ppt
Hadoop MapReduce Fundamentals
Map reducecloudtech
Apache Spark Introduction @ University College London
Hadoop ppt2
Hadoop Big Data A big picture
Ad

More from Michel Bruley (20)

PDF
Propos sur différents sujets de 2022 à 2024 .pdf
PDF
Propos sur d'autres sujets - compilation 2022
PDF
Propos sur l'histoire - compilation - 2022
PDF
Textes de famille concernant les guerres V2.pdf
PDF
Mes trois moyen âge : une période de 1000 ans comprise entre Ve et XVe siècle
PDF
Propos sur l'âme, extraits de recherches numériques
PDF
Religion : Dieu y es-tu ? (les articles)
PDF
Réflexion sur les religions : Dieu y es-tu ?
PDF
La chute de l'Empire romain comme modèle.pdf
PDF
Synthèse sur Neuville.pdf
PDF
Propos sur des sujets qui m'ont titillé.pdf
PDF
Propos sur les Big Data.pdf
PDF
Sun tzu
PDF
Georges Anselmi - 1914 - 1918 Campagnes de France et d'Orient
PPT
Poc banking industry - Churn
PPT
Big Data POC in communication industry
PDF
Photos de famille 1895 1966
PDF
Compilation d'autres textes de famille
PDF
J'aime BRULEY
PDF
Textes de famille concernant les guerres (1814 - 1944)
Propos sur différents sujets de 2022 à 2024 .pdf
Propos sur d'autres sujets - compilation 2022
Propos sur l'histoire - compilation - 2022
Textes de famille concernant les guerres V2.pdf
Mes trois moyen âge : une période de 1000 ans comprise entre Ve et XVe siècle
Propos sur l'âme, extraits de recherches numériques
Religion : Dieu y es-tu ? (les articles)
Réflexion sur les religions : Dieu y es-tu ?
La chute de l'Empire romain comme modèle.pdf
Synthèse sur Neuville.pdf
Propos sur des sujets qui m'ont titillé.pdf
Propos sur les Big Data.pdf
Sun tzu
Georges Anselmi - 1914 - 1918 Campagnes de France et d'Orient
Poc banking industry - Churn
Big Data POC in communication industry
Photos de famille 1895 1966
Compilation d'autres textes de famille
J'aime BRULEY
Textes de famille concernant les guerres (1814 - 1944)

Recently uploaded (20)

PPTX
BUSINESS CYCLE_INFLATION AND UNEMPLOYMENT.pptx
PDF
533158074-Saudi-Arabia-Companies-List-Contact.pdf
PPTX
Slide gioi thieu VietinBank Quy 2 - 2025
DOCX
Center Enamel Powering Innovation and Resilience in the Italian Chemical Indu...
PDF
#1 Safe and Secure Verified Cash App Accounts for Purchase.pdf
PPTX
interschool scomp.pptxzdkjhdjvdjvdjdhjhieij
PDF
NEW - FEES STRUCTURES (01-july-2024).pdf
PDF
Kishore Vora - Best CFO in India to watch in 2025.pdf
PDF
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
PDF
Keppel_Proposed Divestment of M1 Limited
PDF
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications
DOCX
Center Enamel A Strategic Partner for the Modernization of Georgia's Chemical...
PDF
Charisse Litchman: A Maverick Making Neurological Care More Accessible
PPTX
CTG - Business Update 2Q2025 & 6M2025.pptx
PDF
PMB 401-Identification-of-Potential-Biotechnological-Products.pdf
PPTX
basic introduction to research chapter 1.pptx
PPTX
2 - Self & Personality 587689213yiuedhwejbmansbeakjrk
PPT
Lecture 3344;;,,(,(((((((((((((((((((((((
PDF
Ron Thomas - Top Influential Business Leaders Shaping the Modern Industry – 2025
DOCX
Hand book of Entrepreneurship 4 Chapters.docx
BUSINESS CYCLE_INFLATION AND UNEMPLOYMENT.pptx
533158074-Saudi-Arabia-Companies-List-Contact.pdf
Slide gioi thieu VietinBank Quy 2 - 2025
Center Enamel Powering Innovation and Resilience in the Italian Chemical Indu...
#1 Safe and Secure Verified Cash App Accounts for Purchase.pdf
interschool scomp.pptxzdkjhdjvdjvdjdhjhieij
NEW - FEES STRUCTURES (01-july-2024).pdf
Kishore Vora - Best CFO in India to watch in 2025.pdf
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
Keppel_Proposed Divestment of M1 Limited
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications
Center Enamel A Strategic Partner for the Modernization of Georgia's Chemical...
Charisse Litchman: A Maverick Making Neurological Care More Accessible
CTG - Business Update 2Q2025 & 6M2025.pptx
PMB 401-Identification-of-Potential-Biotechnological-Products.pdf
basic introduction to research chapter 1.pptx
2 - Self & Personality 587689213yiuedhwejbmansbeakjrk
Lecture 3344;;,,(,(((((((((((((((((((((((
Ron Thomas - Top Influential Business Leaders Shaping the Modern Industry – 2025
Hand book of Entrepreneurship 4 Chapters.docx

Map Reduce

  • 1. MapReduce michel.bruley@teradata.com Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, … April 2012 www.decideo.fr/bruley
  • 2. What is MapReduce? Restricted parallel programming model meant for large clusters – User implements Map() and Reduce() functions Parallel computing framework – Libraries take care of EVERYTHING else • Parallelization • Fault Tolerance • Data Distribution • Load Balancing Useful model for many practical tasks www.decideo.fr/bruley
  • 3. Map and Reduce The idea of Map, and Reduce is 40+ year old – Present in all Functional Programming Languages. – See, e.g., APL, Lisp and ML Alternate names for Map: Apply-All Higher Order Functions – take function definitions as arguments, or – return a function as output Map and Reduce are higher-order functions. www.decideo.fr/bruley
  • 4. Map and Reduce Functions Functions borrowed from functional programming languages (eg. Lisp) Map() – Process a key/value pair to generate intermediate key/value pairs Reduce() – Merge all intermediate values associated with the same key www.decideo.fr/bruley
  • 5. Example: Counting Words Map() – Input <filename, file text> – Parses file and emits <word, count> pairs • eg. <”hello”, 1> Reduce() – Sums all values for the same key and emits <word, TotalCount> • eg. <”hello”, (3 5 2 7)> => <”hello”, 17> www.decideo.fr/bruley
  • 6. Execution on Clusters 1. Input files split (M splits) 2. Assign Master & Workers 3. Map tasks 4. Writing intermediate data to disk (R regions) 5. Intermediate data read & sort 6. Reduce tasks 7. Return www.decideo.fr/bruley
  • 7. Map/Reduce Cluster Implementation Input files M map Intermediate tasks files R reduce tasks split 0 split 1 split 2 split 3 split 4 Several map or reduce tasks can run on a single computer www.decideo.fr/bruley Output files Output 0 Output 1 Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition
  • 8. Map Reduce vs. Parallel Databases Map Reduce widely used for parallel processing – Google, Yahoo, and 100’s of other companies – Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. Database people say: – but parallel databases have been doing this for decades Map Reduce people say: – we operate at scales of 1000’s of machines – We handle failures seamlessly – We allow procedural code in map and reduce and allow data of any type www.decideo.fr/bruley
  • 10. Map Reduce Implementations Google – Not available outside Google Hadoop – An open-source implementation in Java – Uses HDFS for stable storage – Download: http://lucene.apache.org/hadoop/ Teradata Aster – Cluster-optimized SQL Database that also implements MapReduce • IITB alumnus among founders And several others, such as Cassandra at Facebook, etc. www.decideo.fr/bruley
  • 11. MapReduce v. Hadoop MapReduce Hadoop Org Google Yahoo/Apache Impl C++ Java Distributed GFS File Sys HDFS Data Base Bigtable HBase Distributed Chubby lock mgr ZooKeeper www.decideo.fr/bruley
  • 12. Solutions Stack for Teradata Aster Data Integration / ETL Business Intelligence Tools Query Tools Analytics Specialists Systems Management Aster Data Ecosystem Security Aster Data nCluster Operating System Servers Cloud Infrastructure Storage www.decideo.fr/bruley Aster Data Platform Infrastructure
  • 13. Teradata Aster Platform Infrastructure For physical infrastructure (non-cloud) deployments Aster Data Analytic Platform nCluster nCluster Aster Data nCluster packaged software Operating System Certified Linux operating system Server Hardware Certified commodity (x86) server hardware with internal storage www.decideo.fr/bruley
  • 14. Teradata Aster Infrastructure For cloud deployments Aster Data Analytic Platform nCluster nCluster Aster Data nCluster packaged software Operating System Compute Instance Storage www.decideo.fr/bruley Linux operating system CC CC xLarge xLarge EBS EBS Ephemeral Ephemeral Compute instance from cloud provider (e.g. Amazon Web Services EC2) Storage connected to cloud computing capacity
  • 15. Teradata Aster Architecture for Analytics Your Analytics & Advanced Reporting Applications App App App App • Support for in-database processing of custom applications written in broad variety of languages • Integration with third-party packaged software via ODBC/JDBC or in-database integration Aster Data nCluster Analytic Functions and Frameworks • Rich libraries of MapReduce analytics from Aster Data and partners • Visual development environment--develop in hours Unified Interface • Standard SQL interface • MapReduce processing integrated with SQL via SQL-MapReduce interface SQL SQL-MapReduce Analytics Processing Engines SQL MapReduce Massively Parallel Data Stores www.decideo.fr/bruley … • Optimized SQL engine • Fully-integrated in-database MapReduce • Hybrid row/column DBMS • Linear, incremental scalability • Commodity hardware
  • 16. Teradata Aster Ecosystem Partner Product Product release Platform for Certification MicroStrategy Intelligence Server 9.2.1 32-bit Windows 7, Enterprise Edition SP1, 32-bit, 64-bit SAP Business Objects XI 3.1 Windows 2008, 32-bit Informatica Powercenter 9.0.1 Client: Windows 2003/2008 Server 32 bit. Server: Windows 2003/2008 Server 32 bit and 64 bit IBM Cognos 10.1FP1 n/a Tableau Tableau Server 6 Windows (SS: TBU) Microsoft SSLS, SSAS, SSFS, SSIS SQL Server 2008 .NET Framework 2.0 Windows Server, 2008 64-bit Windows 2003, 32-bit *Oracle BIEE certification currently in process www.decideo.fr/bruley