SlideShare a Scribd company logo
MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
Brief History of MapReduce Pre-2004: used at Google for many data processing apps, including Web indexing 2004: paper in academic conference not written in traditional academic style 2004-2006: Implemented in Nutch 2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases
 
 
Controversy Vast majority of the outrage was about the comparison of the systems BUT: The line between MapReduce and Hadoop (which comes with HDFS) was blurring Hadoop can be used as an alternative to traditional  DW implementations built using DBMS software
 
SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems Compared across a variety of dimensions including performance and ease of use Measured differences in load and query time for some common data processing tasks Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
Hardware Setup 100 node cluster Each node 2.4 GHz Code 2 Duo Processors 4 GB RAM 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
Join Task
UDF Task DBMS clearly doesn’t scale Calculate PageRank over a set of HTML documents Performed via a UDF
Benchmark Conclusions Hadoop has many advantages Load time much faster Significantly easier to install, use Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data Reasons both fundamental and non-fundamental Needs better support for compression and direct operation on compressed data Needs better support for indexing Needs better support for co-partitioning of datasets
Overall Conclusion MapReduce/Hadoop and parallel databases are clearly complementary Use MapReduce if you want to do: ETL Unstructured data processing Deep analysis that is hard to express in SQL Use parallel databases for: Traditional data warehousing / data marts Structured data processing expressible in SQL Cloudera agrees!
 
 
 
 
 
 
We’re all in agreement, right?
But Wait! Hadoop can do everything a parallel database can do Hadoop has (something resembling) a SQL interface (Hive) Many of Hadoop’s performance deficiencies not fundamental Result of initial design for unstructured data Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads Hadoop is free and open source (Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary)
People are using Hadoop as a DW Facebook has 12PB data warehouse in Hadoop/Hive Adding 10TB per day Yahoo’s warehouse is the same order of magnitude Recently switched to Hadoop
Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
So … Hadoop can do everything that parallel databases can do, but: Has better fault tolerance Adjusts better to runtime performance fluctuations Is more open / cheaper Has at least as good scalability (if not better) If only we could fix those performance problems on structured data HadoopDB!
HadoopDB Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems Flexible query interface (accepts both SQL and MapReduce) Open source (built using open source components)
HadoopDB Architecture
TPC-H Benchmark Results
Fault Tolerance and Cluster Heterogeneity Results
HadoopDB: Current Status Initial open source release over a year ago A bunch of new code since then, but not yet put up online This new code is available by request  Expect the next release to be in mid-2011 Money available for people who want to help with development (e-mail justin.borgman@yale.edu)
Invisible Loading Data starts in HDFS Data is immediately available for processing (immediate gratification paradigm) Each MapReduce job causes data movement from HDFS to database systems Data is incrementally loaded, sorted, and indexed Query performance improves “invisibly”
Conclusions MapReduce and parallel databases are definitely complimentary MapReduce and parallel databases are definitely competitive HadoopDB is awesome

More Related Content

PPT
Daniel Abadi: VLDB 2009 Panel
PPTX
Beckman abadi-5min-pres
PPTX
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
PPTX
Hadoop and Graph Data Management: Challenges and Opportunities
PPTX
Hadoop vs. RDBMS for Advanced Analytics
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
PPTX
Big Data Introduction
PPTX
عصر کلان داده، چرا و چگونه؟
Daniel Abadi: VLDB 2009 Panel
Beckman abadi-5min-pres
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop vs. RDBMS for Advanced Analytics
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Big Data Introduction
عصر کلان داده، چرا و چگونه؟

What's hot (20)

PDF
Why Talend for Big Data?
PDF
Introduction to Bigdata and HADOOP
PPTX
Hadoop bigdata overview
PPTX
Comparison - RDBMS vs Hadoop vs Apache
PPTX
Hybrid Data Warehouse Hadoop Implementations
PPTX
Intro to Hybrid Data Warehouse
PPTX
Schema-on-Read vs Schema-on-Write
PPTX
Big data ppt
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
PPTX
Designing modern dw and data lake
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Hadoop and Hive in Enterprises
PDF
Data Engineering Basics
PDF
Nov 2010 HUG: Business Intelligence for Big Data
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
PPTX
Design of Hadoop Distributed File System
PPT
Big Data Analytics 2014
PPTX
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
PPTX
Data lake-itweekend-sharif university-vahid amiry
Why Talend for Big Data?
Introduction to Bigdata and HADOOP
Hadoop bigdata overview
Comparison - RDBMS vs Hadoop vs Apache
Hybrid Data Warehouse Hadoop Implementations
Intro to Hybrid Data Warehouse
Schema-on-Read vs Schema-on-Write
Big data ppt
Jethro data meetup index base sql on hadoop - oct-2014
Designing modern dw and data lake
Big Data Architecture Workshop - Vahid Amiri
Hadoop and Hive in Enterprises
Data Engineering Basics
Nov 2010 HUG: Business Intelligence for Big Data
CS 542 Parallel DBs, NoSQL, MapReduce
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
Design of Hadoop Distributed File System
Big Data Analytics 2014
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Data lake-itweekend-sharif university-vahid amiry
Ad

Viewers also liked (20)

PPTX
SQL-on-Hadoop Tutorial
PDF
Invisible loading
PPT
Boston Hadoop Meetup, April 26 2012
PPTX
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
PDF
Shared slides-edbt-keynote-03-19-13
PDF
Consistency Tradeoffs in Modern Distributed Database System Design
PDF
VLDB 2009 Tutorial on Column-Stores
PDF
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
PPTX
The Power of Determinism in Database Systems
PPTX
Predicting Consumer Behaviour via Hadoop
PPTX
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
PPT
CAP, PACELC, and Determinism
PDF
Cloudera Showcase: SQL-on-Hadoop
PPT
Column-Stores vs. Row-Stores: How Different are they Really?
DOCX
mapReduce for machine learning
PDF
Bi on Big Data - Strata 2016 in London
PPTX
Real-time Big Data Analytics: From Deployment to Production
PDF
Design patterns in MapReduce
PPTX
Predictive Analytics on Big Data. DIY or BUY?
PPT
BCBS 239 - Risk Data Adequacy
SQL-on-Hadoop Tutorial
Invisible loading
Boston Hadoop Meetup, April 26 2012
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Shared slides-edbt-keynote-03-19-13
Consistency Tradeoffs in Modern Distributed Database System Design
VLDB 2009 Tutorial on Column-Stores
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
The Power of Determinism in Database Systems
Predicting Consumer Behaviour via Hadoop
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
CAP, PACELC, and Determinism
Cloudera Showcase: SQL-on-Hadoop
Column-Stores vs. Row-Stores: How Different are they Really?
mapReduce for machine learning
Bi on Big Data - Strata 2016 in London
Real-time Big Data Analytics: From Deployment to Production
Design patterns in MapReduce
Predictive Analytics on Big Data. DIY or BUY?
BCBS 239 - Risk Data Adequacy
Ad

Similar to Daniel Abadi HadoopWorld 2010 (20)

PPT
Big Data: An Overview
PDF
Hadoop versus RDBMS - Comparing the two data paradigms
ODP
HadoopDB
PPTX
Big data or big deal
DOCX
Map reduce advantages over parallel databases report
PDF
Hw09 Hadoop Db
DOCX
Hadoop Seminar Report
PPTX
Introduction to Apache Hadoop
PDF
benchmarks-sigmod09
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PPTX
MapReduce and parallel DBMSs: friends or foes?
PPTX
Bw tech hadoop
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
PPS
Big data hadoop rdbms
PPTX
INTRODUCTION TO BIG DATA HADOOP
PDF
Improving MySQL performance with Hadoop
PPTX
Colorado Springs Open Source Hadoop/MySQL
PPT
Big data with hadoop
Big Data: An Overview
Hadoop versus RDBMS - Comparing the two data paradigms
HadoopDB
Big data or big deal
Map reduce advantages over parallel databases report
Hw09 Hadoop Db
Hadoop Seminar Report
Introduction to Apache Hadoop
benchmarks-sigmod09
Hadoop introduction , Why and What is Hadoop ?
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
MapReduce and parallel DBMSs: friends or foes?
Bw tech hadoop
BW Tech Meetup: Hadoop and The rise of Big Data
Big data hadoop rdbms
INTRODUCTION TO BIG DATA HADOOP
Improving MySQL performance with Hadoop
Colorado Springs Open Source Hadoop/MySQL
Big data with hadoop

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
PPT
Teaching material agriculture food technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence
Teaching material agriculture food technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Daniel Abadi HadoopWorld 2010

  • 1. MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010
  • 2. Brief History of MapReduce Pre-2004: used at Google for many data processing apps, including Web indexing 2004: paper in academic conference not written in traditional academic style 2004-2006: Implemented in Nutch 2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases
  • 3.  
  • 4.  
  • 5. Controversy Vast majority of the outrage was about the comparison of the systems BUT: The line between MapReduce and Hadoop (which comes with HDFS) was blurring Hadoop can be used as an alternative to traditional DW implementations built using DBMS software
  • 6.  
  • 7. SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems Compared across a variety of dimensions including performance and ease of use Measured differences in load and query time for some common data processing tasks Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  • 8. Hardware Setup 100 node cluster Each node 2.4 GHz Code 2 Duo Processors 4 GB RAM 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  • 10. UDF Task DBMS clearly doesn’t scale Calculate PageRank over a set of HTML documents Performed via a UDF
  • 11. Benchmark Conclusions Hadoop has many advantages Load time much faster Significantly easier to install, use Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data Reasons both fundamental and non-fundamental Needs better support for compression and direct operation on compressed data Needs better support for indexing Needs better support for co-partitioning of datasets
  • 12. Overall Conclusion MapReduce/Hadoop and parallel databases are clearly complementary Use MapReduce if you want to do: ETL Unstructured data processing Deep analysis that is hard to express in SQL Use parallel databases for: Traditional data warehousing / data marts Structured data processing expressible in SQL Cloudera agrees!
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 17.  
  • 18.  
  • 19. We’re all in agreement, right?
  • 20. But Wait! Hadoop can do everything a parallel database can do Hadoop has (something resembling) a SQL interface (Hive) Many of Hadoop’s performance deficiencies not fundamental Result of initial design for unstructured data Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads Hadoop is free and open source (Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary)
  • 21. People are using Hadoop as a DW Facebook has 12PB data warehouse in Hadoop/Hive Adding 10TB per day Yahoo’s warehouse is the same order of magnitude Recently switched to Hadoop
  • 22. Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 23. So … Hadoop can do everything that parallel databases can do, but: Has better fault tolerance Adjusts better to runtime performance fluctuations Is more open / cheaper Has at least as good scalability (if not better) If only we could fix those performance problems on structured data HadoopDB!
  • 24. HadoopDB Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems Flexible query interface (accepts both SQL and MapReduce) Open source (built using open source components)
  • 27. Fault Tolerance and Cluster Heterogeneity Results
  • 28. HadoopDB: Current Status Initial open source release over a year ago A bunch of new code since then, but not yet put up online This new code is available by request Expect the next release to be in mid-2011 Money available for people who want to help with development (e-mail justin.borgman@yale.edu)
  • 29. Invisible Loading Data starts in HDFS Data is immediately available for processing (immediate gratification paradigm) Each MapReduce job causes data movement from HDFS to database systems Data is incrementally loaded, sorted, and indexed Query performance improves “invisibly”
  • 30. Conclusions MapReduce and parallel databases are definitely complimentary MapReduce and parallel databases are definitely competitive HadoopDB is awesome