SlideShare a Scribd company logo
HadoopDB: An open source hybrid of MapReduce
           and DBMS technologies

                Azza Abouzeid, Kamil Bajda-Pawlikowski
                   Daniel J. Abadi, Avi Silberschatz

                                 Yale University
                         http://hadoopdb.sourceforge.net


                               October 2, 2009


HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi
Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Motivation


Major Trends



         1   Data explosion:
                    Automation of business processes, proliferation of digital
                    devices.
                    eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
         2   Analysis over raw data




                      Yale University, HadoopWorld 2009   HadoopDB               2/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Motivation


Major Trends



         1   Data explosion:
                    Automation of business processes, proliferation of digital
                    devices.
                    eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
         2   Analysis over raw data

      Bottom line
      Analyzing massive structured data on 1000s of shared-nothing
      nodes.




                      Yale University, HadoopWorld 2009   HadoopDB               2/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Motivation


Sales Record Example


      Consider a large data set of sales log records, each consisting of
      sales information including:
         1   a date of sale
         2   a price
      We would like to take the log records and generate a report
      showing the total sales for each year.
      Question:
      How do we generate this report efficiently and cheaply over massive
      data contained in a shared-nothing cluster of 1000s of machines?




                      Yale University, HadoopWorld 2009   HadoopDB         3/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


MapReduce (Hadoop)


      MapReduce is a programming model which specifies:
             A map function that processes a key/value pair to generate a
             set of intermediate key/value pairs,
             A reduce function that merges all intermediate values
             associated with the same intermediate key.
      Hadoop
             is a MapReduce implementation for processing large data sets
             over 1000s of nodes.
             Maps (and Reduces) run independently of each other over
             blocks of data distributed across a cluster.



                      Yale University, HadoopWorld 2009   HadoopDB                      4/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


Sales Record Example using Hadoop



      Query: Calculate total sales for each year.

      We write a MapReduce program:
             Map: Takes log records and extracts a key-value pair of year
             and sale price in dollars. Outputs the key-value pairs.
             Shuffle: Hadoop automatically partitions the key-value pairs
             by year to the nodes executing the Reduce function
             Reduce: Simply sums up all the dollar values for a year.




                      Yale University, HadoopWorld 2009   HadoopDB                      5/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


Relational Databases

      Suppose that the data is stored in a relational database system,
      the sales record example could be expressed in SQL as:

      SELECT YEAR(date) AS year, SUM(price)
      FROM sales
      GROUP BY year

      The execution plan is:

      projection(year,price) → hash aggregation(year,price) .

      Question:
      How do we process this efficiently if the data is very large?


                      Yale University, HadoopWorld 2009   HadoopDB                      6/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         MapReduce Parallel Databases


Parallel Databases

      Parallel Databases are like single-node databases except:
             Data is partitioned across nodes
             Individual relational operations can be executed in parallel
      xxx
      SELECT YEAR(date) AS year, SUM(price)
      FROM sales GROUP BY year

      Execution plan for the query:
      projection(year,price) → partial hash aggregation(year,price) →
      partitioning(year) → final aggregation(year,price) .

      Note that the execution plan resembles the map and reduce phases
      of Hadoop.

                      Yale University, HadoopWorld 2009   HadoopDB                      7/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion


Differences between Parallel Databases and Hadoop




                      Yale University, HadoopWorld 2009   HadoopDB   8/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion


Differences between Parallel Databases and Hadoop




                      Yale University, HadoopWorld 2009   HadoopDB   8/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion


To summarize




                      Yale University, HadoopWorld 2009   HadoopDB   9/24
At Yale, we looked beyond the differences ...
At Yale, we looked beyond the differences ...
and we discovered ...




                                                             Basic design idea
                                                             Multiple, independent, single
                                                             node databases coordinated by
 ... that they complete each other                           Hadoop.
 http://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Background Architecture SMS


Hadoop Basics




                      Yale University, HadoopWorld 2009   HadoopDB                     12/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Background Architecture SMS


Architecture




                      Yale University, HadoopWorld 2009   HadoopDB                     13/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Background Architecture SMS


SQL-MR-SQL




      SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

                      Yale University, HadoopWorld 2009   HadoopDB                     14/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Evaluating HadoopDB


      Compare HadoopDB to
         1   Hadoop
         2   Parallel databases (Vertica, DBMS-X)
      Features:
        1 Performance:

                    We expected HadoopDB to approach the performance of
                    parallel databases
         2   Scalability:
                    We expected HadoopDB to scale as well as Hadoop
      We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2
      clusters of 10, 50, 100 nodes.



                      Yale University, HadoopWorld 2009   HadoopDB                                 15/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Load



              1600                                             Vertica     DB-X
                                                               HadoopDB    Hadoop                                                    Vertica    DB-X




                                                                                                  Thousands
              1400                                                                                                                   HadoopDB   Hadoop
                                                                                                              50
              1200
                                                                                                              40
              1000
    seconds




                                                                                        seconds
              800                                                                                             30

              600
                                                                                                              20
              400
                                                                     164


                                                                            161
                                              141
                              139




              200                                                                                             10
                                                        100
                     92




                                                                                  77
                                     47




                                                               43




                0
                          10 nodes                  50 nodes           100 nodes                              0
                                                                                                                   10 nodes   50 nodes      100 nodes


               Random Unstructured Data
                                                                                                  Structured data (20GB/node)
                    (535MB/node)



                                          Yale University, HadoopWorld 2009            HadoopDB                                                         16/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Performance: Grep Task


              70                                    Vertica       DB-X
                                                    HadoopDB      Hadoop
              60                                                                      1   Full table scan, highly
              50
                                                                                          selective filter
              40
                                                                                      2   Random data, no
    seconds




              30
                                                                                          room for indexing
              20
                                                                                      3   Hadoop overhead
                                                                                          outweighs query
              10
                                                                                          processing time in
              0
                        10 nodes             50 nodes          100 nodes
                                                                                          single-node databases

                   SELECT * FROM grep WHERE field LIKE ‘%xyz%’;




                                   Yale University, HadoopWorld 2009       HadoopDB                           17/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Performance: Join Task

              2000                                                      Vertica         DB-X
                                                                        HadoopDB        Hadoop
              1800

              1600

              1400

              1200
    seconds




              1000

              800
                                                                                                               1   No full table scan due
              600                                                                                                  to clustered indexing


                                                                                            300.5
                                                                224.2




              400                                                                                              2   Hash partitioning and
                                    126.4




                                                                              67.7
              200
                                                 34.7




                                                                                     31.9
                                                         29.4
                             28.0
                     20.6




                0
                                                                                                                   efficient join
                            10 nodes                    50 nodes                   100 nodes                       algorithm

   SELECT sourceIP, AVG(pageRank), SUM(adRevenue)
   FROM rankings, uservisits
   WHERE pageURL=destURL
   AND visitDate BETWEEN 2000-1-15 AND 2000-1-22
   GROUP BY sourceIP
   ORDER BY SUM(adRevenue) DESC LIMIT 1;


                                        Yale University, HadoopWorld 2009                           HadoopDB                          18/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Performance: Bottom Line




         1   Unstructured data
                    HadoopDB’s performance matches Hadoop
         2   Structured data
                    HadoopDB’s performance is close to parallel databases




                      Yale University, HadoopWorld 2009   HadoopDB                                 19/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Scalability: Setup




         1   Simple aggregation task - full table scan
         2   Data replicated across 10 nodes
         3   Fault-tolerance: Kill a node halfway
         4   Fluctuation-tolerance: Slow down a node for the entire
             experiment




                      Yale University, HadoopWorld 2009   HadoopDB                                 20/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Hypotheses Load Performance Scalability


Scalability: Results



                          200%                                      Vertica
                                                                                           1   HadoopDB and
                                                                    HadoopDB
                          180%
                                                                    Hadoop
                                                                                               Hadoop take
                          160%                                                                 advantage of runtime
    percentage slowdown




                          140%
                                                                                               scheduling by
                          120%
                          100%
                                                                                               splitting data into
                          80%                                                                  chunks or blocks
                          60%                                                              2   Parallel databases
                          40%
                                                                                               restart entire query on
                          20%
                           0%
                                                                                               node failure or wait
                                 Fault-tolerance        Fluctuation-tolerance                  for the slowest node




                                     Yale University, HadoopWorld 2009          HadoopDB                           21/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Summary Future


To summarize


      HadoopDB ...
         1   is a hybrid of DBMS and MapReduce
         2   scales better than commercial parallel databases
         3   is as fault-tolerant as Hadoop
         4   approaches the performance of parallel databases
         5   is free and open-source

                          http://hadoopdb.sourceforge.net




                     Yale University, HadoopWorld 2009   HadoopDB         22/24
Introduction Candidates Differences HadoopDB Evaluation Conclusion
                                                         Summary Future


Future work

      Engineering work:
         1   Full SQL support in SMS
         2   Data compression
         3   Integration with other open source databases
         4   Full automation of the loading and replication process
         5   Out-of-the box deployment
         6   We’re hiring!
      Research work:
             Incremental loading and on-the-fly repartitioning
             Dynamically adjusting fault-tolerance levels based on failure
             rate


                     Yale University, HadoopWorld 2009   HadoopDB            23/24
Thank You ...




     We welcome all thoughts on how to raise HadoopDB ...
                http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg

More Related Content

PPTX
Intro to Hadoop
PPTX
Hive vs Hbase, a Friendly Competition
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
PDF
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
PDF
Integration of HIve and HBase
PPTX
Understanding hdfs
PDF
Distributed Data Analysis with Hadoop and R - OSCON 2011
PPT
Where does hadoop come handy
Intro to Hadoop
Hive vs Hbase, a Friendly Competition
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Integration of HIve and HBase
Understanding hdfs
Distributed Data Analysis with Hadoop and R - OSCON 2011
Where does hadoop come handy

What's hot (20)

PPTX
Hadoop Presentation
PDF
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
PPT
Hive @ Hadoop day seattle_2010
PPTX
WaterlooHiveTalk
PDF
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPSX
ODP
Hadoop demo ppt
PPTX
Big data processing with apache spark part1
PDF
An Introduction to the World of Hadoop
PDF
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
KEY
Flexible In-Situ Indexing for Hadoop via Elephant Twin
PDF
HugeTable:Application-Oriented Structure Data Storage System
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PDF
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
PPT
Introduction to Apache hadoop
KEY
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
PDF
Extending the Data Warehouse with Hadoop - Hadoop world 2011
PPTX
Big Data Concepts
PDF
Extending the EDW with Hadoop - Chicago Data Summit 2011
Hadoop Presentation
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Hive @ Hadoop day seattle_2010
WaterlooHiveTalk
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop demo ppt
Big data processing with apache spark part1
An Introduction to the World of Hadoop
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Flexible In-Situ Indexing for Hadoop via Elephant Twin
HugeTable:Application-Oriented Structure Data Storage System
Overview of Big data, Hadoop and Microsoft BI - version1
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Introduction to Apache hadoop
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Big Data Concepts
Extending the EDW with Hadoop - Chicago Data Summit 2011
Ad

Similar to Hw09 Hadoop Db (20)

PDF
Improving MySQL performance with Hadoop
PPTX
Hadoop and Big Data: Revealed
PPTX
Big data ppt
PDF
Hadoop Ecosystem
PDF
Big data and hadoop
PDF
What is Apache Hadoop and its ecosystem?
ODP
Hadoop @ Sara & BiG Grid
DOCX
Hadoop online training by certified trainer
PPTX
Apache hadoop introduction and architecture
PDF
The Forrester Wave Enterprise Hadoop Solutions Q1 2012
PPTX
Hybrid Data Warehouse Hadoop Implementations
PPTX
Hadoop: An Industry Perspective
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
DOCX
Hadoop Tutorial for Beginners
PPTX
Hadoop and mysql by Chris Schneider
PDF
Hadoop - A Very Short Introduction
PPT
Big Data Analytics 2014
PDF
Keynote from ApacheCon NA 2011
PPTX
מצגת כנס מנתחי מערכות
DOCX
Hadoop online training
Improving MySQL performance with Hadoop
Hadoop and Big Data: Revealed
Big data ppt
Hadoop Ecosystem
Big data and hadoop
What is Apache Hadoop and its ecosystem?
Hadoop @ Sara & BiG Grid
Hadoop online training by certified trainer
Apache hadoop introduction and architecture
The Forrester Wave Enterprise Hadoop Solutions Q1 2012
Hybrid Data Warehouse Hadoop Implementations
Hadoop: An Industry Perspective
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Hadoop Tutorial for Beginners
Hadoop and mysql by Chris Schneider
Hadoop - A Very Short Introduction
Big Data Analytics 2014
Keynote from ApacheCon NA 2011
מצגת כנס מנתחי מערכות
Hadoop online training
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Cell Structure & Organelles in detailed.
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Pre independence Education in Inndia.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
Classroom Observation Tools for Teachers
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Supply Chain Operations Speaking Notes -ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
01-Introduction-to-Information-Management.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Basic Mud Logging Guide for educational purpose
Cell Structure & Organelles in detailed.
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
TR - Agricultural Crops Production NC III.pdf
Week 4 Term 3 Study Techniques revisited.pptx
Pharma ospi slides which help in ospi learning
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
RMMM.pdf make it easy to upload and study
Renaissance Architecture: A Journey from Faith to Humanism
VCE English Exam - Section C Student Revision Booklet
Pre independence Education in Inndia.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Cell Types and Its function , kingdom of life
Classroom Observation Tools for Teachers
Microbial disease of the cardiovascular and lymphatic systems
PPH.pptx obstetrics and gynecology in nursing
Supply Chain Operations Speaking Notes -ICLT Program

Hw09 Hadoop Db

  • 1. HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009 HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.
  • 2. Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Major Trends 1 Data explosion: Automation of business processes, proliferation of digital devices. eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB. 2 Analysis over raw data Yale University, HadoopWorld 2009 HadoopDB 2/24
  • 3. Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Major Trends 1 Data explosion: Automation of business processes, proliferation of digital devices. eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB. 2 Analysis over raw data Bottom line Analyzing massive structured data on 1000s of shared-nothing nodes. Yale University, HadoopWorld 2009 HadoopDB 2/24
  • 4. Introduction Candidates Differences HadoopDB Evaluation Conclusion Motivation Sales Record Example Consider a large data set of sales log records, each consisting of sales information including: 1 a date of sale 2 a price We would like to take the log records and generate a report showing the total sales for each year. Question: How do we generate this report efficiently and cheaply over massive data contained in a shared-nothing cluster of 1000s of machines? Yale University, HadoopWorld 2009 HadoopDB 3/24
  • 5. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases MapReduce (Hadoop) MapReduce is a programming model which specifies: A map function that processes a key/value pair to generate a set of intermediate key/value pairs, A reduce function that merges all intermediate values associated with the same intermediate key. Hadoop is a MapReduce implementation for processing large data sets over 1000s of nodes. Maps (and Reduces) run independently of each other over blocks of data distributed across a cluster. Yale University, HadoopWorld 2009 HadoopDB 4/24
  • 6. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Sales Record Example using Hadoop Query: Calculate total sales for each year. We write a MapReduce program: Map: Takes log records and extracts a key-value pair of year and sale price in dollars. Outputs the key-value pairs. Shuffle: Hadoop automatically partitions the key-value pairs by year to the nodes executing the Reduce function Reduce: Simply sums up all the dollar values for a year. Yale University, HadoopWorld 2009 HadoopDB 5/24
  • 7. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Relational Databases Suppose that the data is stored in a relational database system, the sales record example could be expressed in SQL as: SELECT YEAR(date) AS year, SUM(price) FROM sales GROUP BY year The execution plan is: projection(year,price) → hash aggregation(year,price) . Question: How do we process this efficiently if the data is very large? Yale University, HadoopWorld 2009 HadoopDB 6/24
  • 8. Introduction Candidates Differences HadoopDB Evaluation Conclusion MapReduce Parallel Databases Parallel Databases Parallel Databases are like single-node databases except: Data is partitioned across nodes Individual relational operations can be executed in parallel xxx SELECT YEAR(date) AS year, SUM(price) FROM sales GROUP BY year Execution plan for the query: projection(year,price) → partial hash aggregation(year,price) → partitioning(year) → final aggregation(year,price) . Note that the execution plan resembles the map and reduce phases of Hadoop. Yale University, HadoopWorld 2009 HadoopDB 7/24
  • 9. Introduction Candidates Differences HadoopDB Evaluation Conclusion Differences between Parallel Databases and Hadoop Yale University, HadoopWorld 2009 HadoopDB 8/24
  • 10. Introduction Candidates Differences HadoopDB Evaluation Conclusion Differences between Parallel Databases and Hadoop Yale University, HadoopWorld 2009 HadoopDB 8/24
  • 11. Introduction Candidates Differences HadoopDB Evaluation Conclusion To summarize Yale University, HadoopWorld 2009 HadoopDB 9/24
  • 12. At Yale, we looked beyond the differences ...
  • 13. At Yale, we looked beyond the differences ...
  • 14. and we discovered ... Basic design idea Multiple, independent, single node databases coordinated by ... that they complete each other Hadoop. http://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg
  • 15. Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS Hadoop Basics Yale University, HadoopWorld 2009 HadoopDB 12/24
  • 16. Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS Architecture Yale University, HadoopWorld 2009 HadoopDB 13/24
  • 17. Introduction Candidates Differences HadoopDB Evaluation Conclusion Background Architecture SMS SQL-MR-SQL SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); Yale University, HadoopWorld 2009 HadoopDB 14/24
  • 18. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Evaluating HadoopDB Compare HadoopDB to 1 Hadoop 2 Parallel databases (Vertica, DBMS-X) Features: 1 Performance: We expected HadoopDB to approach the performance of parallel databases 2 Scalability: We expected HadoopDB to scale as well as Hadoop We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2 clusters of 10, 50, 100 nodes. Yale University, HadoopWorld 2009 HadoopDB 15/24
  • 19. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Load 1600 Vertica DB-X HadoopDB Hadoop Vertica DB-X Thousands 1400 HadoopDB Hadoop 50 1200 40 1000 seconds seconds 800 30 600 20 400 164 161 141 139 200 10 100 92 77 47 43 0 10 nodes 50 nodes 100 nodes 0 10 nodes 50 nodes 100 nodes Random Unstructured Data Structured data (20GB/node) (535MB/node) Yale University, HadoopWorld 2009 HadoopDB 16/24
  • 20. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Grep Task 70 Vertica DB-X HadoopDB Hadoop 60 1 Full table scan, highly 50 selective filter 40 2 Random data, no seconds 30 room for indexing 20 3 Hadoop overhead outweighs query 10 processing time in 0 10 nodes 50 nodes 100 nodes single-node databases SELECT * FROM grep WHERE field LIKE ‘%xyz%’; Yale University, HadoopWorld 2009 HadoopDB 17/24
  • 21. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Join Task 2000 Vertica DB-X HadoopDB Hadoop 1800 1600 1400 1200 seconds 1000 800 1 No full table scan due 600 to clustered indexing 300.5 224.2 400 2 Hash partitioning and 126.4 67.7 200 34.7 31.9 29.4 28.0 20.6 0 efficient join 10 nodes 50 nodes 100 nodes algorithm SELECT sourceIP, AVG(pageRank), SUM(adRevenue) FROM rankings, uservisits WHERE pageURL=destURL AND visitDate BETWEEN 2000-1-15 AND 2000-1-22 GROUP BY sourceIP ORDER BY SUM(adRevenue) DESC LIMIT 1; Yale University, HadoopWorld 2009 HadoopDB 18/24
  • 22. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Performance: Bottom Line 1 Unstructured data HadoopDB’s performance matches Hadoop 2 Structured data HadoopDB’s performance is close to parallel databases Yale University, HadoopWorld 2009 HadoopDB 19/24
  • 23. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Scalability: Setup 1 Simple aggregation task - full table scan 2 Data replicated across 10 nodes 3 Fault-tolerance: Kill a node halfway 4 Fluctuation-tolerance: Slow down a node for the entire experiment Yale University, HadoopWorld 2009 HadoopDB 20/24
  • 24. Introduction Candidates Differences HadoopDB Evaluation Conclusion Hypotheses Load Performance Scalability Scalability: Results 200% Vertica 1 HadoopDB and HadoopDB 180% Hadoop Hadoop take 160% advantage of runtime percentage slowdown 140% scheduling by 120% 100% splitting data into 80% chunks or blocks 60% 2 Parallel databases 40% restart entire query on 20% 0% node failure or wait Fault-tolerance Fluctuation-tolerance for the slowest node Yale University, HadoopWorld 2009 HadoopDB 21/24
  • 25. Introduction Candidates Differences HadoopDB Evaluation Conclusion Summary Future To summarize HadoopDB ... 1 is a hybrid of DBMS and MapReduce 2 scales better than commercial parallel databases 3 is as fault-tolerant as Hadoop 4 approaches the performance of parallel databases 5 is free and open-source http://hadoopdb.sourceforge.net Yale University, HadoopWorld 2009 HadoopDB 22/24
  • 26. Introduction Candidates Differences HadoopDB Evaluation Conclusion Summary Future Future work Engineering work: 1 Full SQL support in SMS 2 Data compression 3 Integration with other open source databases 4 Full automation of the loading and replication process 5 Out-of-the box deployment 6 We’re hiring! Research work: Incremental loading and on-the-fly repartitioning Dynamically adjusting fault-tolerance levels based on failure rate Yale University, HadoopWorld 2009 HadoopDB 23/24
  • 27. Thank You ... We welcome all thoughts on how to raise HadoopDB ... http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg