Daniel Abadi HadoopWorld 2010

MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi Yale University October 12 th , 2010

Brief History of MapReduce Pre-2004: used at Google for many data processing apps, including Web indexing 2004: paper in academic conference not written in traditional academic style 2004-2006: Implemented in Nutch 2006-2008: Split off into Hadoop; significant usage at Yahoo; buzz increases

Controversy Vast majority of the outrage was about the comparison of the systems BUT: The line between MapReduce and Hadoop (which comes with HDFS) was blurring Hadoop can be used as an alternative to traditional DW implementations built using DBMS software

SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems Compared across a variety of dimensions including performance and ease of use Measured differences in load and query time for some common data processing tasks Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at

Hardware Setup 100 node cluster Each node 2.4 GHz Code 2 Duo Processors 4 GB RAM 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring

UDF Task DBMS clearly doesn’t scale Calculate PageRank over a set of HTML documents Performed via a UDF

Benchmark Conclusions Hadoop has many advantages Load time much faster Significantly easier to install, use Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data Reasons both fundamental and non-fundamental Needs better support for compression and direct operation on compressed data Needs better support for indexing Needs better support for co-partitioning of datasets

Overall Conclusion MapReduce/Hadoop and parallel databases are clearly complementary Use MapReduce if you want to do: ETL Unstructured data processing Deep analysis that is hard to express in SQL Use parallel databases for: Traditional data warehousing / data marts Structured data processing expressible in SQL Cloudera agrees!

We’re all in agreement, right?

But Wait! Hadoop can do everything a parallel database can do Hadoop has (something resembling) a SQL interface (Hive) Many of Hadoop’s performance deficiencies not fundamental Result of initial design for unstructured data Over 20 research papers in the last two years on improving Hadoop performance for DBMS workloads Hadoop is free and open source (Oracle, IBM/Netezza, Microsoft, Teradata, Vertica, Greenplum, and Aster Data are all proprietary)

People are using Hadoop as a DW Facebook has 12PB data warehouse in Hadoop/Hive Adding 10TB per day Yahoo’s warehouse is the same order of magnitude Recently switched to Hadoop

Fault Tolerance and Cluster Heterogeneity Results Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly

So … Hadoop can do everything that parallel databases can do, but: Has better fault tolerance Adjusts better to runtime performance fluctuations Is more open / cheaper Has at least as good scalability (if not better) If only we could fix those performance problems on structured data HadoopDB!

HadoopDB Use Hadoop to coordinate execution of multiple independent (typically single node, open source) database systems Flexible query interface (accepts both SQL and MapReduce) Open source (built using open source components)

Fault Tolerance and Cluster Heterogeneity Results

HadoopDB: Current Status Initial open source release over a year ago A bunch of new code since then, but not yet put up online This new code is available by request Expect the next release to be in mid-2011 Money available for people who want to help with development (e-mail justin.borgman@yale.edu)

Invisible Loading Data starts in HDFS Data is immediately available for processing (immediate gratification paradigm) Each MapReduce job causes data movement from HDFS to database systems Data is incrementally loaded, sorted, and indexed Query performance improves “invisibly”

Conclusions MapReduce and parallel databases are definitely complimentary MapReduce and parallel databases are definitely competitive HadoopDB is awesome

Daniel Abadi HadoopWorld 2010

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Daniel Abadi HadoopWorld 2010 (20)

Recently uploaded (20)

Daniel Abadi HadoopWorld 2010