Hw09 Hadoop Db

HadoopDB: An open source hybrid of MapReduce
and DBMS technologies

Azza Abouzeid, Kamil Bajda-Pawlikowski
Daniel J. Abadi, Avi Silberschatz

Yale University
http://hadoopdb.sourceforge.net

October 2, 2009

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi
Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.

Introduction Candidates Diﬀerences HadoopDB Evaluation Conclusion
Motivation

Major Trends

1 Data explosion:
Automation of business processes, proliferation of digital
devices.
eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
2 Analysis over raw data

Yale University, HadoopWorld 2009 HadoopDB 2/24

Motivation

Major Trends

1 Data explosion:
Automation of business processes, proliferation of digital
devices.
eBay has a 6.5 PB warehouse, Yahoo! Everest has 10 PB.
2 Analysis over raw data

Bottom line
Analyzing massive structured data on 1000s of shared-nothing
nodes.


Motivation

Sales Record Example

Consider a large data set of sales log records, each consisting of
sales information including:
1 a date of sale
2 a price
We would like to take the log records and generate a report
showing the total sales for each year.
Question:
How do we generate this report eﬃciently and cheaply over massive
data contained in a shared-nothing cluster of 1000s of machines?


MapReduce Parallel Databases

MapReduce (Hadoop)

MapReduce is a programming model which speciﬁes:
A map function that processes a key/value pair to generate a
set of intermediate key/value pairs,
A reduce function that merges all intermediate values
associated with the same intermediate key.
Hadoop
is a MapReduce implementation for processing large data sets
over 1000s of nodes.
Maps (and Reduces) run independently of each other over
blocks of data distributed across a cluster.



Sales Record Example using Hadoop

Query: Calculate total sales for each year.

We write a MapReduce program:
Map: Takes log records and extracts a key-value pair of year
and sale price in dollars. Outputs the key-value pairs.
Shuﬄe: Hadoop automatically partitions the key-value pairs
by year to the nodes executing the Reduce function
Reduce: Simply sums up all the dollar values for a year.



Relational Databases

Suppose that the data is stored in a relational database system,
the sales record example could be expressed in SQL as:

SELECT YEAR(date) AS year, SUM(price)
FROM sales
GROUP BY year

The execution plan is:

projection(year,price) → hash aggregation(year,price) .

Question:
How do we process this eﬃciently if the data is very large?



Parallel Databases

Parallel Databases are like single-node databases except:
Data is partitioned across nodes
Individual relational operations can be executed in parallel
xxx
SELECT YEAR(date) AS year, SUM(price)
FROM sales GROUP BY year

Execution plan for the query:
projection(year,price) → partial hash aggregation(year,price) →
partitioning(year) → ﬁnal aggregation(year,price) .

Note that the execution plan resembles the map and reduce phases
of Hadoop.



Diﬀerences between Parallel Databases and Hadoop



To summarize


At Yale, we looked beyond the diﬀerences ...

and we discovered ...

Basic design idea
Multiple, independent, single
node databases coordinated by
... that they complete each other Hadoop.
http://i214.photobucket.com/albums/cc19/brittanybutton/elephants.jpg

Background Architecture SMS

Hadoop Basics



Architecture



SQL-MR-SQL

SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);


Hypotheses Load Performance Scalability

Evaluating HadoopDB

Compare HadoopDB to
1 Hadoop
2 Parallel databases (Vertica, DBMS-X)
Features:
1 Performance:

We expected HadoopDB to approach the performance of
parallel databases
2 Scalability:
We expected HadoopDB to scale as well as Hadoop
We ran the Pavlo et al. SIGMOD’09 benchmark on Amazon EC2
clusters of 10, 50, 100 nodes.



Load

1600 Vertica DB-X
HadoopDB Hadoop Vertica DB-X

Thousands
1400 HadoopDB Hadoop
50
1200
40
1000
seconds

seconds
800 30

600
20
400
164

161
141
139

200 10
100
92

77
47

43

0
10 nodes 50 nodes 100 nodes 0
10 nodes 50 nodes 100 nodes

Random Unstructured Data
Structured data (20GB/node)
(535MB/node)



Performance: Grep Task

70 Vertica DB-X
HadoopDB Hadoop
60 1 Full table scan, highly
50
selective ﬁlter
40
2 Random data, no
seconds

30
room for indexing
20
3 Hadoop overhead
outweighs query
10
processing time in
0
10 nodes 50 nodes 100 nodes
single-node databases

SELECT * FROM grep WHERE ﬁeld LIKE ‘%xyz%’;



Performance: Join Task

2000 Vertica DB-X
HadoopDB Hadoop
1800

1600

1400

1200
seconds

1000

800
1 No full table scan due
600 to clustered indexing

300.5
224.2

400 2 Hash partitioning and
126.4

67.7
200
34.7

31.9
29.4
28.0
20.6

0
eﬃcient join
10 nodes 50 nodes 100 nodes algorithm

SELECT sourceIP, AVG(pageRank), SUM(adRevenue)
FROM rankings, uservisits
WHERE pageURL=destURL
AND visitDate BETWEEN 2000-1-15 AND 2000-1-22
GROUP BY sourceIP
ORDER BY SUM(adRevenue) DESC LIMIT 1;



Performance: Bottom Line

1 Unstructured data
HadoopDB’s performance matches Hadoop
2 Structured data
HadoopDB’s performance is close to parallel databases



Scalability: Setup

1 Simple aggregation task - full table scan
2 Data replicated across 10 nodes
3 Fault-tolerance: Kill a node halfway
4 Fluctuation-tolerance: Slow down a node for the entire
experiment



Scalability: Results

200% Vertica
1 HadoopDB and
HadoopDB
180%
Hadoop
Hadoop take
160% advantage of runtime
percentage slowdown

140%
scheduling by
120%
100%
splitting data into
80% chunks or blocks
60% 2 Parallel databases
40%
restart entire query on
20%
0%
node failure or wait
Fault-tolerance Fluctuation-tolerance for the slowest node


Summary Future

To summarize

HadoopDB ...
1 is a hybrid of DBMS and MapReduce
2 scales better than commercial parallel databases
3 is as fault-tolerant as Hadoop
4 approaches the performance of parallel databases
5 is free and open-source

http://hadoopdb.sourceforge.net


Summary Future

Future work

Engineering work:
1 Full SQL support in SMS
2 Data compression
3 Integration with other open source databases
4 Full automation of the loading and replication process
5 Out-of-the box deployment
6 We’re hiring!
Research work:
Incremental loading and on-the-ﬂy repartitioning
Dynamically adjusting fault-tolerance levels based on failure
rate


Thank You ...

We welcome all thoughts on how to raise HadoopDB ...
http://www.jpbutler.com/thailand/images/elephant-8-days-old.jpg

Hw09 Hadoop Db

More Related Content

What's hot (20)

Similar to Hw09 Hadoop Db (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hw09 Hadoop Db