Improving MySQL performance with Hadoop

Improving MySQL Performance with
Hadoop
Sagar Jauhari, Manish Kumar

India
May 03 – May 04, 2012

San Francisco
September 30 – October 4, 2012


Program Agenda

● Introduction
● Inside Hadoop!
● Integration with MySQL
● Facebook's usage of MySQL & Hadoop
● Twitter's usage of MySQL &Hadoop


Introduction
MySQL
● 12 million product installations
● 65,000 downloads each day
● Part of the rapidly growing open source LAMP stack
● MySQL Commercial Editions Available


Introduction
Hadoop
● Highly scalable Distributed Framework
○ Yahoo! has a 4000 node cluster!
● Extremely powerful in terms of computation
○ Sorts a TB of random integers in 62 seconds!


Introduction
Hadoop is ..
● A scalable system for data storage and processing.
● Fault tolerant
● Parallelizes data processing across many nodes
● Leverages its distributed file system (HDFS)* to
cheaply and reliably replicate chunks of data.


Introduction
Who uses Hadoop?
● Yahoo:
■ Ad Systems and Web Search.
● Facebook:
■ Reporting/analytics and machine learning.
● Twitter:
■ Data warehousing, data analysis.
● Netflix:
■ Movie recommendation algorithm uses Hive ( which uses
Hadoop, HDFS & MapReduce underneath)


Introduction
MySQL Vs Hadoop
MySQL Hadoop

Data Capacity TB+ (may require sharding) PB+

Data per query GB? PB+

Read/Write Random read/write Sequential scans, Append - only

Query Language SQL Java MapReduce, scripting
languages, Hive QL

Transaction Yes No

Indexes Yes No

Latence Sub-second (hopefully) Minutes to hours

Data structure Structured Structured or unstructured
Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010


Inside Hadoop

A shallow Deep Dive


Inside Hadoop
HDFS
● A distributed, scalable, Name Node

and portable file system
written in Java
● Each node in a Hadoop HDFS

instance typically has a
single name-node; a
cluster of data-nodes form
the HDFS cluster.
Map / Reduce Workers


Inside Hadoop
HDFS
● Uses the TCP/IP layer for Name Node

communication
● Stores large files across
multiple machines HDFS

● Single name node stores
metadata in-memory.

Map / Reduce Workers


Inside Hadoop
HDFS


Inside Hadoop
Map Reduce
● Design Goals
○ Scalability
○ Cost Efficiency
● Implementation
○ User Jobs are executed as 'map' and 'reduce' functions
○ Work distribution and fault tolerance are managed

Input Map Shuffle and sort Reduce Output


Inside Hadoop
Map Reduce
● Map
○ Map Reduce job splits input data into independent chunks
○ Each chunk is processed by the map task in a parallel
manner
○ Generic key-value computation



Inside Hadoop
Map Reduce
● Reduce
○ Data from data nodes is merge sorted so that the key-value
pairs for a given key are contiguous
○ The merged data is read sequentially and the values are
passed to the reduce method with an iterator reading the
input file until the next key value is encountered



Inside Hadoop
Map Reduce

Word
Word Count
Hadoop
Map Hadoop 2
Reduce
MySQL
MySQL 1
Hive
Map Hive 1
Sqoop
Reduce Sqoop 1
Pig
Map
Pig 1
Hadoop


Inside Hadoop
How does hadoop use Map-Reduce
● Framework consists of a single master JobTracker
and one slave TaskTracker per cluster-node.
● Master
○ Schedules the jobs' component tasks on the slaves
○ Monitors the jobs
○ Re-executes the failed tasks
● Slave
○ Executes the tasks as directed by the master.


Inside Hadoop
Why Map Reduce ?
● Language support
○ Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) .
● Scales Horizontally
● Programmer is isolated from individual failed tasks
○ Tasks are restarted on another node


Inside Hadoop
Map Reduce Limitations
● Not a good fit for problems that exhibit task-driven
parallelism.
● Requires a particular form of input - a set of (key,
pair) pairs.
● A lot of MapReduce applications end up sharing data
one way or another.


Integration with MySQL

Leveraging Hadoop to
Improve MySQL
performance



● The benefits of MySQL to developers is the speed,
reliability, data integrity and scalability it provides.
● It can successfully process large amounts of data (in
petabytes).
● But for applications that require a massive parallel
processing we may need the benefits of a parallel
processing system, such as hadoop.



Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010


Problem Statement
Word Count Problem
● In a large set of
documents, find the
number of occurrences
of each word.


Word count problem

Word
Word Count
Hadoop
Map Hadoop 2
Reduce
MySQL
MySQL 1
Hive
Map Hive 1
Sqoop
Reduce Sqoop 1
Pig
Map
Pig 1
Hadoop


Mapping

Key and Value represent a row of data:
Map
key is the byte office, value in a line.
(key,
value)
Intermediate Output
foreach <word1>, 1
(word in <word2>, 1
the <word3>, 1
value)

output
(word,1)


Reducing
Hadoop aggregates the keys
Reduce and calls reduce for each
(key, list) unique key:
sum <word1>, (1,1,1,1,1,1…1)
the list <word2>, (1,1,1)
Output <word3>, (1,1,1,1,1,1) .
(key,
Final result:
sum)
<word1>, 45823
<word2>, 1204
<word3>, 2693



Demo


Video


Facebook's usage of MySQL & Hadoop

● Facebook collects TB of data everyday from around 800 million
users.
● MySQL handles pretty much every user interaction: likes,
shares, status updates, alerts, requests, etc.
● Hadoop/Hive Warehouse
– 4800 cores, 2 PetaBytes (July 2009)
– 4800 cores, 12 PetaBytes (Sept 2009)
● Hadoop Archival Store
– 200 TB


Hive
● Data warehouse system for Hadoop.
● Facilitates easy data summarization.
● Hive translates HiveQL to MapReduce code.
● Querying
○ Provides a mechanism to project structure onto this data
○ Allows querying the data using a SQL-like language called HiveQL



Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010


Hive Vs SQL

RDBMS HIVE

SQL-92 standard (maybe) Subset of SQL-92 plus Hive-
Language
specific extension
INSERT, UPDATE and INSERT but not UPDATE or
Update Capabilities
DELETE DELETE

Yes No
Transactions

Sub-Second Minutes or more
Latency

Any number of indexes, No indexes, data is always
Indexes
very scanned (in parallel)
important for performance
TBs PBs
Data size
Data per query GBs
Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 PBs


Hadoop Implementation
At Twitter
● > 12 terabytes of new data per day!
● Most stored data is LZ0 compressed
● Uses Scribe to write logs to Hadoop
○ Scribe: a log collection framework created and open-
sourced by Facebook.
● Hadoop used for data warehousing, data analysis.


References

● Leveraging Hadoop to Augment MySQL Deployments - Sarah
Sproehnle, Cloudera
● http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
● http://semanticvoid.com
● http://michael-noll.com
● http://hadoop.apache.org/


Legal Disclaimer

● All other products, company names, brand names,
trademarks and logos are the property of their
respective owners.


Thank You


Improving MySQL performance with Hadoop

More Related Content

What's hot (20)

Similar to Improving MySQL performance with Hadoop (20)

Recently uploaded (20)

Improving MySQL performance with Hadoop