Hive Percona 2009

Data Warehousing & Analytics on Hadoop Ashish Thusoo, Prasad Chakka Facebook Data Team

Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 2+TB(compressed) raw data per day today

Lets try Hadoop… Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE

What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on raw files Key Building Principles: SQL as a familiar data warehousing tool Extensibility – Types, Functions, Formats, Scripts Scalability and Performance

Simplifying Hadoop hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*

Looks like this .. Node Node Node Node Node Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce Disks Disks Disks Disks Disks Disks

Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL

Hive/Hadoop Usage @ Facebook Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns for Site Integrity Application API usage patterns Ad Optimization Too many to count ..

Hadoop Usage @ Facebook Data statistics: Total Data: ~1.7PB Cluster Capacity ~2.4PB Net Data added/day: ~15TB 6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily Compression Factor ~5x (gzip, more with bzip) Usage statistics: 3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 80 MM compute minutes/day

HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Parser Planner Web UI Optimizer DB

Data Model Logical Partitioning Hash Partitioning clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols

Hive Query Language SQL Subqueries in from clause Equi-joins Multi-table Insert Multi-group-by Sampling Complex object types Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Pluggable Data Formats

Map Reduce Example Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce

Hive QL – Join INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);

Hive QL – Join in Map Reduce page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32

Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

Hive QL – Group By in Map Reduce pv_users Map Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 Shuffle Sort pageid age count 2 25 2

Group by Optimizations Map side partial aggregations Hash Based aggregates Serialized key/values in hash tables Optimizations being Worked On: Exploit pre-sorted data for distinct counts Partial aggregations and Combiners Be smarter about how to avoid multiple stage Exploit table/column statistics for deciding strategy

Inserts into Files, Tables and Local Files FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Extensibility - Custom Map/Reduce Scripts FROM ( FROM pv_users MAP (pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTO TABLE pv_users_reduced REDUCE (map.dt, map.uid) USING 'reduce_script' AS (date, count);

Open Source Community 21 contributors and growing 6 contributors within Facebook Contributors from: Academia Other web companies Etc.. 7 committers 1 external to Facebook and looking to add more here

Future Work Statistics and cost-based optimization Integration with BI tools (through JDBC/ODBC) Performance improvements More SQL constructs & UDFs Indexing Schema Evolution Advanced operators Cubes/Frequent Item Sets/Window Functions Hive Roadmap http://wiki.apache.org/hadoop/Hive/Roadmap

Information Available as a sub project in Hadoop http://wiki.apache.org/hadoop/Hive (wiki) http://hadoop.apache.org/hive (home page) http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19 Release 0.3 is coming in the next few weeks Mailing Lists: hive-{user,dev,commits}@hadoop.apache.org

Hive Percona 2009

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hive Percona 2009 (20)

Recently uploaded (20)

Hive Percona 2009