SlideShare a Scribd company logo
Hive -  Data Warehousing &   Analytics on Hadoop Wednesday, June 10, 2009  Santa Clara Marriott Namit Jain, Zheng Shao Facebook
Agenda Introduction Facebook Usage Hive Progress and Roadmap Open Source Community Facebook
Introduction Facebook
Why Another Data Warehousing System? Data, data and more data ~1TB per day in March 2008  ~10TB per day today Facebook
 
Lets try Hadoop… Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE Facebook
Lets try Hadoop… (continued) RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1  $ bin/hadoop dfs –cat /tmp/largekey/part* Facebook
What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on raw files Key Building Principles: SQL as a familiar data warehousing tool Extensibility – Types, Functions, Formats, Scripts Scalability and Performance Facebook
Simplifying Hadoop RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. hive> select key, count(1) from kv1 where key > 100 group by key; Facebook
Facebook Usage Facebook
Data Warehousing at Facebook Today Facebook Web Servers Scribe Servers Filers Hive on  Hadoop Cluster Oracle RAC Federated MySQL
Hive/Hadoop Usage @ Facebook Types of Applications: Reporting  Eg: Daily/Weekly aggregations of impression/click counts SELECT pageid, count(1) as imps FROM imp_table GROUP BY pageid WHERE date = ‘2009-05-01’; Complex measures of user engagement  Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns for Site Integrity Application API usage patterns Ad Optimization Facebook
Hadoop Usage @ Facebook Cluster Capacity: 600 nodes ~2.4PB (80% used) Data statistics: Source logs/day:  6TB Dimension data/day:  4TB Compression Factor ~5x (gzip) Usage statistics: 3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 150 active users within Facebook Facebook
Hive Progress and Roadmap Facebook
CREATE TABLE clicks(key STRING, value STRING) LOCATION '/hive/clicks' PARTITIONED BY (ds STRING)  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe'  WITH SERDEPROPERTIES ('testserde.default.serialization.format'='\003'); Facebook
Data Model Facebook Logical Partitioning Hash Partitioning clicks HDFS MetaStore /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
HIVE: Components Facebook HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift CSV JSON.. Execution Parser Planner Web UI Optimizer DB
Hive Query Language SQL Subqueries in from clause Equi-joins Multi-table Insert Multi-group-by Sampling SELECT s.key, count(1) FROM clicks TABLESAMPLE (BUCKET 1 OUT OF 32) s  WHERE s.ds = ‘2009-04-22’ GROUP BY s.key Facebook
FROM pv_users  INSERT INTO  TABLE  pv_gender_sum  SELECT gender, count(DISTINCT userid)  GROUP BY gender  INSERT INTO  DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’  SELECT age, count(DISTINCT userid)  GROUP BY age INSERT INTO  LOCAL DIRECTORY  ‘/home/me/pv_age_sum.dir’ SELECT age, count(DISTINCT userid)  GROUP BY age; Facebook
Hive Query Language (continued) Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Complex object types: List of Maps Pluggable Data Formats Apache Log Format Facebook
FROM (  FROM pv_users  MAP  pv_users.userid, pv_users.date USING 'map_script‘ AS dt, uid CLUSTER  BY dt) map  INSERT INTO TABLE pv_users_reduced  REDUCE  map.dt, map.uid USING 'reduce_script'  AS date, count; Pluggable Map-Reduce Scripts Facebook
Map Reduce Example Facebook Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
Hive QL – Join INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); Facebook
Hive QL – Join in Map Reduce Facebook page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
Join Optimizations Map Joins User specified small tables stored in hash tables on the mapper backed by jdbm No reducer needed INSERT INTO TABLE pv_users SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); Future Exploit table/column statistics for deciding strategy Facebook
Hive QL – Map Join Facebook page_view user Hash table pv_users key value 111 <1,2> 222 <2> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male Pageid age 1 25 2 25 1 32
Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; Facebook
Hive QL – Group By in Map Reduce Facebook pv_users Map Reduce pageid age 1 25 1 25 pageid age count 1 25 3 pageid age 2 32 1 25 key value <1,25> 2 key value <1,25> 1 <2,32> 1 key value <1,25> 2 <1,25> 1 key value <2,32> 1 Shuffle Sort pageid age count 2 32 1
Group by Optimizations Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew Optimizations being Worked On: Exploit pre-sorted data for distinct counts Exploit table/column statistics for deciding strategy Facebook
Columnar Storage CREATE table columnTable (key STRING, value STRING) ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe' STORED AS RCFILE; Saved 25% of space compared with SequenceFile Based on one of the largest tables (30 columns) inside Facebook Both are compressed with GzipCodec Speed improvements in progress Need to propagate column-selection information to FileFormat *Contribution from Yongqiang He (outside Facebook) Facebook
Speed Improvements over Time Facebook QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t; Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec). * No performance benchmarks for Map-side Join yet. Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse  21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
Overcoming Java Overhead Reuse objects Use Writable instead of Java Primitives Reuse objects across all rows *40% speed improvement on Query C Lazy deserialization Only deserialize the column when asked Very helpful for complex types (map/list/struct) *108% speed improvement on Query A Facebook
Generic UDF and UDAF Let UDF and UDAF accept complex-type parameters Integrate UDF and UDAF with Writables public IntWritable evaluate(IntWritable a, IntWritable b) { intWritable.set((int)(a.get() + b.get())); return intWritable; } Facebook
HQL Optimizations Predicate Pushdown Merging n-way join Column Pruning Facebook
Open Source Community Facebook
Open Source Community 21 contributors and growing  6 contributors within Facebook Contributors from: Academia Other web companies Etc.. 7 committers 1 external to Facebook and looking to add more here Facebook
50 jiras fixed in last month 218 jiras still open 125 mails in last month on hive-user@ 600 mails in last month on hive-dev@ Various companies/universities Adknowledge, Admob Berkeley, Chinese Academy of Science Demonstration in VLDB’2009 Facebook
Deployment Options EC2 http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely Cloudera Virtual Machine http://www.cloudera.com/hadoop-training-hive-tutorial Your own cluster http://wiki.apache.org/hadoop/Hive/GettingStarted Hive can directly consume data on hadoop CREATE EXTERNAL TABLE mytable   (key STRING, value STRING) LOCATION '/user/abc/mytable'; Facebook
Future Work Benchmark & Performance Integration with BI tools (through JDBC/ODBC) Indexing More on Hive Roadmap http://wiki.apache.org/hadoop/Hive/Roadmap Machine Learning Integration Real-time Streaming Facebook
Information Available as a sub project in Hadoop http://wiki.apache.org/hadoop/Hive (wiki) http://hadoop.apache.org/hive  (home page) http://svn.apache.org/repos/asf/hadoop/hive  (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19 Release 0.3 is out and more are coming Mailing Lists:  hive-{user,dev,commits}@hadoop.apache.org  Facebook
Contributors Aaron Newton Ashish Thusoo David Phillips Dhruba Borthakur Edward Capriolo Eric Hwang Hao Liu He Yongqiang Jeff Hammerbacher Johan Oskarsson Josh Ferguson Joydeep Sen Sarma Kim P. Facebook Michi Mutsuzaki Min Zhou Namit Jain Neil Conway Pete Wyckoff Prasad Chakka Raghotham Murthy Richard Lee Shyam Sundar Sarkar Suresh Antony Venky Iyer Zheng Shao
Questions Facebook

More Related Content

PDF
Hive Demo Paper at VLDB 2009
PPT
Hive Apachecon 2008
PPT
Hive Percona 2009
PDF
20081030linkedin
PPT
Hadoop Summit 2009 Hive
PPT
2008 Ur Tech Talk Zshao
PPT
Hive User Meeting 2009 8 Facebook
PPT
Hive User Meeting March 2010 - Hive Team
Hive Demo Paper at VLDB 2009
Hive Apachecon 2008
Hive Percona 2009
20081030linkedin
Hadoop Summit 2009 Hive
2008 Ur Tech Talk Zshao
Hive User Meeting 2009 8 Facebook
Hive User Meeting March 2010 - Hive Team

What's hot (18)

PPTX
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
PPT
Hive ICDE 2010
PPTX
Ten tools for ten big data areas 04_Apache Hive
PPTX
Hive : WareHousing Over hadoop
PDF
report on aadhaar anlysis using bid data hadoop and hive
PPT
Hive Training -- Motivations and Real World Use Cases
PPT
Hive User Meeting August 2009 Facebook
PPT
Introduction To Map Reduce
PPTX
MapReduce Design Patterns
PDF
Hadoop-Introduction
PPTX
Map Reduce
PPTX
Apache Hive
PPTX
MapReduce Paradigm
PPTX
Spark meetup v2.0.5
PPTX
Map Reduce
PPT
Hadoop MapReduce Fundamentals
PPTX
03 hive query language (hql)
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Hive ICDE 2010
Ten tools for ten big data areas 04_Apache Hive
Hive : WareHousing Over hadoop
report on aadhaar anlysis using bid data hadoop and hive
Hive Training -- Motivations and Real World Use Cases
Hive User Meeting August 2009 Facebook
Introduction To Map Reduce
MapReduce Design Patterns
Hadoop-Introduction
Map Reduce
Apache Hive
MapReduce Paradigm
Spark meetup v2.0.5
Map Reduce
Hadoop MapReduce Fundamentals
03 hive query language (hql)
How to understand and analyze Apache Hive query execution plan for performanc...
Ad

Viewers also liked (20)

PDF
Hive Quick Start Tutorial
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PDF
20081009nychive
PPTX
Cost-based query optimization in Apache Hive
PPT
Hadoop hive presentation
PDF
Introduction to Apache Hive
PDF
Replacing Telco DB/DW to Hadoop and Hive
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
PDF
A Survey on Big Data Analysis Techniques
PPTX
70a monitoring & troubleshooting
PDF
Troubleshooting Hadoop: Distributed Debugging
PPT
Hive - SerDe and LazySerde
PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
PPTX
An intriduction to hive
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PDF
Data Engineering with Spring, Hadoop and Hive
PDF
Introduction to Hive and HCatalog
PPTX
Hive analytic workloads hadoop summit san jose 2014
PPTX
Introduction to Big Data processing (FGRE2016)
PPTX
Hive ppt (1)
Hive Quick Start Tutorial
HIVE: Data Warehousing & Analytics on Hadoop
20081009nychive
Cost-based query optimization in Apache Hive
Hadoop hive presentation
Introduction to Apache Hive
Replacing Telco DB/DW to Hadoop and Hive
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
A Survey on Big Data Analysis Techniques
70a monitoring & troubleshooting
Troubleshooting Hadoop: Distributed Debugging
Hive - SerDe and LazySerde
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
An intriduction to hive
Hw09 Hadoop Development At Facebook Hive And Hdfs
Data Engineering with Spring, Hadoop and Hive
Introduction to Hive and HCatalog
Hive analytic workloads hadoop summit san jose 2014
Introduction to Big Data processing (FGRE2016)
Hive ppt (1)
Ad

Similar to Hadoop Summit 2009 Hive (20)

PPT
Hadoop Hive Talk At IIT-Delhi
PPT
Hadoop and Hive
PPTX
02 data warehouse applications with hive
PPT
Hive Evolution: ApacheCon NA 2010
PPT
Nextag talk
PPT
PPT
Hive @ Hadoop day seattle_2010
PPT
Introduction to Hive for Hadoop
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
PPTX
WaterlooHiveTalk
PPT
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
PPTX
Apache hive
PPTX
テスト用のプレゼンテーション
PPT
Hadoop - Apache Hive
PDF
20080529dublinpt3
PDF
Facebook hadoop-summit
 
PDF
hive lab
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
PPTX
Hive big-data meetup
Hadoop Hive Talk At IIT-Delhi
Hadoop and Hive
02 data warehouse applications with hive
Hive Evolution: ApacheCon NA 2010
Nextag talk
Hive @ Hadoop day seattle_2010
Introduction to Hive for Hadoop
Hadoop, Hbase and Hive- Bay area Hadoop User Group
WaterlooHiveTalk
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Apache hive
テスト用のプレゼンテーション
Hadoop - Apache Hive
20080529dublinpt3
Facebook hadoop-summit
 
hive lab
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hive big-data meetup

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced IT Governance
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Advanced Soft Computing BINUS July 2025.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced IT Governance
Chapter 3 Spatial Domain Image Processing.pdf

Hadoop Summit 2009 Hive

  • 1. Hive - Data Warehousing & Analytics on Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Namit Jain, Zheng Shao Facebook
  • 2. Agenda Introduction Facebook Usage Hive Progress and Roadmap Open Source Community Facebook
  • 4. Why Another Data Warehousing System? Data, data and more data ~1TB per day in March 2008 ~10TB per day today Facebook
  • 5.  
  • 6. Lets try Hadoop… Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE Facebook
  • 7. Lets try Hadoop… (continued) RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2&quot;\t&quot;$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part* Facebook
  • 8. What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on raw files Key Building Principles: SQL as a familiar data warehousing tool Extensibility – Types, Functions, Formats, Scripts Scalability and Performance Facebook
  • 9. Simplifying Hadoop RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. hive> select key, count(1) from kv1 where key > 100 group by key; Facebook
  • 11. Data Warehousing at Facebook Today Facebook Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  • 12. Hive/Hadoop Usage @ Facebook Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts SELECT pageid, count(1) as imps FROM imp_table GROUP BY pageid WHERE date = ‘2009-05-01’; Complex measures of user engagement Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns for Site Integrity Application API usage patterns Ad Optimization Facebook
  • 13. Hadoop Usage @ Facebook Cluster Capacity: 600 nodes ~2.4PB (80% used) Data statistics: Source logs/day: 6TB Dimension data/day: 4TB Compression Factor ~5x (gzip) Usage statistics: 3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 150 active users within Facebook Facebook
  • 14. Hive Progress and Roadmap Facebook
  • 15. CREATE TABLE clicks(key STRING, value STRING) LOCATION '/hive/clicks' PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe' WITH SERDEPROPERTIES ('testserde.default.serialization.format'='\003'); Facebook
  • 16. Data Model Facebook Logical Partitioning Hash Partitioning clicks HDFS MetaStore /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
  • 17. HIVE: Components Facebook HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift CSV JSON.. Execution Parser Planner Web UI Optimizer DB
  • 18. Hive Query Language SQL Subqueries in from clause Equi-joins Multi-table Insert Multi-group-by Sampling SELECT s.key, count(1) FROM clicks TABLESAMPLE (BUCKET 1 OUT OF 32) s WHERE s.ds = ‘2009-04-22’ GROUP BY s.key Facebook
  • 19. FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ SELECT age, count(DISTINCT userid) GROUP BY age; Facebook
  • 20. Hive Query Language (continued) Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Complex object types: List of Maps Pluggable Data Formats Apache Log Format Facebook
  • 21. FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script‘ AS dt, uid CLUSTER BY dt) map INSERT INTO TABLE pv_users_reduced REDUCE map.dt, map.uid USING 'reduce_script' AS date, count; Pluggable Map-Reduce Scripts Facebook
  • 22. Map Reduce Example Facebook Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 23. Hive QL – Join INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); Facebook
  • 24. Hive QL – Join in Map Reduce Facebook page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
  • 25. Join Optimizations Map Joins User specified small tables stored in hash tables on the mapper backed by jdbm No reducer needed INSERT INTO TABLE pv_users SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); Future Exploit table/column statistics for deciding strategy Facebook
  • 26. Hive QL – Map Join Facebook page_view user Hash table pv_users key value 111 <1,2> 222 <2> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male Pageid age 1 25 2 25 1 32
  • 27. Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; Facebook
  • 28. Hive QL – Group By in Map Reduce Facebook pv_users Map Reduce pageid age 1 25 1 25 pageid age count 1 25 3 pageid age 2 32 1 25 key value <1,25> 2 key value <1,25> 1 <2,32> 1 key value <1,25> 2 <1,25> 1 key value <2,32> 1 Shuffle Sort pageid age count 2 32 1
  • 29. Group by Optimizations Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew Optimizations being Worked On: Exploit pre-sorted data for distinct counts Exploit table/column statistics for deciding strategy Facebook
  • 30. Columnar Storage CREATE table columnTable (key STRING, value STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe' STORED AS RCFILE; Saved 25% of space compared with SequenceFile Based on one of the largest tables (30 columns) inside Facebook Both are compressed with GzipCodec Speed improvements in progress Need to propagate column-selection information to FileFormat *Contribution from Yongqiang He (outside Facebook) Facebook
  • 31. Speed Improvements over Time Facebook QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t; Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec). * No performance benchmarks for Map-side Join yet. Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
  • 32. Overcoming Java Overhead Reuse objects Use Writable instead of Java Primitives Reuse objects across all rows *40% speed improvement on Query C Lazy deserialization Only deserialize the column when asked Very helpful for complex types (map/list/struct) *108% speed improvement on Query A Facebook
  • 33. Generic UDF and UDAF Let UDF and UDAF accept complex-type parameters Integrate UDF and UDAF with Writables public IntWritable evaluate(IntWritable a, IntWritable b) { intWritable.set((int)(a.get() + b.get())); return intWritable; } Facebook
  • 34. HQL Optimizations Predicate Pushdown Merging n-way join Column Pruning Facebook
  • 36. Open Source Community 21 contributors and growing 6 contributors within Facebook Contributors from: Academia Other web companies Etc.. 7 committers 1 external to Facebook and looking to add more here Facebook
  • 37. 50 jiras fixed in last month 218 jiras still open 125 mails in last month on hive-user@ 600 mails in last month on hive-dev@ Various companies/universities Adknowledge, Admob Berkeley, Chinese Academy of Science Demonstration in VLDB’2009 Facebook
  • 38. Deployment Options EC2 http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely Cloudera Virtual Machine http://www.cloudera.com/hadoop-training-hive-tutorial Your own cluster http://wiki.apache.org/hadoop/Hive/GettingStarted Hive can directly consume data on hadoop CREATE EXTERNAL TABLE mytable (key STRING, value STRING) LOCATION '/user/abc/mytable'; Facebook
  • 39. Future Work Benchmark & Performance Integration with BI tools (through JDBC/ODBC) Indexing More on Hive Roadmap http://wiki.apache.org/hadoop/Hive/Roadmap Machine Learning Integration Real-time Streaming Facebook
  • 40. Information Available as a sub project in Hadoop http://wiki.apache.org/hadoop/Hive (wiki) http://hadoop.apache.org/hive (home page) http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19 Release 0.3 is out and more are coming Mailing Lists: hive-{user,dev,commits}@hadoop.apache.org Facebook
  • 41. Contributors Aaron Newton Ashish Thusoo David Phillips Dhruba Borthakur Edward Capriolo Eric Hwang Hao Liu He Yongqiang Jeff Hammerbacher Johan Oskarsson Josh Ferguson Joydeep Sen Sarma Kim P. Facebook Michi Mutsuzaki Min Zhou Namit Jain Neil Conway Pete Wyckoff Prasad Chakka Raghotham Murthy Richard Lee Shyam Sundar Sarkar Suresh Antony Venky Iyer Zheng Shao

Editor's Notes

  • #6: What is this? This is huge amount of data. Along with the fast growth of active users on Facebook, the size of our data is exploding. In the last 12 months, the amount of data increased by 500%. These data are very valuable. They can be used to understand the user behavior, measure the impact of a new product, and make data-based decisions. Traditionally people store data in data warehouse solutions on top of Oracle and MySQL. In the recent years, we are also seeing new proprietary solutions like AsterData and Netezza. However, these solutions either do not scale to the amount of data that we have, or they are very inflexible that cannot satisfy our data analysis requirements. In order to provide the capability to analyze the huge amount of data that we have, we started the Hive project. Hive is based on Hadoop but does much more than Hadoop. We will show the details in the following slides. ============