SlideShare a Scribd company logo
HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team
Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008  2+TB(compressed) raw data per day today The Hadoop Experiment Much superior to availability and scalability of commercial DBs Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Problem: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE
What is HIVE? A system for querying and managing structured data built on top of Hadoop Uses Map-Reduce for execution HDFS for storage – but any system that implements Hadoop FS API Key Building Principles: Structured data with rich data types (structs, lists and maps) Directly query data from different formats (text/binary) and file formats (Flat/Sequence) SQL as a familiar programming tool and for standard analytics Allow embedded scripts for extensibility and for non standard applications Rich MetaData to allow data discovery and for optimization
Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on  Hadoop Cluster Oracle RAC Federated MySQL
Hive/Hadoop Usage @ Facebook Types of Applications: Summarization  Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement  Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns in UGC Application api usage patterns Ad Optimization Too many to count ..
Hadoop Usage @ Facebook Data statistics: Total Data: 180TB (mostly compressed) Net Data added/day: 2+TB (compressed) 6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily Usage statistics: 3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 80 MM compute minutes/day
HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. Web UI
Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables #Buckets=32 Bucketing Info Partitioning Cols
Dealing with Structured Data Type system Primitive types Recursively build up using Composition/Maps/Lists ObjectInspector interface for user-defined types To recursively list schema To recursively access fields within a row object Generic (De)Serialization Interface (SerDe) Serialization families implement interface Thrift DDL based SerDe Delimited text based SerDe You can write your own SerDe (XML, JSON …)
MetaStore Stores Table/Partition properties: Table schema and SerDe library Table Location on HDFS Logical Partitioning keys and types Partition level metadata Other information Thrift API Current clients in Php (Web Interface), Python interface to Hive, Java (Query Engine and CLI) Metadata stored in any SQL backend Future Statistics Schema Evolution
Hive Query Language Basic SQL From clause subquery ANSI JOIN (equi-join only) Multi-table Insert Multi group-by Sampling Objects traversal Extensibility Pluggable Map-reduce scripts using TRANSFORM
Running Custom Map/Reduce Scripts FROM (  FROM pv_users  SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script'  AS(dt, uid)  CLUSTER BY(dt)) map  INSERT INTO TABLE pv_users_reduced  SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count);
(Simplified) Map Reduce Review Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
Hive QL – Join SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); X = page_view user pv_users pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32
Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> pageid age 1 25 2 25 pageid age 1 32
Joins Outer Joins INSERT INTO TABLE pv_users  SELECT pv.*, u.gender, u.age  FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)  WHERE pv.date = 2008-03-03;
Join To Map Reduce Only Equality Joins with conjunctions supported Future Pruning of values send from map to reduce on the basis of projections Make Cartesian product more memory efficient Map side joins  Hash Joins if one of the tables is very small Exploit pre-sorted data by doing map-side merge join
Hive Optimizations  – Merge Sequential Map Reduce Jobs SQL: FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … A Map Reduce B C AB Map Reduce ABC key av bv 1 111 222 key av 1 111 key bv 1 222 key cv 1 333 key av bv cv 1 111 222 333
Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; pv_users pageid age 1 25 2 25 1 32 2 25 pageid age count 1 25 1 2 25 2 1 32 1
Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 pageid age count 2 25 2
Hive QL – Group By with Distinct SELECT pageid, COUNT(DISTINCT userid) FROM page_view GROUP BY pageid page_view pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 pageid count_distinct_userid 1 2 2 1
Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid count 1 1 pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> pageid count 1 2 pageid count 2 1
Group by Future optimizations Map side partial aggregations Hash Based aggregates Exploit pre-sorted data for distinct counts Partial aggregations in Combiner Be smarter about how to avoid multiple stage Exploit table/column statistics for deciding strategy
Inserts into Files, Tables and Local Files  FROM pv_users  INSERT INTO TABLE pv_gender_sum  SELECT pv_users.gender, count_distinct(pv_users.userid)  GROUP BY(pv_users.gender)  INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’  SELECT pv_users.age, count_distinct(pv_users.userid)  GROUP BY(pv_users.age)  INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’   FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013  SELECT pv_users.age, count_distinct(pv_users.userid)  GROUP BY(pv_users.age);
Future Work Cost-based optimization Multiple interfaces (JDBC…) Performance Comparisons with similar work (PIG) SQL Compliance (order by, nested queries…) Integration with BI tools Data Compression Columnar storage schemes  Exploit lazy/functional Hive field retrieval interfaces Better data locality Co-locate hash partitions on same rack Exploit high intra-rack bandwidth for merge joins
Hive Performance full table aggregate (not grouped)  Input data size: 1,407,867,660 (32 files)  count in mapper and 2 map-reduce jobs for sum time taken 30 seconds Test cluster: 10 nodes from (  from test t select transform (t.userid) as (cnt) using myCount'  ) mout  select sum(mout.cnt);
Hadoop Challenges @ Facebook QOS/Isolation: Big jobs can hog the cluster JobTracker memory as limited resource Limit memory impact of runaway tasks Fair Scheduler (Matei) Protection What if a software bug corrupts the NameNode transaction log/image? HDFS SnapShots (Dhruba) Data Archival Not all data is hot and needs colocation with Compute HDFS Symlinks (Dhruba) Data Archival Performance Really hard to understand what bottlenecks are
Conclusion Available as a contrib project in hadoop http://svn.apache.org/repos/asf/hadoop/core/ Checkout src/contrib/hive from trunk (works against 0.19 onwards) Latest distributions (including for hadoop-0.17) at: http://mirror.facebook.com/facebook/hive/ People: Suresh Anthony Zheng Shao Prasad Chakka Pete Wyckoff Namit Jain Raghu Murthy Joydeep Sen Sarma Ashish Thusoo

More Related Content

PPTX
DDoS - Distributed Denial of Service
PDF
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
PPT
Graph database
PDF
Introducing Neo4j
PDF
M|18 Architectural Overview: MariaDB MaxScale
PDF
Neo4j Presentation
PDF
Oracle Database Migration to Oracle Cloud Infrastructure
PDF
PaloAlto Enterprise Security Solution
DDoS - Distributed Denial of Service
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
Graph database
Introducing Neo4j
M|18 Architectural Overview: MariaDB MaxScale
Neo4j Presentation
Oracle Database Migration to Oracle Cloud Infrastructure
PaloAlto Enterprise Security Solution

What's hot (20)

PDF
Intro to Neo4j and Graph Databases
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
PPTX
AI and ML in Cybersecurity
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
PPTX
NoSQL databases - An introduction
PDF
Percona Live 2022 - MySQL Architectures
PPTX
DAS RAID NAS SAN
PPTX
LOAD BALANCING ALGORITHMS
PPTX
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
PDF
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
PDF
Cassandra Database
PDF
Cassandra at eBay - Cassandra Summit 2012
PDF
Cassandra 101
PDF
Neo4j Fundamentals
PDF
MITRE ATT&CKcon 2.0: Zeek-based ATT&CK Metrics and Gap Analysis; Allan Thomso...
PDF
Hadoop Overview & Architecture
 
PDF
Microsoft Intune - Empowering Enterprise Mobility - Presented by Atidan
PDF
Introduction to Hadoop
PPT
Oracle GoldenGate
Intro to Neo4j and Graph Databases
Simplifying Big Data Analytics with Apache Spark
AI and ML in Cybersecurity
Hive, Presto, and Spark on TPC-DS benchmark
NoSQL databases - An introduction
Percona Live 2022 - MySQL Architectures
DAS RAID NAS SAN
LOAD BALANCING ALGORITHMS
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Cassandra Database
Cassandra at eBay - Cassandra Summit 2012
Cassandra 101
Neo4j Fundamentals
MITRE ATT&CKcon 2.0: Zeek-based ATT&CK Metrics and Gap Analysis; Allan Thomso...
Hadoop Overview & Architecture
 
Microsoft Intune - Empowering Enterprise Mobility - Presented by Atidan
Introduction to Hadoop
Oracle GoldenGate
Ad

Viewers also liked (17)

KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
PDF
introduction to data processing using Hadoop and Pig
PPTX
Pig, Making Hadoop Easy
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PDF
Integration of Hive and HBase
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPT
Introduction To Map Reduce
PDF
Hive Quick Start Tutorial
PPTX
Big Data Analytics with Hadoop
PPTX
Big Data & Hadoop Tutorial
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPT
Seminar Presentation Hadoop
PPTX
Big data and Hadoop
ODP
Hadoop - Overview
ODP
Hadoop demo ppt
PDF
A beginners guide to Cloudera Hadoop
PPSX
Hadoop, Pig, and Twitter (NoSQL East 2009)
introduction to data processing using Hadoop and Pig
Pig, Making Hadoop Easy
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Integration of Hive and HBase
Practical Problem Solving with Apache Hadoop & Pig
Introduction To Map Reduce
Hive Quick Start Tutorial
Big Data Analytics with Hadoop
Big Data & Hadoop Tutorial
Hadoop introduction , Why and What is Hadoop ?
Seminar Presentation Hadoop
Big data and Hadoop
Hadoop - Overview
Hadoop demo ppt
A beginners guide to Cloudera Hadoop
Ad

Similar to HIVE: Data Warehousing & Analytics on Hadoop (20)

PPT
Hadoop Hive Talk At IIT-Delhi
PPT
Hive Apachecon 2008
PPT
2008 Ur Tech Talk Zshao
PPT
Hadoop and Hive
PPT
Hive ICDE 2010
PPT
PPT
Hadoop Summit 2009 Hive
PPT
Hadoop Summit 2009 Hive
PPT
Hive Percona 2009
PPT
Hive Training -- Motivations and Real World Use Cases
PPT
Hive User Meeting 2009 8 Facebook
PPT
Hive User Meeting August 2009 Facebook
PPT
Hive @ Hadoop day seattle_2010
PPTX
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
PPTX
02 data warehouse applications with hive
PPT
Hadoop institutes in hyderabad
PDF
Hadoop 101 for bioinformaticians
PDF
[SSA] 04.sql on hadoop(2014.02.05)
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PPTX
Stratosphere with big_data_analytics
Hadoop Hive Talk At IIT-Delhi
Hive Apachecon 2008
2008 Ur Tech Talk Zshao
Hadoop and Hive
Hive ICDE 2010
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hive Percona 2009
Hive Training -- Motivations and Real World Use Cases
Hive User Meeting 2009 8 Facebook
Hive User Meeting August 2009 Facebook
Hive @ Hadoop day seattle_2010
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
02 data warehouse applications with hive
Hadoop institutes in hyderabad
Hadoop 101 for bioinformaticians
[SSA] 04.sql on hadoop(2014.02.05)
Hw09 Hadoop Development At Facebook Hive And Hdfs
Stratosphere with big_data_analytics

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
GamePlan Trading System Review: Professional Trader's Honest Take
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Modernizing your data center with Dell and AMD
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The AUB Centre for AI in Media Proposal.docx
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?

HIVE: Data Warehousing & Analytics on Hadoop

  • 1. HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team
  • 2. Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008 2+TB(compressed) raw data per day today The Hadoop Experiment Much superior to availability and scalability of commercial DBs Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Problem: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE
  • 3. What is HIVE? A system for querying and managing structured data built on top of Hadoop Uses Map-Reduce for execution HDFS for storage – but any system that implements Hadoop FS API Key Building Principles: Structured data with rich data types (structs, lists and maps) Directly query data from different formats (text/binary) and file formats (Flat/Sequence) SQL as a familiar programming tool and for standard analytics Allow embedded scripts for extensibility and for non standard applications Rich MetaData to allow data discovery and for optimization
  • 4. Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  • 5. Hive/Hadoop Usage @ Facebook Types of Applications: Summarization Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns in UGC Application api usage patterns Ad Optimization Too many to count ..
  • 6. Hadoop Usage @ Facebook Data statistics: Total Data: 180TB (mostly compressed) Net Data added/day: 2+TB (compressed) 6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily Usage statistics: 3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 80 MM compute minutes/day
  • 7. HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. Web UI
  • 8. Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables #Buckets=32 Bucketing Info Partitioning Cols
  • 9. Dealing with Structured Data Type system Primitive types Recursively build up using Composition/Maps/Lists ObjectInspector interface for user-defined types To recursively list schema To recursively access fields within a row object Generic (De)Serialization Interface (SerDe) Serialization families implement interface Thrift DDL based SerDe Delimited text based SerDe You can write your own SerDe (XML, JSON …)
  • 10. MetaStore Stores Table/Partition properties: Table schema and SerDe library Table Location on HDFS Logical Partitioning keys and types Partition level metadata Other information Thrift API Current clients in Php (Web Interface), Python interface to Hive, Java (Query Engine and CLI) Metadata stored in any SQL backend Future Statistics Schema Evolution
  • 11. Hive Query Language Basic SQL From clause subquery ANSI JOIN (equi-join only) Multi-table Insert Multi group-by Sampling Objects traversal Extensibility Pluggable Map-reduce scripts using TRANSFORM
  • 12. Running Custom Map/Reduce Scripts FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count);
  • 13. (Simplified) Map Reduce Review Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 14. Hive QL – Join SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); X = page_view user pv_users pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32
  • 15. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> pageid age 1 25 2 25 pageid age 1 32
  • 16. Joins Outer Joins INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) WHERE pv.date = 2008-03-03;
  • 17. Join To Map Reduce Only Equality Joins with conjunctions supported Future Pruning of values send from map to reduce on the basis of projections Make Cartesian product more memory efficient Map side joins Hash Joins if one of the tables is very small Exploit pre-sorted data by doing map-side merge join
  • 18. Hive Optimizations – Merge Sequential Map Reduce Jobs SQL: FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … A Map Reduce B C AB Map Reduce ABC key av bv 1 111 222 key av 1 111 key bv 1 222 key cv 1 333 key av bv cv 1 111 222 333
  • 19. Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; pv_users pageid age 1 25 2 25 1 32 2 25 pageid age count 1 25 1 2 25 2 1 32 1
  • 20. Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 pageid age count 2 25 2
  • 21. Hive QL – Group By with Distinct SELECT pageid, COUNT(DISTINCT userid) FROM page_view GROUP BY pageid page_view pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 pageid count_distinct_userid 1 2 2 1
  • 22. Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid count 1 1 pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> pageid count 1 2 pageid count 2 1
  • 23. Group by Future optimizations Map side partial aggregations Hash Based aggregates Exploit pre-sorted data for distinct counts Partial aggregations in Combiner Be smarter about how to avoid multiple stage Exploit table/column statistics for deciding strategy
  • 24. Inserts into Files, Tables and Local Files FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);
  • 25. Future Work Cost-based optimization Multiple interfaces (JDBC…) Performance Comparisons with similar work (PIG) SQL Compliance (order by, nested queries…) Integration with BI tools Data Compression Columnar storage schemes Exploit lazy/functional Hive field retrieval interfaces Better data locality Co-locate hash partitions on same rack Exploit high intra-rack bandwidth for merge joins
  • 26. Hive Performance full table aggregate (not grouped) Input data size: 1,407,867,660 (32 files) count in mapper and 2 map-reduce jobs for sum time taken 30 seconds Test cluster: 10 nodes from ( from test t select transform (t.userid) as (cnt) using myCount' ) mout select sum(mout.cnt);
  • 27. Hadoop Challenges @ Facebook QOS/Isolation: Big jobs can hog the cluster JobTracker memory as limited resource Limit memory impact of runaway tasks Fair Scheduler (Matei) Protection What if a software bug corrupts the NameNode transaction log/image? HDFS SnapShots (Dhruba) Data Archival Not all data is hot and needs colocation with Compute HDFS Symlinks (Dhruba) Data Archival Performance Really hard to understand what bottlenecks are
  • 28. Conclusion Available as a contrib project in hadoop http://svn.apache.org/repos/asf/hadoop/core/ Checkout src/contrib/hive from trunk (works against 0.19 onwards) Latest distributions (including for hadoop-0.17) at: http://mirror.facebook.com/facebook/hive/ People: Suresh Anthony Zheng Shao Prasad Chakka Pete Wyckoff Namit Jain Raghu Murthy Joydeep Sen Sarma Ashish Thusoo

Editor's Notes

  • #2: STYLE GUIDELINES The master cannot be changed, if you are going to place another logo on any slide, please place it in the lower right corner. Title sizes may not be tampered with. If your title is too long please shorten it. Please do not center the title and subtitle, everything is made to align with the Facebook logo above them. Always remember to use the correct slide type for what you’re using it for. If you’re looking to use half a slide with bullet points and the other half with a picture, pick the correct slide type.