SlideShare a Scribd company logo
  Data Warehousing & Analytics on Hadoop Ashish Thusoo, Prasad Chakka Facebook Data Team
Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008  2+TB(compressed) raw data per day today
 
 
Lets try Hadoop… Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE
What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on raw files Key Building Principles: SQL as a familiar data warehousing tool Extensibility – Types, Functions, Formats, Scripts Scalability and Performance
Simplifying Hadoop hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1  $ bin/hadoop dfs –cat /tmp/largekey/part*
Looks like this .. Node Node Node Node Node Node 1 Gigabit 4-8 Gigabit Node = DataNode  + Map-Reduce Disks Disks Disks Disks Disks Disks
Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on  Hadoop Cluster Oracle RAC Federated MySQL
Hive/Hadoop Usage @ Facebook Types of Applications: Reporting  Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement  Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns for Site Integrity Application API usage patterns Ad Optimization Too many to count ..
Hadoop Usage @ Facebook Data statistics: Total Data: ~1.7PB  Cluster Capacity  ~2.4PB Net Data added/day: ~15TB 6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily Compression Factor ~5x (gzip, more with bzip) Usage statistics: 3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 80 MM compute minutes/day
In Pictures
 
HIVE Internals!!
HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Parser Planner Web UI Optimizer DB
Data Model Logical Partitioning Hash Partitioning clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
Hive Query Language SQL Subqueries in from clause Equi-joins Multi-table Insert Multi-group-by Sampling Complex object types Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Pluggable Data Formats
Map Reduce Example Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
Hive QL – Join INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map Reduce page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;
Hive QL – Group By in Map Reduce pv_users Map Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 Shuffle Sort pageid age count 2 25 2
Group by Optimizations Map side partial aggregations Hash Based aggregates Serialized key/values in hash tables Optimizations being Worked On: Exploit pre-sorted data for distinct counts Partial aggregations and Combiners Be smarter about how to avoid multiple stage Exploit table/column statistics for deciding strategy
Inserts into Files, Tables and Local Files  FROM pv_users  INSERT INTO  TABLE  pv_gender_sum  SELECT pv_users.gender, count_distinct(pv_users.userid)  GROUP BY(pv_users.gender)  INSERT INTO  DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’  SELECT pv_users.age, count_distinct(pv_users.userid)  GROUP BY(pv_users.age)  INSERT INTO  LOCAL DIRECTORY  ‘/home/me/pv_age_sum.dir’   FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013  SELECT pv_users.age, count_distinct(pv_users.userid)  GROUP BY(pv_users.age);
Extensibility - Custom Map/Reduce Scripts FROM (  FROM pv_users  MAP (pv_users.userid, pv_users.date) USING 'map_script'  AS(dt, uid)  CLUSTER  BY(dt)) map  INSERT INTO TABLE pv_users_reduced  REDUCE (map.dt, map.uid) USING 'reduce_script'  AS (date, count);
Open Source Community 21 contributors and growing  6 contributors within Facebook Contributors from: Academia Other web companies Etc.. 7 committers 1 external to Facebook and looking to add more here
Future Work Statistics and cost-based optimization Integration with BI tools (through JDBC/ODBC) Performance improvements More SQL constructs & UDFs Indexing Schema Evolution Advanced operators Cubes/Frequent Item Sets/Window Functions Hive Roadmap http://wiki.apache.org/hadoop/Hive/Roadmap
Information Available as a sub project in Hadoop http://wiki.apache.org/hadoop/Hive  (wiki) http://hadoop.apache.org/hive  (home page) http://svn.apache.org/repos/asf/hadoop/hive  (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19 Release 0.3 is coming in the next few weeks Mailing Lists:  hive-{user,dev,commits}@hadoop.apache.org

More Related Content

PPT
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
PDF
Hadoop
PPTX
Thrift vs Protocol Buffers vs Avro - Biased Comparison
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPT
Chicago Data Summit: Apache HBase: An Introduction
PPTX
Hive + Tez: A Performance Deep Dive
PDF
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
PDF
Redpanda and ClickHouse
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hadoop
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Apache Spark in Depth: Core Concepts, Architecture & Internals
Chicago Data Summit: Apache HBase: An Introduction
Hive + Tez: A Performance Deep Dive
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Redpanda and ClickHouse

What's hot (20)

PDF
Base de données graphe, Noe4j concepts et mise en oeuvre
PDF
On-boarding with JanusGraph Performance
PDF
Chapitre 2 hadoop
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Cours Big Data Chap1
PPTX
Apache HBase™
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
PDF
MyRocks Deep Dive
PDF
Hyperspace for Delta Lake
PDF
Bulk Loading into Cassandra
PDF
Spark overview
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Introduction To HBase
PPTX
Hive ppt (1)
PDF
Cours Big Data Chap4 - Spark
PDF
RocksDB Performance and Reliability Practices
PDF
Introduction to Impala
PDF
MongodB Internals
Base de données graphe, Noe4j concepts et mise en oeuvre
On-boarding with JanusGraph Performance
Chapitre 2 hadoop
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
The columnar roadmap: Apache Parquet and Apache Arrow
Cours Big Data Chap1
Apache HBase™
Optimizing Delta/Parquet Data Lakes for Apache Spark
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
MyRocks Deep Dive
Hyperspace for Delta Lake
Bulk Loading into Cassandra
Spark overview
Simplifying Big Data Analytics with Apache Spark
Introduction To HBase
Hive ppt (1)
Cours Big Data Chap4 - Spark
RocksDB Performance and Reliability Practices
Introduction to Impala
MongodB Internals
Ad

Viewers also liked (20)

PPT
Hadoop hive presentation
PDF
Introduction to Apache Hive
PPT
Seminar Presentation Hadoop
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Hadoop Overview & Architecture
 
PPTX
Big data ppt
PDF
20081009nychive
PPT
2008 Ur Tech Talk Zshao
PPT
Hive Apachecon 2008
PDF
Hive Demo Paper at VLDB 2009
PDF
20081030linkedin
PPT
Hadoop Summit 2009 Hive
PPT
Apache Hive - Introduction
PPT
Hive Evolution: ApacheCon NA 2010
PPT
Allemand
PPS
one woman satisfies 12 men
PPT
Being Caught Stealing (Con La Mano En El Pastel)
PPT
Allemand
PPS
Photos With Reflections
Hadoop hive presentation
Introduction to Apache Hive
Seminar Presentation Hadoop
Hadoop introduction , Why and What is Hadoop ?
Hadoop Overview & Architecture
 
Big data ppt
20081009nychive
2008 Ur Tech Talk Zshao
Hive Apachecon 2008
Hive Demo Paper at VLDB 2009
20081030linkedin
Hadoop Summit 2009 Hive
Apache Hive - Introduction
Hive Evolution: ApacheCon NA 2010
Allemand
one woman satisfies 12 men
Being Caught Stealing (Con La Mano En El Pastel)
Allemand
Photos With Reflections
Ad

Similar to Hive Percona 2009 (20)

PPT
HIVE: Data Warehousing & Analytics on Hadoop
PPT
Hive Training -- Motivations and Real World Use Cases
PPT
Hive ICDE 2010
PPT
Hive @ Hadoop day seattle_2010
PPT
Hadoop Hive Talk At IIT-Delhi
PPT
Hadoop and Hive
PPT
Hadoop Summit 2009 Hive
PPT
Hadoop & Zing
PPT
hadoop&zing
PPT
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
PPTX
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
ODP
Hadoop - Overview
PPT
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
DOCX
Prashanth Kumar_Hadoop_NEW
PPTX
02 data warehouse applications with hive
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PDF
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
ODP
Training
PPTX
Hadoop Interview Questions and Answers
HIVE: Data Warehousing & Analytics on Hadoop
Hive Training -- Motivations and Real World Use Cases
Hive ICDE 2010
Hive @ Hadoop day seattle_2010
Hadoop Hive Talk At IIT-Delhi
Hadoop and Hive
Hadoop Summit 2009 Hive
Hadoop & Zing
hadoop&zing
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Hadoop - Overview
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Prashanth Kumar_Hadoop_NEW
02 data warehouse applications with hive
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Training
Hadoop Interview Questions and Answers

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced IT Governance
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
NewMind AI Monthly Chronicles - July 2025
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Advanced IT Governance
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Weekly Chronicles - August'25 Week I
Advanced Soft Computing BINUS July 2025.pdf
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Hive Percona 2009

  • 1. Data Warehousing & Analytics on Hadoop Ashish Thusoo, Prasad Chakka Facebook Data Team
  • 2. Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 2+TB(compressed) raw data per day today
  • 3.  
  • 4.  
  • 5. Lets try Hadoop… Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas Solution: HIVE
  • 6. What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata on raw files Key Building Principles: SQL as a familiar data warehousing tool Extensibility – Types, Functions, Formats, Scripts Scalability and Performance
  • 7. Simplifying Hadoop hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2&quot;\t&quot;$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*
  • 8. Looks like this .. Node Node Node Node Node Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce Disks Disks Disks Disks Disks Disks
  • 9. Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  • 10. Hive/Hadoop Usage @ Facebook Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns for Site Integrity Application API usage patterns Ad Optimization Too many to count ..
  • 11. Hadoop Usage @ Facebook Data statistics: Total Data: ~1.7PB Cluster Capacity ~2.4PB Net Data added/day: ~15TB 6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily Compression Factor ~5x (gzip, more with bzip) Usage statistics: 3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 80 MM compute minutes/day
  • 13.  
  • 15. HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Parser Planner Web UI Optimizer DB
  • 16. Data Model Logical Partitioning Hash Partitioning clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
  • 17. Hive Query Language SQL Subqueries in from clause Equi-joins Multi-table Insert Multi-group-by Sampling Complex object types Extensibility Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types Pluggable Data Formats
  • 18. Map Reduce Example Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 19. Hive QL – Join INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
  • 20. Hive QL – Join in Map Reduce page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
  • 21. Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;
  • 22. Hive QL – Group By in Map Reduce pv_users Map Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 Shuffle Sort pageid age count 2 25 2
  • 23. Group by Optimizations Map side partial aggregations Hash Based aggregates Serialized key/values in hash tables Optimizations being Worked On: Exploit pre-sorted data for distinct counts Partial aggregations and Combiners Be smarter about how to avoid multiple stage Exploit table/column statistics for deciding strategy
  • 24. Inserts into Files, Tables and Local Files FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);
  • 25. Extensibility - Custom Map/Reduce Scripts FROM ( FROM pv_users MAP (pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTO TABLE pv_users_reduced REDUCE (map.dt, map.uid) USING 'reduce_script' AS (date, count);
  • 26. Open Source Community 21 contributors and growing 6 contributors within Facebook Contributors from: Academia Other web companies Etc.. 7 committers 1 external to Facebook and looking to add more here
  • 27. Future Work Statistics and cost-based optimization Integration with BI tools (through JDBC/ODBC) Performance improvements More SQL constructs & UDFs Indexing Schema Evolution Advanced operators Cubes/Frequent Item Sets/Window Functions Hive Roadmap http://wiki.apache.org/hadoop/Hive/Roadmap
  • 28. Information Available as a sub project in Hadoop http://wiki.apache.org/hadoop/Hive (wiki) http://hadoop.apache.org/hive (home page) http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19 Release 0.3 is coming in the next few weeks Mailing Lists: hive-{user,dev,commits}@hadoop.apache.org