SlideShare a Scribd company logo
  Petabyte Scale Data Warehouse System on Hadoop Ning Zhang Data Infrastructure
Overview Motivations Hive Introduction Hadoop & Hive Deployment and Usage at Facebook Technical Details Hands-on Session
Facebook is a Set of Web Services …
…  at Large Scale The social graph is large 500 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …
Under the Hook Data flow from users’ perspective Clients (browser/phone/3 rd  party apps)    Web Services    Users Another big topic on the Web Services Getting the user feedbacks from data The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports/insights) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics!  Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)
Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “ In 2009, more data will be generated by individuals than the  entire  history of mankind through 2008.” -- by Andreas Weigend,  Harvard Business Review “ The center of the universe has shifted from e-business to  me-business .” -- same as above “ Invariably, simple models and a lot of data trump more elaborate models based on less data.”  -- by Alon Halevy, Peter Norvig and Fernando Pereira,  The Unreasonable Effectiveness of Data
Biggest Challenge at Facebook – growth!  Data, data and more data 200 GB/day in March 2008    12+ TB/day at the end of 2009 About 8x increase per year Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7500+ queries on production cluster/day, mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data Data used to be available for query in next day now available in minutes.
Why not Existing Data Warehousing Systems? Cost  of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports Limited  Scalability  does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long  product development & release cycle Users requirements changes frequently (agile programming practice) Closed  and proprietary systems
Lets try Hadoop… Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know SQL/bash/python) Need to publish data in well known schemas Solution: HIVE
What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata in an RDBMS Key Building Principles: SQL as a familiar data warehousing tool Extensibility – Types, Functions, Formats, Scripts Scalability and Performance Interoperability
Why SQL on Hadoop? hive>  select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1  $ bin/hadoop dfs –cat /tmp/largekey/part*
Hive Architecture
Hive: Familiar Schema Concepts Name HDFS Directory Table pvs /wh/pvs Partition ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=US Bucket user into 32 buckets HDFS file for user hash 0 /wh/pvs/ds=20090801/ctry=US/part-00000
Column Data Types Primitive Types integer types, float, string, date, boolean Nest-able Collections array<any-type> map<primitive-type, any-type> User-defined types structures with attributes which can be of any-type
Hive Query Language DDL {create/alter/drop} {table/view/partition} create table as select DML Insert overwrite QL Sub-queries in from clause Equi-joins (including Outer joins) Multi-table Insert/dynamic partition insert Sampling Lateral Views Interfaces JDBC/ODBC/Thrift
Hive: Making Optimizations Transparent  Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan
Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client Local mode execution
Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
MapReduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT  TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 SELECT  TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info) ;
Usage in Facebook
Hive & Hadoop Usage @ Facebook Types of Applications: Reporting  Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement  Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others
Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs
Data Flow Architecture at Facebook Web Servers Scribe MidTier Filers Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Scribe-Hadoop Cluster Adhoc Hive-Hadoop Cluster Hive replication
Scribe-HDFS: 101 Scribed Scribed Scribed Scribed Scribed <category, msgs> HDFS Data Node HDFS Data Node HDFS Data Node Append to  /staging/<category>/<file> Scribe-HDFS
Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds
Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
More Real-World Use Cases Bizo: We use Hive for reporting and ad hoc queries. Chitika: … for data mining and analysis … CNET: … for data mining, log analysis and ad hoc queries Digg: … data mining, log analysis, R&D, reporting/analytics Grooveshark: … user analytics, dataset cleaning, machine learning R&D. Hi5: … analytics, machine learning, social graph analysis. HubSpot: … to serve near real-time web analytics. Last.fm: … for various ad hoc queries. Trending Topics: … for log data normalization and building sample data sets for trend detection R&D. VideoEgg: … analyze all the usage data
Technical Details
Data Model External Tables Point to existing data directories in HDFS Can create tables and partitions – partition columns just become annotations to external directories Example: create external table with partitions CREATE EXTERNAL TABLE pvs(uhash int, pageid int,  ds string, ctry string)  PARTITIONED ON (ds string, ctry string) STORED AS textfile LOCATION ‘/path/to/existing/table’ Example: add a partition to external table ALTER TABLE pvs  ADD PARTITION (ds=‘20090801’, ctry=‘US’) LOCATION ‘/path/to/existing/partition’
Hive QL – Join in Map Reduce page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid uhash time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 uhash age_bkt gender 111 B3 female 222 B4 male key value 111 < 2, B3> 222 < 2, B4> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, B3> key value 222 < 1, 1> 222 < 2, B4> Shuffle Sort Pageid age_bkt 1 B3 2 B3 pageid age_bkt 1 B4
Join Optimizations Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed
Hive QL – Group By SELECT pageid, age_bkt, count(1) FROM pv_users GROUP BY pageid, age_bkt;
Hive QL – Group By in Map Reduce pv_users Map Reduce pageid age_bkt 1 B3 1 B3 pageid age_bkt count 1 B3 3 pageid age_bkt 2 B4 1 B3 key value <1,B3> 2 key value <1,B3> 1 <2,B4> 1 key value <1,B3> 2 <1,B3> 1 key value <2,B4> 1 Shuffle Sort pageid age_bkt count 2 B4 1
Group by Optimizations Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
Hive Extensibility Features
Hive is an open system Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
Storage Format Example CREATE TABLE mylog ( uhash BIGINT, page_url STRING, unix_time INT) STORED AS TEXTFILE ; LOAD DATA INPATH '/user/myname/log.txt' INTO TABLE mylog;
Existing File Formats *  Splitable: Capable of splitting the file so that a single huge file can be processed by multiple mappers in parallel. TEXTFILE SEQUENCEFILE RCFILE Data type text only  text/binary text/binary Internal Storage order Row-based Row-based Column-based Compression File-based Block-based Block-based Splitable* YES YES YES Splitable* after compression NO YES YES
Serialization Formats SerDe is short for serialization/deserialization. It controls the format of a row. Serialized format: Delimited format (tab, comma, ctrl-a …) Thrift Protocols Deserialized (in-memory) format: Java Integer/String/ArrayList/HashMap Hadoop Writable classes User-defined Java Classes (Thrift)
SerDe Examples CREATE TABLE mylog ( uhash  BIGINT, page_url  STRING, unix_time INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ; CREATE table mylog_rc ( uhash  BIGINT, page_url  STRING, unix_time INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE;
Existing SerDes * LazyObjects: deserialize the columns only when accessed. * Binary Sortable: binary format preserving the sort order. LazySimpleSerDe LazyBinarySerDe (HIVE-640) BinarySortable SerDe serialized format delimited proprietary binary proprietary binary sortable* deserialized format LazyObjects* LazyBinaryObjects* Writable ThriftSerDe (HIVE-706) RegexSerDe ColumnarSerDe serialized format Depends on the Thrift Protocol Regex formatted proprietary column-based deserialized format User-defined Classes, Java Primitive Objects ArrayList<String> LazyObjects*
UDF Example add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength'; SELECT testlength(page_url) FROM mylog; DROP TEMPORARY FUNCTION testlength; UDFTestLength.java: package org.apache.hadoop.hive.ql.udf;  public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); } }
UDAF Example SELECT page_url,  count(1) FROM mylog; public class UDAFCount extends  UDAF  { public static class Evaluator  implements  UDAFEvaluator  { private int mCount; public void init() {mcount = 0;} public boolean  iterate (Object o) { if (o!=null) mCount++; return true;} public Integer  terminatePartial () {return mCount;} public boolean  merge (Integer o) {mCount += o; return true;} public Integer  terminate () {return mCount;} }
Comparison of UDF/UDAF v.s. M/R scripts UDF/UDAF M/R scripts language Java any language data format in-memory objects serialized streams 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF supported 1/n input/output supported via UDTF supported Speed faster Slower
Hive Interoperability
Interoperability: Interfaces JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI
Interoperability: Microstrategy Beta integration with version 8/9 Free form SQL support Periodically pre-compute the cube
Future Use sort properties to optimize query IN, exists and correlated sub-queries Statistics Indexes More join optimizations Better techniques for handling skews for a given key

More Related Content

PPTX
WaterlooHiveTalk
ODP
Hadoop demo ppt
PPT
Hive Evolution: ApacheCon NA 2010
PDF
Integration of HIve and HBase
PDF
Introduction to Big Data & Hadoop
PPT
Where does hadoop come handy
PDF
Hadoop Overview & Architecture
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
WaterlooHiveTalk
Hadoop demo ppt
Hive Evolution: ApacheCon NA 2010
Integration of HIve and HBase
Introduction to Big Data & Hadoop
Where does hadoop come handy
Hadoop Overview & Architecture
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop

What's hot (20)

PPTX
The Evolution of the Hadoop Ecosystem
PPTX
Introduction to Hadoop and Hadoop component
PPT
Hive Training -- Motivations and Real World Use Cases
PPTX
Hive : WareHousing Over hadoop
PPTX
Real time hadoop + mapreduce intro
PPSX
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
Session 14 - Hive
PDF
XML Parsing with Map Reduce
PPTX
Hadoop And Their Ecosystem
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
PPTX
Introduction to Hadoop Technology
PPTX
Introduction to Pig
PPTX
Hadoop Presentation
PPTX
PPTX
Hive vs Hbase, a Friendly Competition
PDF
Introduction to Hadoop
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PDF
Hadoop-Introduction
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
The Evolution of the Hadoop Ecosystem
Introduction to Hadoop and Hadoop component
Hive Training -- Motivations and Real World Use Cases
Hive : WareHousing Over hadoop
Real time hadoop + mapreduce intro
Practical Problem Solving with Apache Hadoop & Pig
Session 14 - Hive
XML Parsing with Map Reduce
Hadoop And Their Ecosystem
Analytical Queries with Hive: SQL Windowing and Table Functions
Introduction to Hadoop Technology
Introduction to Pig
Hadoop Presentation
Hive vs Hbase, a Friendly Competition
Introduction to Hadoop
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hadoop-Introduction
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Ad

Similar to Hive @ Hadoop day seattle_2010 (20)

PPT
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
PPT
Nextag talk
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
PPT
Hive Percona 2009
PDF
Facebook Hadoop Data & Applications
PPT
Hive ICDE 2010
PPT
Hadoop Hive Talk At IIT-Delhi
PPTX
Colorado Springs Open Source Hadoop/MySQL
PPT
Hadoop presentation
PPTX
Big Data Processing
PPT
Hadoop - Introduction to Hadoop
PDF
Hadoop on Azure, Blue elephants
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
PPT
Hive Apachecon 2008
PPT
2008 Ur Tech Talk Zshao
PPT
Hadoop and Hive
PPTX
Apache hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Nextag talk
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hive Percona 2009
Facebook Hadoop Data & Applications
Hive ICDE 2010
Hadoop Hive Talk At IIT-Delhi
Colorado Springs Open Source Hadoop/MySQL
Hadoop presentation
Big Data Processing
Hadoop - Introduction to Hadoop
Hadoop on Azure, Blue elephants
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Hive Apachecon 2008
2008 Ur Tech Talk Zshao
Hadoop and Hive
Apache hive
Ad

Hive @ Hadoop day seattle_2010

  • 1. Petabyte Scale Data Warehouse System on Hadoop Ning Zhang Data Infrastructure
  • 2. Overview Motivations Hive Introduction Hadoop & Hive Deployment and Usage at Facebook Technical Details Hands-on Session
  • 3. Facebook is a Set of Web Services …
  • 4. … at Large Scale The social graph is large 500 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …
  • 5. Under the Hook Data flow from users’ perspective Clients (browser/phone/3 rd party apps)  Web Services  Users Another big topic on the Web Services Getting the user feedbacks from data The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports/insights) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics! Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)
  • 6. Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “ In 2009, more data will be generated by individuals than the entire history of mankind through 2008.” -- by Andreas Weigend, Harvard Business Review “ The center of the universe has shifted from e-business to me-business .” -- same as above “ Invariably, simple models and a lot of data trump more elaborate models based on less data.” -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data
  • 7. Biggest Challenge at Facebook – growth! Data, data and more data 200 GB/day in March 2008  12+ TB/day at the end of 2009 About 8x increase per year Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7500+ queries on production cluster/day, mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data Data used to be available for query in next day now available in minutes.
  • 8. Why not Existing Data Warehousing Systems? Cost of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports Limited Scalability does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long product development & release cycle Users requirements changes frequently (agile programming practice) Closed and proprietary systems
  • 9. Lets try Hadoop… Pros Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID Cons: Programmability and Metadata Map-reduce hard to program (users know SQL/bash/python) Need to publish data in well known schemas Solution: HIVE
  • 10. What is HIVE? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage Metadata in an RDBMS Key Building Principles: SQL as a familiar data warehousing tool Extensibility – Types, Functions, Formats, Scripts Scalability and Performance Interoperability
  • 11. Why SQL on Hadoop? hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2&quot;\t&quot;$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*
  • 13. Hive: Familiar Schema Concepts Name HDFS Directory Table pvs /wh/pvs Partition ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=US Bucket user into 32 buckets HDFS file for user hash 0 /wh/pvs/ds=20090801/ctry=US/part-00000
  • 14. Column Data Types Primitive Types integer types, float, string, date, boolean Nest-able Collections array<any-type> map<primitive-type, any-type> User-defined types structures with attributes which can be of any-type
  • 15. Hive Query Language DDL {create/alter/drop} {table/view/partition} create table as select DML Insert overwrite QL Sub-queries in from clause Equi-joins (including Outer joins) Multi-table Insert/dynamic partition insert Sampling Lateral Views Interfaces JDBC/ODBC/Thrift
  • 16. Hive: Making Optimizations Transparent Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
  • 17. Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan
  • 18. Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client Local mode execution
  • 19. Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
  • 20. MapReduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info) ;
  • 22. Hive & Hadoop Usage @ Facebook Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others
  • 23. Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs
  • 24. Data Flow Architecture at Facebook Web Servers Scribe MidTier Filers Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Scribe-Hadoop Cluster Adhoc Hive-Hadoop Cluster Hive replication
  • 25. Scribe-HDFS: 101 Scribed Scribed Scribed Scribed Scribed <category, msgs> HDFS Data Node HDFS Data Node HDFS Data Node Append to /staging/<category>/<file> Scribe-HDFS
  • 26. Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds
  • 27. Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
  • 28. More Real-World Use Cases Bizo: We use Hive for reporting and ad hoc queries. Chitika: … for data mining and analysis … CNET: … for data mining, log analysis and ad hoc queries Digg: … data mining, log analysis, R&D, reporting/analytics Grooveshark: … user analytics, dataset cleaning, machine learning R&D. Hi5: … analytics, machine learning, social graph analysis. HubSpot: … to serve near real-time web analytics. Last.fm: … for various ad hoc queries. Trending Topics: … for log data normalization and building sample data sets for trend detection R&D. VideoEgg: … analyze all the usage data
  • 30. Data Model External Tables Point to existing data directories in HDFS Can create tables and partitions – partition columns just become annotations to external directories Example: create external table with partitions CREATE EXTERNAL TABLE pvs(uhash int, pageid int, ds string, ctry string) PARTITIONED ON (ds string, ctry string) STORED AS textfile LOCATION ‘/path/to/existing/table’ Example: add a partition to external table ALTER TABLE pvs ADD PARTITION (ds=‘20090801’, ctry=‘US’) LOCATION ‘/path/to/existing/partition’
  • 31. Hive QL – Join in Map Reduce page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid uhash time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 uhash age_bkt gender 111 B3 female 222 B4 male key value 111 < 2, B3> 222 < 2, B4> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, B3> key value 222 < 1, 1> 222 < 2, B4> Shuffle Sort Pageid age_bkt 1 B3 2 B3 pageid age_bkt 1 B4
  • 32. Join Optimizations Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed
  • 33. Hive QL – Group By SELECT pageid, age_bkt, count(1) FROM pv_users GROUP BY pageid, age_bkt;
  • 34. Hive QL – Group By in Map Reduce pv_users Map Reduce pageid age_bkt 1 B3 1 B3 pageid age_bkt count 1 B3 3 pageid age_bkt 2 B4 1 B3 key value <1,B3> 2 key value <1,B3> 1 <2,B4> 1 key value <1,B3> 2 <1,B3> 1 key value <2,B4> 1 Shuffle Sort pageid age_bkt count 2 B4 1
  • 35. Group by Optimizations Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
  • 37. Hive is an open system Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
  • 38. Storage Format Example CREATE TABLE mylog ( uhash BIGINT, page_url STRING, unix_time INT) STORED AS TEXTFILE ; LOAD DATA INPATH '/user/myname/log.txt' INTO TABLE mylog;
  • 39. Existing File Formats * Splitable: Capable of splitting the file so that a single huge file can be processed by multiple mappers in parallel. TEXTFILE SEQUENCEFILE RCFILE Data type text only text/binary text/binary Internal Storage order Row-based Row-based Column-based Compression File-based Block-based Block-based Splitable* YES YES YES Splitable* after compression NO YES YES
  • 40. Serialization Formats SerDe is short for serialization/deserialization. It controls the format of a row. Serialized format: Delimited format (tab, comma, ctrl-a …) Thrift Protocols Deserialized (in-memory) format: Java Integer/String/ArrayList/HashMap Hadoop Writable classes User-defined Java Classes (Thrift)
  • 41. SerDe Examples CREATE TABLE mylog ( uhash BIGINT, page_url STRING, unix_time INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ; CREATE table mylog_rc ( uhash BIGINT, page_url STRING, unix_time INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE;
  • 42. Existing SerDes * LazyObjects: deserialize the columns only when accessed. * Binary Sortable: binary format preserving the sort order. LazySimpleSerDe LazyBinarySerDe (HIVE-640) BinarySortable SerDe serialized format delimited proprietary binary proprietary binary sortable* deserialized format LazyObjects* LazyBinaryObjects* Writable ThriftSerDe (HIVE-706) RegexSerDe ColumnarSerDe serialized format Depends on the Thrift Protocol Regex formatted proprietary column-based deserialized format User-defined Classes, Java Primitive Objects ArrayList<String> LazyObjects*
  • 43. UDF Example add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength'; SELECT testlength(page_url) FROM mylog; DROP TEMPORARY FUNCTION testlength; UDFTestLength.java: package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); } }
  • 44. UDAF Example SELECT page_url, count(1) FROM mylog; public class UDAFCount extends UDAF { public static class Evaluator implements UDAFEvaluator { private int mCount; public void init() {mcount = 0;} public boolean iterate (Object o) { if (o!=null) mCount++; return true;} public Integer terminatePartial () {return mCount;} public boolean merge (Integer o) {mCount += o; return true;} public Integer terminate () {return mCount;} }
  • 45. Comparison of UDF/UDAF v.s. M/R scripts UDF/UDAF M/R scripts language Java any language data format in-memory objects serialized streams 1/1 input/output supported via UDF supported n/1 input/output supported via UDAF supported 1/n input/output supported via UDTF supported Speed faster Slower
  • 47. Interoperability: Interfaces JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI
  • 48. Interoperability: Microstrategy Beta integration with version 8/9 Free form SQL support Periodically pre-compute the cube
  • 49. Future Use sort properties to optimize query IN, exists and correlated sub-queries Statistics Indexes More join optimizations Better techniques for handling skews for a given key

Editor's Notes

  • #2: Polls: How many of you are working or have worked on DW/BI in your organization? How many of you are satisfied with your current solution? How many of your have been using open source solutions in your organization?
  • #4: List of apps, news feed, ads/notifications Dynamic web site What boils down to is a set of web services, not a big deal