SlideShare a Scribd company logo
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Hive
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




What is Hive?

• A data warehouse system for Hadoop


• Facilitates data summarization and ad-hoc queries


• Allows SQL like querying using HiveQL, by transposing metadata onto data
  stored in HDFS


• Can also plug-in custom mappers and reducers
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Supported Platforms

• Linux/Unix and Mac OSX


• Does not work on Cygwin
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Required Software

• Java 1.6.x


• Hadoop 0.17.x to 0.20.x
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Download

• Source: http://hive.apache.org/releases.html


• Version:


   • hive-0.7.0


• Both binary and source distributions available
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




 Install

• Extract: tar zxvf hive-0.7.0-bin.tar.gz


• Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive


• Set environment variable HIVE_HOME to point to the hive directory


• Add $HIVE_HOME/bin to your PATH environment variable
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




 Build From Source

• $ svn co http://svn.apache.org/repos/asf/hive/trunk hive


• $ cd hive


• $ ant clean package


• The binary distribution is in build/dist
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Hive Needs Hadoop

• Needs Hadoop


  • Add Hadoop distribution to your path or set HADOOP_HOME


  • Start Hadoop daemons


    • bin/start-all.sh
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Configure Hive

• Create /tmp in HDFS and set appropriate permissions


  • bin/hadoop fs -mkdir /tmp


  • bin/hadoop fs -chmod g+w /tmp


• Create /user/hive/warehouse and set appropriate permissions


  • bin/hadoop fs -mkdir /user/hive/warehouse


  • bin/hadoop fs -chmod g+w /user/hive/warehouse
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Default Hive Configuration

• Default configuration: conf/hive-default.xml


• Override default configuration by redefining properties in:


   • conf/hive-site.xml


• Set HIVE_CONF_DIR to set a new location for the config file


• Hive configuration is a overlay on top of Hadoop configuration
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Hive Configuration Manipulation

• Edit: conf/hive-site.xml


• Use SET command on the Hive cli


• Pass parameters to Hive


   • bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2


   • set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                           Copyright for all other & referenced work is retained by their respective owners.




Hive by Example -- Getting Started

• Start the cli: bin/hive


• Basic DDL statements


   • List the existing tables


      • SHOW TABLES;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                             Copyright for all other & referenced work is retained by their respective owners.




Create Table

• CREATE TABLE books (isbn INT, title STRING);


• DESCRIBE books;


  • isbn	    int	


  • title	   string


• CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol
  STRING);


  • What is PARTITION BY vcol?
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Logical Table Partitions

• A Hive table can be logically partitioned by a virtual column


• virtual column is derived by the partition in which the data is stored


• A table can have multiple partitions


• Each partition in uniquely identified by a virtual column value
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Alter Table

• ALTER TABLE books ADD COLUMNS (author STRING, category STRING);


• Change Column Property


  • ALTER TABLE table_name CHANGE [COLUMN]


  • old_column_name new_column_name column_type


  • [COMMENT column_comment] [FIRST|AFTER column_name]
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Alter Table Column Property

• ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT
  "multi-valued";


  • old and new column name needs to be specified


  • Data type changed
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Data Types Supported

• Primitives: INT, STRING, etc...


• Complex types: maps, array, struct
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Rename Table

• ALTER TABLE books RENAME TO published_contents;


• DESCRIBE published_contents;


• DESCRIBE books; (Execution error!)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Drop Tables

• DROP TABLE published_contents;


• DROP TABLE users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




GroupLens Example -- Getting the Data Set

• Movie ratings -- 1 million records


• Available in tar.gz format: million-ml-data.tar__0.gz


• Extract: tar zxvf million-ml-data.tar__0.gz


•
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Loading Rating Data

• Format of data in ratings.dat:


   • UserID::MovieID::Rating::Timestamp


• Replace delimiter ‘::’ for ‘#’


   • :%s/::/#/g


• Save as .hash_delimited
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Creating Metadata and Loading the File

• hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp
  STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED
  AS TEXTFILE;


• LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE
  <table name>;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




File Load Properties

• No validation. Developer’s responsibility to make sure schema matches
  between table schema and the file.


• Data can be on the local filesystem or on HDFS


• Data copied to Hive HDFS namespace


• If OVERWRITE not specified then its data append
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Rating Data Load

• hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'


•   > OVERWRITE INTO TABLE ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




A SQL Style Query

• SELECT COUNT(*) FROM ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Loading movies and users data

• Now load the movies and users data in the same way as the ratings data.


  • Details on the console...


• CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation
  STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
  BY '#' STORED AS TEXTFILE;


• add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/
  occupation_mapper.py;


• INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender,
  age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid,
  gender, age, occupation_str, zipcode) FROM users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Good Old SQL

• SELECT * FROM movies LIMIT 5;


• SELECT * FROM ratings WHERE movieid = 1;


• SELECT COUNT(*) FROM ratings WHERE movieid < 10;


• SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;


• SELECT title FROM movies WHERE title = `^Toy+`;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




More Than Good Old SQL

• SELECT `*+(id)` FROM ratings WHERE movieid = 1;


  • regular expression based search on column name


• SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid
  = 1 GROUP BY ratings.rating; (group by)


• SELECT * FROM movies ORDER BY movieid DESC;


• DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




JOIN(s) in HiveQL

• equality joins, outer joins, left semi-joins


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
     ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;


• More than 2 tables:


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
     users.gender FROM ratings JOIN movies ON (ratings.movieid =
     movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




JOIN(s) in HiveQL

• equality joins, outer joins, left semi-joins


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM
     ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;


• More than 2 tables:


   • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title,
     users.gender FROM ratings JOIN movies ON (ratings.movieid =
     movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Explain Plan to Under the hood MapReduce

• EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating =
  5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

PDF
SDEC2011 Essentials of Pig
PDF
Sdec2011 shashank-introducing hadoop
ZIP
Sdec2011 Introducing Hadoop
KEY
Asset Pipeline
PPTX
HCatalog Hadoop Summit 2011
PDF
HCatalog
PDF
Beginning hive and_apache_pig
KEY
Picconf12
SDEC2011 Essentials of Pig
Sdec2011 shashank-introducing hadoop
Sdec2011 Introducing Hadoop
Asset Pipeline
HCatalog Hadoop Summit 2011
HCatalog
Beginning hive and_apache_pig
Picconf12

What's hot (19)

PDF
May 2013 HUG: HCatalog/Hive Data Out
PPTX
Future of HCatalog - Hadoop Summit 2012
KEY
Polyglot Persistence & Big Data in the Cloud
PDF
Apache Hive micro guide - ConfusedCoders
PPTX
H cat berlinbuzzwords2012
PPTX
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
PDF
The First Class Integration of Solr with Hadoop
PDF
Future of HCatalog
PPT
Website designing company_in_delhi_phpwebdevelopment
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
KEY
API Design
PPTX
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
PPTX
Hortonworks HBase Meetup Presentation
PPTX
REDIS327
PPTX
Puppet Camp DC: Puppet for Everybody
PDF
Amebaサービスのログ解析基盤
PPTX
Session 03 - Hadoop Installation and Basic Commands
PDF
Rails 6 Multi-DB 実戦投入
PPTX
An Introduction to Apache Pig
May 2013 HUG: HCatalog/Hive Data Out
Future of HCatalog - Hadoop Summit 2012
Polyglot Persistence & Big Data in the Cloud
Apache Hive micro guide - ConfusedCoders
H cat berlinbuzzwords2012
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
The First Class Integration of Solr with Hadoop
Future of HCatalog
Website designing company_in_delhi_phpwebdevelopment
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
API Design
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Hortonworks HBase Meetup Presentation
REDIS327
Puppet Camp DC: Puppet for Everybody
Amebaサービスのログ解析基盤
Session 03 - Hadoop Installation and Basic Commands
Rails 6 Multi-DB 実戦投入
An Introduction to Apache Pig
Ad

Viewers also liked (20)

PPTX
HbaseHivePigbyRohitDubey
PDF
Benchmark Mail Tutorial
PDF
Brief Intro to Apache Spark @ Stanford ICME
PDF
Spark introduction - In Chinese
KEY
SDEC2011 Big engineer vs small entreprenuer
PDF
Hive
PDF
Adobe Spark Step by Step Guide
PDF
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
PDF
Sept 17 2013 - THUG - HBase a Technical Introduction
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
#MesosCon 2014: Spark on Mesos
PDF
How Apache Spark fits in the Big Data landscape
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PDF
NoSQL HBase schema design and SQL with Apache Drill
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Strata EU 2014: Spark Streaming Case Studies
PPTX
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
PPTX
Internal Hive
PPT
11. From Hadoop to Spark 2/2
HbaseHivePigbyRohitDubey
Benchmark Mail Tutorial
Brief Intro to Apache Spark @ Stanford ICME
Spark introduction - In Chinese
SDEC2011 Big engineer vs small entreprenuer
Hive
Adobe Spark Step by Step Guide
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
Sept 17 2013 - THUG - HBase a Technical Introduction
Apache Spark and the Emerging Technology Landscape for Big Data
#MesosCon 2014: Spark on Mesos
How Apache Spark fits in the Big Data landscape
Databricks Meetup @ Los Angeles Apache Spark User Group
NoSQL HBase schema design and SQL with Apache Drill
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata EU 2014: Spark Streaming Case Studies
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Internal Hive
11. From Hadoop to Spark 2/2
Ad

Similar to SDEC2011 Essentials of Hive (20)

PPTX
Apache Hive
PPTX
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
PPTX
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
PPTX
03 hive query language (hql)
PDF
Hive Quick Start Tutorial
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
PDF
hive lab
PPTX
Ten tools for ten big data areas 04_Apache Hive
PDF
Hypertable - massively scalable nosql database
PPTX
PDF
Apache Hive, data segmentation and bucketing
PDF
20081030linkedin
PPTX
Session 14 - Hive
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PDF
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
PDF
20080529dublinpt3
PPTX
Big Data & Analytics (CSE6005) L6.pptx
Apache Hive
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
03 hive query language (hql)
Hive Quick Start Tutorial
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
hive lab
Ten tools for ten big data areas 04_Apache Hive
Hypertable - massively scalable nosql database
Apache Hive, data segmentation and bucketing
20081030linkedin
Session 14 - Hive
Hw09 Hadoop Development At Facebook Hive And Hdfs
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
20080529dublinpt3
Big Data & Analytics (CSE6005) L6.pptx

More from Korea Sdec (11)

PDF
SDEC2011 Implementing me2day friend suggestion
PDF
SDEC2011 Introducing Hadoop
PDF
SDEC2011 NoSQL Data modelling
PDF
SDEC2011 Essentials of Mahout
PDF
SDEC2011 NoSQL concepts and models
PDF
SDEC2011 Rapidant
PDF
SDEC2011 Mahout - the what, the how and the why
PDF
SDEC2011 Going by TACC
PDF
SDEC2011 Glory-FS development & Experiences
PDF
SDEC2011 Using Couchbase for social game scaling and speed
PDF
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Introducing Hadoop
SDEC2011 NoSQL Data modelling
SDEC2011 Essentials of Mahout
SDEC2011 NoSQL concepts and models
SDEC2011 Rapidant
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Going by TACC
SDEC2011 Glory-FS development & Experiences
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Arcus NHN memcached cloud

Recently uploaded (20)

PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
A Presentation on Touch Screen Technology
PDF
Getting Started with Data Integration: FME Form 101
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Tartificialntelligence_presentation.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
A Presentation on Touch Screen Technology
Getting Started with Data Integration: FME Form 101
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 5: Probability Theory and Statistics
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mushroom cultivation and it's methods.pdf
Tartificialntelligence_presentation.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A comparative study of natural language inference in Swahili using monolingua...
Heart disease approach using modified random forest and particle swarm optimi...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A novel scalable deep ensemble learning framework for big data classification...
cloud_computing_Infrastucture_as_cloud_p
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Assigned Numbers - 2025 - Bluetooth® Document

SDEC2011 Essentials of Hive

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Hive Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Hive? • A data warehouse system for Hadoop • Facilitates data summarization and ad-hoc queries • Allows SQL like querying using HiveQL, by transposing metadata onto data stored in HDFS • Can also plug-in custom mappers and reducers
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Supported Platforms • Linux/Unix and Mac OSX • Does not work on Cygwin
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Required Software • Java 1.6.x • Hadoop 0.17.x to 0.20.x
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Download • Source: http://hive.apache.org/releases.html • Version: • hive-0.7.0 • Both binary and source distributions available
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Install • Extract: tar zxvf hive-0.7.0-bin.tar.gz • Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive • Set environment variable HIVE_HOME to point to the hive directory • Add $HIVE_HOME/bin to your PATH environment variable
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build From Source • $ svn co http://svn.apache.org/repos/asf/hive/trunk hive • $ cd hive • $ ant clean package • The binary distribution is in build/dist
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive Needs Hadoop • Needs Hadoop • Add Hadoop distribution to your path or set HADOOP_HOME • Start Hadoop daemons • bin/start-all.sh
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Configure Hive • Create /tmp in HDFS and set appropriate permissions • bin/hadoop fs -mkdir /tmp • bin/hadoop fs -chmod g+w /tmp • Create /user/hive/warehouse and set appropriate permissions • bin/hadoop fs -mkdir /user/hive/warehouse • bin/hadoop fs -chmod g+w /user/hive/warehouse
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Default Hive Configuration • Default configuration: conf/hive-default.xml • Override default configuration by redefining properties in: • conf/hive-site.xml • Set HIVE_CONF_DIR to set a new location for the config file • Hive configuration is a overlay on top of Hadoop configuration
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive Configuration Manipulation • Edit: conf/hive-site.xml • Use SET command on the Hive cli • Pass parameters to Hive • bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2 • set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Hive by Example -- Getting Started • Start the cli: bin/hive • Basic DDL statements • List the existing tables • SHOW TABLES;
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Create Table • CREATE TABLE books (isbn INT, title STRING); • DESCRIBE books; • isbn int • title string • CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol STRING); • What is PARTITION BY vcol?
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Logical Table Partitions • A Hive table can be logically partitioned by a virtual column • virtual column is derived by the partition in which the data is stored • A table can have multiple partitions • Each partition in uniquely identified by a virtual column value
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Alter Table • ALTER TABLE books ADD COLUMNS (author STRING, category STRING); • Change Column Property • ALTER TABLE table_name CHANGE [COLUMN] • old_column_name new_column_name column_type • [COMMENT column_comment] [FIRST|AFTER column_name]
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Alter Table Column Property • ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT "multi-valued"; • old and new column name needs to be specified • Data type changed
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data Types Supported • Primitives: INT, STRING, etc... • Complex types: maps, array, struct
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Rename Table • ALTER TABLE books RENAME TO published_contents; • DESCRIBE published_contents; • DESCRIBE books; (Execution error!)
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Drop Tables • DROP TABLE published_contents; • DROP TABLE users;
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. GroupLens Example -- Getting the Data Set • Movie ratings -- 1 million records • Available in tar.gz format: million-ml-data.tar__0.gz • Extract: tar zxvf million-ml-data.tar__0.gz •
  • 21. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading Rating Data • Format of data in ratings.dat: • UserID::MovieID::Rating::Timestamp • Replace delimiter ‘::’ for ‘#’ • :%s/::/#/g • Save as .hash_delimited
  • 22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Creating Metadata and Loading the File • hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE; • LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE <table name>;
  • 23. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. File Load Properties • No validation. Developer’s responsibility to make sure schema matches between table schema and the file. • Data can be on the local filesystem or on HDFS • Data copied to Hive HDFS namespace • If OVERWRITE not specified then its data append
  • 24. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Rating Data Load • hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited' • > OVERWRITE INTO TABLE ratings;
  • 25. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A SQL Style Query • SELECT COUNT(*) FROM ratings;
  • 26. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading movies and users data • Now load the movies and users data in the same way as the ratings data. • Details on the console... • CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE; • add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/ occupation_mapper.py; • INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender, age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid, gender, age, occupation_str, zipcode) FROM users;
  • 27. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Good Old SQL • SELECT * FROM movies LIMIT 5; • SELECT * FROM ratings WHERE movieid = 1; • SELECT COUNT(*) FROM ratings WHERE movieid < 10; • SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5; • SELECT title FROM movies WHERE title = `^Toy+`;
  • 28. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. More Than Good Old SQL • SELECT `*+(id)` FROM ratings WHERE movieid = 1; • regular expression based search on column name • SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid = 1 GROUP BY ratings.rating; (group by) • SELECT * FROM movies ORDER BY movieid DESC; • DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
  • 29. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. JOIN(s) in HiveQL • equality joins, outer joins, left semi-joins • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5; • More than 2 tables: • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
  • 30. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. JOIN(s) in HiveQL • equality joins, outer joins, left semi-joins • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5; • More than 2 tables: • SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
  • 31. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Explain Plan to Under the hood MapReduce • EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
  • 32. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com