SlideShare a Scribd company logo
SESSION 2017-2018
B.TECH (CSE) YEAR: III SEMESTER: VI
INTRODUCTION TO HIVE
(CSE6005)
MODULE 2 (L6)
Presented By
Vivek Kumar
Dept of Computer Engineering & Applications
GLA University India
Agenda
Learning Objectives Learning Outcomes
Introduction to Hive
1. To study the Hive Architecture
2. To study the Hive File format
3. To study the Hive Query
Language
a) To understand the hive
architecture.
b) To create databases, tables and
execute data manipulation
language statements on it.
c) To differentiate between static
and dynamic partitions.
d) To differentiate between
managed and external tables.
Agenda
 What is Hive?
 Hive Architecture
 Hive Data Types
 Primitive Data Types
 Collection Data Types
 Hive File Format
 Text File
 Sequential File
 RCFile (Record Columnar File)
Agenda …
 Hive Query Language
 DDL (Data Definition Language) Statements
 DML (Data Manipulation Language) Statements
 Database
 Tables
 Partitions
 Buckets
 Aggregation
 Group BY and Having
 SERDER
Case Study: Retail
 Major Indian retailers
include FutureGroup, Reliance Industries, Tata
Group and Aditya Birla Group are using Hive.
 One of the retail groups, let’s call it BigX,
wanted their last 5 years semi- structured
dataset to be analyzed for trends and patterns.
 Let us see how we can solve their problem
using Hadoop.
Case Study: Retail cont..
About BigX
 BigX is a chain of hypermarket in India.
Currently there are 220+ stores across 85
cities and towns in India and employs 35,000+
people. Its annual revenue for the year 2011
was USD 1 Billion. It offers a wide range of
products including fashion and apparels, food
products, books, furniture, electronics, health
care, general merchandise and entertainment
sections.
Case Study: Retail cont..
Problem Scenario
1. One of BigX log datasets that needs to be
analyzed was approximately 12TB in overall
size and holds 5 years of vital information in
semi structured form.
Case Study: Retail cont..
2. Traditional business intelligence (BI) tools are
good up to a certain degree, usually several
hundreds of gigabytes. But when the scale is
of the order of terabytes and petabytes, these
frameworks become inefficient. Also, BI tools
work best when data is present in a known
pre-defined schema. The particular dataset
from BigX was mostly logs which didn’t
conform to any specific schema.
Case Study: Retail cont..
3. It took around 12+ hours to move the data
into their Business Intelligence systems bi-
weekly. BigX wanted to reduce this time
drastically.
4. Querying such large data set was taking too
long
Case Study: Retail cont..
Solution
 This is where Hadoop shines in all its glory as
a solution. Since the size of the logs dataset is
12TB, at such a large scale, the problem is 2-
fold:
 Problem 1: Moving the logs dataset to HDFS
periodically
 Problem 2: Performing the analysis on this
HDFS dataset
Case Study: Retail cont..
Solution of
Problem1
 Since logs are
unstructured in
this case, Sqoop
was of little or no
use. So Flume
was used to move
the log data
periodically into
HDFS.
Case Study: Retail cont..
Solution of Problem2
 Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query and
analysis. It provides an SQL-like language called
HiveQL and converts the query into MapReduce tasks.
Big Data & Analytics (CSE6005) L6.pptx
Hive in this Case Study
 Hive uses “Schema on Read” unlike a
traditional database which uses “Schema on
Write”.
 While reading log files, the simplest
recommended approach during Hive table
creation is to use a RegexSerDe.
 By default, Hive metadata is usually stored in
an embeddedDerbydatabase which allows
only one user to issue queries. This is not ideal
for production purposes. Hence, Hive was
Conclusion- Case Study: Retail
 Using the Hadoop system, log transfer time
was reduced to ~3 hours bi-weekly and
querying time also was significantly improved.
 Thanks to Vijay, for case study, Big Data Lead
at 8KMiles, holds M. Tech in Information
Retrieval from IIIT-B.
 https://yourstory.com/2012/04/hive-for-retail-
analysis/
What is Hive?
 Hive is a Data Warehousing tool. Hive is used
to query structured data built on top of
Hadoop.
 Facebook created Hive component to manage
their ever-growing volumes of data. Hive
makes use of the following:
1. HDFS for Storage
2. MapReduce for execution
3. Stores metadata in an RDBMS.
What is Hive ?
 Apache Hive is a popular SQL interface for
batch processing on Hadoop.
 Hadoop was built to organize and store
massive amounts of data.
 Hive gives another way to access Data inside
the cluster in easy, quick way.
 Hive provides a query language
called HiveQL that closely resembles the
common Structured Query Language (SQL)
standard.
 Hive was one of the earliest project to bring
higher-level languages to Apache Hadoop.
 Hive Gives ability to Analysts and Data
Scientists to access data with out being expert
in Java .
 Hive gives structure to Data on HDFS making
it data warehousing platform.
 This interface to Hadoop
 not only accelerates the time required to produce
results from data analysis,
 it significantly broadens who can use Hadoop and
MapReduce.
 Let us take a moment to thank Facebook team
because
 Hive was developed by the Facebook Data team
and, after being used internally,
 it was contributed to the Apache Software
Foundation .
 Currently Hive is freely available as an open
What Hive is not?
 Hive is not Relational Database, it uses a
database to store meta data, but the data that
hive processes is stored in HDFS.
 Hive is not designed for on-line transaction
processing(OLTP).
 Hive is not suited for real-time queries and row
level updates and it is best used for batch jobs
over large sets of immutable data such as web
logs.
Typical Use-Case of Hive
 Hive takes large amount of unstructured data
and place it into a structured view.
 Hive supports use cases such as Ad-hoc
queries, summarization, data analysis.
 HIVEQL can also be exchange with custom
scalar functions means user defined
functions(UDF'S), aggregations(UDFA's) and
table functions(UDTF's)
 It converts SQL queries into MapReduce jobs.
Features of Hive
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs,
lists, and maps.
4. Hive supports SQL filters, group-by and order-
by clauses.
Prerequisites of Hive in Hadoop
 The prerequisites for setting up Hive and
running queries are
1. User should have stable build of Hadoop
2. Machine Should have Java 1.6 installed
3. Basic Java Programming skills
4. Basic SQL Knowledge
 Start all the services of Hadoop using the
command $ start-all.sh.
 Check all services are running, then use $ hive to
start HIVE
Hive Integration and Workflow
 Hourly Log data
can be stored
directly into
HDFS
 And then
datacleaning is
performed on the
log file
 Finally Hive Table
can be created to
query the log file.
Hadoop HDFS
Hourly Log
Log Compression
Hive table 2 Hive Table 1
Hive Architecture
Metastore
Driver (Query Compiler,
Executor)
Command-Line
Interface
Hive Web Interface
Hive Server
(Thrift)
Hive
JobTracker TaskTracker
Hadoop
HDFS
Hive Architecture
The various parts are as follows:
 Hive Command-.Line Interface (Hive CLI): The most
commonly used interface to interact with Hive.
 Hive Web Interface: It is a simple Graphic User
Interface to interact with Hive and to execute query.
 Hive Server: This is an optional server. This can be
used to submit Hive Jobs from a remote client.
 JDBC / ODBC: Jobs can be submitted from a JDBC
Client. One can write a Java code to connect to Hive
and submit jobs on it.
Hive Architecture
 Driver: Hive queries are sent to the driver for
compilation, optimization and execution.
 Metastore: Hive table definitions and mappings to the
data are stored in a Metastore. A Metastore
consists of the following:
'Metastore service: Offers interface to the Hive.
' Database: Stores data definitions, mappings to the
data and others.
 The metadata which is stored in the metastore includes IDs
of Database, IDs of Tables, IDs of Indexes etc, the time of
creation of a Table, the Input Format used for a Table, the
Output Format used for a table etc. The metastore is updated
whenever a cable is created or deleted from Hive. There are
three kinds of metastore.
Hive Architecture
 1. Embedded Metatore: This metastore is mainly used
for unit tests. Here, only one process is allowed to
connect to the metastore at a time. This is the default
metastore for Hive. It is Apache Derby Database. In this
metastore, both the database and the metastore service
runs, embedded in the main
Hive Server process. Figure 9.8 shows an Embedded
Mecastore.
 2. Local Metastore: Metadata can be stored in any
RDBMS component like MySQL Local metastore allows
multiple connections at a time. In this mode, the Hive
metastore service runs in the main Hive Server process,
but the metastore database runs in a separate process,
and can be on a separate host, Figure 9.9 shows a local
Hive Architecture
 3. Remote Metastore: In this, the Hive driver and the
metastore interface run on different JVMs (which can
run on different machines as well) as in Figure 9.10.
This way the database can be fire-walled from the Hive
user and also database credentials are completely
isolated from the users of Hive.
Big Data & Analytics (CSE6005) L6.pptx
Hive Data Units
Hive Data Model Contd.
 Tables
- Analogous to relational tables
- Each table has a corresponding directory in
HDFS
- Data serialized and stored as files within that
directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
Hive Data Model Contd.
 Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different folders
like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions
File File File
Partition
 The general definition of Partition
is horizontally dividing the data into number
of slice in a equal and manageable manner.
 Every partition is stored as directory within data
warehouse table.
 In data warehouse this partition concept is
common but there is two types of Partitions
are available in data warehouse concepts.
 There are
i) SQL Partition
ii) Hive Partition
Hive Partition
 The main work of Hive partition is also same
as SQL Partition but
 the main difference between SQL partition and
hive partition is SQL partition is only supported for
single column in table but in Hive partition it
supported for Multiple columns in a table .
 The main work of Hive partition is also same as
SQL Partition but the main difference between
SQL partition and hive partition is SQL partition is
only supported for single column in table but in
Hive partition it supported for Multiple columns in
a table .
Hive Data Model Contd.
 Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket
number
- Each bucket is stored as a file in partition
directory
Hive Data Types
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number
String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
Hive Data Types cont..
Collection Data Types
STRUC
T
Similar to ‘C’ struct. Fields are accessed using dot notation.
E.g.: struct('John', 'Doe')
MAP A collection of key - value pairs. Fields are accessed using [] notation.
E.g.: map('first', 'John', 'last', 'Doe')
ARRAY Ordered sequence of same types. Fields are accessed using array index.
E.g.: array('John', 'Doe')
Hive File Format
 Text File: The default file format is text file.
 Sequential File: Sequential files are flat files
that store binary key-value pairs.
 RCFile (Record Columnar File):
RCFile stores the data in Column Oriented
Manner which ensures that Aggregation
operation is not an expensive operation.
Hive Query Language (HQL)
 Works on Databases, Tables, Partitions, Buckets
(Clusters)
 Create and manage tables and partitions.
 Support various Relational, Arithmetic, and Logical
Operators.
 Evaluate functions.
 Downloads the contents of a table to a local
directory or result of queries to HDFS directory.
Database
 To create a database named “STUDENTS”
with comments and database properties.
CREATE DATABASE IF NOT EXISTS
STUDENTS COMMENT 'STUDENT Details'
WITH DBPROPERTIES ('creator' = 'JOHN');
Database
 To describe a database
DESCRIBE DATABASE STUDENTS;
 To show Databases
SHOW DATABASES;
 To drop database.
DROP DATABASE STUDENTS;
Tables
 There are two types of tables in Hive:
Managed table
External table
 The difference between two is when you drop
a table:
 if it is managed table hive deletes both data and
meta data,
if it is external table hive only deletes metadata.
 Use external keyword to create a external
table
Tables
To create managed table named ‘STUDENT’.
CREATE TABLE IF NOT EXISTS
STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY 't';
Tables
To create external table named
‘EXT_STUDENT’.
CREATE EXTERNAL TABLE IF NOT EXISTS
EXT_STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY 't' LOCATION
‘/STUDENT_INFO;
Tables
To load data into the table from file named
student.tsv.
LOAD DATA LOCAL INPATH
‘/root/hivedemos/student.tsv' OVERWRITE
INTO TABLE EXT_STUDENT;
To retrieve the student details from
“EXT_STUDENT” table.
SELECT * from EXT_STUDENT;
Table ALTER Operations
 ALTER TABLE mytablename RENAME to mt;
 ALTER TABLE mytable ADD COLOUMNS (mycol
STRING);
 ALTER TABLE name RENAME TO new_name
 ALTER TABLE name DROP [COLUMN]
column_name
 ALTER TABLE name CHANGE column_name
new_name new_type
 ALTER TABLE name REPLACE COLUMNS
(col_spec[, col_spec ...])
Partitions
 Partitions split the larger dataset into more meaningful chunks.
 Hive provides two kinds of partitions: Static Partition and Dynamic
Partition.
• To create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT
(rollno INT, name STRING) PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
Load data into partition table from table.
INSERT OVERWRITE TABLE STATIC_PART_STUDENT
PARTITION (gpa =4.0) SELECT rollno, name from
EXT_STUDENT where gpa=4.0;
Partitions
• To create dynamic partition on column date.
CREATE TABLE IF NOT EXISTS
DYNAMIC_PART_STUDENT(rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't';
To load data into a dynamic partition table from table.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Note: The dynamic partition strict mode requires at least one static
partition column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT
PARTITION (gpa) SELECT rollno,name,gpa from
EXT_STUDENT;
Buckets
 Tables or partitions are sub-divided
into buckets, to provide extra structure to the
data that may be used for more efficient
querying. Bucketing works based on the value
of hash function of some column of a table.
 We can add partitions to a table by altering the
table. Let us assume we have a table
called employee with fields such as Id, Name,
Salary, Designation, Dept, and yoj.
Buckets
• To create a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno
INT,name STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;
Load data to bucketed table.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
To display the content of first bucket.
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
Aggregations
 Hive supports aggregation functions like avg,
count, etc.
 To write the average and count aggregation
function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;
Group by and Having
To write group by and having function.
SELECT rollno, name,gpa
FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;
SerDer
 SerDer stands for Serializer/Deserializer.
 Contains the logic to convert unstructured data
into records.
 Implemented using Java.
 Serializers are used at the time of writing.
 Deserializers are used at query time (SELECT
Statement).
Fill in the blanks
 The metastore consists of ______________
and a ______________.
 The most commonly used interface to interact
with Hive is ______________.
 The default metastore for Hive is
______________.
 Metastore contains ______________ of Hive
tables.
 ______________ is responsible for
compilation, optimization, and execution of
Hive queries.

More Related Content

PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Presentation ON Hive Big Data NOSQL.pptx
PPTX
PPTX
Big Data Analytics (BAD601) Module-4.pptx
PPTX
An Introduction-to-Hive and its Applications and Implementations.pptx
PPTX
Big data and hadoop product page
PPTX
Hive and querying data
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of big data & hadoop version 1 - Tony Nguyen
Presentation ON Hive Big Data NOSQL.pptx
Big Data Analytics (BAD601) Module-4.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
Big data and hadoop product page
Hive and querying data

Similar to Big Data & Analytics (CSE6005) L6.pptx (20)

PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
PPTX
Data infrastructure at Facebook
PPTX
01-Introduction-to-Hive.pptx
PPTX
Overview of big data & hadoop v1
PPTX
Big Data Analytics With Hadoop
PPT
hadoop
PPT
hadoop
PDF
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
PPTX
Hive with HDInsight
PPTX
A Glimpse of Bigdata - Introduction
PPTX
Hive_Pig.pptx
PPTX
BDA: Introduction to HIVE, PIG and HBASE
PPTX
Big data
PPTX
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
PPTX
hive architecture and hive components in detail
PDF
Hadoop and its role in Facebook: An Overview
PPTX
Hive.pptx
PPT
Hadoop in action
ODP
Apache hive1
PPTX
Managing Big data with Hadoop
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Data infrastructure at Facebook
01-Introduction-to-Hive.pptx
Overview of big data & hadoop v1
Big Data Analytics With Hadoop
hadoop
hadoop
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Hive with HDInsight
A Glimpse of Bigdata - Introduction
Hive_Pig.pptx
BDA: Introduction to HIVE, PIG and HBASE
Big data
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
hive architecture and hive components in detail
Hadoop and its role in Facebook: An Overview
Hive.pptx
Hadoop in action
Apache hive1
Managing Big data with Hadoop
Ad

More from Anonymous9etQKwW (13)

PPTX
CISCT 2024 template (1) template template
PPTX
distributed system ppt presentation in cs
PPTX
os distributed system theoretical foundation
PPTX
osi model computer networks complete detail
PPT
CODch3Slides.ppt
PPTX
IntroductoryPPT_CSE242.pptx
PPT
Intro.ppt
PPTX
Lecture 2 Hadoop.pptx
PPT
mapreduceApril24.ppt
PPT
PPTX
lecture 2.pptx
PPT
Chap 4.ppt
PPT
Artificial Neural Networks_Bioinsspired_Algorithms_Nov 20.ppt
CISCT 2024 template (1) template template
distributed system ppt presentation in cs
os distributed system theoretical foundation
osi model computer networks complete detail
CODch3Slides.ppt
IntroductoryPPT_CSE242.pptx
Intro.ppt
Lecture 2 Hadoop.pptx
mapreduceApril24.ppt
lecture 2.pptx
Chap 4.ppt
Artificial Neural Networks_Bioinsspired_Algorithms_Nov 20.ppt
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PPTX
Lecture Notes Electrical Wiring System Components
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Welding lecture in detail for understanding
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
573137875-Attendance-Management-System-original
Structs to JSON How Go Powers REST APIs.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
Geodesy 1.pptx...............................................
Lecture Notes Electrical Wiring System Components
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
additive manufacturing of ss316l using mig welding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
UNIT 4 Total Quality Management .pptx
Lesson 3_Tessellation.pptx finite Mathematics
Welding lecture in detail for understanding
CH1 Production IntroductoryConcepts.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

Big Data & Analytics (CSE6005) L6.pptx

  • 1. SESSION 2017-2018 B.TECH (CSE) YEAR: III SEMESTER: VI INTRODUCTION TO HIVE (CSE6005) MODULE 2 (L6) Presented By Vivek Kumar Dept of Computer Engineering & Applications GLA University India
  • 2. Agenda Learning Objectives Learning Outcomes Introduction to Hive 1. To study the Hive Architecture 2. To study the Hive File format 3. To study the Hive Query Language a) To understand the hive architecture. b) To create databases, tables and execute data manipulation language statements on it. c) To differentiate between static and dynamic partitions. d) To differentiate between managed and external tables.
  • 3. Agenda  What is Hive?  Hive Architecture  Hive Data Types  Primitive Data Types  Collection Data Types  Hive File Format  Text File  Sequential File  RCFile (Record Columnar File)
  • 4. Agenda …  Hive Query Language  DDL (Data Definition Language) Statements  DML (Data Manipulation Language) Statements  Database  Tables  Partitions  Buckets  Aggregation  Group BY and Having  SERDER
  • 5. Case Study: Retail  Major Indian retailers include FutureGroup, Reliance Industries, Tata Group and Aditya Birla Group are using Hive.  One of the retail groups, let’s call it BigX, wanted their last 5 years semi- structured dataset to be analyzed for trends and patterns.  Let us see how we can solve their problem using Hadoop.
  • 6. Case Study: Retail cont.. About BigX  BigX is a chain of hypermarket in India. Currently there are 220+ stores across 85 cities and towns in India and employs 35,000+ people. Its annual revenue for the year 2011 was USD 1 Billion. It offers a wide range of products including fashion and apparels, food products, books, furniture, electronics, health care, general merchandise and entertainment sections.
  • 7. Case Study: Retail cont.. Problem Scenario 1. One of BigX log datasets that needs to be analyzed was approximately 12TB in overall size and holds 5 years of vital information in semi structured form.
  • 8. Case Study: Retail cont.. 2. Traditional business intelligence (BI) tools are good up to a certain degree, usually several hundreds of gigabytes. But when the scale is of the order of terabytes and petabytes, these frameworks become inefficient. Also, BI tools work best when data is present in a known pre-defined schema. The particular dataset from BigX was mostly logs which didn’t conform to any specific schema.
  • 9. Case Study: Retail cont.. 3. It took around 12+ hours to move the data into their Business Intelligence systems bi- weekly. BigX wanted to reduce this time drastically. 4. Querying such large data set was taking too long
  • 10. Case Study: Retail cont.. Solution  This is where Hadoop shines in all its glory as a solution. Since the size of the logs dataset is 12TB, at such a large scale, the problem is 2- fold:  Problem 1: Moving the logs dataset to HDFS periodically  Problem 2: Performing the analysis on this HDFS dataset
  • 11. Case Study: Retail cont.. Solution of Problem1  Since logs are unstructured in this case, Sqoop was of little or no use. So Flume was used to move the log data periodically into HDFS.
  • 12. Case Study: Retail cont.. Solution of Problem2  Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. It provides an SQL-like language called HiveQL and converts the query into MapReduce tasks.
  • 14. Hive in this Case Study  Hive uses “Schema on Read” unlike a traditional database which uses “Schema on Write”.  While reading log files, the simplest recommended approach during Hive table creation is to use a RegexSerDe.  By default, Hive metadata is usually stored in an embeddedDerbydatabase which allows only one user to issue queries. This is not ideal for production purposes. Hence, Hive was
  • 15. Conclusion- Case Study: Retail  Using the Hadoop system, log transfer time was reduced to ~3 hours bi-weekly and querying time also was significantly improved.  Thanks to Vijay, for case study, Big Data Lead at 8KMiles, holds M. Tech in Information Retrieval from IIIT-B.  https://yourstory.com/2012/04/hive-for-retail- analysis/
  • 16. What is Hive?  Hive is a Data Warehousing tool. Hive is used to query structured data built on top of Hadoop.  Facebook created Hive component to manage their ever-growing volumes of data. Hive makes use of the following: 1. HDFS for Storage 2. MapReduce for execution 3. Stores metadata in an RDBMS.
  • 17. What is Hive ?  Apache Hive is a popular SQL interface for batch processing on Hadoop.  Hadoop was built to organize and store massive amounts of data.  Hive gives another way to access Data inside the cluster in easy, quick way.
  • 18.  Hive provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard.  Hive was one of the earliest project to bring higher-level languages to Apache Hadoop.  Hive Gives ability to Analysts and Data Scientists to access data with out being expert in Java .  Hive gives structure to Data on HDFS making it data warehousing platform.
  • 19.  This interface to Hadoop  not only accelerates the time required to produce results from data analysis,  it significantly broadens who can use Hadoop and MapReduce.  Let us take a moment to thank Facebook team because  Hive was developed by the Facebook Data team and, after being used internally,  it was contributed to the Apache Software Foundation .  Currently Hive is freely available as an open
  • 20. What Hive is not?  Hive is not Relational Database, it uses a database to store meta data, but the data that hive processes is stored in HDFS.  Hive is not designed for on-line transaction processing(OLTP).  Hive is not suited for real-time queries and row level updates and it is best used for batch jobs over large sets of immutable data such as web logs.
  • 21. Typical Use-Case of Hive  Hive takes large amount of unstructured data and place it into a structured view.  Hive supports use cases such as Ad-hoc queries, summarization, data analysis.  HIVEQL can also be exchange with custom scalar functions means user defined functions(UDF'S), aggregations(UDFA's) and table functions(UDTF's)  It converts SQL queries into MapReduce jobs.
  • 22. Features of Hive 1. It is similar to SQL. 2. HQL is easy to code. 3. Hive supports rich data types such as structs, lists, and maps. 4. Hive supports SQL filters, group-by and order- by clauses.
  • 23. Prerequisites of Hive in Hadoop  The prerequisites for setting up Hive and running queries are 1. User should have stable build of Hadoop 2. Machine Should have Java 1.6 installed 3. Basic Java Programming skills 4. Basic SQL Knowledge  Start all the services of Hadoop using the command $ start-all.sh.  Check all services are running, then use $ hive to start HIVE
  • 24. Hive Integration and Workflow  Hourly Log data can be stored directly into HDFS  And then datacleaning is performed on the log file  Finally Hive Table can be created to query the log file. Hadoop HDFS Hourly Log Log Compression Hive table 2 Hive Table 1
  • 25. Hive Architecture Metastore Driver (Query Compiler, Executor) Command-Line Interface Hive Web Interface Hive Server (Thrift) Hive JobTracker TaskTracker Hadoop HDFS
  • 26. Hive Architecture The various parts are as follows:  Hive Command-.Line Interface (Hive CLI): The most commonly used interface to interact with Hive.  Hive Web Interface: It is a simple Graphic User Interface to interact with Hive and to execute query.  Hive Server: This is an optional server. This can be used to submit Hive Jobs from a remote client.  JDBC / ODBC: Jobs can be submitted from a JDBC Client. One can write a Java code to connect to Hive and submit jobs on it.
  • 27. Hive Architecture  Driver: Hive queries are sent to the driver for compilation, optimization and execution.  Metastore: Hive table definitions and mappings to the data are stored in a Metastore. A Metastore consists of the following: 'Metastore service: Offers interface to the Hive. ' Database: Stores data definitions, mappings to the data and others.  The metadata which is stored in the metastore includes IDs of Database, IDs of Tables, IDs of Indexes etc, the time of creation of a Table, the Input Format used for a Table, the Output Format used for a table etc. The metastore is updated whenever a cable is created or deleted from Hive. There are three kinds of metastore.
  • 28. Hive Architecture  1. Embedded Metatore: This metastore is mainly used for unit tests. Here, only one process is allowed to connect to the metastore at a time. This is the default metastore for Hive. It is Apache Derby Database. In this metastore, both the database and the metastore service runs, embedded in the main Hive Server process. Figure 9.8 shows an Embedded Mecastore.  2. Local Metastore: Metadata can be stored in any RDBMS component like MySQL Local metastore allows multiple connections at a time. In this mode, the Hive metastore service runs in the main Hive Server process, but the metastore database runs in a separate process, and can be on a separate host, Figure 9.9 shows a local
  • 29. Hive Architecture  3. Remote Metastore: In this, the Hive driver and the metastore interface run on different JVMs (which can run on different machines as well) as in Figure 9.10. This way the database can be fire-walled from the Hive user and also database credentials are completely isolated from the users of Hive.
  • 32. Hive Data Model Contd.  Tables - Analogous to relational tables - Each table has a corresponding directory in HDFS - Data serialized and stored as files within that directory - Hive has default serialization built in which supports compression and lazy deserialization - Users can specify custom serialization – deserialization schemes (SerDe’s)
  • 33. Hive Data Model Contd.  Partitions - Each table can be broken into partitions - Partitions determine distribution of data within subdirectories Example - CREATE_TABLE Sales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT) So each partition will be split out into different folders like Sales/country=US/year=2012/month=12
  • 34. Hierarchy of Hive Partitions File File File
  • 35. Partition  The general definition of Partition is horizontally dividing the data into number of slice in a equal and manageable manner.  Every partition is stored as directory within data warehouse table.  In data warehouse this partition concept is common but there is two types of Partitions are available in data warehouse concepts.  There are i) SQL Partition ii) Hive Partition
  • 36. Hive Partition  The main work of Hive partition is also same as SQL Partition but  the main difference between SQL partition and hive partition is SQL partition is only supported for single column in table but in Hive partition it supported for Multiple columns in a table .  The main work of Hive partition is also same as SQL Partition but the main difference between SQL partition and hive partition is SQL partition is only supported for single column in table but in Hive partition it supported for Multiple columns in a table .
  • 37. Hive Data Model Contd.  Buckets - Data in each partition divided into buckets - Based on a hash function of the column - H(column) mod NumBuckets = bucket number - Each bucket is stored as a file in partition directory
  • 38. Hive Data Types Numeric Data Type TINYINT 1 - byte signed integer SMALLINT 2 -byte signed integer INT 4 - byte signed integer BIGINT 8 - byte signed integer FLOAT 4 - byte single-precision floating-point DOUBLE 8 - byte double-precision floating-point number String Types STRING VARCHAR Only available starting with Hive 0.12.0 CHAR Only available starting with Hive 0.13.0 Strings can be expressed in either single quotes (‘) or double quotes (“) Miscellaneous Types BOOLEAN BINARY Only available starting with Hive
  • 39. Hive Data Types cont.. Collection Data Types STRUC T Similar to ‘C’ struct. Fields are accessed using dot notation. E.g.: struct('John', 'Doe') MAP A collection of key - value pairs. Fields are accessed using [] notation. E.g.: map('first', 'John', 'last', 'Doe') ARRAY Ordered sequence of same types. Fields are accessed using array index. E.g.: array('John', 'Doe')
  • 40. Hive File Format  Text File: The default file format is text file.  Sequential File: Sequential files are flat files that store binary key-value pairs.  RCFile (Record Columnar File): RCFile stores the data in Column Oriented Manner which ensures that Aggregation operation is not an expensive operation.
  • 41. Hive Query Language (HQL)  Works on Databases, Tables, Partitions, Buckets (Clusters)  Create and manage tables and partitions.  Support various Relational, Arithmetic, and Logical Operators.  Evaluate functions.  Downloads the contents of a table to a local directory or result of queries to HDFS directory.
  • 42. Database  To create a database named “STUDENTS” with comments and database properties. CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH DBPROPERTIES ('creator' = 'JOHN');
  • 43. Database  To describe a database DESCRIBE DATABASE STUDENTS;  To show Databases SHOW DATABASES;  To drop database. DROP DATABASE STUDENTS;
  • 44. Tables  There are two types of tables in Hive: Managed table External table  The difference between two is when you drop a table:  if it is managed table hive deletes both data and meta data, if it is external table hive only deletes metadata.  Use external keyword to create a external table
  • 45. Tables To create managed table named ‘STUDENT’. CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
  • 46. Tables To create external table named ‘EXT_STUDENT’. CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name STRING,gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION ‘/STUDENT_INFO;
  • 47. Tables To load data into the table from file named student.tsv. LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE INTO TABLE EXT_STUDENT; To retrieve the student details from “EXT_STUDENT” table. SELECT * from EXT_STUDENT;
  • 48. Table ALTER Operations  ALTER TABLE mytablename RENAME to mt;  ALTER TABLE mytable ADD COLOUMNS (mycol STRING);  ALTER TABLE name RENAME TO new_name  ALTER TABLE name DROP [COLUMN] column_name  ALTER TABLE name CHANGE column_name new_name new_type  ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
  • 49. Partitions  Partitions split the larger dataset into more meaningful chunks.  Hive provides two kinds of partitions: Static Partition and Dynamic Partition. • To create static partition based on “gpa” column. CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, name STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; Load data into partition table from table. INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa =4.0) SELECT rollno, name from EXT_STUDENT where gpa=4.0;
  • 50. Partitions • To create dynamic partition on column date. CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT(rollno INT, name STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; To load data into a dynamic partition table from table. SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; Note: The dynamic partition strict mode requires at least one static partition column. To turn this off, set hive.exec.dynamic.partition.mode=nonstrict INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT PARTITION (gpa) SELECT rollno,name,gpa from EXT_STUDENT;
  • 51. Buckets  Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.  We can add partitions to a table by altering the table. Let us assume we have a table called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.
  • 52. Buckets • To create a bucketed table having 3 buckets. CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name STRING,grade FLOAT) CLUSTERED BY (grade) into 3 buckets; Load data to bucketed table. FROM STUDENT INSERT OVERWRITE TABLE STUDENT_BUCKET SELECT rollno,name,grade; To display the content of first bucket. SELECT DISTINCT GRADE FROM STUDENT_BUCKET TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
  • 53. Aggregations  Hive supports aggregation functions like avg, count, etc.  To write the average and count aggregation function. SELECT avg(gpa) FROM STUDENT; SELECT count(*) FROM STUDENT;
  • 54. Group by and Having To write group by and having function. SELECT rollno, name,gpa FROM STUDENT GROUP BY rollno,name,gpa HAVING gpa > 4.0;
  • 55. SerDer  SerDer stands for Serializer/Deserializer.  Contains the logic to convert unstructured data into records.  Implemented using Java.  Serializers are used at the time of writing.  Deserializers are used at query time (SELECT Statement).
  • 56. Fill in the blanks  The metastore consists of ______________ and a ______________.  The most commonly used interface to interact with Hive is ______________.  The default metastore for Hive is ______________.  Metastore contains ______________ of Hive tables.  ______________ is responsible for compilation, optimization, and execution of Hive queries.