Big Data & Analytics (CSE6005) L6.pptx

SESSION 2017-2018
B.TECH (CSE) YEAR: III SEMESTER: VI
INTRODUCTION TO HIVE
(CSE6005)
MODULE 2 (L6)
Presented By
Vivek Kumar
Dept of Computer Engineering & Applications
GLA University India

Agenda
Learning Objectives Learning Outcomes
Introduction to Hive
1. To study the Hive Architecture
2. To study the Hive File format
3. To study the Hive Query
Language
a) To understand the hive
architecture.
b) To create databases, tables and
execute data manipulation
language statements on it.
c) To differentiate between static
and dynamic partitions.
d) To differentiate between
managed and external tables.

Agenda
 What is Hive?
 Hive Architecture
 Hive Data Types
 Primitive Data Types
 Collection Data Types
 Hive File Format
 Text File
 Sequential File
 RCFile (Record Columnar File)

Agenda …
 Hive Query Language
 DDL (Data Definition Language) Statements
 DML (Data Manipulation Language) Statements
 Database
 Tables
 Partitions
 Buckets
 Aggregation
 Group BY and Having
 SERDER

Case Study: Retail
 Major Indian retailers
include FutureGroup, Reliance Industries, Tata
Group and Aditya Birla Group are using Hive.
 One of the retail groups, let’s call it BigX,
wanted their last 5 years semi- structured
dataset to be analyzed for trends and patterns.
 Let us see how we can solve their problem
using Hadoop.

Case Study: Retail cont..
About BigX
 BigX is a chain of hypermarket in India.
Currently there are 220+ stores across 85
cities and towns in India and employs 35,000+
people. Its annual revenue for the year 2011
was USD 1 Billion. It offers a wide range of
products including fashion and apparels, food
products, books, furniture, electronics, health
care, general merchandise and entertainment
sections.

Problem Scenario
1. One of BigX log datasets that needs to be
analyzed was approximately 12TB in overall
size and holds 5 years of vital information in
semi structured form.

2. Traditional business intelligence (BI) tools are
good up to a certain degree, usually several
hundreds of gigabytes. But when the scale is
of the order of terabytes and petabytes, these
frameworks become inefficient. Also, BI tools
work best when data is present in a known
pre-defined schema. The particular dataset
from BigX was mostly logs which didn’t
conform to any specific schema.

3. It took around 12+ hours to move the data
into their Business Intelligence systems bi-
weekly. BigX wanted to reduce this time
drastically.
4. Querying such large data set was taking too
long

Solution
 This is where Hadoop shines in all its glory as
a solution. Since the size of the logs dataset is
12TB, at such a large scale, the problem is 2-
fold:
 Problem 1: Moving the logs dataset to HDFS
periodically
 Problem 2: Performing the analysis on this
HDFS dataset

Solution of
Problem1
 Since logs are
unstructured in
this case, Sqoop
was of little or no
use. So Flume
was used to move
the log data
periodically into
HDFS.

Solution of Problem2
 Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query and
analysis. It provides an SQL-like language called
HiveQL and converts the query into MapReduce tasks.

Big Data & Analytics (CSE6005) L6.pptx

Hive in this Case Study
 Hive uses “Schema on Read” unlike a
traditional database which uses “Schema on
Write”.
 While reading log files, the simplest
recommended approach during Hive table
creation is to use a RegexSerDe.
 By default, Hive metadata is usually stored in
an embeddedDerbydatabase which allows
only one user to issue queries. This is not ideal
for production purposes. Hence, Hive was

Conclusion- Case Study: Retail
 Using the Hadoop system, log transfer time
was reduced to ~3 hours bi-weekly and
querying time also was significantly improved.
 Thanks to Vijay, for case study, Big Data Lead
at 8KMiles, holds M. Tech in Information
Retrieval from IIIT-B.
 https://yourstory.com/2012/04/hive-for-retail-
analysis/

What is Hive?
 Hive is a Data Warehousing tool. Hive is used
to query structured data built on top of
Hadoop.
 Facebook created Hive component to manage
their ever-growing volumes of data. Hive
makes use of the following:
1. HDFS for Storage
2. MapReduce for execution
3. Stores metadata in an RDBMS.

What is Hive ?
 Apache Hive is a popular SQL interface for
batch processing on Hadoop.
 Hadoop was built to organize and store
massive amounts of data.
 Hive gives another way to access Data inside
the cluster in easy, quick way.

 Hive provides a query language
called HiveQL that closely resembles the
common Structured Query Language (SQL)
standard.
 Hive was one of the earliest project to bring
higher-level languages to Apache Hadoop.
 Hive Gives ability to Analysts and Data
Scientists to access data with out being expert
in Java .
 Hive gives structure to Data on HDFS making
it data warehousing platform.

 This interface to Hadoop
 not only accelerates the time required to produce
results from data analysis,
 it significantly broadens who can use Hadoop and
MapReduce.
 Let us take a moment to thank Facebook team
because
 Hive was developed by the Facebook Data team
and, after being used internally,
 it was contributed to the Apache Software
Foundation .
 Currently Hive is freely available as an open

What Hive is not?
 Hive is not Relational Database, it uses a
database to store meta data, but the data that
hive processes is stored in HDFS.
 Hive is not designed for on-line transaction
processing(OLTP).
 Hive is not suited for real-time queries and row
level updates and it is best used for batch jobs
over large sets of immutable data such as web
logs.

Typical Use-Case of Hive
 Hive takes large amount of unstructured data
and place it into a structured view.
 Hive supports use cases such as Ad-hoc
queries, summarization, data analysis.
 HIVEQL can also be exchange with custom
scalar functions means user defined
functions(UDF'S), aggregations(UDFA's) and
table functions(UDTF's)
 It converts SQL queries into MapReduce jobs.

Features of Hive
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs,
lists, and maps.
4. Hive supports SQL filters, group-by and order-
by clauses.

Prerequisites of Hive in Hadoop
 The prerequisites for setting up Hive and
running queries are
1. User should have stable build of Hadoop
2. Machine Should have Java 1.6 installed
3. Basic Java Programming skills
4. Basic SQL Knowledge
 Start all the services of Hadoop using the
command $ start-all.sh.
 Check all services are running, then use $ hive to
start HIVE

Hive Integration and Workflow
 Hourly Log data
can be stored
directly into
HDFS
 And then
datacleaning is
performed on the
log file
 Finally Hive Table
can be created to
query the log file.
Hadoop HDFS
Hourly Log
Log Compression
Hive table 2 Hive Table 1

Hive Architecture
Metastore
Driver (Query Compiler,
Executor)
Command-Line
Interface
Hive Web Interface
Hive Server
(Thrift)
Hive
JobTracker TaskTracker
Hadoop
HDFS

Hive Architecture
The various parts are as follows:
 Hive Command-.Line Interface (Hive CLI): The most
commonly used interface to interact with Hive.
 Hive Web Interface: It is a simple Graphic User
Interface to interact with Hive and to execute query.
 Hive Server: This is an optional server. This can be
used to submit Hive Jobs from a remote client.
 JDBC / ODBC: Jobs can be submitted from a JDBC
Client. One can write a Java code to connect to Hive
and submit jobs on it.

Hive Architecture
 Driver: Hive queries are sent to the driver for
compilation, optimization and execution.
 Metastore: Hive table definitions and mappings to the
data are stored in a Metastore. A Metastore
consists of the following:
'Metastore service: Offers interface to the Hive.
' Database: Stores data definitions, mappings to the
data and others.
 The metadata which is stored in the metastore includes IDs
of Database, IDs of Tables, IDs of Indexes etc, the time of
creation of a Table, the Input Format used for a Table, the
Output Format used for a table etc. The metastore is updated
whenever a cable is created or deleted from Hive. There are
three kinds of metastore.

Hive Architecture
 1. Embedded Metatore: This metastore is mainly used
for unit tests. Here, only one process is allowed to
connect to the metastore at a time. This is the default
metastore for Hive. It is Apache Derby Database. In this
metastore, both the database and the metastore service
runs, embedded in the main
Hive Server process. Figure 9.8 shows an Embedded
Mecastore.
 2. Local Metastore: Metadata can be stored in any
RDBMS component like MySQL Local metastore allows
multiple connections at a time. In this mode, the Hive
metastore service runs in the main Hive Server process,
but the metastore database runs in a separate process,
and can be on a separate host, Figure 9.9 shows a local

Hive Architecture
 3. Remote Metastore: In this, the Hive driver and the
metastore interface run on different JVMs (which can
run on different machines as well) as in Figure 9.10.
This way the database can be fire-walled from the Hive
user and also database credentials are completely
isolated from the users of Hive.

Hive Data Model Contd.
 Tables
- Analogous to relational tables
- Each table has a corresponding directory in
HDFS
- Data serialized and stored as files within that
directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)

 Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different folders
like
Sales/country=US/year=2012/month=12

Hierarchy of Hive Partitions
File File File

Partition
 The general definition of Partition
is horizontally dividing the data into number
of slice in a equal and manageable manner.
 Every partition is stored as directory within data
warehouse table.
 In data warehouse this partition concept is
common but there is two types of Partitions
are available in data warehouse concepts.
 There are
i) SQL Partition
ii) Hive Partition

Hive Partition
 The main work of Hive partition is also same
as SQL Partition but
 the main difference between SQL partition and
hive partition is SQL partition is only supported for
single column in table but in Hive partition it
supported for Multiple columns in a table .
 The main work of Hive partition is also same as
SQL Partition but the main difference between
SQL partition and hive partition is SQL partition is
only supported for single column in table but in
Hive partition it supported for Multiple columns in
a table .

 Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket
number
- Each bucket is stored as a file in partition
directory

Hive Data Types
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number
String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive

Hive Data Types cont..
Collection Data Types
STRUC
T
Similar to ‘C’ struct. Fields are accessed using dot notation.
E.g.: struct('John', 'Doe')
MAP A collection of key - value pairs. Fields are accessed using [] notation.
E.g.: map('first', 'John', 'last', 'Doe')
ARRAY Ordered sequence of same types. Fields are accessed using array index.
E.g.: array('John', 'Doe')

Hive File Format
 Text File: The default file format is text file.
 Sequential File: Sequential files are flat files
that store binary key-value pairs.
 RCFile (Record Columnar File):
RCFile stores the data in Column Oriented
Manner which ensures that Aggregation
operation is not an expensive operation.

Hive Query Language (HQL)
 Works on Databases, Tables, Partitions, Buckets
(Clusters)
 Create and manage tables and partitions.
 Support various Relational, Arithmetic, and Logical
Operators.
 Evaluate functions.
 Downloads the contents of a table to a local
directory or result of queries to HDFS directory.

Database
 To create a database named “STUDENTS”
with comments and database properties.
CREATE DATABASE IF NOT EXISTS
STUDENTS COMMENT 'STUDENT Details'
WITH DBPROPERTIES ('creator' = 'JOHN');

Database
 To describe a database
DESCRIBE DATABASE STUDENTS;
 To show Databases
SHOW DATABASES;
 To drop database.
DROP DATABASE STUDENTS;

Tables
 There are two types of tables in Hive:
Managed table
External table
 The difference between two is when you drop
a table:
 if it is managed table hive deletes both data and
meta data,
if it is external table hive only deletes metadata.
 Use external keyword to create a external
table

Tables
To create managed table named ‘STUDENT’.
CREATE TABLE IF NOT EXISTS
STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY 't';

Tables
To create external table named
‘EXT_STUDENT’.
CREATE EXTERNAL TABLE IF NOT EXISTS
EXT_STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY 't' LOCATION
‘/STUDENT_INFO;

Tables
To load data into the table from file named
student.tsv.
LOAD DATA LOCAL INPATH
‘/root/hivedemos/student.tsv' OVERWRITE
INTO TABLE EXT_STUDENT;
To retrieve the student details from
“EXT_STUDENT” table.
SELECT * from EXT_STUDENT;

Table ALTER Operations
 ALTER TABLE mytablename RENAME to mt;
 ALTER TABLE mytable ADD COLOUMNS (mycol
STRING);
 ALTER TABLE name RENAME TO new_name
 ALTER TABLE name DROP [COLUMN]
column_name
 ALTER TABLE name CHANGE column_name
new_name new_type
 ALTER TABLE name REPLACE COLUMNS
(col_spec[, col_spec ...])

Partitions
 Partitions split the larger dataset into more meaningful chunks.
 Hive provides two kinds of partitions: Static Partition and Dynamic
Partition.
• To create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT
(rollno INT, name STRING) PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
Load data into partition table from table.
INSERT OVERWRITE TABLE STATIC_PART_STUDENT
PARTITION (gpa =4.0) SELECT rollno, name from
EXT_STUDENT where gpa=4.0;

Partitions
• To create dynamic partition on column date.
CREATE TABLE IF NOT EXISTS
DYNAMIC_PART_STUDENT(rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't';
To load data into a dynamic partition table from table.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Note: The dynamic partition strict mode requires at least one static
partition column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT
PARTITION (gpa) SELECT rollno,name,gpa from
EXT_STUDENT;

Buckets
 Tables or partitions are sub-divided
into buckets, to provide extra structure to the
data that may be used for more efficient
querying. Bucketing works based on the value
of hash function of some column of a table.
 We can add partitions to a table by altering the
table. Let us assume we have a table
called employee with fields such as Id, Name,
Salary, Designation, Dept, and yoj.

Buckets
• To create a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno
INT,name STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;
Load data to bucketed table.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
To display the content of first bucket.
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);

Aggregations
 Hive supports aggregation functions like avg,
count, etc.
 To write the average and count aggregation
function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;

Group by and Having
To write group by and having function.
SELECT rollno, name,gpa
FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;

SerDer
 SerDer stands for Serializer/Deserializer.
 Contains the logic to convert unstructured data
into records.
 Implemented using Java.
 Serializers are used at the time of writing.
 Deserializers are used at query time (SELECT
Statement).

Fill in the blanks
 The metastore consists of ______________
and a ______________.
 The most commonly used interface to interact
with Hive is ______________.
 The default metastore for Hive is
______________.
 Metastore contains ______________ of Hive
tables.
 ______________ is responsible for
compilation, optimization, and execution of
Hive queries.

Big Data & Analytics (CSE6005) L6.pptx

More Related Content

Similar to Big Data & Analytics (CSE6005) L6.pptx (20)

More from Anonymous9etQKwW (13)

Recently uploaded (20)

Big Data & Analytics (CSE6005) L6.pptx