SlideShare a Scribd company logo
Apache Hive
Sheetal Sharma
Intern At IBM Innovation Centre
Apache Hive
● Apache Hive is a tool built on top of Hadoop
for analyzing large, unstructured data sets
using a SQL-like syntax, thus making Hadoop
accessible to legions of existing BI and
corporate analytics researchers.
● Hive is fundamentally an operational data
store that's also suitable for analyzing large,
relatively static data sets where query time is
not important.
Apache Hive
● Hive makes an excellent addition to an existing data
warehouse, but it is not a replacement. Instead,
using Hive to augment a data warehouse is a great
way to leverage existing investments while keeping
up with the data deluge.
● Hive data store brings together vast amounts
of unstructured data -- such as log files,
customer tweets, email messages, geo-data,
and CRM interactions -- and stores them in an
unstructured format on cheap commodity
hardware.
Apache Hive
● Hive allows analysts to project a databaselike
structure on this data, to resemble traditional
tables, columns, and rows, and to write SQL-
like queries over it.
● This means that different schemas may be
projected over the same data sets, depending
on the nature of the query, allowing the user to
ask questions that weren't envisioned when
the data was gathered.
Apache Hive
● Hive queries traditionally had high latency,
and even small queries could take some time
to run because they were transformed into
map-reduce jobs and submitted to the cluster
to be run in batch mode.
● long-running queries were inconvenient and
troublesome to run in a multi-user
environment, where a single job could
dominate the cluster.
Apache Hive
multi-user environment
Apache Hive
● HiveQL, the query language, is based on SQL-92, it
differs from SQL in some important ways due to its
running on top of Hadoop.
● For instance, DDL (Data Definition Language)
commands need to account for the fact that tables
exist in a multi-user file system that supports multiple
storage formats.
● Nevertheless, SQL users will find the HiveQL
language familiar and should not have any problems
adapting to it.
Hive platform architecture
Hive platform architecture
● From the top down, Hive looks much like any other
relational database.
● Users write SQL queries and submit them for
processing, using either a command line tool that
interacts directly with the database engine or by
using third-party tools that communicate with the
database via JDBC or ODBC.
● By using the JDBC and ODBC drivers, available for
Mac and Windows, data workers can connect their
favorite SQL client to Hive to browse, query, and
create tables.
Working with Hive
● HiveQL was designed to ease the transition from SQL
and to get data analysts up and running on Hadoop right
away.
● Most BI and SQL developer tools can connect to Hive as
easily as to any other database. Using the ODBC
connector, users can import data and use tools like
PowerPivot for Excel to explore and analyze data,
making big data accessible across the organization.
Differences in HiveQL and standard SQL
Hive 0.13 was designed to perform full-table scans
across petabyte-scale data sets using the YARN and Tez
infrastructure, so some features normally found in a
relational database aren't available to the Hive user.
These include transactions, cursors, prepared
statements, row-level updates and deletes, and the
ability to cancel a running query.
The absence of these features won't significantly
affect data analysis, but it might affect your ability to use
existing SQL queries on a Hive cluster.
Differences in HiveQL and standard SQL
In a traditional database environment, the database
engine controls all reads and writes to the database. In
Hive, the database tables are stored as files in the
Hadoop Distributed File System (HDFS), where other
applications could have modified them.
Although this can be a good thing, it means that Hive
can never be certain if the data being read matches the
schema.
Aspects of Data Storage
File formats and Compression
● Tuning Hive queries can involve making the underlying
map-reduce jobs run more efficiently by optimizing the
number, type, and size of the files backing the database
tables.
● Hive's default storage format is text, which has the
advantage of being usable by other tools.
● The disadvantage, however, is that queries over raw
text files can't be easily optimized.
Hive can read and write several file formats and decompress
many of them on the fly. Storage requirements and query
efficiency can differ dramatically among these file formats, as can
be seen in the figure below (courtesy of Hortonworks).
File formats are an active area of research in the Hadoop community.
Efficient file formats both reduce storage costs and increase query
efficiency.
For Example
● For example, let's say you want to do a query
that's not part of the built-in SQL. Without a
UDF, you would have to dump a temporary
table to disk, run a second tool (such as Pig or
Java) for your custom query, and possibly
produce a third table in HDFS that would be
analyzed by Hive
Hive Query Performance
Hive 0.13 is the final piece in the Stinger initiative, a
community effort to improve the performance of Hive. The
most significant feature of 0.13 is the ability to run queries on
the new Tez execution framework.
● query times drop by half when run on Tez.
● On queries that could be cached, times dropped another 30
percent.
● On larger data sets, the speedup was even more dramatic.
● possible to execute petabyte-scale queries to refine and
cleanse data for later incorporation into data warehouse
analytics.
Hive Query Performance
● Hadoop and Hive could also be used in the reverse scenario:
to off-load data summaries that would otherwise need to be
stored in the data warehouse at much greater cost.
● Organizations or departments without a data warehouse can
start with Hive to get a feel for the value of data analytics.
● It does make a great, low-cost, large-scale operational data
store with a fair set of analytics tools.
● Hive offers near linear scalability in query processing, an order
of magnitude better price/performance ratio than traditional
enterprise data warehouses.
Apache Hive At a Glance
Thank You!

More Related Content

PPTX
SQL Server 2012 and Big Data
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPTX
Big Data on the Microsoft Platform
PPTX
Apache Hive
PPTX
Apache hadoop introduction and architecture
PPTX
12 SQL On-Hadoop Tools
PDF
Big Data technology Landscape
PPTX
Apache Hive
SQL Server 2012 and Big Data
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data on the Microsoft Platform
Apache Hive
Apache hadoop introduction and architecture
12 SQL On-Hadoop Tools
Big Data technology Landscape
Apache Hive

What's hot (20)

PPTX
SQL-on-Hadoop Tutorial
PDF
PPTX
Design of Hadoop Distributed File System
PPT
Boston Hadoop Meetup, April 26 2012
PPTX
Comparison - RDBMS vs Hadoop vs Apache
PPTX
PPTX
Introduction to Hadoop
PPTX
Apache Hadoop Hive
PPTX
SQL on Hadoop
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Big Data and Hadoop Components
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPTX
PPTX
Big Data and Hadoop
PPTX
HADOOP TECHNOLOGY ppt
PPT
Introduction to Apache hadoop
PPTX
Hadoop Innovation Summit 2014
PPTX
Session 14 - Hive
PPTX
SQL-on-Hadoop Tutorial
Design of Hadoop Distributed File System
Boston Hadoop Meetup, April 26 2012
Comparison - RDBMS vs Hadoop vs Apache
Introduction to Hadoop
Apache Hadoop Hive
SQL on Hadoop
Big data vahidamiri-tabriz-13960226-datastack.ir
Big Data and Hadoop Components
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Big Data and Hadoop
HADOOP TECHNOLOGY ppt
Introduction to Apache hadoop
Hadoop Innovation Summit 2014
Session 14 - Hive
Ad

Viewers also liked (17)

PDF
Consumer 720-The keys to consumer engagement in a social media world
PDF
LR Beauty 01 2015
PPTX
Internet and Social Media Marketing - L5 Sample
DOC
Ganesan resume
PPTX
Сучасна школа
PPTX
11. робота з обдарованими учнями
PPTX
Alexander Godfrey Learning marketing (feb 2015)
DOC
eusim unlimited call to eu
PPTX
estrategias de comunicación
PPTX
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
DOC
PPTX
метод учебного проекта на уроках
PDF
Rockagent
PDF
Updated baron tower near Greenhills, San Juan City, Metro Manila
PPTX
Presentation2
PPTX
Urban deca tower edsa (1)
PPTX
HERE GEIZL
Consumer 720-The keys to consumer engagement in a social media world
LR Beauty 01 2015
Internet and Social Media Marketing - L5 Sample
Ganesan resume
Сучасна школа
11. робота з обдарованими учнями
Alexander Godfrey Learning marketing (feb 2015)
eusim unlimited call to eu
estrategias de comunicación
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
метод учебного проекта на уроках
Rockagent
Updated baron tower near Greenhills, San Juan City, Metro Manila
Presentation2
Urban deca tower edsa (1)
HERE GEIZL
Ad

Similar to Apache hive1 (20)

PPTX
Hive - A theoretical overview in Detail.pptx
PPTX
Apache hive introduction
PDF
hive hadoop sql
PPTX
PPTX
Unit 5-apache hive
PPTX
Presentation ON Hive Big Data NOSQL.pptx
PPTX
Hive big-data meetup
PPTX
Hive_Pig.pptx
PPTX
Apache Hive for modern DBAs
PPTX
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
PPTX
hive architecture and hive components in detail
PPTX
Big Data & Analytics (CSE6005) L6.pptx
PPTX
Apache hive
PPT
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
PPTX
Apache Hive and commands PPT Presentation
PPTX
Apache hive
PPTX
Ten tools for ten big data areas 04_Apache Hive
PDF
hive lab
PPTX
Grill at HadoopSummit
PPTX
Hive presentation
Hive - A theoretical overview in Detail.pptx
Apache hive introduction
hive hadoop sql
Unit 5-apache hive
Presentation ON Hive Big Data NOSQL.pptx
Hive big-data meetup
Hive_Pig.pptx
Apache Hive for modern DBAs
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
hive architecture and hive components in detail
Big Data & Analytics (CSE6005) L6.pptx
Apache hive
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Apache Hive and commands PPT Presentation
Apache hive
Ten tools for ten big data areas 04_Apache Hive
hive lab
Grill at HadoopSummit
Hive presentation

More from sheetal sharma (9)

PDF
Db import&export
PDF
Db import&export
ODP
Hadoop Introduction
ODP
Apache hadoop
ODP
Apache hadoop hbase
PDF
Telecommunication Analysis (3 use-cases) with IBM watson analytics
PDF
Telecommunication Analysis(3 use-cases) with IBM cognos insight
PPTX
Sentiment Analysis App with DevOps Services
PPTX
Watson analytics
Db import&export
Db import&export
Hadoop Introduction
Apache hadoop
Apache hadoop hbase
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Sentiment Analysis App with DevOps Services
Watson analytics

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
sap open course for s4hana steps from ECC to s4
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
sap open course for s4hana steps from ECC to s4
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx

Apache hive1

  • 1. Apache Hive Sheetal Sharma Intern At IBM Innovation Centre
  • 2. Apache Hive ● Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers. ● Hive is fundamentally an operational data store that's also suitable for analyzing large, relatively static data sets where query time is not important.
  • 3. Apache Hive ● Hive makes an excellent addition to an existing data warehouse, but it is not a replacement. Instead, using Hive to augment a data warehouse is a great way to leverage existing investments while keeping up with the data deluge. ● Hive data store brings together vast amounts of unstructured data -- such as log files, customer tweets, email messages, geo-data, and CRM interactions -- and stores them in an unstructured format on cheap commodity hardware.
  • 4. Apache Hive ● Hive allows analysts to project a databaselike structure on this data, to resemble traditional tables, columns, and rows, and to write SQL- like queries over it. ● This means that different schemas may be projected over the same data sets, depending on the nature of the query, allowing the user to ask questions that weren't envisioned when the data was gathered.
  • 5. Apache Hive ● Hive queries traditionally had high latency, and even small queries could take some time to run because they were transformed into map-reduce jobs and submitted to the cluster to be run in batch mode. ● long-running queries were inconvenient and troublesome to run in a multi-user environment, where a single job could dominate the cluster.
  • 7. Apache Hive ● HiveQL, the query language, is based on SQL-92, it differs from SQL in some important ways due to its running on top of Hadoop. ● For instance, DDL (Data Definition Language) commands need to account for the fact that tables exist in a multi-user file system that supports multiple storage formats. ● Nevertheless, SQL users will find the HiveQL language familiar and should not have any problems adapting to it.
  • 9. Hive platform architecture ● From the top down, Hive looks much like any other relational database. ● Users write SQL queries and submit them for processing, using either a command line tool that interacts directly with the database engine or by using third-party tools that communicate with the database via JDBC or ODBC. ● By using the JDBC and ODBC drivers, available for Mac and Windows, data workers can connect their favorite SQL client to Hive to browse, query, and create tables.
  • 10. Working with Hive ● HiveQL was designed to ease the transition from SQL and to get data analysts up and running on Hadoop right away. ● Most BI and SQL developer tools can connect to Hive as easily as to any other database. Using the ODBC connector, users can import data and use tools like PowerPivot for Excel to explore and analyze data, making big data accessible across the organization.
  • 11. Differences in HiveQL and standard SQL Hive 0.13 was designed to perform full-table scans across petabyte-scale data sets using the YARN and Tez infrastructure, so some features normally found in a relational database aren't available to the Hive user. These include transactions, cursors, prepared statements, row-level updates and deletes, and the ability to cancel a running query. The absence of these features won't significantly affect data analysis, but it might affect your ability to use existing SQL queries on a Hive cluster.
  • 12. Differences in HiveQL and standard SQL In a traditional database environment, the database engine controls all reads and writes to the database. In Hive, the database tables are stored as files in the Hadoop Distributed File System (HDFS), where other applications could have modified them. Although this can be a good thing, it means that Hive can never be certain if the data being read matches the schema.
  • 13. Aspects of Data Storage File formats and Compression ● Tuning Hive queries can involve making the underlying map-reduce jobs run more efficiently by optimizing the number, type, and size of the files backing the database tables. ● Hive's default storage format is text, which has the advantage of being usable by other tools. ● The disadvantage, however, is that queries over raw text files can't be easily optimized.
  • 14. Hive can read and write several file formats and decompress many of them on the fly. Storage requirements and query efficiency can differ dramatically among these file formats, as can be seen in the figure below (courtesy of Hortonworks). File formats are an active area of research in the Hadoop community. Efficient file formats both reduce storage costs and increase query efficiency.
  • 15. For Example ● For example, let's say you want to do a query that's not part of the built-in SQL. Without a UDF, you would have to dump a temporary table to disk, run a second tool (such as Pig or Java) for your custom query, and possibly produce a third table in HDFS that would be analyzed by Hive
  • 16. Hive Query Performance Hive 0.13 is the final piece in the Stinger initiative, a community effort to improve the performance of Hive. The most significant feature of 0.13 is the ability to run queries on the new Tez execution framework. ● query times drop by half when run on Tez. ● On queries that could be cached, times dropped another 30 percent. ● On larger data sets, the speedup was even more dramatic. ● possible to execute petabyte-scale queries to refine and cleanse data for later incorporation into data warehouse analytics.
  • 17. Hive Query Performance ● Hadoop and Hive could also be used in the reverse scenario: to off-load data summaries that would otherwise need to be stored in the data warehouse at much greater cost. ● Organizations or departments without a data warehouse can start with Hive to get a feel for the value of data analytics. ● It does make a great, low-cost, large-scale operational data store with a fair set of analytics tools. ● Hive offers near linear scalability in query processing, an order of magnitude better price/performance ratio than traditional enterprise data warehouses.
  • 18. Apache Hive At a Glance