SlideShare a Scribd company logo
The Apache Hadoop HIVE
Omoyayi Ibrahim Omodamilola
Student No.; 20174831
PhD Biomedical Engineering
Outline
• Big Data
• History of Database (NoSQL vs SQL)
• New SQL database
• SQL
• NoSQL
• Factors Affecting the Selection of a database
• Hadoop Hive
– Functions of Hive on Hadoop
– Hive vs Java vs Pig
• Hadoop Distributed File System
• Hive Architecture
• Work Flow of Hive
• List of Reference
Apache Hadoop Hive
INTRODUCTION
• The initiation of The Hadoop Apache Hive began in 2007 by
Facebook due to its data growth.
• This ETL system began to fail over few years as more people
joined Facebook.
• In August 2008, Facebook decided to move to scalable a
more scalable open-source Hadoop environment; Hive
• Facebook, Netflix and Amazons support the Apache Hive
SQL now known as the HiveQL
SQL (left) vs NoSQL (right)
Source: Google Images
NEW STRUCTURED QUERY LANGUAGE
NewSQL
• Relational + NoSQL
• designed for Web-scale applications
• provide many of the traditional SQL
operations
Class of modern relational database management systems that seek to
provide the same scalable performance of NoSQL systems for online
transaction processing (OLTP) read-write workloads while still
maintaining the ACID guarantees of a traditional database system.
RELATIONAL DATABASES SQL
• Structured Query Language (SQL)
• Consists of two or more tables with columns and
row
• Relationship between tables and field types is called
a schema
• (SQL) is a programming language used by database
(MySQL, Sybase, Oracle, or IBM DM2, SQL)
architects to design relational databases.
• These databases are well understood and widely
supported
Popular SQL databases and RDBMS’s
• MySQL—the most popular open-source database
• Oracle—an object-relational DBMS written in the C++ language.
• IMB DB2—a family of database server products from IBM that are
built to handle advanced “big data” analytics.
• Sybase—a relational model database server product for
businesses primarily used on the Unix OS and Linux
• MS SQL Server—a Microsoft-developed RDBMS for enterprise-
level databases that supports both SQL and NoSQL architectures.
• Microsoft Azure—a cloud computing platform that supports any
operating system, and lets you store, compute, and scale data
• MariaDB—an enhanced, drop-in version of MySQL.
• PostgreSQL—an enterprise-level, object-relational DBMS that uses
procedural languages like Perl and Python.
NOSQL DATABASES
• Easy to access
• Greater flexibility
• Documents oriented data
• Massive amounts of data
• Uncleared data requirements
• Data Includes: sensor data, social sharing, personal
settings, photos, location-based information, online
activity, usage metrics, etc
Source: UpWork
POPULAR NOSQL DATABASES
• MongoDB—the most popular NoSQL system
• Apache’s CouchDB—a true DB for the web, it uses the
JSON data exchange format to store its documents
• HBase—another Apache project, developed as a part of
Hadoop, this open-source, non-relational “column
store”
• Oracle NoSQL—Oracle’s entry into the NoSQL category.
• Apache’s Cassandra DB—born at Facebook, handling
massive amounts of structured data. Examples:
Instagram, Comcast, Apple, and Spotify (growing app).
• Riak—It has fault-tolerance replication and automatic
data distribution built in for excellent performance.
Apache Hadoop Hive
SQL
Pros Cons
Relational databases work with structured data. Relational Databases do not scale out
horizontally very well (concurrency and data
size), only vertically.
They support ACID (Atomicity, Consistency,
Isolation, Durability) transactional consistency
and support.
Data is normalized, meaning lots of joins, which
affects speed.
They come with built-in data integrity and a
large eco-system.
Data is normalized, meaning lots of joins, which
affects speed.
Relationships in this system have constraints. They have problems working with semi-
structured data.
There is limitless indexing. Strong SQL
NoSQL
Pros Cons
They scale out horizontally and work with
unstructured and semi-structured data.
Data is deformalized, requiring mass updates
(i.e. product name change).
Some support ACID transactional
consistency.
Weaker or eventual consistency instead of
ACID
Schema-free or Schema-on-read options. Does not have built-in data integrity (must
do in code)
High availability of language training, setup,
and developments cost
Limited support
Databases are open source and so “free” Does not have built-in data integrity (must
do in code)
Numerous commercial products available.
Hadoop
• Facebook, Google, Yahoo, Amazon, and Microsoft
• Exponential growth of data
• Doug Cutting developed an open source version of
MapReduce system called Hadoop
• Hadoop is a software ecosystem that allows for
massively parallel computing
• Large data procedure which might takes 20 hours of
processing time on relational database may only
take 3 minutes with Hadoop
• Hive looks like old SQL - HQL
Hadoop clusters on Client computers
Hive is not
• A relational database
• A design for OnLine Transaction Processing
OLTP
• A language for real-time queries and row-level
updates
FUCTIONS OF HIVE ON HADOOP
• Data Warehouse system built on top of Hadoop
• Takes advantages of Hadoop processing power
• Facilitates data summarization, ad-hoc queries,
analysis of large datasets stored in Hadoop
• Provides a SQL interface (known as Hive QL – HQL)
which is widely familiar to most programmers
• Saves times using Hadoop MapReduce programmes
• Provides mechanism to project structure onto
Hadoop datasets
• Loads fast and allow flexibility at the cost of query
time
Apaches framework
• Sqoop: It is used to import and export data to
and from between HDFS and RDBMS.
• Pig: It is a procedural language platform used
to develop a script for MapReduce operations.
• Hive: It is a platform used to develop SQL type
scripts to do MapReduce operations
Hive vs Java and Pig
Java Pig
• Word Count MapReduce
example: List words and
number of occurrences in a
document
Java takes 63 lines of java codes
to write this hive only takes 7
easy lines of code.
• High level programming
language
• Good for ETL
• Powerful transformation
capabilities
• Often used in combination with
HIVE.
Hive Architecture
HIVE DIRECTORY STRUCTURE
• Lib directory
– SHIVE_HOME/lib
– Location of the Hive JAR files
– Contain the actual Java code that implement the Hive
functionality
• Bin directory
– SHIVE_HOME/bin
– Location of Hive Scripts/Services
• Conf directory
– HIVE_HOME/conf
– Location of configuration files
Summary & Conclusion
• Hive is a data warehouse infrastructure tool to process
structured data in Hadoop.
• It resides on top Hadoop to summarize Big Data, and
makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later the
Apache Software Foundation took it up and
• Developed it further as an open source under the
name Apache Hive.
• It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
REFERENCES
• http://www.dataversity.net/review-pros-cons-
different-databases-relational-versus-non-
relational/
• https://segment.com/blog/choosing-a-
database-for-analytics/
• https://www.upwork.com/hiring/data/sql-vs-
nosql-databases-whats-the-difference/
DON’T THANK ME THANK HIVE

More Related Content

PPTX
Big Data and Hadoop
PPTX
Big Data and Hadoop Components
PPTX
Apache storm
PPTX
Apache Hive
PPTX
Apache Hive
PDF
PPTX
Session 14 - Hive
PDF
Apache hive
Big Data and Hadoop
Big Data and Hadoop Components
Apache storm
Apache Hive
Apache Hive
Session 14 - Hive
Apache hive

What's hot (19)

PPTX
Apache hive
PPTX
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPT
Hive(ppt)
PPTX
Apache hive introduction
PPTX
Apache HBase™
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PPT
Apache Hive - Introduction
ODP
Apache hive1
PPTX
Introduction to HiveQL
PDF
PPTX
Introduction To HBase
PPTX
Hive and HiveQL - Module6
PPTX
Introduction to HBase
PPTX
Introduction to Hive
PPTX
Unit 5-apache hive
PPTX
SQL Server 2012 and Big Data
PPTX
Apache hive
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive(ppt)
Apache hive introduction
Apache HBase™
Introduction to Apache Hive(Big Data, Final Seminar)
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Apache Hive - Introduction
Apache hive1
Introduction to HiveQL
Introduction To HBase
Hive and HiveQL - Module6
Introduction to HBase
Introduction to Hive
Unit 5-apache hive
SQL Server 2012 and Big Data
Ad

Similar to Apache Hadoop Hive (20)

PPTX
Hive - A theoretical overview in Detail.pptx
PPTX
Apache Hive for modern DBAs
PPTX
BDA: Introduction to HIVE, PIG and HBASE
PPTX
Overview of big data & hadoop v1
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPTX
Hive and querying data
PPTX
Big data and tools
PPTX
Case study on big data
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
PPTX
Big dataproposal
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
PPTX
Colorado Springs Open Source Hadoop/MySQL
PPTX
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
PPTX
hive architecture and hive components in detail
PDF
An Overview Of Apache Pig And Apache Hive
PPTX
Presentation ON Hive Big Data NOSQL.pptx
PPTX
Hive vs Hbase, a Friendly Competition
PPTX
Hive_Pig.pptx
PDF
Hive explanation with examples and syntax
Hive - A theoretical overview in Detail.pptx
Apache Hive for modern DBAs
BDA: Introduction to HIVE, PIG and HBASE
Overview of big data & hadoop v1
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Hive and querying data
Big data and tools
Case study on big data
Unit II Hadoop Ecosystem_Updated.pptx
Big dataproposal
Big Data Developers Moscow Meetup 1 - sql on hadoop
Colorado Springs Open Source Hadoop/MySQL
HIVE-NEED, CHARACTERISTICS, OPTIMIZATION
hive architecture and hive components in detail
An Overview Of Apache Pig And Apache Hive
Presentation ON Hive Big Data NOSQL.pptx
Hive vs Hbase, a Friendly Competition
Hive_Pig.pptx
Hive explanation with examples and syntax
Ad

More from Some corner at the Laboratory (10)

PPTX
CRISPR in Cancer Biology and Therapy.pptx
PPTX
Wet Granulation Process Optimization 1.0.pptx
PPTX
Smart “Anti-Bacterial” Silk-Silver Nanoparticles Hydrogel Biosynthesis
PPTX
microRNA “miRNA”mi RNA
PPTX
Tissue regeneration of the liver
PPTX
PPTX
Hydrogel Nanocomposties: THE BIOMEDICAL APPLICATION
PPT
The Skeletal & Muscular Systems ;
PPTX
Biomaterials presentation
PPTX
The social value of pollution prevention
CRISPR in Cancer Biology and Therapy.pptx
Wet Granulation Process Optimization 1.0.pptx
Smart “Anti-Bacterial” Silk-Silver Nanoparticles Hydrogel Biosynthesis
microRNA “miRNA”mi RNA
Tissue regeneration of the liver
Hydrogel Nanocomposties: THE BIOMEDICAL APPLICATION
The Skeletal & Muscular Systems ;
Biomaterials presentation
The social value of pollution prevention

Recently uploaded (20)

PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Institutional Correction lecture only . . .
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
master seminar digital applications in india
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Basic Mud Logging Guide for educational purpose
PDF
Insiders guide to clinical Medicine.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Pre independence Education in Inndia.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Cell Structure & Organelles in detailed.
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPH.pptx obstetrics and gynecology in nursing
Institutional Correction lecture only . . .
Module 4: Burden of Disease Tutorial Slides S2 2025
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Renaissance Architecture: A Journey from Faith to Humanism
master seminar digital applications in india
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
O7-L3 Supply Chain Operations - ICLT Program
Basic Mud Logging Guide for educational purpose
Insiders guide to clinical Medicine.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Microbial disease of the cardiovascular and lymphatic systems
human mycosis Human fungal infections are called human mycosis..pptx
Pre independence Education in Inndia.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Cell Structure & Organelles in detailed.
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
TR - Agricultural Crops Production NC III.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf

Apache Hadoop Hive

  • 1. The Apache Hadoop HIVE Omoyayi Ibrahim Omodamilola Student No.; 20174831 PhD Biomedical Engineering
  • 2. Outline • Big Data • History of Database (NoSQL vs SQL) • New SQL database • SQL • NoSQL • Factors Affecting the Selection of a database • Hadoop Hive – Functions of Hive on Hadoop – Hive vs Java vs Pig • Hadoop Distributed File System • Hive Architecture • Work Flow of Hive • List of Reference
  • 4. INTRODUCTION • The initiation of The Hadoop Apache Hive began in 2007 by Facebook due to its data growth. • This ETL system began to fail over few years as more people joined Facebook. • In August 2008, Facebook decided to move to scalable a more scalable open-source Hadoop environment; Hive • Facebook, Netflix and Amazons support the Apache Hive SQL now known as the HiveQL
  • 5. SQL (left) vs NoSQL (right) Source: Google Images
  • 6. NEW STRUCTURED QUERY LANGUAGE NewSQL • Relational + NoSQL • designed for Web-scale applications • provide many of the traditional SQL operations Class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.
  • 7. RELATIONAL DATABASES SQL • Structured Query Language (SQL) • Consists of two or more tables with columns and row • Relationship between tables and field types is called a schema • (SQL) is a programming language used by database (MySQL, Sybase, Oracle, or IBM DM2, SQL) architects to design relational databases. • These databases are well understood and widely supported
  • 8. Popular SQL databases and RDBMS’s • MySQL—the most popular open-source database • Oracle—an object-relational DBMS written in the C++ language. • IMB DB2—a family of database server products from IBM that are built to handle advanced “big data” analytics. • Sybase—a relational model database server product for businesses primarily used on the Unix OS and Linux • MS SQL Server—a Microsoft-developed RDBMS for enterprise- level databases that supports both SQL and NoSQL architectures. • Microsoft Azure—a cloud computing platform that supports any operating system, and lets you store, compute, and scale data • MariaDB—an enhanced, drop-in version of MySQL. • PostgreSQL—an enterprise-level, object-relational DBMS that uses procedural languages like Perl and Python.
  • 9. NOSQL DATABASES • Easy to access • Greater flexibility • Documents oriented data • Massive amounts of data • Uncleared data requirements • Data Includes: sensor data, social sharing, personal settings, photos, location-based information, online activity, usage metrics, etc
  • 11. POPULAR NOSQL DATABASES • MongoDB—the most popular NoSQL system • Apache’s CouchDB—a true DB for the web, it uses the JSON data exchange format to store its documents • HBase—another Apache project, developed as a part of Hadoop, this open-source, non-relational “column store” • Oracle NoSQL—Oracle’s entry into the NoSQL category. • Apache’s Cassandra DB—born at Facebook, handling massive amounts of structured data. Examples: Instagram, Comcast, Apple, and Spotify (growing app). • Riak—It has fault-tolerance replication and automatic data distribution built in for excellent performance.
  • 13. SQL Pros Cons Relational databases work with structured data. Relational Databases do not scale out horizontally very well (concurrency and data size), only vertically. They support ACID (Atomicity, Consistency, Isolation, Durability) transactional consistency and support. Data is normalized, meaning lots of joins, which affects speed. They come with built-in data integrity and a large eco-system. Data is normalized, meaning lots of joins, which affects speed. Relationships in this system have constraints. They have problems working with semi- structured data. There is limitless indexing. Strong SQL
  • 14. NoSQL Pros Cons They scale out horizontally and work with unstructured and semi-structured data. Data is deformalized, requiring mass updates (i.e. product name change). Some support ACID transactional consistency. Weaker or eventual consistency instead of ACID Schema-free or Schema-on-read options. Does not have built-in data integrity (must do in code) High availability of language training, setup, and developments cost Limited support Databases are open source and so “free” Does not have built-in data integrity (must do in code) Numerous commercial products available.
  • 15. Hadoop • Facebook, Google, Yahoo, Amazon, and Microsoft • Exponential growth of data • Doug Cutting developed an open source version of MapReduce system called Hadoop • Hadoop is a software ecosystem that allows for massively parallel computing • Large data procedure which might takes 20 hours of processing time on relational database may only take 3 minutes with Hadoop • Hive looks like old SQL - HQL
  • 16. Hadoop clusters on Client computers
  • 17. Hive is not • A relational database • A design for OnLine Transaction Processing OLTP • A language for real-time queries and row-level updates
  • 18. FUCTIONS OF HIVE ON HADOOP • Data Warehouse system built on top of Hadoop • Takes advantages of Hadoop processing power • Facilitates data summarization, ad-hoc queries, analysis of large datasets stored in Hadoop • Provides a SQL interface (known as Hive QL – HQL) which is widely familiar to most programmers • Saves times using Hadoop MapReduce programmes • Provides mechanism to project structure onto Hadoop datasets • Loads fast and allow flexibility at the cost of query time
  • 19. Apaches framework • Sqoop: It is used to import and export data to and from between HDFS and RDBMS. • Pig: It is a procedural language platform used to develop a script for MapReduce operations. • Hive: It is a platform used to develop SQL type scripts to do MapReduce operations
  • 20. Hive vs Java and Pig Java Pig • Word Count MapReduce example: List words and number of occurrences in a document Java takes 63 lines of java codes to write this hive only takes 7 easy lines of code. • High level programming language • Good for ETL • Powerful transformation capabilities • Often used in combination with HIVE.
  • 22. HIVE DIRECTORY STRUCTURE • Lib directory – SHIVE_HOME/lib – Location of the Hive JAR files – Contain the actual Java code that implement the Hive functionality • Bin directory – SHIVE_HOME/bin – Location of Hive Scripts/Services • Conf directory – HIVE_HOME/conf – Location of configuration files
  • 23. Summary & Conclusion • Hive is a data warehouse infrastructure tool to process structured data in Hadoop. • It resides on top Hadoop to summarize Big Data, and makes querying and analyzing easy. • Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and • Developed it further as an open source under the name Apache Hive. • It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
  • 25. DON’T THANK ME THANK HIVE