SlideShare a Scribd company logo
2
Most read
3
Most read
8
Most read
8 ESSENTIAL CONCEPTS OF 
BIG DATA AND HADOOP 
A QUICK RE F ERENC E 
GUIDE
What is Big Data and Hadoop? 
Big Data refers to large sets of data that cannot be analyzed with traditional tools. It stands for data related to 
large-scale processing architectures. 
Hadoop is the software framework that is developed by Apache to support distributed processing of data. Initially, 
Ja va™ language was used to develop Hadoop, but today many other languages are used for scripting Hadoop. 
Hadoop is used as the core platform to structure Big Data and helps in performing data analytics. 
Table of Contents 
Chapter: 1 Important Definitions 
Chapter: 2 MapReduce 
Chapter: 3 HDFS 
Chapter: 4 Pig vs. SQL 
Chapter: 5 HBase componenets 
Chapter: 6 Cloudera 
Chapter: 7 Zookeeper and Sqoop 
Chapter: 8 Hadoop Ecosystem
Chapter: 1— Important Definitions 
TERM 
Big data 
Hadoop 
VMware Player 
Hadoop Architecture 
HDF 
S 
MapReduce 
Apache Hadoop 
DEFINITION 
Big Data refers to the data sets whose size makes it difficult for commonly 
used data capturing software tools to interpret, manage, and process them within a 
reasonable time frame. 
Hadoop is an open-source framework built on the Ja va environment. It assists in the 
processing of large data sets in a distributed computing environment. 
VMware Player is a free software package offered by VMware, Inc., which is used to create and 
manage virtual machines. 
Hadoop is a master and slave architecture that includes the NameNode as the master and the 
DataNode as the slave. 
The Hadoop Distributed File System (HDFS) is a distributed file system that shares some of the 
features of other distributed file systems. It is used for storing and retrieving unstructured data. 
The MapReduce is a core component of Hadoop, and is responsible for processing jobs in 
distributed mode. 
One of the primary technologies, which rules the field of Big Data technology, is Apache 
Hadoop™. 
Ubuntu Server 
Pig 
Ubuntu is a leading open-source platform for scale out. Ubuntu helps in utilizing the infrastructure 
at its optimum level irrespective of whether users want to deploy a cloud, a web farm, or a Hadoop 
cluster. 
The Apache Pig is a platform which helps to analyze large datasets that includes high-level 
language required to express data analysis programs. Pig is one of the components of the Hadoop 
eco-system. 
Hive 
SQ 
L 
Hive is an open-source data warehousing system used to analyze a large amount of dataset that is 
stored in Hadoop files. It has three key functions like summarization of data, query, and analysis. 
It is a query language used to interact with SQL databases. 
Metastore 
Driver 
It is the component that stores the system catalog and metadata about tables, columns, 
partitions, etc. It is stored in a traditional RDBMS format. 
Driver is the component that Manages the lifecycle of a HiveQL statement. 
Query compiler 
A query compiler is one of the driver components. It is responsible for compiling the Hive script for 
errors.
Query optimizer A query optimizer optimizes Hive scripts for faster execution of the same. It consists of a chain of 
Chapter: 2 —MapReduce 
The MapReduce component of Hadoop is responsible for processing jobs in 
distributed mode. The features of MapReduce are as follows: 
Dist ributed data pro cess ing 
The first feature of MapReduce 
component is that it performs distributed 
data processing using the MapReduce 
programming paradigm. 
User - d e fi n e d map phase 
The second feature of MapReduce is 
that you can possess a user-defined 
map phase, which is a parallel, 
share-nothing processing of input. 
A g g regat ion of output 
The third feature of MapReduce is 
aggregation of the output of the map 
phase, which is a user-defined reduce 
phase after a map process. 
transformations. 
Execution engine: The role of the execution engine is to execute the tasks produced by the compiler in proper 
dependency order. 
Hive server The Hive Server is the main component which is responsible for providing an interface to the user. It 
also maintains connectivity in modules. 
Client components The developer uses client components to perform development in Hive. The client components 
include Command Line Interface (CLI), web UI, and the JDBC/ODBC driver. 
Apache HBase It is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed Filesystem). 
HBase can scale horizontally to thousands of commodity servers and petabytes by indexing the 
storage. 
ZooKeeper It is used for performing region assignment. ZooKeeper is a centralized management service for 
maintaining and configuring information, naming, providing distributed synchronization, and group 
services. 
Cloudera It is a commercial tool for deploying Hadoop in an enterprise setup. 
Sqoop It is a tool that extracts data derived from non-Hadoop sources and formats them such that the data 
can be used by Hadoop later.
Chapter: 3 —HDFS 
HDFS is used for storing and retrieving unstructured data. The features of 
Hadoop HDFS are as follows: 
Provides a ccess to data bloc k s 
HDFS provides a high-throughput access to 
data blocks. When an unstructured data is 
uploaded on HDFS, it is converted into data 
blocks of fixed size. The data is chunked into 
blocks so that it is compatible with commodity 
hardware's storage. 
Helps to manage fi l e sys tem 
HDFS provides a limited interface for 
managing the file system to allow it to 
scale. This feature ensures that you can 
perform a scale up or scale down of 
resources in the Hadoop cluster. 
C reates mul t iple repl icas 
of data bloc k s 
The third feature of MapReduce is 
aggregation of the output of the map 
phase, which is a user-defined reduce 
phase after a map process. 
Chapter: 4 — Pig vs. SQL 
The table below includes the differences between Pig and SQL: 
Difference Pig SQL 
Definition HDFS provides a limited interface for managing the file 
system to allow it to scale. This feature ensures that you can 
perform a scale up or scale down of resources in the Hadoop 
cluster. 
It is a query language used to 
interact with SQL databases. 
Example customer = LOAD '/data/customer.dat' AS 
(c_id,name,city); 
sales = LOAD '/data/sales.dat' AS (s_id,c_id,- 
date,amount); 
salesBLR = FILTER customer BY city == 'Bangalore'; 
joined= JOIN customer BY c_id, salesBLR BY c_id; 
grouped = GROUP joined BY c_id; 
summed= FOREACH grouped GENERATE GROUP, 
SUM(joined.salesBLR::amount); 
spenders= FILTER summed BY $1 > 100000; 
sorted = ORDER spenders BY $1 DESC; DUMP 
sorted; 
SELECT c_id , SUM(amount) 
AS CTotal 
FROM customers c JOIN 
sales s ON c.c_id = s.c_id 
WHERE c.city = 'Bangalore' 
GROUP BY c_id HAVING 
SUM(amount) > 100000 
ORDER BY CTotal DESC
Chapter: 5 —HBase components 
Introduct ion 
Apache HBase is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed Filesystem). HBase can scale 
horizontally to thousands of commodity servers and petabytes by indexing the storage. HBase supports random real-time CRUD operations. 
HBase also has linear and modular scalability. It supports an easy-to-use Ja va API for programmatic access. 
HBase is integrated with the MapReduce framework in Hadoop. It is an open-source framework that has been modeled after Google’s BigTable. 
Further, HBase is a type of NoSQL. 
HBase Components 
The components of HBase are HBase Master and Multiple RegionServers 
Master R egionSer ver 
Memstor 
e 
ZooKeeper 
/hba se/region1 
/hba se/region2 
— 
— 
/hba se/region 
HFil 
e 
WA 
L 
An explanation of the components of HBase is given below: 
HBase Master 
It is responsible for managing the schema that is stored 
in Hadoop Distributed File System (HDFS). 
Multiple R egionSer vers 
They act like availability servers that help in main- taining 
a part of the complete data, which is stored in HDFS 
according to the requirement of the user. They do this 
using the HFile and WAL (W rite Ahead Log) service. 
The RegionServers always stay in sync with the HBase 
Master. The responsibility of ZooKeeper is to ensure that 
the RegionServers are in sync with the HBase Master.
Chapter: 6 —Cloudera 
Cloudera is a commercial tool for deploying Hadoop in an enterprise setup. The 
salient features of Cloudera are as follows: 
It uses 100% open-source distribution of Apache 
Hadoop and related projects like Apache Pig, Apache 
Hive, Apache HBase, Apache Sqoop, etc. 
It has its own user-friendly Cloudera Manager for 
system management, Cloudera Navigator for data 
management, dedicated technical support, etc. 
Chapter: 7 — Zookeeper and Sqoop 
ZooKeeper is an open-source and high performance co-ordination service for distributed applications. It offers services such as Naming, 
Locks and synchronization, Configuration management, and Group services. 
ZooKeeper Data Model 
ZooKeeper has a hierarchical namespace. Each node in the namespace is called Znode. The example given here shows the tree diagram used to 
represent the namespace. The tree follows a top-down approach where '/' is the root and App1 and App2 resides in the root. The path to access 
db is /App1/db. This path is called the hierarchical path. 
/Ap p 1 
/ 
/ A p p 2 
/ A p p 1/db /Ap p 1/conf /Ap p 1/conf 
Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases like MySQL, 
MSSQL, Oracle, etc. to HDFS. Following are the reasons for using Sqoop: 
SQL servers are deployed worldwide and are the primary ways to accept the data from a user. Nightly 
processing is done on SQL server for years. 
It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS. 
Transferring the data using some automated script is inefficient and time-consuming. 
Traditional DB has reporting, data visualization, and other applications built in enterprises but to handle large data, we need an ecosystem. 
The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by Sqoop.
Chapter: 8 — Hadoop Ecosystem 
The image given here depicts the various Hadoop ecosystem components. The base of all the components is Hadoop Distributed File System 
(HDFS). Above this component is YARN MapReduce v2. This framework component is used for the distributed processing in a Hadoop cluster. 
The next component is Flume. Flume is used for collecting logs across a cluster. Sqoop is used for data exchange between a relational database 
and Hadoop HDFS. 
The ZooKeeper component is used for coordinating the nodes in a cluster. The next ecosystem component is Oozie. This component is used for 
creating, executing, and modifying the workflow of a MapReduce job. The Pig component is used for performing scripting for MapReduce 
applications. 
The next component is Mahout. This component is used for machine learning based on machine inputs. R Connectors are used for generating 
statistics of the nodes in a cluster. Hive is used for interacting with Hadoop using SQL like query. The next component is HBase. This component 
is used for slicing of large data. 
The last component is Ambari. This component is used for provisioning, managing, and monitoring Hadoop clusters.

More Related Content

PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PPTX
Azure Synapse Analytics Overview (r1)
PPTX
Azure Synapse Analytics Overview (r2)
PDF
Azure Synapse 101 Webinar Presentation
PPTX
Databricks for Dummies
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Learn to Use Databricks for the Full ML Lifecycle
PDF
dbt Python models - GoDataFest by Guillermo Sanchez
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r2)
Azure Synapse 101 Webinar Presentation
Databricks for Dummies
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Learn to Use Databricks for the Full ML Lifecycle
dbt Python models - GoDataFest by Guillermo Sanchez

What's hot (20)

PPTX
AWS Storage Gateway
PPTX
Migrating on premises and cloud contents to SharePoint Online at no cost with...
PDF
Data Discovery at Databricks with Amundsen
PDF
Databricks Delta Lake and Its Benefits
PDF
Apache Kafka and the Data Mesh | Michael Noll, Confluent
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PPTX
Data Engineer's Lunch #54: dbt and Spark
PPTX
Introducing Azure SQL Data Warehouse
PDF
Modernizing to a Cloud Data Architecture
PDF
Elasticsearch for Data Analytics
PPTX
Building a Big Data Solution
PDF
AWS
PDF
Spark on yarn
PPTX
Cloud computing by Google Cloud Platform - Presentation
PPTX
Apigee Products Overview
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Azure Data Factory v2
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PDF
Azure Active Directory
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
AWS Storage Gateway
Migrating on premises and cloud contents to SharePoint Online at no cost with...
Data Discovery at Databricks with Amundsen
Databricks Delta Lake and Its Benefits
Apache Kafka and the Data Mesh | Michael Noll, Confluent
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Data Engineer's Lunch #54: dbt and Spark
Introducing Azure SQL Data Warehouse
Modernizing to a Cloud Data Architecture
Elasticsearch for Data Analytics
Building a Big Data Solution
AWS
Spark on yarn
Cloud computing by Google Cloud Platform - Presentation
Apigee Products Overview
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Azure Data Factory v2
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Azure Active Directory
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Ad

Viewers also liked (20)

PPT
Critical Path Method
PPTX
Project Management - Critical path method
PPT
Critical path method (cpm)
PPTX
Critical path method
PPT
Cpm (critical path method)
PDF
Programme evaluation and review technique (pert)
PPTX
The Critical Path Method
PPTX
Facebook Do's and Don'ts - social networking sites - best practices social me...
PPTX
Why you should consider a career in Tableau ?
PDF
Business to Business - What is B2B website - B2B case studies
PPTX
IS AUDIT presentation
PPTX
Linkbuilding with ux is new SEO
PPTX
SAP, Oracle, Infor, Microsoft Dynamics - MNC ERP
PPTX
Backup or Pack up - Disaster Recovery Plan - Importance of Backup
PPTX
How applications of bigdata drive industries
PDF
Microsoft Bill Gates - Case Study - MS DOS to XBOX - from 1975
PPTX
Social media and QR Code promo - Do not feel shy
PDF
4 young millionaires: Who they are and how they got there
PPTX
Year End rush - ERP and eCommerce Case study of FMCG companies in India
Critical Path Method
Project Management - Critical path method
Critical path method (cpm)
Critical path method
Cpm (critical path method)
Programme evaluation and review technique (pert)
The Critical Path Method
Facebook Do's and Don'ts - social networking sites - best practices social me...
Why you should consider a career in Tableau ?
Business to Business - What is B2B website - B2B case studies
IS AUDIT presentation
Linkbuilding with ux is new SEO
SAP, Oracle, Infor, Microsoft Dynamics - MNC ERP
Backup or Pack up - Disaster Recovery Plan - Importance of Backup
How applications of bigdata drive industries
Microsoft Bill Gates - Case Study - MS DOS to XBOX - from 1975
Social media and QR Code promo - Do not feel shy
4 young millionaires: Who they are and how they got there
Year End rush - ERP and eCommerce Case study of FMCG companies in India
Ad

Similar to Big Data and Hadoop Guide (20)

DOCX
project report on hadoop
PDF
2.1-HADOOP.pdf
PPTX
Apache hadoop introduction and architecture
PPT
Hadoop distributed file system (HDFS), HDFS concept
PPT
Unit-3_BDA.ppt
PPT
unit-3bda-230421082621-d2b7d921.ppthjghh
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPTX
Hadoop_arunam_ppt
PPTX
Distributed Systems Hadoop.pptx
PPTX
Introduction to Hadoop and Hadoop component
PDF
Unit IV.pdf
PPTX
Managing Big data with Hadoop
PPTX
PPTX
PPTX
Cppt Hadoop
PPTX
Brief Introduction about Hadoop and Core Services.
PPTX
Intro to Hadoop
PPTX
Overview of big data & hadoop v1
PPT
Big Data and Hadoop Basics
PDF
BIGDATA MODULE 3.pdf
project report on hadoop
2.1-HADOOP.pdf
Apache hadoop introduction and architecture
Hadoop distributed file system (HDFS), HDFS concept
Unit-3_BDA.ppt
unit-3bda-230421082621-d2b7d921.ppthjghh
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop_arunam_ppt
Distributed Systems Hadoop.pptx
Introduction to Hadoop and Hadoop component
Unit IV.pdf
Managing Big data with Hadoop
Cppt Hadoop
Brief Introduction about Hadoop and Core Services.
Intro to Hadoop
Overview of big data & hadoop v1
Big Data and Hadoop Basics
BIGDATA MODULE 3.pdf

More from Simplilearn (20)

PPTX
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
PPTX
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
PPTX
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
PPTX
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
PPTX
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
PPTX
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
PPTX
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
PPTX
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
PPTX
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
PPTX
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
PPTX
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
PPTX
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
PPTX
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
PPTX
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
PPTX
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
PPTX
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
sap open course for s4hana steps from ECC to s4
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
Diabetes mellitus diagnosis method based random forest with bat algorithm
Programs and apps: productivity, graphics, security and other tools
sap open course for s4hana steps from ECC to s4

Big Data and Hadoop Guide

  • 1. 8 ESSENTIAL CONCEPTS OF BIG DATA AND HADOOP A QUICK RE F ERENC E GUIDE
  • 2. What is Big Data and Hadoop? Big Data refers to large sets of data that cannot be analyzed with traditional tools. It stands for data related to large-scale processing architectures. Hadoop is the software framework that is developed by Apache to support distributed processing of data. Initially, Ja va™ language was used to develop Hadoop, but today many other languages are used for scripting Hadoop. Hadoop is used as the core platform to structure Big Data and helps in performing data analytics. Table of Contents Chapter: 1 Important Definitions Chapter: 2 MapReduce Chapter: 3 HDFS Chapter: 4 Pig vs. SQL Chapter: 5 HBase componenets Chapter: 6 Cloudera Chapter: 7 Zookeeper and Sqoop Chapter: 8 Hadoop Ecosystem
  • 3. Chapter: 1— Important Definitions TERM Big data Hadoop VMware Player Hadoop Architecture HDF S MapReduce Apache Hadoop DEFINITION Big Data refers to the data sets whose size makes it difficult for commonly used data capturing software tools to interpret, manage, and process them within a reasonable time frame. Hadoop is an open-source framework built on the Ja va environment. It assists in the processing of large data sets in a distributed computing environment. VMware Player is a free software package offered by VMware, Inc., which is used to create and manage virtual machines. Hadoop is a master and slave architecture that includes the NameNode as the master and the DataNode as the slave. The Hadoop Distributed File System (HDFS) is a distributed file system that shares some of the features of other distributed file systems. It is used for storing and retrieving unstructured data. The MapReduce is a core component of Hadoop, and is responsible for processing jobs in distributed mode. One of the primary technologies, which rules the field of Big Data technology, is Apache Hadoop™. Ubuntu Server Pig Ubuntu is a leading open-source platform for scale out. Ubuntu helps in utilizing the infrastructure at its optimum level irrespective of whether users want to deploy a cloud, a web farm, or a Hadoop cluster. The Apache Pig is a platform which helps to analyze large datasets that includes high-level language required to express data analysis programs. Pig is one of the components of the Hadoop eco-system. Hive SQ L Hive is an open-source data warehousing system used to analyze a large amount of dataset that is stored in Hadoop files. It has three key functions like summarization of data, query, and analysis. It is a query language used to interact with SQL databases. Metastore Driver It is the component that stores the system catalog and metadata about tables, columns, partitions, etc. It is stored in a traditional RDBMS format. Driver is the component that Manages the lifecycle of a HiveQL statement. Query compiler A query compiler is one of the driver components. It is responsible for compiling the Hive script for errors.
  • 4. Query optimizer A query optimizer optimizes Hive scripts for faster execution of the same. It consists of a chain of Chapter: 2 —MapReduce The MapReduce component of Hadoop is responsible for processing jobs in distributed mode. The features of MapReduce are as follows: Dist ributed data pro cess ing The first feature of MapReduce component is that it performs distributed data processing using the MapReduce programming paradigm. User - d e fi n e d map phase The second feature of MapReduce is that you can possess a user-defined map phase, which is a parallel, share-nothing processing of input. A g g regat ion of output The third feature of MapReduce is aggregation of the output of the map phase, which is a user-defined reduce phase after a map process. transformations. Execution engine: The role of the execution engine is to execute the tasks produced by the compiler in proper dependency order. Hive server The Hive Server is the main component which is responsible for providing an interface to the user. It also maintains connectivity in modules. Client components The developer uses client components to perform development in Hive. The client components include Command Line Interface (CLI), web UI, and the JDBC/ODBC driver. Apache HBase It is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed Filesystem). HBase can scale horizontally to thousands of commodity servers and petabytes by indexing the storage. ZooKeeper It is used for performing region assignment. ZooKeeper is a centralized management service for maintaining and configuring information, naming, providing distributed synchronization, and group services. Cloudera It is a commercial tool for deploying Hadoop in an enterprise setup. Sqoop It is a tool that extracts data derived from non-Hadoop sources and formats them such that the data can be used by Hadoop later.
  • 5. Chapter: 3 —HDFS HDFS is used for storing and retrieving unstructured data. The features of Hadoop HDFS are as follows: Provides a ccess to data bloc k s HDFS provides a high-throughput access to data blocks. When an unstructured data is uploaded on HDFS, it is converted into data blocks of fixed size. The data is chunked into blocks so that it is compatible with commodity hardware's storage. Helps to manage fi l e sys tem HDFS provides a limited interface for managing the file system to allow it to scale. This feature ensures that you can perform a scale up or scale down of resources in the Hadoop cluster. C reates mul t iple repl icas of data bloc k s The third feature of MapReduce is aggregation of the output of the map phase, which is a user-defined reduce phase after a map process. Chapter: 4 — Pig vs. SQL The table below includes the differences between Pig and SQL: Difference Pig SQL Definition HDFS provides a limited interface for managing the file system to allow it to scale. This feature ensures that you can perform a scale up or scale down of resources in the Hadoop cluster. It is a query language used to interact with SQL databases. Example customer = LOAD '/data/customer.dat' AS (c_id,name,city); sales = LOAD '/data/sales.dat' AS (s_id,c_id,- date,amount); salesBLR = FILTER customer BY city == 'Bangalore'; joined= JOIN customer BY c_id, salesBLR BY c_id; grouped = GROUP joined BY c_id; summed= FOREACH grouped GENERATE GROUP, SUM(joined.salesBLR::amount); spenders= FILTER summed BY $1 > 100000; sorted = ORDER spenders BY $1 DESC; DUMP sorted; SELECT c_id , SUM(amount) AS CTotal FROM customers c JOIN sales s ON c.c_id = s.c_id WHERE c.city = 'Bangalore' GROUP BY c_id HAVING SUM(amount) > 100000 ORDER BY CTotal DESC
  • 6. Chapter: 5 —HBase components Introduct ion Apache HBase is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed Filesystem). HBase can scale horizontally to thousands of commodity servers and petabytes by indexing the storage. HBase supports random real-time CRUD operations. HBase also has linear and modular scalability. It supports an easy-to-use Ja va API for programmatic access. HBase is integrated with the MapReduce framework in Hadoop. It is an open-source framework that has been modeled after Google’s BigTable. Further, HBase is a type of NoSQL. HBase Components The components of HBase are HBase Master and Multiple RegionServers Master R egionSer ver Memstor e ZooKeeper /hba se/region1 /hba se/region2 — — /hba se/region HFil e WA L An explanation of the components of HBase is given below: HBase Master It is responsible for managing the schema that is stored in Hadoop Distributed File System (HDFS). Multiple R egionSer vers They act like availability servers that help in main- taining a part of the complete data, which is stored in HDFS according to the requirement of the user. They do this using the HFile and WAL (W rite Ahead Log) service. The RegionServers always stay in sync with the HBase Master. The responsibility of ZooKeeper is to ensure that the RegionServers are in sync with the HBase Master.
  • 7. Chapter: 6 —Cloudera Cloudera is a commercial tool for deploying Hadoop in an enterprise setup. The salient features of Cloudera are as follows: It uses 100% open-source distribution of Apache Hadoop and related projects like Apache Pig, Apache Hive, Apache HBase, Apache Sqoop, etc. It has its own user-friendly Cloudera Manager for system management, Cloudera Navigator for data management, dedicated technical support, etc. Chapter: 7 — Zookeeper and Sqoop ZooKeeper is an open-source and high performance co-ordination service for distributed applications. It offers services such as Naming, Locks and synchronization, Configuration management, and Group services. ZooKeeper Data Model ZooKeeper has a hierarchical namespace. Each node in the namespace is called Znode. The example given here shows the tree diagram used to represent the namespace. The tree follows a top-down approach where '/' is the root and App1 and App2 resides in the root. The path to access db is /App1/db. This path is called the hierarchical path. /Ap p 1 / / A p p 2 / A p p 1/db /Ap p 1/conf /Ap p 1/conf Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases like MySQL, MSSQL, Oracle, etc. to HDFS. Following are the reasons for using Sqoop: SQL servers are deployed worldwide and are the primary ways to accept the data from a user. Nightly processing is done on SQL server for years. It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS. Transferring the data using some automated script is inefficient and time-consuming. Traditional DB has reporting, data visualization, and other applications built in enterprises but to handle large data, we need an ecosystem. The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by Sqoop.
  • 8. Chapter: 8 — Hadoop Ecosystem The image given here depicts the various Hadoop ecosystem components. The base of all the components is Hadoop Distributed File System (HDFS). Above this component is YARN MapReduce v2. This framework component is used for the distributed processing in a Hadoop cluster. The next component is Flume. Flume is used for collecting logs across a cluster. Sqoop is used for data exchange between a relational database and Hadoop HDFS. The ZooKeeper component is used for coordinating the nodes in a cluster. The next ecosystem component is Oozie. This component is used for creating, executing, and modifying the workflow of a MapReduce job. The Pig component is used for performing scripting for MapReduce applications. The next component is Mahout. This component is used for machine learning based on machine inputs. R Connectors are used for generating statistics of the nodes in a cluster. Hive is used for interacting with Hadoop using SQL like query. The next component is HBase. This component is used for slicing of large data. The last component is Ambari. This component is used for provisioning, managing, and monitoring Hadoop clusters.