SlideShare a Scribd company logo
Introduction to hadoop
What is BigData?
The term “BigData” is used to describe the
collection of Complex and Large Data such that it’s
difficult to capture, search, store, process and analyze this
kind of data using Database Management System.
Basically the data coming from everyware like,
Social media sites
Traffic, Satellite
Digital world
Software logs
Business data
And many more…..
• BigData Includes both Structured and Unstructured
data.
• BigData is difficult to work with using most Relational
database management systems.
• BigData is more than simply a matter of size; it is an
opportunity to find insights in new and emerging types
of data and content, to make your business more agile.
• why it so important ,
1.More data leads to more accurate analyses.
2.More accurate analyses leads to better decision
making.
3.Better decisions means greater operational
efficiencies, cost reductions and reduced Risk.
What is Hadoop…?
“Apache hadoop is open source software
library framework use to process large data sets across
the distributed cluster using simple programming on
commodity(highly available) hardware.”
 Hadoop process the data parallel on large cluster.
Google created its own distributed computing
framework and published papers about the same.
Hadoop was developed on the basis of papers released
by Google.
Core hadoop consists of two core components,
-The Hadoop Distributed File System (HDFS)
-MapReduce
Why Hadoop ?
Why
Hadoop
Economical
(cost
effective)
Flexible
Scalable
Solves
Bigdata
problems
Reliable
Smart
How Hadoop works
Client
Program
Data
Master
Node
Slave Node
Slave Node
Slave Node
HDFS
Name
Node
Map
Reduce
Job
Tracker
Map Reduce
Task Tracker
HDFS
Name Node
HDFS
Name Node
Map Reduce
Task Tracker
HDFS
Name Node
Map Reduce
Task Tracker
STEPS:
Step 1 : Data is Broken Into file splits of 64 mb OR
128 mb and the blocks are moved to different
Nodes.
Step 2 : Once all the blocks are moved, The hadoop
framework passes on the program to each
node.
Step 3 : Job Tracker Then Starts the scheduling the
programs on individual nodes.
Step 4 : Once all the node are done, the output id
return back.
History……
Hadoop was inspired by Google’s MapReduce, a
software framework in which an application is broken
down into numerous small parts. Any of those parts
(also called fragments or blocks) can be run on any node
in the cluster.
Doug Cutting, hadoop’s creator , named the framework
after his child’s stuffed toy elephant.
In 2002, Doug Cutting created an open source, web
crawler project. In 2004, Google published MapReduce,
GFS papers. In 2006, Doug Cutting developed the open
source, MapReduce and HDFS project. In 2008, Yahoo
run 4,000 node Hadoop cluster and Hadoop won
terabyte sort benchmark. In 2009, Facebook launched
SQL support for Hadoop.
Hadoop Eco-System
PIG
Apache PIG is a platform for analyzing large data set,
that consist of high level language, for expressing data
analysis programs. Introduced by Yahoo.
HIVE
Apache HIVE is data warehouse software used to
querying and managing large data set on distributed
cluster. Introduced by Facebook.
HBase
Apache HBase is a Distributed column-oriented database
on top of HDFS and Hadoop.
SQOOP
SQOOP is a combination of SQL-Hadoop.
SQOOP is import and export utility, it is a data transfer
tool, to get data into hadoop from relational system and
put data into RDBMS for analysis with BI tools.
Zookeeper
Apache zookeeper coordination service for distributed
system, it is fast and scalable.
OOZiE
OOZiE is a workflow engine that runs on server, it is job
scheduling service within a hadoop cluster.
FLUME
FLUME is a service that basically lets you ingest data
(typically file data) into HDFS. Defined as, distributed
reliable, available service for moving large amount of
data as it is produced.
Ganesh L. Sanap
connectoganesh@gmail.com

More Related Content

PPT
BigData Analytics with Hadoop and BIRT
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
PPTX
Intro to Hadoop and MapReduce
PPTX
Big Data Technology Stack : Nutshell
PPTX
Big Data Analytics
PPT
Big Tools for Big Data
PDF
Introduction to Big Data
PPTX
Why Hadoop is Useful?
BigData Analytics with Hadoop and BIRT
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Intro to Hadoop and MapReduce
Big Data Technology Stack : Nutshell
Big Data Analytics
Big Tools for Big Data
Introduction to Big Data
Why Hadoop is Useful?

What's hot (20)

DOCX
Big data and Hadoop overview
PDF
Is Hadoop a Necessity for Data Science
PPTX
Hadoop for beginners free course ppt
DOCX
Big data abstract
PPTX
Big data
PDF
9 facts about statice's data anonymization solution
PDF
Introduction To Big Data Analytics On Hadoop - SpringPeople
PPTX
Great Expectations Presentation
PPTX
Big data ppt
PPTX
Bigdata
PDF
Introduction to Big Data
PDF
Introduction_OF_Hadoop_and_BigData
PPTX
Hadoop Tutorial
PDF
Bigdata and Hadoop Bootcamp
PPTX
Fundamentals of big data analytics and Hadoop
PPTX
Hadoop, SQL and NoSQL, No longer an either/or question
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PDF
Big Data Ecosystem
PDF
All data accessible to all my organization - Presentation at OW2con'19, June...
 
Big data and Hadoop overview
Is Hadoop a Necessity for Data Science
Hadoop for beginners free course ppt
Big data abstract
Big data
9 facts about statice's data anonymization solution
Introduction To Big Data Analytics On Hadoop - SpringPeople
Great Expectations Presentation
Big data ppt
Bigdata
Introduction to Big Data
Introduction_OF_Hadoop_and_BigData
Hadoop Tutorial
Bigdata and Hadoop Bootcamp
Fundamentals of big data analytics and Hadoop
Hadoop, SQL and NoSQL, No longer an either/or question
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Ecosystem
All data accessible to all my organization - Presentation at OW2con'19, June...
 
Ad

Similar to Introduction to hadoop (20)

PPTX
A Glimpse of Bigdata - Introduction
PPTX
Big data Analytics Hadoop
PPTX
Introduction-to-Big-Data-and-Hadoop.pptx
PPTX
Big data Presentation
PPTX
Data analytics
PPTX
Hadoop basics
PPTX
Bigdata and hadoop
PPTX
Hadoop info
PPTX
Hadoop and BigData - July 2016
PDF
Big Data
PPTX
Hadoop
PPT
Hadoop HDFS.ppt
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Big Data Hadoop Technology
PPTX
Cap 10 ingles
PPTX
Cap 10 ingles
PPTX
Big data
PPTX
Big data
PPTX
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
A Glimpse of Bigdata - Introduction
Big data Analytics Hadoop
Introduction-to-Big-Data-and-Hadoop.pptx
Big data Presentation
Data analytics
Hadoop basics
Bigdata and hadoop
Hadoop info
Hadoop and BigData - July 2016
Big Data
Hadoop
Hadoop HDFS.ppt
Introduction to Apache Hadoop Eco-System
Big Data Hadoop Technology
Cap 10 ingles
Cap 10 ingles
Big data
Big data
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
Ad

Recently uploaded (20)

PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
top salesforce developer skills in 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Transform Your Business with a Software ERP System
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
history of c programming in notes for students .pptx
PDF
System and Network Administration Chapter 2
PPTX
Introduction to Artificial Intelligence
PPTX
Operating system designcfffgfgggggggvggggggggg
Which alternative to Crystal Reports is best for small or large businesses.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
How to Migrate SBCGlobal Email to Yahoo Easily
top salesforce developer skills in 2025.pdf
ai tools demonstartion for schools and inter college
Adobe Illustrator 28.6 Crack My Vision of Vector Design
VVF-Customer-Presentation2025-Ver1.9.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Softaken Excel to vCard Converter Software.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Design an Analysis of Algorithms II-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Transform Your Business with a Software ERP System
Navsoft: AI-Powered Business Solutions & Custom Software Development
2025 Textile ERP Trends: SAP, Odoo & Oracle
history of c programming in notes for students .pptx
System and Network Administration Chapter 2
Introduction to Artificial Intelligence
Operating system designcfffgfgggggggvggggggggg

Introduction to hadoop

  • 2. What is BigData? The term “BigData” is used to describe the collection of Complex and Large Data such that it’s difficult to capture, search, store, process and analyze this kind of data using Database Management System. Basically the data coming from everyware like, Social media sites Traffic, Satellite Digital world Software logs Business data And many more…..
  • 3. • BigData Includes both Structured and Unstructured data. • BigData is difficult to work with using most Relational database management systems. • BigData is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile. • why it so important , 1.More data leads to more accurate analyses. 2.More accurate analyses leads to better decision making. 3.Better decisions means greater operational efficiencies, cost reductions and reduced Risk.
  • 4. What is Hadoop…? “Apache hadoop is open source software library framework use to process large data sets across the distributed cluster using simple programming on commodity(highly available) hardware.”  Hadoop process the data parallel on large cluster. Google created its own distributed computing framework and published papers about the same. Hadoop was developed on the basis of papers released by Google. Core hadoop consists of two core components, -The Hadoop Distributed File System (HDFS) -MapReduce
  • 6. How Hadoop works Client Program Data Master Node Slave Node Slave Node Slave Node HDFS Name Node Map Reduce Job Tracker Map Reduce Task Tracker HDFS Name Node HDFS Name Node Map Reduce Task Tracker HDFS Name Node Map Reduce Task Tracker
  • 7. STEPS: Step 1 : Data is Broken Into file splits of 64 mb OR 128 mb and the blocks are moved to different Nodes. Step 2 : Once all the blocks are moved, The hadoop framework passes on the program to each node. Step 3 : Job Tracker Then Starts the scheduling the programs on individual nodes. Step 4 : Once all the node are done, the output id return back.
  • 8. History…… Hadoop was inspired by Google’s MapReduce, a software framework in which an application is broken down into numerous small parts. Any of those parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, hadoop’s creator , named the framework after his child’s stuffed toy elephant. In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, MapReduce and HDFS project. In 2008, Yahoo run 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for Hadoop.
  • 10. PIG Apache PIG is a platform for analyzing large data set, that consist of high level language, for expressing data analysis programs. Introduced by Yahoo. HIVE Apache HIVE is data warehouse software used to querying and managing large data set on distributed cluster. Introduced by Facebook. HBase Apache HBase is a Distributed column-oriented database on top of HDFS and Hadoop.
  • 11. SQOOP SQOOP is a combination of SQL-Hadoop. SQOOP is import and export utility, it is a data transfer tool, to get data into hadoop from relational system and put data into RDBMS for analysis with BI tools. Zookeeper Apache zookeeper coordination service for distributed system, it is fast and scalable. OOZiE OOZiE is a workflow engine that runs on server, it is job scheduling service within a hadoop cluster.
  • 12. FLUME FLUME is a service that basically lets you ingest data (typically file data) into HDFS. Defined as, distributed reliable, available service for moving large amount of data as it is produced. Ganesh L. Sanap connectoganesh@gmail.com