SlideShare a Scribd company logo
HADOOP AND THEIR
ECOSYSTEM
BY:- SUNERA PATHAN
CONTENTS
• History of Hadoop
• What Is Hadoop
• Hadoop Architecture
• Hadoop Services
• Hadoop Ecosystem
Hdfs, Hive,Hbase,Mapreduce,Pig,Sqoop,Flume,
Zookeeper,
• Advantage of Hadoop
• Disadvantage of Hadoop
• Use of Hadoop
• References
• Conclusion
History of hadoop
• Hadoop was created by Doug Cutting who had created the Apache Lucene
(Text Search),which is origin in Apache Nutch (Open source search
Engine).Hadoop is a part of Apache Lucene Project.Actually Apache Nutch
was started in 2002 for working crawler and search
• In January 2008, Hadoop was made its own top-level project at Apache for,
confirming success ,By this time, Hadoop was being used by many other
companies such as Yahoo!, Facebook, etc.
• In April 2008, Hadoop broke a world record to become the fastest system
to sort a terabyte of data.
• Yahoo take test in which To process 1TB of data (1024 columns)
oracle – 3 ½ day
teradata – 4 ½ day
netezza – 2 hour 50 min
hadoop - 3.4 min
WHAT IS HADOOP
• Hadoop is the product of Apach ,it is the type of distributed system, it is
framework for big data
• Apache Hadoop is an open-source software framework for storage and
large-scale processing of data-sets on clusters of commodity hardware.
• Some of the characteristics:
• Open source
• Distributed processing
• Distributed storage
• Reliable
• Economical
• Flexible
Hadoop Framework Modules
The base Apache Hadoop framework is composed of the
following modules:
• Hadoop Common :– contains libraries and utilities needed
by other Hadoop modules
• Hadoop Distributed File System (HDFS) :– a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the cluster
• Hadoop YARN:– a resource-management platform
responsible for managing computing resources in clusters
and using them for scheduling of users' applications
• Hadoop MapReduce:– an implementation of
the MapReduce programming model for large scale data
processing.
Framework Architecture
Hadoop Services
• Storage
1. HDFS (Hadoop distributed file System)
a)Horizontally Unlimited Scalability
(No Limit For Max no.of Slaves)
b)Block Size=64MB(old Version)
128MB(New Version)
• Process
1. MapReduce(Old Model)
2. Spark(New Model)
Hadoop Architecture
Hadoop consists of the Hadoop Common package, which
provides file system and OS level abstractions, a
MapReduce engine and the Hadoop Distributed File
System (HDFS). The Hadoop Common package contains the
necessary Java Archive (JAR) files and scripts needed to start
Hadoop.
 Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem
Working Of Ecosystem
HDFS
• Hadoop Distributed File System (HDFS) is designed to
reliably store very large files across machines in a large
cluster. It is inspired by the GoogleFileSystem.
• Distribute large data file into blocks
• Blocks are managed by different nodes in the cluster
• Each block is replicated on multiple nodes
• Name node stored metadata information about files and
blocks
MAPREDUCE
• The Mapper:-
1. Each block is processed in isolation by a map task called
mapper
2. Map task runs on the node where the block is stored
• The Reducer:-
1. Consolidate result from different mappers
2. Produce final output
HBASE
• Hadoop database for random read/write access
• Features of HBASE:-
1. Type of NoSql database
2. Strongly consistent read and write
3. Automatic sharding
4. Automatic RegionServer failover
5. Hadoop/HDFS Integration
6. HBase supports massively parallelized processing via
MapReduce for using HBase as both source and sink.
7. HBase supports an easy to use Java API for programmatic
access.
8. HBase also supports Thrift and REST for non-Java front-ends.
HIVE
• SQL-like queries and tables on large datasets
• Features of HIVE:-
1. An sql like interface to Hadoop.
2. Data warehouse infrastructure built on top of Hadoop
3. Provide data summarization, query and analysis
4. Query execution via MapReduce
5. Hive interpreter convert the query to Map reduce format.
6. Open source project.
7. Developed by Facebook
8. Also used by Netflix, Cnet, Digg, eHarmony etc.
PIG
• Data flow language and compiler
• Features of pig:-
1. A scripting platform for processing and analyzing large
data sets
2. Apache Pig allows to write complex MapReduce programs
using a simple scripting language.
3. High level language: Pig Latin
4. Pig Latin is data flow language.
5. Pig translate Pig Latin script into MapReduce to execute
within Hadoop.
6. Open source project
7. Developed by Yahoo
ZOOKEEPER
• Coordination service for distributed applications
• Features of Zookeeper:-
1. Because coordinating distributed systems is a Zoo.
2. ZooKeeper is a centralized service for maintaining
configuration information, naming, providing distributed
synchronization, and providing group services.
FLUME
• Configurable streaming data collection
• Apache Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of streaming data into
the Hadoop Distributed File System (HDFS).
SQOOP
• Integration of databases and data warehouses with Hadoop
• Features of Sqoop:-
1. Command-line interface for transforming data between relational
database and Hadoop
2. Support incremental imports
3. Imports use to populate tables in Hadoop
4. Exports use to put data from Hadoop into relational database such as
SQL server
Hadoop RDBMSsqoop
OOZIE
• To design and schedule workflows
• Oozie is a workflow scheduler where the workflows are
expressed as Directed Acyclic Graphs. Oozie runs in a
Java servlet container Tomcat and makes use of a
database to store all the running workflow instances, their
states ad variables along with the workflow definitions to
manage Hadoop jobs (MapReduce, Sqoop, Pig and
Hive).The workflows in Oozie are executed based on data
and time dependencies.
Hadoop Advantages
• Unlimited data storage
1. Server Scaling Mode
a) Vertical Scale
b)Horizontal Scale
• High speed processing system
• All varities of data processing
1. Structural
2. Unstructural
3. semi-structural
Disadvantage of Hadoop
• If volume is small then speed of hadoop is bad
• Limitation of hadoop data storage
Well there is obviously a practical limit. But physically
HDFS Block IDs are Java longs so they have a max of 2^63
and if your block size is 64 MB then the maximum size is
512 yottabytes.
• Hadoop should be used for only batch processing
1. Batch process:-background process
where user can’t interactive
• Hadoop is not used for OLTP
– OLTP process:-interactive with uses
Conclusion
A scalable fault-tolerant distributed system hadoop for data
storage and processing huge amount of data with great speed
and maintainence
References
• http://training.cloudera.com/essentials.pdf
• http://en.wikipedia.org/wiki/Apache_Hadoop
• http://practicalanalytics.wordpress.com/2011/11
/06/explaining-hadoop-to-management-whats-
the-big-data-deal/
• https://developer.yahoo.com/hadoop/tutorial/m
odule1.html
• http://hadoop.apache.org/
• http://wiki.apache.org/hadoop/FrontPage

More Related Content

PPTX
PPTX
Analysing of big data using map reduce
PPTX
Hadoop technology
PPTX
Apache hive introduction
PDF
Hadoop MapReduce Framework
PPTX
Big data frameworks
PPTX
Big Data & Hadoop Tutorial
PPT
Hadoop hive presentation
Analysing of big data using map reduce
Hadoop technology
Apache hive introduction
Hadoop MapReduce Framework
Big data frameworks
Big Data & Hadoop Tutorial
Hadoop hive presentation

What's hot (20)

PPTX
Hadoop and Big Data
PPTX
Hadoop
PPTX
Introduction to Hadoop and Hadoop component
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PDF
Apache Hbase Architecture
PPT
Seminar Presentation Hadoop
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPTX
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
PPTX
Introduction to Hadoop Technology
PDF
What Is RDD In Spark? | Edureka
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Introduction to Map Reduce
PDF
Hadoop YARN
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
PPTX
Big data Hadoop presentation
DOCX
Unit II -BIG DATA ANALYTICS.docx
PDF
UNIT 1 -BIG DATA ANALYTICS Full.pdf
PPTX
Hadoop Tutorial For Beginners
PDF
Hadoop ecosystem
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Hadoop and Big Data
Hadoop
Introduction to Hadoop and Hadoop component
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Apache Hbase Architecture
Seminar Presentation Hadoop
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Introduction to Hadoop Technology
What Is RDD In Spark? | Edureka
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Introduction to Map Reduce
Hadoop YARN
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Big data Hadoop presentation
Unit II -BIG DATA ANALYTICS.docx
UNIT 1 -BIG DATA ANALYTICS Full.pdf
Hadoop Tutorial For Beginners
Hadoop ecosystem
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Ad

Viewers also liked (20)

PDF
Hadoop operations basic
PDF
Hadoop Ecosystem
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
PDF
Hadoop ecosystem
PDF
The Hadoop Ecosystem for Developers
PDF
Big Data and Hadoop Ecosystem
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Hadoop Ecosystem at a Glance
PDF
Hadoop ecosystem
PPTX
Hadoop Ecosystem
PDF
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PPT
Hadoop ecosystem
PPTX
The Evolution of the Hadoop Ecosystem
PPT
Map reduce - simplified data processing on large clusters
PPT
Hadoop ecosystem framework n hadoop in live environment
PPSX
Hadoop Ecosystem
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PPTX
Learn Hadoop Administration
PDF
Hadoop Ecosystem Architecture Overview
Hadoop operations basic
Hadoop Ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Hadoop ecosystem
The Hadoop Ecosystem for Developers
Big Data and Hadoop Ecosystem
Hadoop And Their Ecosystem ppt
Hadoop Ecosystem at a Glance
Hadoop ecosystem
Hadoop Ecosystem
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Hadoop ecosystem
The Evolution of the Hadoop Ecosystem
Map reduce - simplified data processing on large clusters
Hadoop ecosystem framework n hadoop in live environment
Hadoop Ecosystem
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Learn Hadoop Administration
Hadoop Ecosystem Architecture Overview
Ad

Similar to Hadoop And Their Ecosystem (20)

PPTX
Hadoop and their in big data analysis EcoSystem.pptx
PPTX
Getting started big data
PPTX
Cloudera Hadoop Distribution
PPTX
Hadoop training
PDF
Hadoop Primer
PPTX
01-Introduction-to-Hive.pptx
PPTX
An Introduction-to-Hive and its Applications and Implementations.pptx
PPT
Hadoop
ODP
Introdution to Apache Hadoop
PPTX
Asbury Hadoop Overview
PPSX
PPTX
Introduction to Apache Hadoop Ecosystem
PDF
Unit IV.pdf
PPTX
Big Data UNIT 2 AKTU syllabus all topics covered
PDF
BIGDATA ppts
PPTX
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
PPTX
Introduction to Hadoop
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
Hadoop and their in big data analysis EcoSystem.pptx
Getting started big data
Cloudera Hadoop Distribution
Hadoop training
Hadoop Primer
01-Introduction-to-Hive.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
Hadoop
Introdution to Apache Hadoop
Asbury Hadoop Overview
Introduction to Apache Hadoop Ecosystem
Unit IV.pdf
Big Data UNIT 2 AKTU syllabus all topics covered
BIGDATA ppts
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Introduction to Hadoop
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Welding lecture in detail for understanding
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Construction Project Organization Group 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Digital Logic Computer Design lecture notes
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
573137875-Attendance-Management-System-original
CH1 Production IntroductoryConcepts.pptx
Welding lecture in detail for understanding
Lecture Notes Electrical Wiring System Components
Construction Project Organization Group 2.pptx
Sustainable Sites - Green Building Construction
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Digital Logic Computer Design lecture notes
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
Structs to JSON How Go Powers REST APIs.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mechanical Engineering MATERIALS Selection
Strings in CPP - Strings in C++ are sequences of characters used to store and...

Hadoop And Their Ecosystem

  • 2. CONTENTS • History of Hadoop • What Is Hadoop • Hadoop Architecture • Hadoop Services • Hadoop Ecosystem Hdfs, Hive,Hbase,Mapreduce,Pig,Sqoop,Flume, Zookeeper, • Advantage of Hadoop • Disadvantage of Hadoop • Use of Hadoop • References • Conclusion
  • 3. History of hadoop • Hadoop was created by Doug Cutting who had created the Apache Lucene (Text Search),which is origin in Apache Nutch (Open source search Engine).Hadoop is a part of Apache Lucene Project.Actually Apache Nutch was started in 2002 for working crawler and search • In January 2008, Hadoop was made its own top-level project at Apache for, confirming success ,By this time, Hadoop was being used by many other companies such as Yahoo!, Facebook, etc. • In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. • Yahoo take test in which To process 1TB of data (1024 columns) oracle – 3 ½ day teradata – 4 ½ day netezza – 2 hour 50 min hadoop - 3.4 min
  • 4. WHAT IS HADOOP • Hadoop is the product of Apach ,it is the type of distributed system, it is framework for big data • Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. • Some of the characteristics: • Open source • Distributed processing • Distributed storage • Reliable • Economical • Flexible
  • 5. Hadoop Framework Modules The base Apache Hadoop framework is composed of the following modules: • Hadoop Common :– contains libraries and utilities needed by other Hadoop modules • Hadoop Distributed File System (HDFS) :– a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster • Hadoop YARN:– a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications • Hadoop MapReduce:– an implementation of the MapReduce programming model for large scale data processing.
  • 7. Hadoop Services • Storage 1. HDFS (Hadoop distributed file System) a)Horizontally Unlimited Scalability (No Limit For Max no.of Slaves) b)Block Size=64MB(old Version) 128MB(New Version) • Process 1. MapReduce(Old Model) 2. Spark(New Model)
  • 8. Hadoop Architecture Hadoop consists of the Hadoop Common package, which provides file system and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java Archive (JAR) files and scripts needed to start Hadoop.
  • 12. HDFS • Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the GoogleFileSystem. • Distribute large data file into blocks • Blocks are managed by different nodes in the cluster • Each block is replicated on multiple nodes • Name node stored metadata information about files and blocks
  • 13. MAPREDUCE • The Mapper:- 1. Each block is processed in isolation by a map task called mapper 2. Map task runs on the node where the block is stored • The Reducer:- 1. Consolidate result from different mappers 2. Produce final output
  • 14. HBASE • Hadoop database for random read/write access • Features of HBASE:- 1. Type of NoSql database 2. Strongly consistent read and write 3. Automatic sharding 4. Automatic RegionServer failover 5. Hadoop/HDFS Integration 6. HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink. 7. HBase supports an easy to use Java API for programmatic access. 8. HBase also supports Thrift and REST for non-Java front-ends.
  • 15. HIVE • SQL-like queries and tables on large datasets • Features of HIVE:- 1. An sql like interface to Hadoop. 2. Data warehouse infrastructure built on top of Hadoop 3. Provide data summarization, query and analysis 4. Query execution via MapReduce 5. Hive interpreter convert the query to Map reduce format. 6. Open source project. 7. Developed by Facebook 8. Also used by Netflix, Cnet, Digg, eHarmony etc.
  • 16. PIG • Data flow language and compiler • Features of pig:- 1. A scripting platform for processing and analyzing large data sets 2. Apache Pig allows to write complex MapReduce programs using a simple scripting language. 3. High level language: Pig Latin 4. Pig Latin is data flow language. 5. Pig translate Pig Latin script into MapReduce to execute within Hadoop. 6. Open source project 7. Developed by Yahoo
  • 17. ZOOKEEPER • Coordination service for distributed applications • Features of Zookeeper:- 1. Because coordinating distributed systems is a Zoo. 2. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  • 18. FLUME • Configurable streaming data collection • Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS).
  • 19. SQOOP • Integration of databases and data warehouses with Hadoop • Features of Sqoop:- 1. Command-line interface for transforming data between relational database and Hadoop 2. Support incremental imports 3. Imports use to populate tables in Hadoop 4. Exports use to put data from Hadoop into relational database such as SQL server Hadoop RDBMSsqoop
  • 20. OOZIE • To design and schedule workflows • Oozie is a workflow scheduler where the workflows are expressed as Directed Acyclic Graphs. Oozie runs in a Java servlet container Tomcat and makes use of a database to store all the running workflow instances, their states ad variables along with the workflow definitions to manage Hadoop jobs (MapReduce, Sqoop, Pig and Hive).The workflows in Oozie are executed based on data and time dependencies.
  • 21. Hadoop Advantages • Unlimited data storage 1. Server Scaling Mode a) Vertical Scale b)Horizontal Scale • High speed processing system • All varities of data processing 1. Structural 2. Unstructural 3. semi-structural
  • 22. Disadvantage of Hadoop • If volume is small then speed of hadoop is bad • Limitation of hadoop data storage Well there is obviously a practical limit. But physically HDFS Block IDs are Java longs so they have a max of 2^63 and if your block size is 64 MB then the maximum size is 512 yottabytes. • Hadoop should be used for only batch processing 1. Batch process:-background process where user can’t interactive • Hadoop is not used for OLTP – OLTP process:-interactive with uses
  • 23. Conclusion A scalable fault-tolerant distributed system hadoop for data storage and processing huge amount of data with great speed and maintainence
  • 24. References • http://training.cloudera.com/essentials.pdf • http://en.wikipedia.org/wiki/Apache_Hadoop • http://practicalanalytics.wordpress.com/2011/11 /06/explaining-hadoop-to-management-whats- the-big-data-deal/ • https://developer.yahoo.com/hadoop/tutorial/m odule1.html • http://hadoop.apache.org/ • http://wiki.apache.org/hadoop/FrontPage