SlideShare a Scribd company logo
BIRLA VISHVAKARMA
MAHAVIDHYALAYA
TOPIC :SUBMITTED TO : MISSES BIJAL DALWADI
SUBMITTED BY : NIKITA SURE(140080116025)
JIMMY CHOPDA(140080116013)
VRUTI TANKARIA(140080116057)
MESHWA PATEL(140080116035)
DATABASE MANAGEMENT SYSTEM
PART 1
What is hadoop?
History of hadoop.
Why hadoop?
Where hadoop.
What is Hadoop?
Apache hadoop is an open-source
software framework written in java for
distributed storage and distributed
processing of very large data set on
computer clusters built from commodity
hardware.
Hadoop was created by doug
cutting and mike cafarella in 2005.
Cutting who was working at yahoo !
At that time, named it after his
son’s toy elephant.
It was originally developed to
support distribution for the nutch
search engine project.
It’s latest release version is 2.7.1
on july 6,2015.
Doug Cutting
Hadoop - Why ?
The complexity of modern analytics needs is
outstripping the available computing power
of legacy systems. With its distributed
processing, Hadoop can handle large
volumes of structured and unstructured data
more efficiently than the traditional
enterprise data warehouse. Because Hadoop
is open source and can run on commodity
hardware, the initial cost savings are
dramatic and continue to grow as your
organizational data grows.
 Smart meters are deployed in homes worldwide to help consumers and utility
companies manage the use of water, electricity, and gas better. Historically,
meter readers would walk from house to house recording meter read outs and
reporting them to the utility company for billing purposes. Because of the labor
costs, many utilities switched from monthly readings to quarterly. This delayed
revenue and made it impossible to analyze residential usage in any detail.
Consider a fictional company called CostCutter Utilities that serves 10 million
households. Once a quarter, they gathered 10 million readings to produce
utility bills. With government regulation and the price of oil skyrocketing,
CostCutter started deploying smart meters so they could get hourly readings of
electricity usage. They now collect 21.6 billion sensor readings per quarter
from the smart meters. Analysis of the meter data over months and years can
be correlated with energy saving campaigns, weather patterns, and local
events, providing savings insights both for consumers and CostCutter Utilities.
When consumers are offered a billing plan that has cheaper electricity from 8
p.m. to 5 a.m., they demand five minute intervals in their smart meter reports
so they can identify high-use activity in their homes. At five minute intervals,
the smart meters are collecting more than 100 billion meter readings every 90
days, and CostCutter Utilities now has a big data problem. Their data volume
exceeds their ability to process it with existing software and hardware. So
CostCutter Utilities turns to Hadoop to handle the incoming meter readings.
The base Apache Hadoop framework is composed
of the following modules:
1. Hadoop Common – contains libraries and
utilities needed by other Hadoop modules
2. Hadoop Distributed File System (HDFS) – a
distributed file-system that stores data on
commodity machines
3. Hadoop YARN – a resource-management
platform responsible for managing computing
resources in clusters and using them for
scheduling of users' applications
4. Hadoop MapReduce – a programming model for
large scale data processing
PART 2
 Map reduce.
 HDFS.
 Introduction to hadoop architecture.
What is map reduce?
 Mapreduce is a processing technique and
a program model for distributed
computing based on java.
 The mapreduce algorithm contain two
important tasks, namely map and reduce.
 Map takes a set of data and converts it
into another set of data, where individual
elements are broken down into tuples.
What is HDFS?
 HDFS is a Java-based file system that provides
scalable and reliable data storage, and it was
designed to span large clusters of commodity
servers. HDFS has demonstrated production
scalability of up to 200 PB of storage and a single
cluster of 4500 servers, supporting close to a billion
files and blocks.
 Hadoop can work directly with any mountable
distributed file system, but the most common file
system used by Hadoop is the HDFS. It is a fault-
tolerant distributed file system that is designed for
commonly available hardware. It is well-suited for
large data sets due to its high throughput access to
application data.
HADOOP ARCHITECTURE
 "Hadoop employs a master/slave
architecture for both distributed storage
and distributed computation". In the
distributed storage, the NameNode is the
master and the DataNodes are the slaves.
In the distributed computation, the
Jobtracker is the master and the
Tasktrackers are the slaves
MASTERs
1 . NAMENODE
 The NameNode is the HEART of an HDFS file
system. It keeps the directory tree of all files in
the file system, and tracks where across the
cluster the file data is kept. It does not store the
data of these files itself.
 When the NameNode goes down, the file system
goes offline.
2 . JOBTRACKER
 The JobTracker is the service within Hadoop that
farms out MapReduce tasks to specific nodes in
the cluster, ideally the nodes that have the data,
or at least are in the same rack.
 The JobTracker is a point of failure for the
hadoop MapReduce service. If it goes down, all
running jobs are halted.
SLAVES
1 . DATANODE
 A DataNode stores data in the
HadoopFileSystem. A functional filesystem
has more than one DataNode, with data
replicated across them.
2 . TASKTRACKER
A TaskTracker is a node in the cluster that
accepts tasks - Map, Reduce and Shuffle
operations - from a JobTracker.
PART 3
 How hadoop architecture works.
 Reading files from HDFS.
 Writing files from HDFS.
Hadoop info
Hadoop info
Hadoop info
PART 4
 Advantages of hadoop.
 Disadvantages of hadoop
 Where it is used?
 Subprojects of hadoop
Hadoop RelatedSubprojects
 Pig
◦ High-level language for data analysis
 HBase
◦ Table storage for semi-structured data
 Zookeeper
◦ Coordinating distributed applications
 Hive
◦ SQL-like Query language and Metastore
 Mahout
◦ Machine learning
Etc….
Hadoop advantages
1. Scalable
Hadoop is a highly scalable storage platform, because it can
store and distribute very large data sets across hundreds of
inexpensive servers that operate in parallel.
2. Cost effective
Hadoop is designed as a scale-out architecture that can
affordably store all of a company’s data for later use. The cost
savings are staggering: instead of costing thousands to tens of
thousands of pounds per terabyte, Hadoop offers computing
and storage capabilities for hundreds of pounds per terabyte.
3. Flexible
 Hadoop can be used for a wide variety of purposes, such as
log processing, recommendation systems, data
warehousing, market campaign analysis and fraud
detection.
 4. Fast
 Hadoop’s unique storage method is based on a distributed
file system that basically ‘maps’ data wherever it is located
on a cluster. The tools for data processing are often on the
same servers where the data is located, resulting in much
faster data processing. If you’re dealing with large volumes
of unstructured data, Hadoop is able to efficiently process
terabytes of data in just minutes, and petabytes in hours.
 5. Resilient to failure
 A key advantage of using Hadoop is its fault tolerance.
When data is sent to an individual node, that data is also
replicated to other nodes in the cluster, which means that
in the event of failure, there is another copy available for
use.
Problems
 Coding is tedious
 Want to change that data? SQL UPDATE
and the change is in. Hadoop does not do
this.
 Hadoop stores data in files, and does not
index them. If you want to find something,
you have to run a MapReduce job going
through all the data.
 Where Hadoop works is where the data is
too big for a database!
What we want ?
 Guaranteed data processing
 Fault-tolerance
 No intermediate message brokers!
 Higher level abstraction than message
passing
 “Just works” !!
Who uses hadoop?
Hadoop is in use at most organizations that
handle big data:
 Amazon/A9
 Facebook
 Google
 IBM
 New York Times
 PowerSet
 Yahoo!
ANY QUESTIONS????

More Related Content

DOCX
Hadoop Seminar Report
PDF
Seminar_Report_hadoop
PPTX
Big Data and Hadoop
ODP
Hadoop seminar
PPTX
Hadoop and Big Data
PPTX
PPT on Hadoop
DOCX
Hadoop technology doc
PPTX
Big data and hadoop
Hadoop Seminar Report
Seminar_Report_hadoop
Big Data and Hadoop
Hadoop seminar
Hadoop and Big Data
PPT on Hadoop
Hadoop technology doc
Big data and hadoop

What's hot (20)

PPTX
Apache hadoop introduction and architecture
PPTX
Hadoop: Distributed Data Processing
PPSX
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Introduction to Apache Hadoop
PPTX
Hadoop Tutorial For Beginners
DOCX
Hadoop Seminar Report
PPTX
Big data Hadoop presentation
PPTX
Apache Hadoop
PDF
Introduction to Hadoop and MapReduce
PDF
Report Hadoop Map Reduce
PPTX
Introduction to Apache Hadoop Ecosystem
PPTX
Big data and Hadoop
PPT
Seminar Presentation Hadoop
PDF
Apache Hadoop - Big Data Engineering
PDF
An Introduction to the World of Hadoop
PPTX
Hadoop Presentation - PPT
PDF
Hadoop Family and Ecosystem
PPTX
Big data
Apache hadoop introduction and architecture
Hadoop: Distributed Data Processing
HADOOP TECHNOLOGY ppt
Introduction to Apache Hadoop
Hadoop Tutorial For Beginners
Hadoop Seminar Report
Big data Hadoop presentation
Apache Hadoop
Introduction to Hadoop and MapReduce
Report Hadoop Map Reduce
Introduction to Apache Hadoop Ecosystem
Big data and Hadoop
Seminar Presentation Hadoop
Apache Hadoop - Big Data Engineering
An Introduction to the World of Hadoop
Hadoop Presentation - PPT
Hadoop Family and Ecosystem
Big data
Ad

Similar to Hadoop info (20)

PPTX
Hadoop and BigData - July 2016
DOCX
project report on hadoop
PPTX
Big Data Hadoop Technology
PPTX
THE SOLUTION FOR BIG DATA
PPTX
THE SOLUTION FOR BIG DATA
PDF
2.1-HADOOP.pdf
PPTX
Big data Analytics Hadoop
PDF
Unit IV.pdf
PPT
Big Data and Hadoop Basics
PPTX
Hadoop An Introduction
PPTX
Seminar ppt
PPTX
PPTX
Bigdata and hadoop
PPTX
OPERATING SYSTEM .pptx
PPTX
Hadoop basics
PDF
What is Apache Hadoop and its ecosystem?
PPT
Introduction to Apache hadoop
PPTX
Bigdata and Hadoop Introduction
Hadoop and BigData - July 2016
project report on hadoop
Big Data Hadoop Technology
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
2.1-HADOOP.pdf
Big data Analytics Hadoop
Unit IV.pdf
Big Data and Hadoop Basics
Hadoop An Introduction
Seminar ppt
Bigdata and hadoop
OPERATING SYSTEM .pptx
Hadoop basics
What is Apache Hadoop and its ecosystem?
Introduction to Apache hadoop
Bigdata and Hadoop Introduction
Ad

Recently uploaded (20)

PPT
Mechanical Engineering MATERIALS Selection
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Welding lecture in detail for understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PPT on Performance Review to get promotions
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Mechanical Engineering MATERIALS Selection
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
bas. eng. economics group 4 presentation 1.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Lecture Notes Electrical Wiring System Components
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Welding lecture in detail for understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT on Performance Review to get promotions
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Structs to JSON How Go Powers REST APIs.pdf
CH1 Production IntroductoryConcepts.pptx
Digital Logic Computer Design lecture notes
Embodied AI: Ushering in the Next Era of Intelligent Systems
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx

Hadoop info

  • 1. BIRLA VISHVAKARMA MAHAVIDHYALAYA TOPIC :SUBMITTED TO : MISSES BIJAL DALWADI SUBMITTED BY : NIKITA SURE(140080116025) JIMMY CHOPDA(140080116013) VRUTI TANKARIA(140080116057) MESHWA PATEL(140080116035) DATABASE MANAGEMENT SYSTEM
  • 2. PART 1 What is hadoop? History of hadoop. Why hadoop? Where hadoop.
  • 3. What is Hadoop? Apache hadoop is an open-source software framework written in java for distributed storage and distributed processing of very large data set on computer clusters built from commodity hardware.
  • 4. Hadoop was created by doug cutting and mike cafarella in 2005. Cutting who was working at yahoo ! At that time, named it after his son’s toy elephant. It was originally developed to support distribution for the nutch search engine project. It’s latest release version is 2.7.1 on july 6,2015. Doug Cutting
  • 5. Hadoop - Why ? The complexity of modern analytics needs is outstripping the available computing power of legacy systems. With its distributed processing, Hadoop can handle large volumes of structured and unstructured data more efficiently than the traditional enterprise data warehouse. Because Hadoop is open source and can run on commodity hardware, the initial cost savings are dramatic and continue to grow as your organizational data grows.
  • 6.  Smart meters are deployed in homes worldwide to help consumers and utility companies manage the use of water, electricity, and gas better. Historically, meter readers would walk from house to house recording meter read outs and reporting them to the utility company for billing purposes. Because of the labor costs, many utilities switched from monthly readings to quarterly. This delayed revenue and made it impossible to analyze residential usage in any detail. Consider a fictional company called CostCutter Utilities that serves 10 million households. Once a quarter, they gathered 10 million readings to produce utility bills. With government regulation and the price of oil skyrocketing, CostCutter started deploying smart meters so they could get hourly readings of electricity usage. They now collect 21.6 billion sensor readings per quarter from the smart meters. Analysis of the meter data over months and years can be correlated with energy saving campaigns, weather patterns, and local events, providing savings insights both for consumers and CostCutter Utilities. When consumers are offered a billing plan that has cheaper electricity from 8 p.m. to 5 a.m., they demand five minute intervals in their smart meter reports so they can identify high-use activity in their homes. At five minute intervals, the smart meters are collecting more than 100 billion meter readings every 90 days, and CostCutter Utilities now has a big data problem. Their data volume exceeds their ability to process it with existing software and hardware. So CostCutter Utilities turns to Hadoop to handle the incoming meter readings.
  • 7. The base Apache Hadoop framework is composed of the following modules: 1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules 2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines 3. Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications 4. Hadoop MapReduce – a programming model for large scale data processing
  • 8. PART 2  Map reduce.  HDFS.  Introduction to hadoop architecture.
  • 9. What is map reduce?  Mapreduce is a processing technique and a program model for distributed computing based on java.  The mapreduce algorithm contain two important tasks, namely map and reduce.  Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples.
  • 10. What is HDFS?  HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.  Hadoop can work directly with any mountable distributed file system, but the most common file system used by Hadoop is the HDFS. It is a fault- tolerant distributed file system that is designed for commonly available hardware. It is well-suited for large data sets due to its high throughput access to application data.
  • 11. HADOOP ARCHITECTURE  "Hadoop employs a master/slave architecture for both distributed storage and distributed computation". In the distributed storage, the NameNode is the master and the DataNodes are the slaves. In the distributed computation, the Jobtracker is the master and the Tasktrackers are the slaves
  • 12. MASTERs 1 . NAMENODE  The NameNode is the HEART of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.  When the NameNode goes down, the file system goes offline. 2 . JOBTRACKER  The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.  The JobTracker is a point of failure for the hadoop MapReduce service. If it goes down, all running jobs are halted.
  • 13. SLAVES 1 . DATANODE  A DataNode stores data in the HadoopFileSystem. A functional filesystem has more than one DataNode, with data replicated across them. 2 . TASKTRACKER A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
  • 14. PART 3  How hadoop architecture works.  Reading files from HDFS.  Writing files from HDFS.
  • 18. PART 4  Advantages of hadoop.  Disadvantages of hadoop  Where it is used?  Subprojects of hadoop
  • 19. Hadoop RelatedSubprojects  Pig ◦ High-level language for data analysis  HBase ◦ Table storage for semi-structured data  Zookeeper ◦ Coordinating distributed applications  Hive ◦ SQL-like Query language and Metastore  Mahout ◦ Machine learning Etc….
  • 20. Hadoop advantages 1. Scalable Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. 2. Cost effective Hadoop is designed as a scale-out architecture that can affordably store all of a company’s data for later use. The cost savings are staggering: instead of costing thousands to tens of thousands of pounds per terabyte, Hadoop offers computing and storage capabilities for hundreds of pounds per terabyte.
  • 21. 3. Flexible  Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.  4. Fast  Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.  5. Resilient to failure  A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.
  • 22. Problems  Coding is tedious  Want to change that data? SQL UPDATE and the change is in. Hadoop does not do this.  Hadoop stores data in files, and does not index them. If you want to find something, you have to run a MapReduce job going through all the data.  Where Hadoop works is where the data is too big for a database!
  • 23. What we want ?  Guaranteed data processing  Fault-tolerance  No intermediate message brokers!  Higher level abstraction than message passing  “Just works” !!
  • 24. Who uses hadoop? Hadoop is in use at most organizations that handle big data:  Amazon/A9  Facebook  Google  IBM  New York Times  PowerSet  Yahoo!