SlideShare a Scribd company logo
Big Data and Hadoop
 What is Big Data
 How 3vs define Big data
 Hadoop and its ecosystem
 HDFS
 Map reduce and Yarn
 Career in Big Data and Hadoop
o Order Details for a store
o All orders across 100s of stores
o A person’s stock portfolio
o All stock transactions for Stock Exchange
 Its data that is created very fast and is too big to
be processed on a single machine .These data
come from various sources in various formats.
What is BIG DATA ???
How 3Vs define Big Data ???
1. Volume
 It is the size of the data which determines the value and
potential of the data under consideration. The name ‘Big
Data’ itself contains a term which is related to size and
hence the characteristic.
2. Variety
 Data today comes in all types of formats. Structured, numeric data in
traditional databases. Unstructured text documents, email, stock ticker data
and financial transactions and semi-structured data too.
3. Velocity
 speed of generation of data or how fast the data is generated
and processed to meet the demands and the challenges which
lie ahead in the path of growth and development.
 SUMMARY
 Veracity ( came much later after 3Vs but entered as next big wave of innovation )
 The quality of the data being captured can vary greatly. Accuracy of
analysis depends on the veracity of the source data.
What is HADOOP ???
“Hadoop” was name of a yellow toy elephant owned by the son of one of its inventors.
Hadoop is an open-source software framework for storing and processing
big data in a distributed fashion on large clusters of commodity hardware.
Essentially, it accomplishes two tasks : : massive data storage and faster
processing.•Open-source software. Open source software differs from commercial software due to the broad
and open network of developers that create and manage the programs.
•Framework. In this case, it means everything you need to develop and run your software applications
is provided – programs, tool sets, connections, etc.
•Distributed. Data is divided and stored across multiple computers, and computations can be run in
parallel across multiple connected machines.
•Massive storage. The Hadoop framework can store huge amounts of data by breaking the data into
blocks and storing it on clusters of lower-cost commodity hardware.
•Faster processing. How? Hadoop processes large amounts of data in parallel across clusters of tightly
connected low-cost computers for quick results.
 Low cost. The open-source framework is free and uses commodity hardware to store large
quantities of data.
 Computing power. Its distributed computing model can quickly process very large volumes
of data.
 Scalability. You can easily grow your system simply by adding more nodes with little administration .
 Storage flexibility. Unlike traditional relational databases, you don’t have to pre-process
data before storing it. You can store as much data as you want .
 Inherent data protection. Data and application processing are protected against hardware failure.
 self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to
make sure the distributed computing does not fail and automatically stores multiple copies of all data.
 What’s in Hadoop ???
 HDFS – the Java-based distributed file system that can store all kinds of data
without prior organization.
 MapReduce – a software programming model for processing large sets of
data in parallel.
 YARN – a resource management framework for scheduling and handling
resource requests from distributed applications.
 Hadoop Ecosystem
 Basically ,HDFS and MapReduce are the two core components of the Hadoop Ecosystem
and are at the heart of the Hadoop framework.
 But Some of the other Apache Projects which are built around the Hadoop Framework
are part of the Hadoop Ecosystem.
HDFS (Hadoop Distributed File System)
o HDFS enables Hadoop to store huge files. It’s a scalable file system
that distributes and stores data across all machines in a Hadoop cluster.
 Scale-Out Architecture - Add servers to increase capacity
 High Availability - Serve mission-critical workflows and applications
 Fault Tolerance - Automatically and seamlessly recover from failures
 Load Balancing - Place data intelligently for maximum efficiency and utilization
 Tunable Replication - Multiple copies of each file provide data protection and
computational performance
 Namenode and datanode
64 MB
64 MB
22 MB
150MB Text File
 When file(say 150MB Text file) is uploaded on HDFS then each block is
stored as a node in the Hadoop cluster.
 NameNode- It Runs on a master node that tracks and
directs the storage of the cluster. Also we know that
the nodes or blocks which make up the original 150
MB file and that is handled by a separate machine is
the Namenode. Information stored here is called as
metadata.
DN
 DataNode- There is a piece of software running on each of
these nodes of the cluster called Datanode which
runs on slave nodes which make up the majority of the
machines of a cluster. The name node places the data
into these data nodes.
Name
Node
DN
DN
Cluster.
 HOW HDFS WORKS ???
Name
Node
DN
DN
DN
Which of these are a problem if it occurs ?
oNetwork failure Between the nodes
oDisk failure on Datanode
oNot all Datanodes are used
oBlock sizes if differ of Datanodes
oDisk failure of Namenode
 We may lose some data nodes and hence will be losing some amount of data say
64MB out 150MB text file
 We may also have some hardware problem in namenode and may lose it too.
 HOW HDFS WORKS continued….???
o Replication Factor ( RF ) -The number of copies of a file is called the
replication factor of that file. This information is stored by the Namenode.
Solution to problem occurred...(Datanode lost)
 Hadoop replicates each file 3 times as it stores in
HDFS. ( RF = 3 )
 HOW HDFS WORKS continued….???
 NFS (Network File System) - Now , meta data
is stored not only on someone’s hard drive but
also on NFS . It is a method of mounting a
remote disk that way if namenode and
metadata are lost still we have a copy of
metadata elsewhere on the network.
 Even more efficient, now a days , two
Namenodes have been configured.
 Namenode(Active) - works in normal
condition
 Namenode(StandBy) - works if active
Solution to problem occurred…( NAMEnode lost )
• Earlier for a long time when Namenode (and metadata stored inside) was lost then the entire cluster
was inaccessible but now we have 2 techniques by which we can maintain our data .
MapReduce
 MapReduce is a programming model and an associated implementation for processing
and generating large data sets with a parallel, distributed algorithm on a cluster.
Scale-out Architecture - Add servers to increase processing power
Security & Authentication - Works with HDFS security to make sure that only approved users can
operate against the data in the system
Resource Manager - Employs data locality and server resources to determine optimal computing
operations
Optimized Scheduling - Completes jobs according to prioritization
Flexibility - Procedures can be written in virtually any programming language
Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail independently
and restart automatically
 Why MapReduce ???
 To process data serially i.e. from top to bottom could take some long time
 Historically we may probably use an associative array and Hash Tables but
these may lead us to some serious problem .
 As the hash sizes grow, heap pressure becomes more of an issue
Say we are using 1TB of data ,then what issues may occur ????
o It won’t work.
o We may run out of memory.
o Data processing may take long time.
 how MapReduce works ???
MapReduce divides workloads up into multiple tasks that can be executed in parallel.
Solution to problem
 Mapreduce applications typically implement the Mapper and Reducer interfaces to provide
the map and reduce methods. These form the core of the job.
 Mappers and Reducers
Mappers
 Mappers are the individual tasks that transform input records into intermediate records.
 These are just small programs that deal with a relatively small amount of data and work in parallel.
 The output obtained are called as intermediate records.
 Mapper maps input key/value pairs to a set of intermediate key/value pairs .
 Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data.
 Shuffle is the movement of intermediate records from mappers to reducers.
 Sort is the fact that reducers will organize these records in the sorted order.
Reducers
 Reducer reduces a set of intermediate values which share a key to a smaller set of values.
 It works on one set of records at a time. It gets the key and the list of all values and then it writes the final
result
Yarn ( part of mapreduce )
 YARN is the architectural centre of Hadoop that allows multiple data processing engines such as
interactive SQL, real-time streaming, data science and batch processing to handle data stored in
a single platform, unlocking an entirely new approach to analytics.
 Career in Big Data and Hadoop
Big Data and Hadoop

More Related Content

PPTX
Apache Hadoop Hive
PPTX
Apache storm
PPTX
Big Data and Hadoop Components
PPTX
Apache Hive
PPTX
Apache Hive
PPTX
Session 14 - Hive
PDF
PDF
Apache hive
Apache Hadoop Hive
Apache storm
Big Data and Hadoop Components
Apache Hive
Apache Hive
Session 14 - Hive
Apache hive

What's hot (19)

PPTX
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPTX
Apache hive
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PPTX
Apache hive introduction
PDF
PPTX
Apache HBase™
PPTX
Hive and HiveQL - Module6
PPTX
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
PPTX
Unit 5-apache hive
PPT
Apache Hive - Introduction
PPT
Hive(ppt)
PPTX
Introduction To HBase
PPTX
Introduction to HBase
ODP
Apache hive1
PPTX
Introduction to HiveQL
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
PPTX
03 hive query language (hql)
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Apache hive
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Apache hive introduction
Apache HBase™
Hive and HiveQL - Module6
Introduction to Apache Hive(Big Data, Final Seminar)
Unit 5-apache hive
Apache Hive - Introduction
Hive(ppt)
Introduction To HBase
Introduction to HBase
Apache hive1
Introduction to HiveQL
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
03 hive query language (hql)
Ad

Viewers also liked (16)

PPTX
PPTX
Evaluation 1
PPTX
Indira nooyi
PPTX
Social network
PPTX
Review of Structural Reforms in Financial Sector of Pakistan
PPT
The hungry caterpillar
PPT
The hungry caterpillar
PPT
powerpoint.26
PPTX
Tugas Kelompok 8C
PPT
The hungry caterpillar
PDF
powerpoint.25
PPTX
Kaizen
DOCX
Metro - Cash & Carry Pakistan
PPTX
Taxation review of Pakistan
PPTX
Germany- history,culture,society,organizational structure and approach to man...
DOCX
Mc donald's - Comprehensive management review of McDonald in Pakistan
Evaluation 1
Indira nooyi
Social network
Review of Structural Reforms in Financial Sector of Pakistan
The hungry caterpillar
The hungry caterpillar
powerpoint.26
Tugas Kelompok 8C
The hungry caterpillar
powerpoint.25
Kaizen
Metro - Cash & Carry Pakistan
Taxation review of Pakistan
Germany- history,culture,society,organizational structure and approach to man...
Mc donald's - Comprehensive management review of McDonald in Pakistan
Ad

Similar to Big Data and Hadoop (20)

PPT
hadoop
PPT
hadoop
PPTX
Big Data & Hadoop
PPTX
Hadoop
PPTX
OPERATING SYSTEM .pptx
PPTX
Hadoop and Big Data
PPTX
PPTX
PPTX
Cppt Hadoop
PPTX
Hadoop by kamran khan
PPTX
2. hadoop fundamentals
PPTX
Managing Big data with Hadoop
PPT
PPTX
Apache hadoop basics
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PPTX
MOD-2 presentation on engineering students
PPTX
Bigdata and hadoop
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
hadoop
hadoop
Big Data & Hadoop
Hadoop
OPERATING SYSTEM .pptx
Hadoop and Big Data
Cppt Hadoop
Hadoop by kamran khan
2. hadoop fundamentals
Managing Big data with Hadoop
Apache hadoop basics
Hadoop and MapReduce addDdaDadadDDAD.pptx
MOD-2 presentation on engineering students
Bigdata and hadoop
Introduction to Hadoop and Big Data
Hadoop_EcoSystem slide by CIDAC India.pptx

Recently uploaded (20)

PDF
Trump Administration's workforce development strategy
PDF
RMMM.pdf make it easy to upload and study
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
What if we spent less time fighting change, and more time building what’s rig...
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
Unit 4 Skeletal System.ppt.pptxopresentatiom
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PDF
Complications of Minimal Access Surgery at WLH
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
Introduction to Building Materials
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PPTX
Final Presentation General Medicine 03-08-2024.pptx
Trump Administration's workforce development strategy
RMMM.pdf make it easy to upload and study
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
What if we spent less time fighting change, and more time building what’s rig...
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Weekly quiz Compilation Jan -July 25.pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Unit 4 Skeletal System.ppt.pptxopresentatiom
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
202450812 BayCHI UCSC-SV 20250812 v17.pptx
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
Complications of Minimal Access Surgery at WLH
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
Practical Manual AGRO-233 Principles and Practices of Natural Farming
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
Introduction to Building Materials
A systematic review of self-coping strategies used by university students to ...
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Final Presentation General Medicine 03-08-2024.pptx

Big Data and Hadoop

  • 2.  What is Big Data  How 3vs define Big data  Hadoop and its ecosystem  HDFS  Map reduce and Yarn  Career in Big Data and Hadoop
  • 3. o Order Details for a store o All orders across 100s of stores o A person’s stock portfolio o All stock transactions for Stock Exchange  Its data that is created very fast and is too big to be processed on a single machine .These data come from various sources in various formats. What is BIG DATA ???
  • 4. How 3Vs define Big Data ???
  • 5. 1. Volume  It is the size of the data which determines the value and potential of the data under consideration. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic.
  • 6. 2. Variety  Data today comes in all types of formats. Structured, numeric data in traditional databases. Unstructured text documents, email, stock ticker data and financial transactions and semi-structured data too.
  • 7. 3. Velocity  speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.
  • 8.  SUMMARY  Veracity ( came much later after 3Vs but entered as next big wave of innovation )  The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data.
  • 9. What is HADOOP ??? “Hadoop” was name of a yellow toy elephant owned by the son of one of its inventors. Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks : : massive data storage and faster processing.•Open-source software. Open source software differs from commercial software due to the broad and open network of developers that create and manage the programs. •Framework. In this case, it means everything you need to develop and run your software applications is provided – programs, tool sets, connections, etc. •Distributed. Data is divided and stored across multiple computers, and computations can be run in parallel across multiple connected machines. •Massive storage. The Hadoop framework can store huge amounts of data by breaking the data into blocks and storing it on clusters of lower-cost commodity hardware. •Faster processing. How? Hadoop processes large amounts of data in parallel across clusters of tightly connected low-cost computers for quick results.
  • 10.  Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.  Computing power. Its distributed computing model can quickly process very large volumes of data.  Scalability. You can easily grow your system simply by adding more nodes with little administration .  Storage flexibility. Unlike traditional relational databases, you don’t have to pre-process data before storing it. You can store as much data as you want .  Inherent data protection. Data and application processing are protected against hardware failure.  self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail and automatically stores multiple copies of all data.
  • 11.  What’s in Hadoop ???  HDFS – the Java-based distributed file system that can store all kinds of data without prior organization.  MapReduce – a software programming model for processing large sets of data in parallel.  YARN – a resource management framework for scheduling and handling resource requests from distributed applications.
  • 12.  Hadoop Ecosystem  Basically ,HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework.  But Some of the other Apache Projects which are built around the Hadoop Framework are part of the Hadoop Ecosystem.
  • 13. HDFS (Hadoop Distributed File System) o HDFS enables Hadoop to store huge files. It’s a scalable file system that distributes and stores data across all machines in a Hadoop cluster.  Scale-Out Architecture - Add servers to increase capacity  High Availability - Serve mission-critical workflows and applications  Fault Tolerance - Automatically and seamlessly recover from failures  Load Balancing - Place data intelligently for maximum efficiency and utilization  Tunable Replication - Multiple copies of each file provide data protection and computational performance
  • 14.  Namenode and datanode 64 MB 64 MB 22 MB 150MB Text File  When file(say 150MB Text file) is uploaded on HDFS then each block is stored as a node in the Hadoop cluster.  NameNode- It Runs on a master node that tracks and directs the storage of the cluster. Also we know that the nodes or blocks which make up the original 150 MB file and that is handled by a separate machine is the Namenode. Information stored here is called as metadata. DN  DataNode- There is a piece of software running on each of these nodes of the cluster called Datanode which runs on slave nodes which make up the majority of the machines of a cluster. The name node places the data into these data nodes. Name Node DN DN Cluster.
  • 15.  HOW HDFS WORKS ??? Name Node DN DN DN Which of these are a problem if it occurs ? oNetwork failure Between the nodes oDisk failure on Datanode oNot all Datanodes are used oBlock sizes if differ of Datanodes oDisk failure of Namenode  We may lose some data nodes and hence will be losing some amount of data say 64MB out 150MB text file  We may also have some hardware problem in namenode and may lose it too.
  • 16.  HOW HDFS WORKS continued….??? o Replication Factor ( RF ) -The number of copies of a file is called the replication factor of that file. This information is stored by the Namenode. Solution to problem occurred...(Datanode lost)  Hadoop replicates each file 3 times as it stores in HDFS. ( RF = 3 )
  • 17.  HOW HDFS WORKS continued….???  NFS (Network File System) - Now , meta data is stored not only on someone’s hard drive but also on NFS . It is a method of mounting a remote disk that way if namenode and metadata are lost still we have a copy of metadata elsewhere on the network.  Even more efficient, now a days , two Namenodes have been configured.  Namenode(Active) - works in normal condition  Namenode(StandBy) - works if active Solution to problem occurred…( NAMEnode lost ) • Earlier for a long time when Namenode (and metadata stored inside) was lost then the entire cluster was inaccessible but now we have 2 techniques by which we can maintain our data .
  • 18. MapReduce  MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Scale-out Architecture - Add servers to increase processing power Security & Authentication - Works with HDFS security to make sure that only approved users can operate against the data in the system Resource Manager - Employs data locality and server resources to determine optimal computing operations Optimized Scheduling - Completes jobs according to prioritization Flexibility - Procedures can be written in virtually any programming language Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail independently and restart automatically
  • 19.  Why MapReduce ???  To process data serially i.e. from top to bottom could take some long time  Historically we may probably use an associative array and Hash Tables but these may lead us to some serious problem .  As the hash sizes grow, heap pressure becomes more of an issue Say we are using 1TB of data ,then what issues may occur ???? o It won’t work. o We may run out of memory. o Data processing may take long time.
  • 20.  how MapReduce works ??? MapReduce divides workloads up into multiple tasks that can be executed in parallel. Solution to problem  Mapreduce applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. These form the core of the job.
  • 21.  Mappers and Reducers Mappers  Mappers are the individual tasks that transform input records into intermediate records.  These are just small programs that deal with a relatively small amount of data and work in parallel.  The output obtained are called as intermediate records.  Mapper maps input key/value pairs to a set of intermediate key/value pairs .  Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data.  Shuffle is the movement of intermediate records from mappers to reducers.  Sort is the fact that reducers will organize these records in the sorted order. Reducers  Reducer reduces a set of intermediate values which share a key to a smaller set of values.  It works on one set of records at a time. It gets the key and the list of all values and then it writes the final result
  • 22. Yarn ( part of mapreduce )  YARN is the architectural centre of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.
  • 23.  Career in Big Data and Hadoop