SlideShare a Scribd company logo
Big Data Analysis Using
Hadoop Cluster
By:
Syed Furqan Haider Shah #176
Introduction
What is
BIG DATA
The term Big data is used to describe a massive volume
of both structured and unstructured data that is so large
that it's difficult to process using traditional database
and software techniques.
BIG DATA(contd.)
• Big data consists of a heterogeneous mixture of structured and
unstructured data.
• Big data refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, process
and analyze.
Challenges
• These statistical records keep on increasing and increase
very fast.
• Unfortunately, as the data grows it becomes a tedious task
to process such a large data set and extract meaningful
information.
• If the data generated is in various formats, its processing
possesses new challenges.
Challenges(contd.)
• An issue with big data is that it uses NoSQL and has no Data
Description Language.
• Also, web-scale data is not universal and is heterogeneous. For
analysis of big data, database integration and cleaning is much
harder than the traditional mining approaches.
Solution
• Parallel computing programming
• An efficient platform for computing will not have centralized data
storage instead of that platform will be distributed in big scale
storage.
• Restricting access to the data
HADOOP
HADOOP
Hadoop is basically a tool which operates on a Distributive
File System. In this Architecture, all the Data Nodes
function parallel but functioning of a single Data Node is
still in sequential fashion.
HADOOP Architecture
•It is developed by Apache Software Foundation project and
open source software platform for scalable, distributed
computing.
•Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across
clusters of computers using simple programming models.
HADOOP Architecture(contd.)
•Hadoop provides fast and reliable analysis of both
Structured and un structured data.
•It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
•Hadoop uses Map/Reduce programming model to mine
data.
• This Map Reduce program is used to separate datasets which are sent as
input into independent subsets.Those are process parallel map task.
• Map() procedure that performs filtering and sorting
• Reduce() procedure that performs a summary operation
Big data analysis using hadoop cluster
METHODOLOGY
Methodology
Hadoop’s library is designed to deliver a highly-available service on
top of a cluster of computers. Hadoop Cluster as a whole can be seen
as that consisting of:
1. Core Hadoop
2. Hadoop Ecosystem
Relationship b/w Core Hadoop and Hadoop
Ecosystem
Core Hadoop consists of :
• HDFS
• MapReduce.
Since the commencement of the project, a lot of other softwares
have grown around it.This is called Hadoop Ecosystem
HDFS(HADOOP distributed file system)
• An HDFS instance may consist of a large number of server machines,
each storing a part of the file system data.
• Detection of faults and quick automatic recovery from them is a core
architectural objective of HDFS.
• Applications that run on HDFS need streaming access to their datasets.
MapReduce
It is the basic logic flow of task execution. It comprises
mainly of Mappers and Reducers.
Mappers:
Mappers do the job of extracting the required raw information from
the whole dataset. i.e. In one case it extracts date of sale, name of the
product, selling price and cost price of various products.
MapReduce(contd.)
•Reducers:
It is then sorted according to the key value of Mappers and
passed to Reducers. Reducers do actual processing on this
reduced data provided by Mappers and accomplish the final
task yielding desired output.
Big data analysis using hadoop cluster

More Related Content

PPTX
1.demystifying big data & hadoop
PPT
Big data and hadoop
PPTX
PPTX
Introduction to bigdata
PPTX
Hadoop An Introduction
PPT
Big data
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PPTX
Analytics 3
1.demystifying big data & hadoop
Big data and hadoop
Introduction to bigdata
Hadoop An Introduction
Big data
Hadoop - Architectural road map for Hadoop Ecosystem
Analytics 3

What's hot (20)

PPTX
Pivotal-HadoopOverview2016-working
PPTX
Hadoop at LinkedIn
PPTX
Hadoop introduction
PPTX
HADOOP TECHNOLOGY ppt
PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
PDF
Présentation on radoop
PPTX
Intro to Big Data Hadoop
DOCX
Hadoop map reduce
PPTX
Hadoop Architecture
PDF
Big Data
PPTX
Big data
PPTX
Fundamentals of big data analytics and Hadoop
PPTX
A Glimpse of Bigdata - Introduction
PPTX
Introduction to hadoop
PPTX
Big Data Unit 4 - Hadoop
PPT
Big Data Technologies - Hadoop
PPTX
Big Data Hadoop Technology
PPTX
Big data and hadoop
PDF
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
PPTX
Big Data Analytics Projects - Real World with Pentaho
Pivotal-HadoopOverview2016-working
Hadoop at LinkedIn
Hadoop introduction
HADOOP TECHNOLOGY ppt
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
Présentation on radoop
Intro to Big Data Hadoop
Hadoop map reduce
Hadoop Architecture
Big Data
Big data
Fundamentals of big data analytics and Hadoop
A Glimpse of Bigdata - Introduction
Introduction to hadoop
Big Data Unit 4 - Hadoop
Big Data Technologies - Hadoop
Big Data Hadoop Technology
Big data and hadoop
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Big Data Analytics Projects - Real World with Pentaho
Ad

Similar to Big data analysis using hadoop cluster (20)

PPTX
Big Data and Cloud Computing
PPTX
Foxvalley bigdata
PDF
Big data and hadoop overvew
PPT
Hadoop hive presentation
PPTX
Big Data Analytics With Hadoop
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
PPTX
MOD-2 presentation on engineering students
PPT
Big Data: An Overview
PDF
Intro to Big Data
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PPTX
Lecture 3.31 3.32.pptx
PPTX
Big data and hadoop
PDF
The Hadoop Ecosystem for Developers
PPTX
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PPTX
Hadoop
PDF
Hadoop Master Class : A concise overview
PPTX
Hadoop and Big Data
PPTX
Apache-Hadoop-Slides.pptx
Big Data and Cloud Computing
Foxvalley bigdata
Big data and hadoop overvew
Hadoop hive presentation
Big Data Analytics With Hadoop
MODULE 1: Introduction to Big Data Analytics.pptx
MOD-2 presentation on engineering students
Big Data: An Overview
Intro to Big Data
Big Data Analytics Presentation on the resourcefulness of Big data
Lecture 3.31 3.32.pptx
Big data and hadoop
The Hadoop Ecosystem for Developers
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Hadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop
Hadoop Master Class : A concise overview
Hadoop and Big Data
Apache-Hadoop-Slides.pptx
Ad

Recently uploaded (20)

PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
composite construction of structures.pdf
PDF
Well-logging-methods_new................
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPT
Project quality management in manufacturing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Geodesy 1.pptx...............................................
PPTX
web development for engineering and engineering
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Welding lecture in detail for understanding
PDF
PPT on Performance Review to get promotions
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
composite construction of structures.pdf
Well-logging-methods_new................
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
additive manufacturing of ss316l using mig welding
Project quality management in manufacturing
CYBER-CRIMES AND SECURITY A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Internet of Things (IOT) - A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
Geodesy 1.pptx...............................................
web development for engineering and engineering
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Welding lecture in detail for understanding
PPT on Performance Review to get promotions

Big data analysis using hadoop cluster

  • 1. Big Data Analysis Using Hadoop Cluster By: Syed Furqan Haider Shah #176
  • 4. BIG DATA The term Big data is used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.
  • 5. BIG DATA(contd.) • Big data consists of a heterogeneous mixture of structured and unstructured data. • Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, process and analyze.
  • 6. Challenges • These statistical records keep on increasing and increase very fast. • Unfortunately, as the data grows it becomes a tedious task to process such a large data set and extract meaningful information. • If the data generated is in various formats, its processing possesses new challenges.
  • 7. Challenges(contd.) • An issue with big data is that it uses NoSQL and has no Data Description Language. • Also, web-scale data is not universal and is heterogeneous. For analysis of big data, database integration and cleaning is much harder than the traditional mining approaches.
  • 8. Solution • Parallel computing programming • An efficient platform for computing will not have centralized data storage instead of that platform will be distributed in big scale storage. • Restricting access to the data
  • 10. HADOOP Hadoop is basically a tool which operates on a Distributive File System. In this Architecture, all the Data Nodes function parallel but functioning of a single Data Node is still in sequential fashion.
  • 11. HADOOP Architecture •It is developed by Apache Software Foundation project and open source software platform for scalable, distributed computing. •Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • 12. HADOOP Architecture(contd.) •Hadoop provides fast and reliable analysis of both Structured and un structured data. •It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. •Hadoop uses Map/Reduce programming model to mine data.
  • 13. • This Map Reduce program is used to separate datasets which are sent as input into independent subsets.Those are process parallel map task. • Map() procedure that performs filtering and sorting • Reduce() procedure that performs a summary operation
  • 16. Methodology Hadoop’s library is designed to deliver a highly-available service on top of a cluster of computers. Hadoop Cluster as a whole can be seen as that consisting of: 1. Core Hadoop 2. Hadoop Ecosystem
  • 17. Relationship b/w Core Hadoop and Hadoop Ecosystem Core Hadoop consists of : • HDFS • MapReduce. Since the commencement of the project, a lot of other softwares have grown around it.This is called Hadoop Ecosystem
  • 18. HDFS(HADOOP distributed file system) • An HDFS instance may consist of a large number of server machines, each storing a part of the file system data. • Detection of faults and quick automatic recovery from them is a core architectural objective of HDFS. • Applications that run on HDFS need streaming access to their datasets.
  • 19. MapReduce It is the basic logic flow of task execution. It comprises mainly of Mappers and Reducers. Mappers: Mappers do the job of extracting the required raw information from the whole dataset. i.e. In one case it extracts date of sale, name of the product, selling price and cost price of various products.
  • 20. MapReduce(contd.) •Reducers: It is then sorted according to the key value of Mappers and passed to Reducers. Reducers do actual processing on this reduced data provided by Mappers and accomplish the final task yielding desired output.