SlideShare a Scribd company logo
THE SOLUTION FOR BIG DATA
NAME:SIVAKOTI TARAKA SATYA PHANINDRA
ROLL NO:15K81D5824
COURSE: CSE M.TECH/SEM-1
CONTENT:
Data – Trends in storing data.
BigData – Problems in IT industry
Why BigData ?
Introduction to HADOOP
HDFS (Hadoop Distributed File System)
 MapReduce
Prominent users of Hadoop.
Conclusion
Data – Trends in storing data
What is data--- Any real world symbol (character, numeric,
special character) or a of group of them is said to be data it
may be of the visual or audio or scriptural , images, etc​.,
File system
Databases
Cloud (internet)
BIG DATA:
What is big data—In IT, it is a collection of data sets so
large and complex data that it becomes difficult to process
using on-hand database management tools or traditional
data processing applications.
 As of 2016, limits on the size of data sets that are
feasible to process in reasonable time were on the order
of Exabyte of data.​(KBs MBs GBs TBs PB
ZB )
THE SOLUTION FOR BIG DATA
BIGDATA and problems with it.
 Daily about 0.8 Petabytes of updates are being made
into FACEBOOK including 50 millions photos.​
 Daily, YOUTUBE is loaded with videos that can be watched for one year
continuously​
 Limitations are encountered due to large data sets in many areas, including
meteorology, genomics, complex physics simulations, and biological and
environmental research.
 Also affect Internet search, finance and business informatics.
 The challenges include in capture, retrieval, storage, search, sharing, analysis,
and visualization.​
Why BIG DATA ?
Unstructured DATA growth !
THEN WHAT COULD BE THE SOLUTION
FOR BIGDATA ?
Hadoop’s Developers:
 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to
support distribution for the Nutch search engine project.
 The project was funded by Yahoo.
 2006: Yahoo gave the project to Apache Software Foundation.
Doug Cutting
What is Hadoop?
 It is a open source software written in java
 Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of
computers using simple programming models.
 It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
• Apache top level project,
open-source
implementation of
frameworks for reliable,
scalable, distributed
computing and data
storage.
• It is a flexible and highly-
available architecture for
large scale computation and
data processing on a
network of commodity
hardware.
THE SOLUTION FOR BIG DATA
The project includes these modules:
Hadoop Common
Hadoop Distributed File System(HDFS)
Hadoop MapReduce
1.Hadoop Commons
 It provides access to the filesystems supported by Hadoop.
 The Hadoop Common package contains the necessary JAR
files and scripts needed to start Hadoop.
 The package also provides source code, documentation,
and a contribution section which includes projects from
the Hadoop Community (Avro, Cassandra, Chukwa, Hbase,
Hive, Mahout, Pig, ZooKeeper)
2. Hadoop Distributed File System
(HDFS):
 Hadoop uses HDFS, a distributed file system based on GFS (Google
File System), as its shared filesystem.
 HDFS architecture divides files into large chunks (~64MB)
distributed across data servers (this is configurable).
 It has a namenode and datanodes
What does a HDFS contain
 HDFS consists of a global namenodes or namespaces and they are
federated.
 The datanodes are used as common storage for blocks by all the
Namenodes.
 Each datanode registers with all the Namenodes in the cluster.
 Datanodes send periodic heartbeats and block reports and handles
commands from the Namenodes
Structure of Hadoop system:
Master Node :
Name Node
Secondary Name Node
Job Tracker
Slaves :
Data Node
Task Tracker
MASTER NODE:
 Master node
 Keeps track of namespace and metadata about items
 Keeps track of MapReduce jobs in the system
 Hadoop currently configured with centurion064 as the master
node
 Hadoop is locally installed in each system.
 Installed location is in /localtmp/hadoop/hadoop-0.15.3
SLAVE NODES:
 Slave nodes
 Manage blocks of data sent from master node
 In common, these are the chunkservers
 Currently centurion060, centurion064 are the two slave nodes being
used.
 Slave nodes store their data in /localtmp/hadoop/hadoop-dfs (this is
automatically created by the DFS)
 Once you use the DFS, relative paths are from /usr/{your usr id}
THE SOLUTION FOR BIG DATA
Advantages and Limitations of HDFS :
 Reduce traffic on job scheduling.
 File access can be achieved through the native Java or
language of the users' choice (C++, Java, Python, PHP, Ruby,
Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml),
 It cannot be directly mounted by an existing operating
system.
 It should be provided with UNIX or LUNIX system.
3.Hadoop MAPREDUCE SYSTEM:
 The Hadoop MapReduce framework harnesses a cluster of
machines and executes user defined MapReduce jobs across
the nodes in the cluster.
 A MapReduce computation has two phases
 a map phase and
 a reduce phase.
MAP AND REDUCE METHODS USAGE…
Map function
Reduce function
Run this program as a
MapReduce job
WORD COUNT OVER A GIVEN SET OF STRINGS
We 1
love 1
India 1
We 1
Play 1
Tennis 1
Love 1
India 1
We 2
Tennis 1
Play 1
Map Reduce
MAPREDUCE IN WITH NO REDUCE TASKS
MAPREDUCE WITH TWO REDUCE TASKS - AUTOMATIC
PARALLEL EXECUTION IN MAPREDUCE
Shuffle and sort in MapReduce with
multiple reduce tasks
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Prominent users of HADOOP
 Amazon – 100 nodes
 Facebook – two clusters of 8000 and 3000 nodes
 Adobe – 80 node system
 EBay – 532 node cluster
 yahoo – cluster of about 4500 nodes
 IIIT Hyderabad – 30 node cluster
Trending :Hadoop Job’s
Salaries Tend in Hadoop:
Achievements :
 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209
seconds, compared to previous record of 297 seconds)
 2009 - Avro and Chukwa became new members of Hadoop Framework family
 2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding more
computational power to Hadoop framework
 2011 - ZooKeeper Completed
 March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation
Award
 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
- Ambari, Cassandra, Mahout have been added
Conclusion:
It reduce traffic on capture, storage, search, sharing, analysis, and
visualization.
A huge amount of data could be stored and large computations
could be done in a single compound with full safety and security
at cheap cost.
BIGDATA and BIGDATA-SOLUTIONS is one of the burning issues in
the present IT industry so, work on those will surely make you
more useful to that.
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA

More Related Content

PPTX
Big Data and Hadoop - An Introduction
PPTX
Hadoop technology
PPTX
Introduction to Hadoop Technology
PPSX
PPTX
Hadoop info
PDF
Introduction to Hadoop
PPT
An Introduction to Hadoop
PPTX
HADOOP TECHNOLOGY ppt
Big Data and Hadoop - An Introduction
Hadoop technology
Introduction to Hadoop Technology
Hadoop info
Introduction to Hadoop
An Introduction to Hadoop
HADOOP TECHNOLOGY ppt

What's hot (20)

DOCX
Hadoop Seminar Report
PPT
Hadoop Technologies
PPTX
Big Data and Hadoop
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PDF
Hadoop installation, Configuration, and Mapreduce program
PPTX
Apache Hadoop
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
Introduction to Apache Hadoop Ecosystem
PPTX
PPT on Hadoop
ODP
Hadoop seminar
PDF
Introduction to Big Data & Hadoop
PDF
Large Scale Math with Hadoop MapReduce
PPTX
Hadoop basics
PDF
Introduction to Hadoop part1
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PDF
Seminar_Report_hadoop
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PPTX
Big Data and Hadoop Introduction
PPTX
Apache hadoop introduction and architecture
Hadoop Seminar Report
Hadoop Technologies
Big Data and Hadoop
Introduction to Big Data & Hadoop Architecture - Module 1
Hadoop installation, Configuration, and Mapreduce program
Apache Hadoop
EclipseCon Keynote: Apache Hadoop - An Introduction
Introduction to Apache Hadoop Ecosystem
PPT on Hadoop
Hadoop seminar
Introduction to Big Data & Hadoop
Large Scale Math with Hadoop MapReduce
Hadoop basics
Introduction to Hadoop part1
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Seminar_Report_hadoop
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Big Data and Hadoop Introduction
Apache hadoop introduction and architecture
Ad

Viewers also liked (20)

PDF
What Can I Do with My Degree?
PPTX
літературно мистецький жовтень2015
PDF
литтерра альманах 2015
PDF
金融监管框架的改革国际经验和中国的选择
PDF
Bai 1 lam quen voi sql 2008
PDF
SUMMER STYLE SOURCE FINAL_v7
PPTX
מוצרי שייט 2016
PPTX
Pollution control
PPTX
информация по тренингу
PPTX
літературно мистецький червень2015
PPTX
E learning
PDF
Altinet_Education_Brochure
PPTX
Лiтературно-мистецький календар
PPTX
літературно мистецький липень2015
PPT
How to conduct yourself at work
PPTX
Evaluation questions 1
DOCX
Brynn Kardash Resume May 2016
PPTX
【京都勉強会】Android入門編1月31日
PPTX
Evaluation questions 1
What Can I Do with My Degree?
літературно мистецький жовтень2015
литтерра альманах 2015
金融监管框架的改革国际经验和中国的选择
Bai 1 lam quen voi sql 2008
SUMMER STYLE SOURCE FINAL_v7
מוצרי שייט 2016
Pollution control
информация по тренингу
літературно мистецький червень2015
E learning
Altinet_Education_Brochure
Лiтературно-мистецький календар
літературно мистецький липень2015
How to conduct yourself at work
Evaluation questions 1
Brynn Kardash Resume May 2016
【京都勉強会】Android入門編1月31日
Evaluation questions 1
Ad

Similar to THE SOLUTION FOR BIG DATA (20)

PPTX
Big data Analytics Hadoop
PDF
Unit IV.pdf
PPTX
Hadoop and BigData - July 2016
PPT
hadoop
PPT
hadoop
PPTX
Bigdata and hadoop
PPTX
BIG DATA: Apache Hadoop
PPTX
Seminar ppt
DOCX
project report on hadoop
PPTX
Hadoop Big Data A big picture
PPTX
Introduction to Hadoop and Hadoop component
PPTX
PPTX
Distributed Systems Hadoop.pptx
PPTX
Overview of big data & hadoop v1
PPTX
PPTX
PPTX
Cppt Hadoop
PPTX
Big data
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
Big data Analytics Hadoop
Unit IV.pdf
Hadoop and BigData - July 2016
hadoop
hadoop
Bigdata and hadoop
BIG DATA: Apache Hadoop
Seminar ppt
project report on hadoop
Hadoop Big Data A big picture
Introduction to Hadoop and Hadoop component
Distributed Systems Hadoop.pptx
Overview of big data & hadoop v1
Cppt Hadoop
Big data
Overview of Big data, Hadoop and Microsoft BI - version1

Recently uploaded (20)

PPTX
Impressionism_PostImpressionism_Presentation.pptx
PDF
Why Top Brands Trust Enuncia Global for Language Solutions.pdf
PPTX
Primary and secondary sources, and history
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PPTX
worship songs, in any order, compilation
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PPTX
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
fundraisepro pitch deck elegant and modern
PPTX
Tour Presentation Educational Activity.pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
Project and change Managment: short video sequences for IBA
PPTX
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PPTX
Introduction to Effective Communication.pptx
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Relationship Management Presentation In Banking.pptx
Impressionism_PostImpressionism_Presentation.pptx
Why Top Brands Trust Enuncia Global for Language Solutions.pdf
Primary and secondary sources, and history
Tablets And Capsule Preformulation Of Paracetamol
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
worship songs, in any order, compilation
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
fundraisepro pitch deck elegant and modern
Tour Presentation Educational Activity.pptx
Instagram's Product Secrets Unveiled with this PPT
Project and change Managment: short video sequences for IBA
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
_ISO_Presentation_ISO 9001 and 45001.pptx
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
Introduction to Effective Communication.pptx
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
nose tajweed for the arabic alphabets for the responsive
Relationship Management Presentation In Banking.pptx

THE SOLUTION FOR BIG DATA

  • 1. THE SOLUTION FOR BIG DATA NAME:SIVAKOTI TARAKA SATYA PHANINDRA ROLL NO:15K81D5824 COURSE: CSE M.TECH/SEM-1
  • 2. CONTENT: Data – Trends in storing data. BigData – Problems in IT industry Why BigData ? Introduction to HADOOP HDFS (Hadoop Distributed File System)  MapReduce Prominent users of Hadoop. Conclusion
  • 3. Data – Trends in storing data What is data--- Any real world symbol (character, numeric, special character) or a of group of them is said to be data it may be of the visual or audio or scriptural , images, etc​., File system Databases Cloud (internet)
  • 4. BIG DATA: What is big data—In IT, it is a collection of data sets so large and complex data that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  As of 2016, limits on the size of data sets that are feasible to process in reasonable time were on the order of Exabyte of data.​(KBs MBs GBs TBs PB ZB )
  • 6. BIGDATA and problems with it.  Daily about 0.8 Petabytes of updates are being made into FACEBOOK including 50 millions photos.​  Daily, YOUTUBE is loaded with videos that can be watched for one year continuously​  Limitations are encountered due to large data sets in many areas, including meteorology, genomics, complex physics simulations, and biological and environmental research.  Also affect Internet search, finance and business informatics.  The challenges include in capture, retrieval, storage, search, sharing, analysis, and visualization.​
  • 9. THEN WHAT COULD BE THE SOLUTION FOR BIGDATA ?
  • 10. Hadoop’s Developers:  2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project.  The project was funded by Yahoo.  2006: Yahoo gave the project to Apache Software Foundation. Doug Cutting
  • 11. What is Hadoop?  It is a open source software written in java  Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
  • 12. • Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. • It is a flexible and highly- available architecture for large scale computation and data processing on a network of commodity hardware.
  • 14. The project includes these modules: Hadoop Common Hadoop Distributed File System(HDFS) Hadoop MapReduce
  • 15. 1.Hadoop Commons  It provides access to the filesystems supported by Hadoop.  The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop.  The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community (Avro, Cassandra, Chukwa, Hbase, Hive, Mahout, Pig, ZooKeeper)
  • 16. 2. Hadoop Distributed File System (HDFS):  Hadoop uses HDFS, a distributed file system based on GFS (Google File System), as its shared filesystem.  HDFS architecture divides files into large chunks (~64MB) distributed across data servers (this is configurable).  It has a namenode and datanodes
  • 17. What does a HDFS contain  HDFS consists of a global namenodes or namespaces and they are federated.  The datanodes are used as common storage for blocks by all the Namenodes.  Each datanode registers with all the Namenodes in the cluster.  Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes
  • 18. Structure of Hadoop system: Master Node : Name Node Secondary Name Node Job Tracker Slaves : Data Node Task Tracker
  • 19. MASTER NODE:  Master node  Keeps track of namespace and metadata about items  Keeps track of MapReduce jobs in the system  Hadoop currently configured with centurion064 as the master node  Hadoop is locally installed in each system.  Installed location is in /localtmp/hadoop/hadoop-0.15.3
  • 20. SLAVE NODES:  Slave nodes  Manage blocks of data sent from master node  In common, these are the chunkservers  Currently centurion060, centurion064 are the two slave nodes being used.  Slave nodes store their data in /localtmp/hadoop/hadoop-dfs (this is automatically created by the DFS)  Once you use the DFS, relative paths are from /usr/{your usr id}
  • 22. Advantages and Limitations of HDFS :  Reduce traffic on job scheduling.  File access can be achieved through the native Java or language of the users' choice (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml),  It cannot be directly mounted by an existing operating system.  It should be provided with UNIX or LUNIX system.
  • 23. 3.Hadoop MAPREDUCE SYSTEM:  The Hadoop MapReduce framework harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster.  A MapReduce computation has two phases  a map phase and  a reduce phase.
  • 24. MAP AND REDUCE METHODS USAGE… Map function Reduce function Run this program as a MapReduce job
  • 25. WORD COUNT OVER A GIVEN SET OF STRINGS We 1 love 1 India 1 We 1 Play 1 Tennis 1 Love 1 India 1 We 2 Tennis 1 Play 1 Map Reduce
  • 26. MAPREDUCE IN WITH NO REDUCE TASKS
  • 27. MAPREDUCE WITH TWO REDUCE TASKS - AUTOMATIC PARALLEL EXECUTION IN MAPREDUCE
  • 28. Shuffle and sort in MapReduce with multiple reduce tasks
  • 32. Prominent users of HADOOP  Amazon – 100 nodes  Facebook – two clusters of 8000 and 3000 nodes  Adobe – 80 node system  EBay – 532 node cluster  yahoo – cluster of about 4500 nodes  IIIT Hyderabad – 30 node cluster
  • 34. Salaries Tend in Hadoop:
  • 35. Achievements :  2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209 seconds, compared to previous record of 297 seconds)  2009 - Avro and Chukwa became new members of Hadoop Framework family  2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding more computational power to Hadoop framework  2011 - ZooKeeper Completed  March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Award  2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha. - Ambari, Cassandra, Mahout have been added
  • 36. Conclusion: It reduce traffic on capture, storage, search, sharing, analysis, and visualization. A huge amount of data could be stored and large computations could be done in a single compound with full safety and security at cheap cost. BIGDATA and BIGDATA-SOLUTIONS is one of the burning issues in the present IT industry so, work on those will surely make you more useful to that.