SlideShare a Scribd company logo
Mayuri Agarwal
Data Management !!!!!!
Hadoop
Big Data-What does it mean?
Velocity:
Often time sensitive , big data must be used
as it is streaming in to the enterprise it order
to maximize its value to the business.
Batch ,Near time , Real-time ,streams
Volume:
Big data comes in one size : large .
Enterprises are awash with data ,easy
amassing terabytes and even petabytes of
information.
TB , Records , Transactions ,Tables , Files.
Variety:
Big data extends beyond structured data
, including semi-structured and unstructured
data to all varieties :text , audio , video ,click
streams ,log files and more
Structured , Unstructured , Semi-structured
Veracity:
Quality and provenance of received data.
Good , Undefined , bad , Inconsistency
, Incompleteness , Ambiguity
Value
Big Data
90%
10%
Worldwide Data
Last 2 years
Since the Beginnning of
the Time
What is Hadoop?
Software project that enables the distributed processing of large data sets across clusters of
commodity servers
Works with structured and unstructured data
Open source software + Hardware commodity = IT cost Reduction
It is designed to scale up from a single server to thousands of machines
Very high degree of fault tolerance software’s ability to detect and handle failures at the application
layer
The origin of the name Hadoop….
The name Hadoop is not an acronym; it’s a
made-up name. The project’s creator, Doug
Cutting, explains how the name came about:
The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used
elsewhere: those are my naming criteria.
Kids are good at generating such. Googol is
a kid’s term.
Hadoop Sub-projects
 HDFS
 Map-Reduce
HDFS-Hadoop Distributed File System
 Distributed, scalable, and portable file system
Each node in a Hadoop instance typically has
a single Namenode : a cluster of Datanodes
form the HDFS cluster
Asynchronous replication.
Data divided into 64mb (default) or 128mb
blocks , each block replicated 3 times (default)
Namenode holds file system metadata.
Files are broken up and spread over Datanode
.
HDFS- Read & Write
MapReduce
Software framework for distributed
computation
Input | Map() | Copy/Sort | Reduce () |
Output
JobTracker schedules and manages
jobs.
Task tracker executes individual
map() and reduce task on each cluster
node.
Example : MapReduce
Master – Slave Model
Hadoop Ecosystem
HBase
 HBase is an open source , non-relational, distributed database
 A Key-value store
 A value is identified by the key
 Both key and value are a byte array
 The values are stored in key-order
 Thus access data by key is very fast
 Users create table in HBase
 There is no schema of HBase table
 Very good for sparse data
 Takes lots of disk space
HBase Architecture
 Master: Responsible for coordinating with region server.
 Region server: Serves data for read and write
 Zookeeper: Manages the HBase cluster
 Low latency and random access to data
Hive
 A system for managing and querying structured data built on Hadoop
 SQL-Like query language called HQL
 Main purpose is analysis and ad hoc querying
 Database/table/partition –DDL operation
 Not for :small data sets ,Low latency queries ,OLTP
Hadoop-Hive Architecture
HBase-Hive configuration
HBase as ETL data sink
HBase as Data Source
Low Latency warehouse
Hive and MySQL Database Structure
Hadoop Limitations
 Not a high-speed SQL database.
 Is not a particularly simple technology.
 Hadoop is not easy to connect to legacy systems.
 Hadoop is not a replacement for traditional data warehouses. It is an
adjunctive product to data warehouses.
 Normal DBAs will need to learn new skills before they can adopt
Hadoop tools.
 The architecture around the data - the way you store data, the way
you de-normalize data, the way you ingest data, the way you extract
data - is different in Hadoop.
 Linux and Java skills are critical for making a Hadoop environment a
reality.
Hadoop’s Capability
 Hadoop is a super-powerful environment that can transform your
understanding of data.
 Hadoop can store vast amounts of data.
 Hadoop can run queries on huge data sets.
 You can archive data on Hadoop and still query it.
 Hadoop allows you to ingest data at incredible speeds and analyze it and
report on it in near real-time.
 Hadoop massively reduces the latency of data.
Hadoop: Hot skill to acquire on IT job
circuit
 The market for data technologies, such as databases, is a multi-billion dollar industry.
 Many start-ups are working on technology extensions to Hadoop to make it both analytical
and transactional. That would be big.
 Major companies have a big data strategy and want to build their businesses on top of this
 Google, the originator of Hadoop, has already moved on – suggesting that within a decade
either the Hadoop framework will have to be developed beyond all recognition or that
something newer could be on the way to supplant it.
 Every major internet company - be it Google, Twitter, Linkedin or Facebook - uses some form
of Hadoop .
Hadoop
mayuri.enggheads@gmail.com

More Related Content

PDF
Bigdata and Hadoop Bootcamp
PPTX
Intro to Big Data Hadoop
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Hadoop Tutorial For Beginners
PPTX
Big Data Concepts
PPTX
Big Data and Hadoop
PDF
Introduction to Bigdata and HADOOP
Bigdata and Hadoop Bootcamp
Intro to Big Data Hadoop
Introduction to Apache Hadoop Eco-System
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners
Big Data Concepts
Big Data and Hadoop
Introduction to Bigdata and HADOOP

What's hot (20)

PPTX
عصر کلان داده، چرا و چگونه؟
PPTX
Hadoop and Big Data
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
PPTX
Hadoop and big data
PPTX
Hadoop and BigData - July 2016
PPTX
Hadoop: Distributed Data Processing
PPT
BigData Analytics with Hadoop and BIRT
PDF
Introduction To Big Data Analytics On Hadoop - SpringPeople
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Hadoop: An Industry Perspective
PPTX
Big data ppt
PPTX
Big data Analytics Hadoop
PPTX
Big Data Technology Stack : Nutshell
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Big data concepts
PPTX
PPT on Hadoop
PPTX
Big data Hadoop presentation
PPTX
Big data ppt
PPTX
Big data
PDF
An introduction to Big Data
عصر کلان داده، چرا و چگونه؟
Hadoop and Big Data
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Hadoop and big data
Hadoop and BigData - July 2016
Hadoop: Distributed Data Processing
BigData Analytics with Hadoop and BIRT
Introduction To Big Data Analytics On Hadoop - SpringPeople
HADOOP TECHNOLOGY ppt
Hadoop: An Industry Perspective
Big data ppt
Big data Analytics Hadoop
Big Data Technology Stack : Nutshell
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data concepts
PPT on Hadoop
Big data Hadoop presentation
Big data ppt
Big data
An introduction to Big Data
Ad

Similar to Hadoop (20)

PPTX
Hadoop and Big Data: Revealed
PDF
1. Big Data - Introduction(what is bigdata).pdf
PPTX
Introduction to BIg Data and Hadoop
PPTX
Big Data Training in Ludhiana
PDF
Big data and hadoop overvew
PPTX
Big Data Training in Mohali
PPTX
Big Data Training in Amritsar
ODP
Hadoop seminar
PPTX
Hadoop online training
PDF
Hadoop Master Class : A concise overview
PPTX
Not Just Another Overview of Apache Hadoop
DOCX
Big data and Hadoop overview
DOCX
Hadoop Seminar Report
PDF
Big Data-Survey
PPTX
INTRODUCTION TO BIG DATA HADOOP
PPTX
Architecting Your First Big Data Implementation
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
ODP
BigData Hadoop
PPTX
Big data
PPTX
Hadoop info
Hadoop and Big Data: Revealed
1. Big Data - Introduction(what is bigdata).pdf
Introduction to BIg Data and Hadoop
Big Data Training in Ludhiana
Big data and hadoop overvew
Big Data Training in Mohali
Big Data Training in Amritsar
Hadoop seminar
Hadoop online training
Hadoop Master Class : A concise overview
Not Just Another Overview of Apache Hadoop
Big data and Hadoop overview
Hadoop Seminar Report
Big Data-Survey
INTRODUCTION TO BIG DATA HADOOP
Architecting Your First Big Data Implementation
Hadoop_EcoSystem slide by CIDAC India.pptx
BigData Hadoop
Big data
Hadoop info
Ad

Recently uploaded (20)

PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
GamePlan Trading System Review: Professional Trader's Honest Take
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx
Modernizing your data center with Dell and AMD
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Hadoop

  • 4. Big Data-What does it mean? Velocity: Often time sensitive , big data must be used as it is streaming in to the enterprise it order to maximize its value to the business. Batch ,Near time , Real-time ,streams Volume: Big data comes in one size : large . Enterprises are awash with data ,easy amassing terabytes and even petabytes of information. TB , Records , Transactions ,Tables , Files. Variety: Big data extends beyond structured data , including semi-structured and unstructured data to all varieties :text , audio , video ,click streams ,log files and more Structured , Unstructured , Semi-structured Veracity: Quality and provenance of received data. Good , Undefined , bad , Inconsistency , Incompleteness , Ambiguity Value
  • 5. Big Data 90% 10% Worldwide Data Last 2 years Since the Beginnning of the Time
  • 6. What is Hadoop? Software project that enables the distributed processing of large data sets across clusters of commodity servers Works with structured and unstructured data Open source software + Hardware commodity = IT cost Reduction It is designed to scale up from a single server to thousands of machines Very high degree of fault tolerance software’s ability to detect and handle failures at the application layer
  • 7. The origin of the name Hadoop…. The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.
  • 9. HDFS-Hadoop Distributed File System  Distributed, scalable, and portable file system Each node in a Hadoop instance typically has a single Namenode : a cluster of Datanodes form the HDFS cluster Asynchronous replication. Data divided into 64mb (default) or 128mb blocks , each block replicated 3 times (default) Namenode holds file system metadata. Files are broken up and spread over Datanode .
  • 10. HDFS- Read & Write
  • 11. MapReduce Software framework for distributed computation Input | Map() | Copy/Sort | Reduce () | Output JobTracker schedules and manages jobs. Task tracker executes individual map() and reduce task on each cluster node.
  • 15. HBase  HBase is an open source , non-relational, distributed database  A Key-value store  A value is identified by the key  Both key and value are a byte array  The values are stored in key-order  Thus access data by key is very fast  Users create table in HBase  There is no schema of HBase table  Very good for sparse data  Takes lots of disk space
  • 16. HBase Architecture  Master: Responsible for coordinating with region server.  Region server: Serves data for read and write  Zookeeper: Manages the HBase cluster  Low latency and random access to data
  • 17. Hive  A system for managing and querying structured data built on Hadoop  SQL-Like query language called HQL  Main purpose is analysis and ad hoc querying  Database/table/partition –DDL operation  Not for :small data sets ,Low latency queries ,OLTP
  • 19. HBase-Hive configuration HBase as ETL data sink HBase as Data Source Low Latency warehouse
  • 20. Hive and MySQL Database Structure
  • 21. Hadoop Limitations  Not a high-speed SQL database.  Is not a particularly simple technology.  Hadoop is not easy to connect to legacy systems.  Hadoop is not a replacement for traditional data warehouses. It is an adjunctive product to data warehouses.  Normal DBAs will need to learn new skills before they can adopt Hadoop tools.  The architecture around the data - the way you store data, the way you de-normalize data, the way you ingest data, the way you extract data - is different in Hadoop.  Linux and Java skills are critical for making a Hadoop environment a reality.
  • 22. Hadoop’s Capability  Hadoop is a super-powerful environment that can transform your understanding of data.  Hadoop can store vast amounts of data.  Hadoop can run queries on huge data sets.  You can archive data on Hadoop and still query it.  Hadoop allows you to ingest data at incredible speeds and analyze it and report on it in near real-time.  Hadoop massively reduces the latency of data.
  • 23. Hadoop: Hot skill to acquire on IT job circuit  The market for data technologies, such as databases, is a multi-billion dollar industry.  Many start-ups are working on technology extensions to Hadoop to make it both analytical and transactional. That would be big.  Major companies have a big data strategy and want to build their businesses on top of this  Google, the originator of Hadoop, has already moved on – suggesting that within a decade either the Hadoop framework will have to be developed beyond all recognition or that something newer could be on the way to supplant it.  Every major internet company - be it Google, Twitter, Linkedin or Facebook - uses some form of Hadoop .