SlideShare a Scribd company logo
www.edureka.in/hadoop-admin
www.edureka.in/hadoop-admin
How It Works…
 LIVE classes
 Class recordings
 Module wise Quizzes and Practical Assignments
 24x7 on-demand technical support
 Deployment of different clusters
 Online certification exam
 Lifetime access to the Learning Management System
www.edureka.in/hadoop-admin
Course Topics
 Week 1
– Understanding Big Data
– Hadoop Components
– Introduction to Hadoop 2.0
 Week 2
– Hadoop 2.0
– Hadoop Configuration
– Hadoop Cluster Architecture
 Week 3
– Different Hadoop Server Roles
– Data processing flow
– Cluster Network Configuration
 Week 4
– Job Scheduling
– Fair Scheduler
– Monitoring a Hadoop Cluster
 Week 5
– Securing your Hadoop Cluster
– Kerberos and HDFS Federation
– Backup and Recovery
 Week 6
– Oozie and Hive Administration
– HBase Architecture
– HBase Administration
www.edureka.in/hadoop-admin
Topics for Today
 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduction to Hadoop
 Hadoop Eco-System
 Hadoop Core Components
 MapReduce software framework
 Hadoop Cluster Administrator: Roles and Responsibilities
 Introduction to Hadoop 2.0
www.edureka.in/hadoop-admin
What Is Big Data?
 Lots of Data (Terabytes or Petabytes).
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
www.edureka.in/hadoop-admin
IBM’s Definition
 IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Volume Velocity Variety
Characteristics Of Big Data
12 Terabytes of
tweets created
each day
Scrutinizes 5 million
trade events created
each day to identify
potential fraud
Sensor data, audio,
video, click streams,
log files and more
www.edureka.in/hadoop-admin
 Estimated Global Data Volume:
 2011: 1.8 ZB
 2015: 7.9 ZB
 The world's information doubles every two years
 Over the next 10 years:
 The number of servers worldwide will grow by 10x
 Amount of information managed by enterprise data
centers will grow by 50x
 Number of “files” enterprise data center handle will
grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study
Data Volume Is Growing Exponentially
www.edureka.in/hadoop-admin
What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”
“Leaders in every sector will have to grapple the implications of Big Data.”
McKinsey
Gartner
Forrester
Research
“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”
“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”
“Prioritize Big Data projects that might benefit from Hadoop.”
www.edureka.in/hadoop-admin
Some Of the Hadoop Users
www.edureka.in/hadoop-admin
Hadoop Users – In Detail
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.in/hadoop-admin
 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?
www.edureka.in/hadoop-admin
Hadoop Key Characteristics
Reliable
Economical
Scalable
Flexible
Hadoop
Features
www.edureka.in/hadoop-admin
Hadoop History
Doug Cutting & Mike Cafarella
started working on Nutch
NY Times converts 4TB of
Image archives over 100 EC2s
Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch
Google publishes GFS &
MapReduce papers Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop
Doug Cutting
Joins Cloudera
Hadoop Summit 2009,
750 attendees
Founded
2002 2003 2004 2005 2006 2007 2008 2009
www.edureka.in/hadoop-admin
Hadoop 1.x Eco-System
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig Latin
Data Analysis
Mahout
Machine Learning
Hive
DW System
MapReduce Framework
HBase
Flume Sqoop
Import Or Export
Unstructured or
Semi-Structured data Structured Data
www.edureka.in/hadoop-admin
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers
Hadoop 1.x Core Components
 Additional Administration
Tools:
 Filesystem utilities
 Job scheduling and monitoring
 Web UI
www.edureka.in/hadoop-admin
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Hadoop 1.x Core Components (Contd.)
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node
www.edureka.in/hadoop-admin
 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 DataNodes:
 slaves which are deployed on each machine and provide the
actual storage
 responsible for serving read and write requests for the clients
Name Node and Data Nodes
www.edureka.in/hadoop-admin
 Secondary NameNode:
 Not a hot standby for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Single Point
Failure
Secondary
NameNode
NameNode
Secondary Name Node
metadata
metadata
www.edureka.in/hadoop-admin
What Is MapReduce?
 MapReduce is a programming model
 It is neither platform- nor language-specific
 Record-oriented data processing (key and value)
 Task distributed across multiple nodes
 Where possible, each node processes data
stored on that node
 Consists of two phases
 Map
 Reduce
ValueKey
MapReduce
www.edureka.in/hadoop-admin
What Is MapReduce? (Contd.)
Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '.html' | sort | uniq –c > /my/outfile
MAP SORT REDUCE
www.edureka.in/hadoop-admin
Client
HDFS Map Reduce
Hadoop 1.x – In Summary
Secondary
Name Node
Data
Blocks
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
….
www.edureka.in/hadoop-admin
Poll Questions
www.edureka.in/hadoop-admin
Hadoop Cluster Administrator
 Deploying the cluster
 Performance and availability of the cluster
 Job scheduling and Management
 Upgrades
 Backup and Recovery
 Monitoring the cluster
 Troubleshooting
Roles and Responsibilities
www.edureka.in/hadoop-admin
Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, Node
Manager, App Master
www.edureka.in/hadoop-admin
MRv1 Vs. MRv2
Data Node
HDFS
(Data Storage)
MapReduce
(data processing)
MapReduce
(Data Processing)
Others
(data Processing)
Hadoop 1.0 Hadoop 2.0
Scheduler
Applications
Manager (AsM)
Job Tracker
YARN
(Cluster Resource Management)
HDFS
(Data Storage)
 Provides a Cluster Level Resource Manager
 Application Level Resource Management (Node
Manager??)
 Provides slots for Jobs other than Map and Reduce
 Problems with Resource utilization
 Slots only for Map and Reduce
www.edureka.in/hadoop-admin
Client
HDFS
YARN
Resource Manager
Hadoop 2.0 - Architecture
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node Data Node
Data Node Data Node
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Standby
NameNode
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Active
NameNode
www.edureka.in/hadoop-admin
 Attempt the following Assignments using the documents present in the LMS:
 Single Node Apache Hadoop 1.0 Installation on Ubuntu
 Execute Linux Basic Commands
 Execute HDFS Hands On
 Cloudera CDH3 and CDH4 Quick VM installation on your local machine
Assignments
Thank You
See You in Class Next Week

More Related Content

PPTX
Debunking the Myths of HDFS Erasure Coding Performance
PDF
Hadoop Overview & Architecture
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PDF
Hadoop Architecture and HDFS
PPTX
Apache hive
PDF
Overview - IBM Big Data Platform
PPT
Hadoop 1.x vs 2
PPTX
Hadoop File system (HDFS)
Debunking the Myths of HDFS Erasure Coding Performance
Hadoop Overview & Architecture
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Architecture and HDFS
Apache hive
Overview - IBM Big Data Platform
Hadoop 1.x vs 2
Hadoop File system (HDFS)

What's hot (20)

PDF
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
PDF
Introduction to Hadoop
PPTX
Introduction to Apache Hadoop Eco-System
PDF
Database overview
PDF
Big Data Architecture
PPT
Data Warehouse Modeling
ODP
Hadoop HDFS by rohitkapa
PPTX
Information Retrieval Evaluation
PPTX
Big Data Analytics with Hadoop
PDF
An Apache Hive Based Data Warehouse
PDF
Building large scale transactional data lake using apache hudi
PPT
Hadoop Technology
PPTX
You Need a Data Catalog. Do You Know Why?
DOCX
Unit II -BIG DATA ANALYTICS.docx
PDF
Introduction to Hadoop Administration
PPTX
Big Data & Hadoop Tutorial
PDF
Apache Spark with Scala
PPTX
03 hive query language (hql)
PPT
Query processing-and-optimization
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
Introduction to Hadoop
Introduction to Apache Hadoop Eco-System
Database overview
Big Data Architecture
Data Warehouse Modeling
Hadoop HDFS by rohitkapa
Information Retrieval Evaluation
Big Data Analytics with Hadoop
An Apache Hive Based Data Warehouse
Building large scale transactional data lake using apache hudi
Hadoop Technology
You Need a Data Catalog. Do You Know Why?
Unit II -BIG DATA ANALYTICS.docx
Introduction to Hadoop Administration
Big Data & Hadoop Tutorial
Apache Spark with Scala
03 hive query language (hql)
Query processing-and-optimization
Ad

Viewers also liked (20)

PPTX
Learn Hadoop Administration
PPTX
Introduction to Cloudera's Administrator Training for Apache Hadoop
ODP
Hadoop admin
PPTX
Introduction to Hadoop Administration
PPTX
A day in the life of hadoop administrator!
PDF
Top 5 Hadoop Admin Tasks
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
PDF
Hw09 Welcome To Hadoop World
PDF
Administer Hadoop Cluster
PDF
Secure Hadoop Cluster With Kerberos
PPT
Deployment and Management of Hadoop Clusters
PPTX
Hadoop: An Industry Perspective
PDF
HDFS NameNode High Availability
PPT
Hadoop MapReduce Fundamentals
PPTX
Hdfs ha using journal nodes
DOCX
Apache kafka configuration-guide
PDF
Hadoop single node installation on ubuntu 14
PPTX
Hadoop administration
PDF
Advanced Security In Hadoop Cluster
PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Learn Hadoop Administration
Introduction to Cloudera's Administrator Training for Apache Hadoop
Hadoop admin
Introduction to Hadoop Administration
A day in the life of hadoop administrator!
Top 5 Hadoop Admin Tasks
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hw09 Welcome To Hadoop World
Administer Hadoop Cluster
Secure Hadoop Cluster With Kerberos
Deployment and Management of Hadoop Clusters
Hadoop: An Industry Perspective
HDFS NameNode High Availability
Hadoop MapReduce Fundamentals
Hdfs ha using journal nodes
Apache kafka configuration-guide
Hadoop single node installation on ubuntu 14
Hadoop administration
Advanced Security In Hadoop Cluster
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Ad

Similar to Hadoop Administration pdf (20)

PPTX
Hadoop Adminstration with Latest Release (2.0)
PPTX
Learn Big Data & Hadoop
PPTX
Learn Hadoop
PDF
Hadoop Developer
PDF
Hadoop MapReduce Framework
PPTX
What is hadoop
PDF
Distributed Cache With MapReduce
PPTX
Whatisbigdataandwhylearnhadoop
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Introduction to hadoop administration jk
PDF
Introduction to Big data & Hadoop -I
PDF
Introduction to Big Data & Hadoop
PDF
Power Hadoop Cluster with AWS Cloud
PDF
Map Reduce along with Amazon EMR
PDF
XML Parsing with Map Reduce
PPTX
Introduction to Big Data and Hadoop
PDF
Webinar: Top 5 Hadoop Admin Tasks
PDF
Introduction to Big Data and Hadoop
PDF
Bulk Loading Into HBase With MapReduce
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Adminstration with Latest Release (2.0)
Learn Big Data & Hadoop
Learn Hadoop
Hadoop Developer
Hadoop MapReduce Framework
What is hadoop
Distributed Cache With MapReduce
Whatisbigdataandwhylearnhadoop
Hadoop_EcoSystem slide by CIDAC India.pptx
Introduction to hadoop administration jk
Introduction to Big data & Hadoop -I
Introduction to Big Data & Hadoop
Power Hadoop Cluster with AWS Cloud
Map Reduce along with Amazon EMR
XML Parsing with Map Reduce
Introduction to Big Data and Hadoop
Webinar: Top 5 Hadoop Admin Tasks
Introduction to Big Data and Hadoop
Bulk Loading Into HBase With MapReduce
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PPTX
Cardiovascular Pharmacology for pharmacy students.pptx
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
The Final Stretch: How to Release a Game and Not Die in the Process.
PDF
Business Ethics Teaching Materials for college
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
PPTX
Open Quiz Monsoon Mind Game Final Set.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Cardiovascular Pharmacology for pharmacy students.pptx
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
2.FourierTransform-ShortQuestionswithAnswers.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPH.pptx obstetrics and gynecology in nursing
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
O7-L3 Supply Chain Operations - ICLT Program
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
STATICS OF THE RIGID BODIES Hibbelers.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
The Final Stretch: How to Release a Game and Not Die in the Process.
Business Ethics Teaching Materials for college
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Open Quiz Monsoon Mind Game Final Set.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
GDM (1) (1).pptx small presentation for students
Anesthesia in Laparoscopic Surgery in India
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
TR - Agricultural Crops Production NC III.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx

Hadoop Administration pdf

  • 2. www.edureka.in/hadoop-admin How It Works…  LIVE classes  Class recordings  Module wise Quizzes and Practical Assignments  24x7 on-demand technical support  Deployment of different clusters  Online certification exam  Lifetime access to the Learning Management System
  • 3. www.edureka.in/hadoop-admin Course Topics  Week 1 – Understanding Big Data – Hadoop Components – Introduction to Hadoop 2.0  Week 2 – Hadoop 2.0 – Hadoop Configuration – Hadoop Cluster Architecture  Week 3 – Different Hadoop Server Roles – Data processing flow – Cluster Network Configuration  Week 4 – Job Scheduling – Fair Scheduler – Monitoring a Hadoop Cluster  Week 5 – Securing your Hadoop Cluster – Kerberos and HDFS Federation – Backup and Recovery  Week 6 – Oozie and Hive Administration – HBase Architecture – HBase Administration
  • 4. www.edureka.in/hadoop-admin Topics for Today  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  MapReduce software framework  Hadoop Cluster Administrator: Roles and Responsibilities  Introduction to Hadoop 2.0
  • 5. www.edureka.in/hadoop-admin What Is Big Data?  Lots of Data (Terabytes or Petabytes).  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
  • 6. www.edureka.in/hadoop-admin IBM’s Definition  IBM’s definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Volume Velocity Variety Characteristics Of Big Data 12 Terabytes of tweets created each day Scrutinizes 5 million trade events created each day to identify potential fraud Sensor data, audio, video, click streams, log files and more
  • 7. www.edureka.in/hadoop-admin  Estimated Global Data Volume:  2011: 1.8 ZB  2015: 7.9 ZB  The world's information doubles every two years  Over the next 10 years:  The number of servers worldwide will grow by 10x  Amount of information managed by enterprise data centers will grow by 50x  Number of “files” enterprise data center handle will grow by 75x Source: http://www.emc.com/leadership/programs/digital-universe.htm, which was based on the 2011 IDC Digital Universe Study Data Volume Is Growing Exponentially
  • 8. www.edureka.in/hadoop-admin What Big Companies Have To Say… “Analyzing Big Data sets will become a key basis for competition.” “Leaders in every sector will have to grapple the implications of Big Data.” McKinsey Gartner Forrester Research “Big Data analytics are rapidly emerging as the preferred solution to business and technology trends that are disrupting.” “Enterprises should not delay implementation of Big Data Analytics.” “Use Hadoop to gain a competitive advantage over more risk-averse enterprises.” “Prioritize Big Data projects that might benefit from Hadoop.”
  • 10. www.edureka.in/hadoop-admin Hadoop Users – In Detail http://wiki.apache.org/hadoop/PoweredBy
  • 11. www.edureka.in/hadoop-admin  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. What Is Hadoop?
  • 13. www.edureka.in/hadoop-admin Hadoop History Doug Cutting & Mike Cafarella started working on Nutch NY Times converts 4TB of Image archives over 100 EC2s Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours Over 3.658 nodes Fastest sort of a TB, 3.5mins over 910 nodes Doug Cutting adds DFS & MapReduce support to Nutch Google publishes GFS & MapReduce papers Yahoo! hires Cutting, Hadoop spins out of Nutch Facebook launches Hive: SQL Support for Hadoop Doug Cutting Joins Cloudera Hadoop Summit 2009, 750 attendees Founded 2002 2003 2004 2005 2006 2007 2008 2009
  • 14. www.edureka.in/hadoop-admin Hadoop 1.x Eco-System Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Mahout Machine Learning Hive DW System MapReduce Framework HBase Flume Sqoop Import Or Export Unstructured or Semi-Structured data Structured Data
  • 15. www.edureka.in/hadoop-admin Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers Hadoop 1.x Core Components  Additional Administration Tools:  Filesystem utilities  Job scheduling and monitoring  Web UI
  • 16. www.edureka.in/hadoop-admin Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Hadoop 1.x Core Components (Contd.) MapReduce Engine HDFS Cluster Job Tracker Admin Node Name node
  • 17. www.edureka.in/hadoop-admin  NameNode:  master of the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients Name Node and Data Nodes
  • 18. www.edureka.in/hadoop-admin  Secondary NameNode:  Not a hot standby for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Single Point Failure Secondary NameNode NameNode Secondary Name Node metadata metadata
  • 19. www.edureka.in/hadoop-admin What Is MapReduce?  MapReduce is a programming model  It is neither platform- nor language-specific  Record-oriented data processing (key and value)  Task distributed across multiple nodes  Where possible, each node processes data stored on that node  Consists of two phases  Map  Reduce ValueKey MapReduce
  • 20. www.edureka.in/hadoop-admin What Is MapReduce? (Contd.) Process can be considered as being similar to a Unix pipeline cat /my/log | grep '.html' | sort | uniq –c > /my/outfile MAP SORT REDUCE
  • 21. www.edureka.in/hadoop-admin Client HDFS Map Reduce Hadoop 1.x – In Summary Secondary Name Node Data Blocks Data Node Name Node Job Tracker Task Tracker Map Reduce Data Node Task Tracker Map Reduce ….
  • 23. www.edureka.in/hadoop-admin Hadoop Cluster Administrator  Deploying the cluster  Performance and availability of the cluster  Job scheduling and Management  Upgrades  Backup and Recovery  Monitoring the cluster  Troubleshooting Roles and Responsibilities
  • 24. www.edureka.in/hadoop-admin Hadoop 1.0 Vs. Hadoop 2.0 Property Hadoop 1.x Hadoop 2.x NameNodes 1 Many High Availability Not present Highly Available Processing Control JobTracker, Task Tracker Resource Manager, Node Manager, App Master
  • 25. www.edureka.in/hadoop-admin MRv1 Vs. MRv2 Data Node HDFS (Data Storage) MapReduce (data processing) MapReduce (Data Processing) Others (data Processing) Hadoop 1.0 Hadoop 2.0 Scheduler Applications Manager (AsM) Job Tracker YARN (Cluster Resource Management) HDFS (Data Storage)  Provides a Cluster Level Resource Manager  Application Level Resource Management (Node Manager??)  Provides slots for Jobs other than Map and Reduce  Problems with Resource utilization  Slots only for Map and Reduce
  • 26. www.edureka.in/hadoop-admin Client HDFS YARN Resource Manager Hadoop 2.0 - Architecture Shared edit logs All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node Data Node Data Node Data Node Data Node Node Manager Container App Master Node Manager Container App Master Standby NameNode Node Manager Container App Master Node Manager Container App Master Active NameNode
  • 27. www.edureka.in/hadoop-admin  Attempt the following Assignments using the documents present in the LMS:  Single Node Apache Hadoop 1.0 Installation on Ubuntu  Execute Linux Basic Commands  Execute HDFS Hands On  Cloudera CDH3 and CDH4 Quick VM installation on your local machine Assignments
  • 28. Thank You See You in Class Next Week