SlideShare a Scribd company logo
Hadoop Cluster Management



                 Dheeraj Kapur
                 Principal Engineer, Yahoo!
                 dheerajk@yahoo-inc.com
What it is…
§  Workflow based system for cluster management.
§  Completely modular & distributed design.
§  Has its own JMX based library(can be used to monitor other
    services on cluster).
§  Fully controllable from WebUI.
§  Has command line utility for adhoc administration.




                                    2
What it does…
§    Manage clusters.
§    Break fixing.
§    Upgrades OS seamlessly.
§    Consistency/efficiency of clusters.
§    Proactive self-healing Model.
§    User Management.




                                       3
Manage Clusters
§    Its has well defined workflow to manage clusters.
§    No/Minimal human intervention required.
§    Keep up efficiency of cluster.
§    Keep track of Missing/Bad blocks on system.
§    Well defined WebUI and Command line utility




                                      4
System Overview




                  5
Workflow




           6
Contd..




          7
Command Line Utility




                   8
Web Interface




                9
Web Interface contd…




                  10
fixing bad/mal-performing nodes
These errors can lead to SLA miss or Job failures
§  Takes care of Blacklisted JT nodes.
§  Errors like high load average, wrong network speed.
§  Parse system logs at X frequency (thru workflows) and look for
    patterns.
§  Visit each node multiple times in a day and check health of node.




                                   11
Upgrade OS
§  Upgrade & rollback OS seamlessly.
§  Upgrading on production, heavily used clusters.




                                   12
Consistency & efficiency of clusters
§  Keep track of cluster MR capacity
§  Proactive Fixing of sick nodes, which can cause potential issues.




                                   13
Introducing Proactive self-healing system

Let me set the ground for it.
§  Wounded hosts Called Set A - Hosts having issues, but still in service
    (with degraded services), Which can cause potential SLA misses and
    job execution issues.(which we have seen in past)
§  Fractured Hosts Called Set B - Hosts already in Break fix cycle and
    getting fixed
§  All grid hosts Called Set X - all grid hosts healthy + fine
§  Set A & B are sub-set of set X
§  to find wounded hosts we have to scan entire infrastructure once a
    day.
§  Calculate Symmetric difference b/w Set A & B, we will get actual
    wounded hosts needs service.




                                    14
Proactive self-healing contd….
                  All Grid Hosts - X




                  Set A    Set B




                          15
Proactive self-healing contd….




                   16
User Management
§  We have one of the most complex and secure environment.
§  User access and management is a complex task, due to the
    number of users, security constraints and complexity involved in
    provisioning access.
§  Single request provisioning requires change at multiple places.
§  Well defined workflow based system, where 100% automation is
    achieved.
§  Great help during system audit and compliance.




                                   17
Q&A




 18
Thank You



    19
Sessions will resume at 4:30pm




                             Page 20

More Related Content

PDF
HMS: Scalable Configuration Management System for Hadoop
PPT
PostgreSQL9.3 Switchover/Switchback
PDF
MySQL Backup and Security Best Practices
PPT
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
PDF
MySQL Server Backup, Restoration, and Disaster Recovery Planning
PDF
Reducing Risk When Upgrading MySQL
PDF
Basics of Logical Replication,Streaming replication vs Logical Replication ,U...
PDF
PostgreSQL replication
HMS: Scalable Configuration Management System for Hadoop
PostgreSQL9.3 Switchover/Switchback
MySQL Backup and Security Best Practices
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
MySQL Server Backup, Restoration, and Disaster Recovery Planning
Reducing Risk When Upgrading MySQL
Basics of Logical Replication,Streaming replication vs Logical Replication ,U...
PostgreSQL replication

What's hot (20)

ODP
PostgreSQL Replication in 10 Minutes - SCALE
PPTX
Streaming Replication Made Easy in v9.3
PDF
MySQL's new Secure by Default Install -- All Things Open October 20th 2015
PPTX
Windows Server 2012 R2 Hyper-V Replica
PDF
GlassFish v2 Clustering
PDF
Ora10g Rac Best Practices
PPT
My two cents about Mysql backup
PPTX
PDF
Built-in Replication in PostgreSQL
PDF
What's New in Postgres Plus Advanced Server 9.3
 
PDF
MySQL Backup and Recovery Essentials
PDF
Online MySQL Backups with Percona XtraBackup
PPTX
Sql server 2012 ha dr nova
PDF
PGPool-II Load testing
 
PDF
PostgreSQL Scaling And Failover
PDF
Essential Linux Commands for DBAs
PPTX
ProxySQL para mysql
PDF
MySQL Tuning
PDF
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
PPTX
Sql server 2012 ha dr 24_hop_final
PostgreSQL Replication in 10 Minutes - SCALE
Streaming Replication Made Easy in v9.3
MySQL's new Secure by Default Install -- All Things Open October 20th 2015
Windows Server 2012 R2 Hyper-V Replica
GlassFish v2 Clustering
Ora10g Rac Best Practices
My two cents about Mysql backup
Built-in Replication in PostgreSQL
What's New in Postgres Plus Advanced Server 9.3
 
MySQL Backup and Recovery Essentials
Online MySQL Backups with Percona XtraBackup
Sql server 2012 ha dr nova
PGPool-II Load testing
 
PostgreSQL Scaling And Failover
Essential Linux Commands for DBAs
ProxySQL para mysql
MySQL Tuning
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Sql server 2012 ha dr 24_hop_final
Ad

Viewers also liked (20)

PDF
Hadoop Overview & Architecture
 
TXT
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
PDF
New Use Cases for DAM in the Enterprise
PDF
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
PDF
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
DOC
Tailings dump recovery concept
PPTX
Polymer optical fibers
PPTX
Hadoop from Hive with Stinger to Tez
PPT
GIS for Infrastructure Management
PDF
Real-time, Sensor-based Monitoring of Shipping Containers
RTF
Chem Lab Report (1)
PPT
Designing your Product as a Platform
PPTX
Hadoop & Greenplum: Why Do Such a Thing?
PDF
High-Density Wireless Networks for Auditoriums
PDF
Airport Billing System for Aviation and Non-Aviation Services
PDF
Web Services Automated Testing via SoapUI Tool
PPTX
Spend Analysis In 60 Seconds
PPTX
Surgical induced astigmatism
PDF
Best practice strategies to clean up and maintain your database with Hether G...
Hadoop Overview & Architecture
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
New Use Cases for DAM in the Enterprise
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
Tailings dump recovery concept
Polymer optical fibers
Hadoop from Hive with Stinger to Tez
GIS for Infrastructure Management
Real-time, Sensor-based Monitoring of Shipping Containers
Chem Lab Report (1)
Designing your Product as a Platform
Hadoop & Greenplum: Why Do Such a Thing?
High-Density Wireless Networks for Auditoriums
Airport Billing System for Aviation and Non-Aviation Services
Web Services Automated Testing via SoapUI Tool
Spend Analysis In 60 Seconds
Surgical induced astigmatism
Best practice strategies to clean up and maintain your database with Hether G...
Ad

Similar to Hadoop Cluster Management (20)

PPTX
Planning to Fail #phpuk13
PDF
Highly Available Load Balanced Galera MySql Cluster
PPTX
Planning to Fail #phpne13
PDF
VMworld 2014: Virtualizing Databases
PDF
1 introduction
PDF
Run Book Automation with PlateSpin Orchestrate
PDF
Run Book Automation with PlateSpin Orchestrate
PDF
Run Book Automation with PlateSpin Orchestrate
PDF
Run Book Automation with PlateSpin Orchestrate
PDF
Run Book Automation with PlateSpin Orchestrate
PDF
Nonfunctional Testing: Examine the Other Side of the Coin
PDF
How netflix manages petabyte scale apache cassandra in the cloud
PDF
Practice and challenges from building IaaS
PPTX
Resilience Testing
DOC
Bishwambar Linux Admin
PDF
Ansible for networks
PPTX
Ansible: How to Get More Sleep and Require Less Coffee
PPTX
Next-Gen Decision Making in Under 2ms
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
PDF
IaaS - Virtualization_Cambridge.pdf
Planning to Fail #phpuk13
Highly Available Load Balanced Galera MySql Cluster
Planning to Fail #phpne13
VMworld 2014: Virtualizing Databases
1 introduction
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
Nonfunctional Testing: Examine the Other Side of the Coin
How netflix manages petabyte scale apache cassandra in the cloud
Practice and challenges from building IaaS
Resilience Testing
Bishwambar Linux Admin
Ansible for networks
Ansible: How to Get More Sleep and Require Less Coffee
Next-Gen Decision Making in Under 2ms
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
IaaS - Virtualization_Cambridge.pdf

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Teaching material agriculture food technology
KodekX | Application Modernization Development
Programs and apps: productivity, graphics, security and other tools
sap open course for s4hana steps from ECC to s4
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
Approach and Philosophy of On baking technology

Hadoop Cluster Management

  • 1. Hadoop Cluster Management Dheeraj Kapur Principal Engineer, Yahoo! dheerajk@yahoo-inc.com
  • 2. What it is… §  Workflow based system for cluster management. §  Completely modular & distributed design. §  Has its own JMX based library(can be used to monitor other services on cluster). §  Fully controllable from WebUI. §  Has command line utility for adhoc administration. 2
  • 3. What it does… §  Manage clusters. §  Break fixing. §  Upgrades OS seamlessly. §  Consistency/efficiency of clusters. §  Proactive self-healing Model. §  User Management. 3
  • 4. Manage Clusters §  Its has well defined workflow to manage clusters. §  No/Minimal human intervention required. §  Keep up efficiency of cluster. §  Keep track of Missing/Bad blocks on system. §  Well defined WebUI and Command line utility 4
  • 11. fixing bad/mal-performing nodes These errors can lead to SLA miss or Job failures §  Takes care of Blacklisted JT nodes. §  Errors like high load average, wrong network speed. §  Parse system logs at X frequency (thru workflows) and look for patterns. §  Visit each node multiple times in a day and check health of node. 11
  • 12. Upgrade OS §  Upgrade & rollback OS seamlessly. §  Upgrading on production, heavily used clusters. 12
  • 13. Consistency & efficiency of clusters §  Keep track of cluster MR capacity §  Proactive Fixing of sick nodes, which can cause potential issues. 13
  • 14. Introducing Proactive self-healing system Let me set the ground for it. §  Wounded hosts Called Set A - Hosts having issues, but still in service (with degraded services), Which can cause potential SLA misses and job execution issues.(which we have seen in past) §  Fractured Hosts Called Set B - Hosts already in Break fix cycle and getting fixed §  All grid hosts Called Set X - all grid hosts healthy + fine §  Set A & B are sub-set of set X §  to find wounded hosts we have to scan entire infrastructure once a day. §  Calculate Symmetric difference b/w Set A & B, we will get actual wounded hosts needs service. 14
  • 15. Proactive self-healing contd…. All Grid Hosts - X Set A Set B 15
  • 17. User Management §  We have one of the most complex and secure environment. §  User access and management is a complex task, due to the number of users, security constraints and complexity involved in provisioning access. §  Single request provisioning requires change at multiple places. §  Well defined workflow based system, where 100% automation is achieved. §  Great help during system audit and compliance. 17
  • 19. Thank You 19
  • 20. Sessions will resume at 4:30pm Page 20