SlideShare a Scribd company logo
| 1
MULTI-TENANT HADOOP
THE CHALLENGE OF
MAINTAINING HIGH SLAS
Edouard ROUSSEAUX
EDF-DTEO-DSIT-ITO-DATACENTER
Big Data Tech Lead
| 2
SUMMARY
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
1. EDF CONTEXT
PRESENTATION
2. BIG DATA STRATEGY OF EDF-DSIT (IT PRODUCTION DEPARTMENT)
HISTORY
ARCHITECTURE CHOICES
BIG DATA SERVICE OFFER
3. CHALLENGES TO TAKE UP
CURRENT STATE
DIAGNOSTIC
FOCUS
4. ACTION PLAN
TECHNICAL POINTS
ORGANIZATIONAL POINTS
| 3
EDF CONTEXT
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
WORLD LEADER IN LOW-CARBON ENERGY, EDF GROUP BRINGS TOGETHER ALL THE BUSINESS OF
PRODUCTION, TRADE AND ELECTRICITY NETWORKS.
EDF-DSIT PROVIDES IT-SERVICES TO SUPPORT
THE GROUP IN ITS DIGITAL TRANSFORMATION
| 4
BIG DATA STRATEGY OF EDF-DSIT
A SHARED DATALAKE
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
ARCHITECTURE CHOICES
Shared Platform for all Group businesses:
- Centralization of data
- Economic efficiency
- Sharing / Cross Analyzing data between Business
- Simplify operations
Specific platform per type of use cases in
order to guaranty SLA, performance and flexibility:
- Development / Pre-production
- Production (mainly used as a backend of production
app)
- Backup / Disaster Recovery
- Analytics (soon)
These architectural choices have a very strong
impact on the performance of our infrastructures and
applications
HISTOIRY OF BIG DATA AT EDF
We start in 2012 with a first cluster (Hadoop v1)
- Trade Direction wanted to start cross analyzing data
- 4 recycled hosts
… exponential growth …
Until today :
- 3 physical environments (4th one soon)
- We are using Hortonworks (HDP, HDF)
- 200 hosts
- HDFS ≈ 1,4 PB (usable space)
- YARN ≈ 14,6TB of RAM / 4600 vCores
- HBASE ≈ 8,2TB of RAM
- Biggest HBASE table ≈ 90TB (8k Regions)
| 5
BIG DATA STRATEGY OF EDF-DSIT
BIG DATA SERVICE OFFER
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
AN OFFER THAT FOUND ITS PUBLIC
− 5 Business Directions 50 production applications
− +10k Yarn jobs per day (in production)
− 250 HBase Tables / 25k Hbase Regions /
− 150+ Hive DB
− +500 users
− 24/7 applications
− 1 HA applications
− All kind of application type:
• Batch (ELT)
• Streaming / Real-time
• OLAP
• OLTP
• Big volumes / Small volumes
• Critical / Non-critical applications
« BIG DATA » SERVICE OFFER
Business oriented offer :
- A price catalog
- Very simple units of work : TB et vCores
- Global SLA on shared services available on our clusters
(HDFS, HBASE, KAFKA,…)
- Organization, process,…
« self-service » consumption by trades
The design of this service offer also has a strong impact on
the performance of infrastructures and applications
OUR BIG DATA INFRASTRUCTURES HAS BECOME ESSENTIAL TO OUR BUSINESSES
| 6
CHALLENGES TO TAKE UP
AN OFFER « VICTIM » OF ITS OWN SUCCESS
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
SITUATION LAST SUMMER
Many critical use cases for businesses
Hard time maintaining the expected level of service :
- Instability / Unavailability
- Difficulty to Communicate
99.5
100 100 100
99.35
99.86
99.1
99
96
97.44
98.85
99.8599.8
99.9
96.3
100
99.2
99.88
100 100
99.6
100 100 100
95
96
97
98
99
100
101
Juin Juillet Août Septembre Octobre Novembre
Availability of services in 2017
Hive Hbase Hdfs Yarn
| 7
CHALLENGES TO TAKE UP
DIAGNOSTIC
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
TECHNICAL ISSUES
Improve the way we operate our shared Big Data Infrastructures:
- Insufficient Metrics
- Monitoring didn’t evolve as fast as application types
- Complex diagnostics
- Sizing anticipation were not accurate
Some applications were developed with anti-patterns
Lack of rigor when putting applications in production
- Scale-out and Performance tests
- No code review
Internal billing based on storage and CPU (insufficient) :
- Does not accurately reflect cluster usage
- Does not encourage a virtuous use of clusters
NEED FOR FURTHER TECHNICAL &
PRODUCTION SUPPORT
| 8
CHALLENGES TO TAKE UP
DIAGNOSTIC
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
ORGANIZATIONAL ISSUES
« Self-Service » Offer:
- Partial control of what runs on infrastructure
- Lack of a global vision of clusters usage
SLA not accurate enough:
- No concept of degraded service
- Feeling not equal between business and operation teams
Improvement of our production skills
- Shift in skill type required
Shared infrastructure:
- Governance challenge (change management, ...)
- Businesses Use cases impact each other
- Accurate resources allocation is essential (according to needs)
Capacity Planning :
- Not sufficiently detailed
- Not easy to quantify the potential impact of business use cases
(Restricted vision / Sizing)
NEED MORE COORDINATION BETWEEN TEAMS
| 9
CHALLENGES TO TAKE UP
IN HINDSIGHT
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
EVOLVING CLUSTER USAGE
Until Summer 2017 : Only « ELT » usage
New type of business use cases :
- Intensive use of HBASE
- Huge volumes
- Critical use cases
Lack of anticipation :
- Impact of those new use cases not qualified
- More than 1000 Regions par Region Servers
- HDFS saturation
- HBASE collapsed
99.5
100 100 100
99.35
99.86
99.1 99
96
97.44
98.85
99.8599.8 99.9
96.3
100
99.2
99.88
100 100
99.6
100 100 100
95
96
97
98
99
100
101
Juin Juillet Août Septembre Octobre Novembre
Availability of services 2017
Hive Hbase Hdfs Yarn
COMPELING EVENT FOR CHANGE
| 10
ACTION PLAN
TECHNICAL POINTS
EDITOR SUPPORT
Expertise
- Diagnostic aid
- Improving operating aspect of infrastructure
- Metrics
- Tools
- Sizing / Scaling
- Improvement of application ops
Internal skill improvement
Communication
- Facilitate internal communication between teams
- Best practices
- Shared Diagnostics
- Facilitate dialogue with management
- Investment decisions
- Technical evolution decisions
RESOURCES ISOLATION / SHARING
Fine management of Yarn queues :
- Queues per Business Directions
- Sub-queues per applications or use cases ?
- Overbooking ? Pre-emption ?
Resources Isolation :
- Node labels
- Region Server groups
- Containerization of Elasticsearch/SolR
Protection mechanism of resources :
- HDFS Quotas
- HBASE Quotas
- Kafka Quotas
- Zookeeper Observers
Evaluate multi-tenancy to match business requirements and
enterprise strategy.
| 11
ACTION PLAN
CONFIGURATION & ARCHITECTURE
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
| 12
ACTION PLAN
SERVICES AVAILABILITY
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
| 13
ACTION PLAN
MONITORING – HBASE TABLE
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
| 14
ACTION PLAN
MONITORING – HBASE REGION SERVER
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
| 15
ACTION PLAN
MONITORING – KAFKA
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
| 16
ACTION PLAN
MONITORING – HBASE REPLICATION (EVOLUTION OF BAD ROWS)
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
| 17
ACTION PLAN
ORGANIZATION
COMMUNICATION
Build and animate Enterprise Communities
- Templates / GIT,…
- Feedback Sharing
Improve Cross team communication
Share Development guides
Create shared metrics
- Dashboards
- Health Check
- Application Status
| 18
ACTION PLAN
ORGANIZATION
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
MANAGEMENT
Reorganization of Application Management :
- Centralized team
- Global vision of jobs
- Centralization of application logs
- Tools
CAPACITY PLANNING
Setup of an « industrial » Capacity Planning
- Definition of the right metrics to predict
- Tools
- Regural meetup to review the planning
SERVICE OFFER
Modification of Billing to reflect actual usage
- Memory and CPU Usage for HBASE
- Number of Region
- Markup for small files
MANAGE EXPECTATIONS & ANTICIPATE
REQUIREMENTS
| 19
ACTION PLAN
POSITIVE EFFECTS
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
99.5
100 100 100
99.35
99.86
100 100 100
99.1
99
96
97.44
98.85
99.85
100 100
99.899.8
99.9
96.3
100
99.2
99.88
100 100 100100 100
99.6
100 100 100 100 100 100
95
96
97
98
99
100
101
Juin Juillet Août Septembre Octobre Novembre Décembre Janvier Février
Availability of services
Hive Hbase Hdfs Yarn
| 20
THING TO TAKE-AWAY
Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
Service OFFER & Multi-tenancy must be managed globally
Communication & coordination at all level is essential
Get help !!
| 21Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
THANK YOU !
QUESTIONS

More Related Content

PPTX
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
PPTX
Operating a secure big data platform in a multi-cloud environment
PPTX
Data Offload for the Chief Data Officer – how to move data onto Hadoop withou...
PPTX
HDFS tiered storage: mounting object stores in HDFS
PPTX
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
PPTX
Inside open metadata—the deep dive
PPTX
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
PPTX
Benefits of Transferring Real-Time Data to Hadoop at Scale
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Operating a secure big data platform in a multi-cloud environment
Data Offload for the Chief Data Officer – how to move data onto Hadoop withou...
HDFS tiered storage: mounting object stores in HDFS
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Inside open metadata—the deep dive
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Benefits of Transferring Real-Time Data to Hadoop at Scale

What's hot (20)

PPTX
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
PPTX
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
PDF
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
PPTX
Synchronicity of a distributed financial system
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PPTX
Securing and governing a multi-tenant data lake within the financial industry
PPTX
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
PPTX
Cloud Innovation Day - Commonwealth of PA v11.3
PPTX
Pouring the Foundation: Data Management in the Energy Industry
PPTX
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
PPT
Ultralight Data Movement for IoT with SDC Edge
PPTX
Swimming Across the Data Lake, Lessons learned and keys to success
PPTX
Lessons learned processing 70 billion data points a day using the hybrid cloud
PPTX
Continuous Data Ingestion pipeline for the Enterprise
PPTX
The convergence of reporting and interactive BI on Hadoop
PPTX
Extending Data Lake using the Lambda Architecture June 2015
PPTX
Depositing Value from Transactional Data at Danske Bank
PPTX
Accelerating Data Warehouse Modernization
PPTX
Tools and approaches for migrating big datasets to the cloud
PDF
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Synchronicity of a distributed financial system
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Securing and governing a multi-tenant data lake within the financial industry
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Cloud Innovation Day - Commonwealth of PA v11.3
Pouring the Foundation: Data Management in the Energy Industry
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Ultralight Data Movement for IoT with SDC Edge
Swimming Across the Data Lake, Lessons learned and keys to success
Lessons learned processing 70 billion data points a day using the hybrid cloud
Continuous Data Ingestion pipeline for the Enterprise
The convergence of reporting and interactive BI on Hadoop
Extending Data Lake using the Lambda Architecture June 2015
Depositing Value from Transactional Data at Danske Bank
Accelerating Data Warehouse Modernization
Tools and approaches for migrating big datasets to the cloud
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Ad

Similar to Multi-tenant Hadoop - the challenge of maintaining high SLAS (20)

PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PPTX
Hadoop operations-2014-strata-new-york-v5
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PDF
Next Generation Hadoop Operations
PPTX
Meeting Performance Goals in multi-tenant Hadoop Clusters
PPTX
Managing growth in Production Hadoop Deployments
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PDF
Scaling Hadoop at LinkedIn
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PDF
Infrastructure Considerations for Analytical Workloads
PDF
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
PDF
Facebook - Jonthan Gray - Hadoop World 2010
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PDF
Keep your Hadoop cluster at its best!
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PDF
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
PPTX
Hadoop Operations - Best Practices from the Field
PPTX
Hadoop operations-2015-hadoop-summit-san-jose-v5
PDF
Hp Converged Systems and Hortonworks - Webinar Slides
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Hadoop operations-2014-strata-new-york-v5
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Next Generation Hadoop Operations
Meeting Performance Goals in multi-tenant Hadoop Clusters
Managing growth in Production Hadoop Deployments
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Scaling Hadoop at LinkedIn
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Infrastructure Considerations for Analytical Workloads
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
Facebook - Jonthan Gray - Hadoop World 2010
Hadoop - Architectural road map for Hadoop Ecosystem
Keep your Hadoop cluster at its best!
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Hadoop Operations - Best Practices from the Field
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hp Converged Systems and Hortonworks - Webinar Slides
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
“AI and Expert System Decision Support & Business Intelligence Systems”
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars

Multi-tenant Hadoop - the challenge of maintaining high SLAS

  • 1. | 1 MULTI-TENANT HADOOP THE CHALLENGE OF MAINTAINING HIGH SLAS Edouard ROUSSEAUX EDF-DTEO-DSIT-ITO-DATACENTER Big Data Tech Lead
  • 2. | 2 SUMMARY Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 1. EDF CONTEXT PRESENTATION 2. BIG DATA STRATEGY OF EDF-DSIT (IT PRODUCTION DEPARTMENT) HISTORY ARCHITECTURE CHOICES BIG DATA SERVICE OFFER 3. CHALLENGES TO TAKE UP CURRENT STATE DIAGNOSTIC FOCUS 4. ACTION PLAN TECHNICAL POINTS ORGANIZATIONAL POINTS
  • 3. | 3 EDF CONTEXT Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 WORLD LEADER IN LOW-CARBON ENERGY, EDF GROUP BRINGS TOGETHER ALL THE BUSINESS OF PRODUCTION, TRADE AND ELECTRICITY NETWORKS. EDF-DSIT PROVIDES IT-SERVICES TO SUPPORT THE GROUP IN ITS DIGITAL TRANSFORMATION
  • 4. | 4 BIG DATA STRATEGY OF EDF-DSIT A SHARED DATALAKE Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 ARCHITECTURE CHOICES Shared Platform for all Group businesses: - Centralization of data - Economic efficiency - Sharing / Cross Analyzing data between Business - Simplify operations Specific platform per type of use cases in order to guaranty SLA, performance and flexibility: - Development / Pre-production - Production (mainly used as a backend of production app) - Backup / Disaster Recovery - Analytics (soon) These architectural choices have a very strong impact on the performance of our infrastructures and applications HISTOIRY OF BIG DATA AT EDF We start in 2012 with a first cluster (Hadoop v1) - Trade Direction wanted to start cross analyzing data - 4 recycled hosts … exponential growth … Until today : - 3 physical environments (4th one soon) - We are using Hortonworks (HDP, HDF) - 200 hosts - HDFS ≈ 1,4 PB (usable space) - YARN ≈ 14,6TB of RAM / 4600 vCores - HBASE ≈ 8,2TB of RAM - Biggest HBASE table ≈ 90TB (8k Regions)
  • 5. | 5 BIG DATA STRATEGY OF EDF-DSIT BIG DATA SERVICE OFFER Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 AN OFFER THAT FOUND ITS PUBLIC − 5 Business Directions 50 production applications − +10k Yarn jobs per day (in production) − 250 HBase Tables / 25k Hbase Regions / − 150+ Hive DB − +500 users − 24/7 applications − 1 HA applications − All kind of application type: • Batch (ELT) • Streaming / Real-time • OLAP • OLTP • Big volumes / Small volumes • Critical / Non-critical applications « BIG DATA » SERVICE OFFER Business oriented offer : - A price catalog - Very simple units of work : TB et vCores - Global SLA on shared services available on our clusters (HDFS, HBASE, KAFKA,…) - Organization, process,… « self-service » consumption by trades The design of this service offer also has a strong impact on the performance of infrastructures and applications OUR BIG DATA INFRASTRUCTURES HAS BECOME ESSENTIAL TO OUR BUSINESSES
  • 6. | 6 CHALLENGES TO TAKE UP AN OFFER « VICTIM » OF ITS OWN SUCCESS Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 SITUATION LAST SUMMER Many critical use cases for businesses Hard time maintaining the expected level of service : - Instability / Unavailability - Difficulty to Communicate 99.5 100 100 100 99.35 99.86 99.1 99 96 97.44 98.85 99.8599.8 99.9 96.3 100 99.2 99.88 100 100 99.6 100 100 100 95 96 97 98 99 100 101 Juin Juillet Août Septembre Octobre Novembre Availability of services in 2017 Hive Hbase Hdfs Yarn
  • 7. | 7 CHALLENGES TO TAKE UP DIAGNOSTIC Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 TECHNICAL ISSUES Improve the way we operate our shared Big Data Infrastructures: - Insufficient Metrics - Monitoring didn’t evolve as fast as application types - Complex diagnostics - Sizing anticipation were not accurate Some applications were developed with anti-patterns Lack of rigor when putting applications in production - Scale-out and Performance tests - No code review Internal billing based on storage and CPU (insufficient) : - Does not accurately reflect cluster usage - Does not encourage a virtuous use of clusters NEED FOR FURTHER TECHNICAL & PRODUCTION SUPPORT
  • 8. | 8 CHALLENGES TO TAKE UP DIAGNOSTIC Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 ORGANIZATIONAL ISSUES « Self-Service » Offer: - Partial control of what runs on infrastructure - Lack of a global vision of clusters usage SLA not accurate enough: - No concept of degraded service - Feeling not equal between business and operation teams Improvement of our production skills - Shift in skill type required Shared infrastructure: - Governance challenge (change management, ...) - Businesses Use cases impact each other - Accurate resources allocation is essential (according to needs) Capacity Planning : - Not sufficiently detailed - Not easy to quantify the potential impact of business use cases (Restricted vision / Sizing) NEED MORE COORDINATION BETWEEN TEAMS
  • 9. | 9 CHALLENGES TO TAKE UP IN HINDSIGHT Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 EVOLVING CLUSTER USAGE Until Summer 2017 : Only « ELT » usage New type of business use cases : - Intensive use of HBASE - Huge volumes - Critical use cases Lack of anticipation : - Impact of those new use cases not qualified - More than 1000 Regions par Region Servers - HDFS saturation - HBASE collapsed 99.5 100 100 100 99.35 99.86 99.1 99 96 97.44 98.85 99.8599.8 99.9 96.3 100 99.2 99.88 100 100 99.6 100 100 100 95 96 97 98 99 100 101 Juin Juillet Août Septembre Octobre Novembre Availability of services 2017 Hive Hbase Hdfs Yarn COMPELING EVENT FOR CHANGE
  • 10. | 10 ACTION PLAN TECHNICAL POINTS EDITOR SUPPORT Expertise - Diagnostic aid - Improving operating aspect of infrastructure - Metrics - Tools - Sizing / Scaling - Improvement of application ops Internal skill improvement Communication - Facilitate internal communication between teams - Best practices - Shared Diagnostics - Facilitate dialogue with management - Investment decisions - Technical evolution decisions RESOURCES ISOLATION / SHARING Fine management of Yarn queues : - Queues per Business Directions - Sub-queues per applications or use cases ? - Overbooking ? Pre-emption ? Resources Isolation : - Node labels - Region Server groups - Containerization of Elasticsearch/SolR Protection mechanism of resources : - HDFS Quotas - HBASE Quotas - Kafka Quotas - Zookeeper Observers Evaluate multi-tenancy to match business requirements and enterprise strategy.
  • 11. | 11 ACTION PLAN CONFIGURATION & ARCHITECTURE Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
  • 12. | 12 ACTION PLAN SERVICES AVAILABILITY Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
  • 13. | 13 ACTION PLAN MONITORING – HBASE TABLE Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
  • 14. | 14 ACTION PLAN MONITORING – HBASE REGION SERVER Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
  • 15. | 15 ACTION PLAN MONITORING – KAFKA Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
  • 16. | 16 ACTION PLAN MONITORING – HBASE REPLICATION (EVOLUTION OF BAD ROWS) Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018
  • 17. | 17 ACTION PLAN ORGANIZATION COMMUNICATION Build and animate Enterprise Communities - Templates / GIT,… - Feedback Sharing Improve Cross team communication Share Development guides Create shared metrics - Dashboards - Health Check - Application Status
  • 18. | 18 ACTION PLAN ORGANIZATION Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 MANAGEMENT Reorganization of Application Management : - Centralized team - Global vision of jobs - Centralization of application logs - Tools CAPACITY PLANNING Setup of an « industrial » Capacity Planning - Definition of the right metrics to predict - Tools - Regural meetup to review the planning SERVICE OFFER Modification of Billing to reflect actual usage - Memory and CPU Usage for HBASE - Number of Region - Markup for small files MANAGE EXPECTATIONS & ANTICIPATE REQUIREMENTS
  • 19. | 19 ACTION PLAN POSITIVE EFFECTS Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 99.5 100 100 100 99.35 99.86 100 100 100 99.1 99 96 97.44 98.85 99.85 100 100 99.899.8 99.9 96.3 100 99.2 99.88 100 100 100100 100 99.6 100 100 100 100 100 100 95 96 97 98 99 100 101 Juin Juillet Août Septembre Octobre Novembre Décembre Janvier Février Availability of services Hive Hbase Hdfs Yarn
  • 20. | 20 THING TO TAKE-AWAY Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 Service OFFER & Multi-tenancy must be managed globally Communication & coordination at all level is essential Get help !!
  • 21. | 21Multi-tenant Hadoop - The challenge of maintaining high SLAS | april 2018 THANK YOU ! QUESTIONS