PRODUCTIONIZING HADOOP
New Lessons Learned
Eric Sammer
General Announcements

• All lines are muted
• Ask questions any time using the “questions”
  pane on your GoToWebinar panel
• Recording of this webinar will be available on
  demand at www.cloudera.com
The Universe of Operations


System Operations       Architecture and App Ops
• Server and network    • Data architecture
• Operating system      • Data integration
• Identity and access   • Data quality monitoring
• Resource management   • Resource management
• Maintenance           • Pipeline maintenance
• Cluster monitoring    • Governance
• Backup and DR
Scope for Today

• A focus on common stumbling blocks
   • Workload-oriented planning and identification
   • Network architecture
   • Host management
   • Configuration management
   • Identity, Access, and Authorization
   • Cluster and resource sharing
• Time for questions
Proper Planning

• Develop an understanding of your use cases
   • What you (will) do defines what you need
   • Analog: OLTP RDBMS versus OLAP
• Prototype if necessary
Understanding Cluster Usage
…by use case



           Data Mining / IR


                                             ETL


                              Report Generation

         Analytics
…by use case



                      Data Mining / IR
    Network utilization is a
    function of job size, its
    profile, and the number                             ETL
    of concurrent jobs
                                         Report Generation

                  Analytics
Network Architecture

• Your current architecture is probably fine
   • Typical: traditional L2 tree (fine for North/South)
   • Emerging: L3 spine/leaf (optimized for East/West)
• Minimize oversubscription (normal: 1:1.2)
• Deep port buffers       (with fair allocation for shared memory)

• Do not collocate low-latency apps with MR
• Monitor, monitor, monitor
   • Bandwidth, buffer, packet count, and size deciles
Host Configuration

• OS version and patches
• Java 6   (HotSpot VM)

• PAM limits    (nofile, nproc)

• Naming    (nsswitch.conf, resolv.conf, hosts, gethostname())

• OS filesystem selection and tuning
• Time service
• Users, groups, and identity management
• Machines should not be unique snowflakes
Configuration Management

• Puppet/Chef/<your favorite> for OS config
   • Package installation
   • Identity and authorization wiring
• Cloudera Manager for platform management
   • Deployment and configuration
   • Service lifecycle
   • Platform-specific service monitoring and diagnostics
   • Activity monitoring
• Complementary systems
   • Differentiating factors: centralized
     coordination, service awareness, orchestration
Identity, Access, and Authorization

• MapReduce is a code execution engine
• Identity management and access control is hard
  (in distributed systems like Hadoop)
• Hadoop uses the OS (or Kerberos) for identity
   • Lots of entry points
   • Comparatively low level
• Access control is a function of each service
   • HDFS: Unix-style octal permissions on objects
   • MapReduce: ACLs on job queues
Resource Sharing

• One cluster, many groups
• Pros
   • Benefit from aggregate resources
   • Greater utilization
   • Reduced cap/op-ex
Resource Sharing
• Three dimensions of sharing a cluster
    • Collocation of services (e.g. MapReduce and HBase)
    • Collocation of groups of users
    • Collocation of workload profiles (ETL, analytics)
• In an ideal world, collocate all and enforce policy
    • Not currently possible
• Problems
    • System utilization varies wildly
    • Fair distribution of shared resources
    • Increased access control complexity
    • SLA of most sensitive group applies to all
    • …but nothing new
Resource Sharing
• Reasons to collocate groups / applications:
   • Similar system utilization profiles
   • Time-based utilization (e.g. daily ETL and office hour
      analytics)
   • Maintain similar SLAs
   • Extensively data sharing
   • When it’s trivially easy with current control mechanisms
• Reasons to segregate groups / applications:
   • Compliance, regulation, or where security is paramount
   • Wildly dissimilar utilization profiles (notably HBase and
      MapReduce)
• A significant area of interest for Cloudera
Now What?

• There’s a lot (more) to think about
• We can help
   • Education
   • Services
   • Software
   • Support
• Strata + Hadoop World 2012
• Look for upcoming webinars
Questions?
Type them in the “Questions” panel.

Congratulations to the winners
of the book drawing!
• Vani Mahobia
• Ken Gayler
• Richard Zhang
• Anand Rajan
• Erica Muxlow
Questions?
Type them in the “Questions” panel.



To learn more about Hadoop
Operations, A Guide for
Developers and
Administrators, or about the
spotted cavy, go to
www.oreilly.com
THANK YOU!
Eric Sammer, Principal Solutions Architect
@esammer
For more information: www.cloudera.com
Sales: (888)789-1488
@cloudera
Hardware Planning

• CPU
• Disk capacity and configuration
• Spindle count
• Memory (amount and configuration)
• NIC configuration
• Hadoop’s hardware preferences tend to be
 controversial until the architecture is understood
Baseline Hardware

• Disk
   • SATA II 7200RPM (SAS controller)
   • JBOD (OS on R1)
   • Option 1: 12x3.5” LFF 3TB
   • Option 2: 24x2.5” SFF 1TB
   • Option: MDL/NL SAS drives
• 2x2.2Ghz 6C 20MB cache
• 48GB+ DDR3-1600 ECC
• 1GbE vs. 10GbE
   • Is there new info here?

More Related Content

PPT
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
PPTX
Trusted advisory on technology comparison --exadata, hana, db2
PDF
Impala use case @ Zoosk
PPTX
Apache hadoop technology : Beginners
PPT
SQL, NoSQL, BigData in Data Architecture
PPTX
Big data solutions in Azure
PPTX
Big Data as PaaS in Enterprises
PPTX
Combining Machine Learning frameworks with Apache Spark
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
Trusted advisory on technology comparison --exadata, hana, db2
Impala use case @ Zoosk
Apache hadoop technology : Beginners
SQL, NoSQL, BigData in Data Architecture
Big data solutions in Azure
Big Data as PaaS in Enterprises
Combining Machine Learning frameworks with Apache Spark

What's hot (19)

PPTX
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
PPTX
Data engineering
PPTX
VTU 6th Sem Elective CSE - Module 4 cloud computing
PDF
A Closer Look at Apache Kudu
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
PDF
JethroData technical white paper
PPTX
Building Big data solutions in Azure
PDF
ETL Made Easy with Azure Data Factory and Azure Databricks
PPTX
Big data solutions in azure
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPTX
Learning UML with Enterprise Architect
PDF
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
PPTX
Operationalizing Data Science Using Cloud Foundry
PDF
Accelerate Data Science Initiatives: Databricks & Privacera
PPTX
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
PDF
Ankus, bigdata deployment and orchestration framework
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Data engineering
VTU 6th Sem Elective CSE - Module 4 cloud computing
A Closer Look at Apache Kudu
How to deploy Apache Spark in a multi-tenant, on-premises environment
JethroData technical white paper
Building Big data solutions in Azure
ETL Made Easy with Azure Data Factory and Azure Databricks
Big data solutions in azure
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Learning UML with Enterprise Architect
Spark and Couchbase– Augmenting the Operational Database with Spark
Operationalizing Data Science Using Cloud Foundry
Accelerate Data Science Initiatives: Databricks & Privacera
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Ankus, bigdata deployment and orchestration framework
Ad

Similar to Productionizing Hadoop - New Lessons Learned (20)

PDF
Nisha talagala keynote_inflow_2016
PDF
Survey of Big Data Infrastructures
PDF
An overview of modern scalable web development
DOCX
cloud service management.Details of classic data center
PDF
Meta scale kognitio hadoop webinar
PDF
How to Build a Compute Cluster
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PPTX
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
PDF
So You Want to Build a Data Lake?
PDF
Hpc lunch and learn
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
Hadoop and IDW - When_to_use_which
KEY
What ya gonna do?
 
PDF
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
PPTX
Options for Data Prep - A Survey of the Current Market
PDF
Apache Tajo - An open source big data warehouse
PDF
Implementing Private Database Clouds
PDF
Meta scale kognitio hadoop webinar
PDF
StreamHorizon overview
PPTX
Chicago HUG Presentation Oct 2011
Nisha talagala keynote_inflow_2016
Survey of Big Data Infrastructures
An overview of modern scalable web development
cloud service management.Details of classic data center
Meta scale kognitio hadoop webinar
How to Build a Compute Cluster
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
So You Want to Build a Data Lake?
Hpc lunch and learn
Big Data Architecture Workshop - Vahid Amiri
Hadoop and IDW - When_to_use_which
What ya gonna do?
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Options for Data Prep - A Survey of the Current Market
Apache Tajo - An open source big data warehouse
Implementing Private Database Clouds
Meta scale kognitio hadoop webinar
StreamHorizon overview
Chicago HUG Presentation Oct 2011
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
STKI Israel Market Study 2025 version august
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
DOCX
search engine optimization ppt fir known well about this
PPT
Geologic Time for studying geology for geologist
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Produktkatalog fĂĽr HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPT
What is a Computer? Input Devices /output devices
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
STKI Israel Market Study 2025 version august
UiPath Agentic Automation session 1: RPA to Agents
Taming the Chaos: How to Turn Unstructured Data into Decisions
OpenACC and Open Hackathons Monthly Highlights July 2025
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
search engine optimization ppt fir known well about this
Geologic Time for studying geology for geologist
Module 1.ppt Iot fundamentals and Architecture
A review of recent deep learning applications in wood surface defect identifi...
sbt 2.0: go big (Scala Days 2025 edition)
Final SEM Unit 1 for mit wpu at pune .pptx
Developing a website for English-speaking practice to English as a foreign la...
Produktkatalog fĂĽr HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
1 - Historical Antecedents, Social Consideration.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Benefits of Physical activity for teenagers.pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
What is a Computer? Input Devices /output devices
Convolutional neural network based encoder-decoder for efficient real-time ob...

Productionizing Hadoop - New Lessons Learned

  • 2. General Announcements • All lines are muted • Ask questions any time using the “questions” pane on your GoToWebinar panel • Recording of this webinar will be available on demand at www.cloudera.com
  • 3. The Universe of Operations System Operations Architecture and App Ops • Server and network • Data architecture • Operating system • Data integration • Identity and access • Data quality monitoring • Resource management • Resource management • Maintenance • Pipeline maintenance • Cluster monitoring • Governance • Backup and DR
  • 4. Scope for Today • A focus on common stumbling blocks • Workload-oriented planning and identification • Network architecture • Host management • Configuration management • Identity, Access, and Authorization • Cluster and resource sharing • Time for questions
  • 5. Proper Planning • Develop an understanding of your use cases • What you (will) do defines what you need • Analog: OLTP RDBMS versus OLAP • Prototype if necessary
  • 7. …by use case Data Mining / IR ETL Report Generation Analytics
  • 8. …by use case Data Mining / IR Network utilization is a function of job size, its profile, and the number ETL of concurrent jobs Report Generation Analytics
  • 9. Network Architecture • Your current architecture is probably fine • Typical: traditional L2 tree (fine for North/South) • Emerging: L3 spine/leaf (optimized for East/West) • Minimize oversubscription (normal: 1:1.2) • Deep port buffers (with fair allocation for shared memory) • Do not collocate low-latency apps with MR • Monitor, monitor, monitor • Bandwidth, buffer, packet count, and size deciles
  • 10. Host Configuration • OS version and patches • Java 6 (HotSpot VM) • PAM limits (nofile, nproc) • Naming (nsswitch.conf, resolv.conf, hosts, gethostname()) • OS filesystem selection and tuning • Time service • Users, groups, and identity management • Machines should not be unique snowflakes
  • 11. Configuration Management • Puppet/Chef/<your favorite> for OS config • Package installation • Identity and authorization wiring • Cloudera Manager for platform management • Deployment and configuration • Service lifecycle • Platform-specific service monitoring and diagnostics • Activity monitoring • Complementary systems • Differentiating factors: centralized coordination, service awareness, orchestration
  • 12. Identity, Access, and Authorization • MapReduce is a code execution engine • Identity management and access control is hard (in distributed systems like Hadoop) • Hadoop uses the OS (or Kerberos) for identity • Lots of entry points • Comparatively low level • Access control is a function of each service • HDFS: Unix-style octal permissions on objects • MapReduce: ACLs on job queues
  • 13. Resource Sharing • One cluster, many groups • Pros • Benefit from aggregate resources • Greater utilization • Reduced cap/op-ex
  • 14. Resource Sharing • Three dimensions of sharing a cluster • Collocation of services (e.g. MapReduce and HBase) • Collocation of groups of users • Collocation of workload profiles (ETL, analytics) • In an ideal world, collocate all and enforce policy • Not currently possible • Problems • System utilization varies wildly • Fair distribution of shared resources • Increased access control complexity • SLA of most sensitive group applies to all • …but nothing new
  • 15. Resource Sharing • Reasons to collocate groups / applications: • Similar system utilization profiles • Time-based utilization (e.g. daily ETL and office hour analytics) • Maintain similar SLAs • Extensively data sharing • When it’s trivially easy with current control mechanisms • Reasons to segregate groups / applications: • Compliance, regulation, or where security is paramount • Wildly dissimilar utilization profiles (notably HBase and MapReduce) • A significant area of interest for Cloudera
  • 16. Now What? • There’s a lot (more) to think about • We can help • Education • Services • Software • Support • Strata + Hadoop World 2012 • Look for upcoming webinars
  • 17. Questions? Type them in the “Questions” panel. Congratulations to the winners of the book drawing! • Vani Mahobia • Ken Gayler • Richard Zhang • Anand Rajan • Erica Muxlow
  • 18. Questions? Type them in the “Questions” panel. To learn more about Hadoop Operations, A Guide for Developers and Administrators, or about the spotted cavy, go to www.oreilly.com
  • 19. THANK YOU! Eric Sammer, Principal Solutions Architect @esammer For more information: www.cloudera.com Sales: (888)789-1488 @cloudera
  • 20. Hardware Planning • CPU • Disk capacity and configuration • Spindle count • Memory (amount and configuration) • NIC configuration • Hadoop’s hardware preferences tend to be controversial until the architecture is understood
  • 21. Baseline Hardware • Disk • SATA II 7200RPM (SAS controller) • JBOD (OS on R1) • Option 1: 12x3.5” LFF 3TB • Option 2: 24x2.5” SFF 1TB • Option: MDL/NL SAS drives • 2x2.2Ghz 6C 20MB cache • 48GB+ DDR3-1600 ECC • 1GbE vs. 10GbE • Is there new info here?

Editor's Notes

  • #2: INTERNAL NOTES – DELETE BEFORE POSTING!Set expectation that this is targeted to relatively beginner audience?What’s new? What are the NEW lessons learned? Example war story to start it off would help audience get into it.Scope? Core Hadoop (MR &amp; HDFS) vs. the entire CDH stack (Hive, ZK, HBase, etc.) and how do they co-locate deployment-wise. i.e. Do I need separate HW to run other components?(MapR depositioning): Mention: HA, performance, DR, data integrity, federation, MR2,
  • #3: SCRIPT for Zoo/Moderator (go through this as quickly as you can)Before we get started I’d like to let you know thatAll lines are mutedAsk questions any time by typing them into the QUESTIONS pane on your GoToWebinar panelThis webinar is being recorded and will be available later at cloudera.comLet me pass you to Eric Sammer, who is a Principal Solutions Architect and Cloudera and author of the recently published book “Hadoop Operations” by O’Reilly Media.
  • #10: - Do I need to dedicated rack/network for Hadoop? Or can I run other apps services running on same rack/network?
  • #12: Why not use Puppet/Chef for Hadoop config as well? Why is CM better? If I use Puppet/Chef for ALL my config mgmt (systems &amp; apps), why point solution CM for Hadoop?
  • #18: SCRIPT Zoo/moderator (speak fast):Thank you Eric. Let’s now move quickly into the Q&amp;A portion of this webinar. Please type your questions into the QUESTIONS PANEL and we’ll get to as many questions as we have time for. While Eric is reviewing the questions I’d like to congratulate the winners of the book drawing. If you see your name listed here your book will be mailed to you by the last week of October. It’s being printed now so when you receive it it’ll be “hot off the press”.MOVE TO NEXT SLIDE – get winners’ names off the screen
  • #19: SCRIPT Zoo/moderator (speak fast):Eric, are you ready to answer some questions?MOVE TO THANK YOU SLIDE WHILE CLOSING