Productionizing Hadoop - New Lessons Learned

PRODUCTIONIZING HADOOP
New Lessons Learned
Eric Sammer

General Announcements

• All lines are muted
• Ask questions any time using the “questions”
pane on your GoToWebinar panel
• Recording of this webinar will be available on
demand at www.cloudera.com

The Universe of Operations

System Operations Architecture and App Ops
• Server and network • Data architecture
• Operating system • Data integration
• Identity and access • Data quality monitoring
• Resource management • Resource management
• Maintenance • Pipeline maintenance
• Cluster monitoring • Governance
• Backup and DR

Scope for Today

• A focus on common stumbling blocks
• Workload-oriented planning and identification
• Network architecture
• Host management
• Configuration management
• Identity, Access, and Authorization
• Cluster and resource sharing
• Time for questions

Proper Planning

• Develop an understanding of your use cases
• What you (will) do defines what you need
• Analog: OLTP RDBMS versus OLAP
• Prototype if necessary

…by use case

Data Mining / IR

ETL

Report Generation

Analytics

…by use case

Data Mining / IR
Network utilization is a
function of job size, its
profile, and the number ETL
of concurrent jobs
Report Generation

Analytics

Network Architecture

• Your current architecture is probably fine
• Typical: traditional L2 tree (fine for North/South)
• Emerging: L3 spine/leaf (optimized for East/West)
• Minimize oversubscription (normal: 1:1.2)
• Deep port buffers (with fair allocation for shared memory)

• Do not collocate low-latency apps with MR
• Monitor, monitor, monitor
• Bandwidth, buffer, packet count, and size deciles

Host Configuration

• OS version and patches
• Java 6 (HotSpot VM)

• PAM limits (nofile, nproc)

• Naming (nsswitch.conf, resolv.conf, hosts, gethostname())

• OS filesystem selection and tuning
• Time service
• Users, groups, and identity management
• Machines should not be unique snowflakes

Configuration Management

• Puppet/Chef/<your favorite> for OS config
• Package installation
• Identity and authorization wiring
• Cloudera Manager for platform management
• Deployment and configuration
• Service lifecycle
• Platform-specific service monitoring and diagnostics
• Activity monitoring
• Complementary systems
• Differentiating factors: centralized
coordination, service awareness, orchestration

Identity, Access, and Authorization

• MapReduce is a code execution engine
• Identity management and access control is hard
(in distributed systems like Hadoop)
• Hadoop uses the OS (or Kerberos) for identity
• Lots of entry points
• Comparatively low level
• Access control is a function of each service
• HDFS: Unix-style octal permissions on objects
• MapReduce: ACLs on job queues

Resource Sharing

• One cluster, many groups
• Pros
• Benefit from aggregate resources
• Greater utilization
• Reduced cap/op-ex

Resource Sharing
• Three dimensions of sharing a cluster
• Collocation of services (e.g. MapReduce and HBase)
• Collocation of groups of users
• Collocation of workload profiles (ETL, analytics)
• In an ideal world, collocate all and enforce policy
• Not currently possible
• Problems
• System utilization varies wildly
• Fair distribution of shared resources
• Increased access control complexity
• SLA of most sensitive group applies to all
• …but nothing new

Resource Sharing
• Reasons to collocate groups / applications:
• Similar system utilization profiles
• Time-based utilization (e.g. daily ETL and office hour
analytics)
• Maintain similar SLAs
• Extensively data sharing
• When it’s trivially easy with current control mechanisms
• Reasons to segregate groups / applications:
• Compliance, regulation, or where security is paramount
• Wildly dissimilar utilization profiles (notably HBase and
MapReduce)
• A significant area of interest for Cloudera

Now What?

• There’s a lot (more) to think about
• We can help
• Education
• Services
• Software
• Support
• Strata + Hadoop World 2012
• Look for upcoming webinars

Questions?
Type them in the “Questions” panel.

Congratulations to the winners
of the book drawing!
• Vani Mahobia
• Ken Gayler
• Richard Zhang
• Anand Rajan
• Erica Muxlow

Questions?
Type them in the “Questions” panel.

To learn more about Hadoop
Operations, A Guide for
Developers and
Administrators, or about the
spotted cavy, go to
www.oreilly.com

THANK YOU!
Eric Sammer, Principal Solutions Architect
@esammer
For more information: www.cloudera.com
Sales: (888)789-1488
@cloudera

Hardware Planning

• CPU
• Disk capacity and configuration
• Spindle count
• Memory (amount and configuration)
• NIC configuration
• Hadoop’s hardware preferences tend to be
controversial until the architecture is understood

Baseline Hardware

• Disk
• SATA II 7200RPM (SAS controller)
• JBOD (OS on R1)
• Option 1: 12x3.5” LFF 3TB
• Option 2: 24x2.5” SFF 1TB
• Option: MDL/NL SAS drives
• 2x2.2Ghz 6C 20MB cache
• 48GB+ DDR3-1600 ECC
• 1GbE vs. 10GbE
• Is there new info here?

Productionizing Hadoop - New Lessons Learned

More Related Content

What's hot (19)

Similar to Productionizing Hadoop - New Lessons Learned (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Productionizing Hadoop - New Lessons Learned

Editor's Notes