SlideShare a Scribd company logo
© 2015 Concurrent, Inc. All rights reserved.
ACHIEVINGOPERATIONAL
READINESSONHADOOP
9 BEST PRACTICES FOR THE ENTERPRISE
BY SUPREET OBEROI
VP of Field Engineering, Concurrent
© 2015 Concurrent, Inc. All rights reserved.
Back then, only hobbyists and status seekers drove
their own vehicles, which were temperamental, to put
it mildly. To operate a car you pretty much had to be a
mechanic, because the only thing you could count on
was a breakdown. The dashboard showed little, if any,
actionable information. Only after that changed, and you
could reliably get from here to there without a wrench,
did automobiles really take off.
I draw the analogy to Hadoop because nearly every day
I hear from enterprise IT teams in industries like retail,
finance, health care, and insurance who thought Hadoop
was a Camry, and are learning that it’s more like whatever
predated the Model T. These teams were led to believe
they could replicate the business successes of Twitter,
LinkedIn, and Netflix simply taking a shiny new cluster
for a spin, but now they’re struggling to deploy Hadoop
applications with the standards of quality, reliability, and
manageability that they have come to expect.
In short, Hadoop, out of the box, is not operationally ready.
In this paper, I’ll share 9 best practices for how IT
organization can achieve operational readiness on
Hadoop. Of course, there are not yet formal certifications
or commonly accepted standards for overcoming the
many challenges. But there is an emerging consensus
around how Hadoop applications are best built, deployed,
and managed.
GettingstartedwithHadoop
isalittlelikebuyingacar—
100yearsago.
Buildcultureandtoolssupporting
collaborationbetweendevelopers,
operators,&otherHadoopteammembers
Operators are not trained — and often can’t or don’t want to be trained — to look
at the Java stack traces and debug code. Likewise, asking a developer to address
performance problems is a little like asking a passenger to get out of your car and look
under the hood (assuming your passenger is not a mechanic).
Therefore, at least for the foreseeable future, optimizing performance on Hadoop is
a team proposition. In the best case, an operator who detects a problem can easily
collaborate with the developers, data scientists, and even business managers who have
stakes in the application running smoothly. They should all see operational readiness as
a shared responsibility, and be armed with tools that show them the complete picture
when remediation is required.
1
© 2015 Concurrent, Inc. All rights reserved.
2
Connectexecutionproblemswith
applicationcontext
At newer companies, especially those in the San Francisco Bay Area, I find that one
person typically handles all the work — the data science, the development, and the
deployment to production — around a big data application. If there are problems when
the app runs, that same person can usually fix it. After all, she wrote it.
For big, traditional enterprises, it’s a different story. The operations team running fraud
and risk detection apps on Hadoop might live in Phoenix, while the team that developed
them sits ten time zones away in India. In some cases, the operations team today is
completely different from the one that first deployed an application.
Therefore, when a Hadoop job fails or takes too long to execute, operators should be
able to quickly link problems not only to the application that caused them, but also to
the relevant data flow logic inside the application. It’s also great if the operator can
immediately see detailed mapper/reducer stats tied to the problem. That way, they can
more quickly understand if the problem is with the code, the data or the hardware.
When a performance
problem arises, operators
should be able to investigate
app logic and cluster usage.
To an operator running hundreds or thousands of applications on a Hadoop cluster,
all of them look the same — until there’s a problem. So you need tools that let you
look at performance over groups of applications. Ideally, you should be able to
segment performance tracking by application types, departments, teams and data-
sensitivity levels.
Monitorthefleet,notthevehicle
3
© 2015 Concurrent, Inc. All rights reserved.
Monitoring a fleet still means knowing when an individual vehicle performs poorly.
Similarly, operators need to set SLA bounds on performance and define alerts and
escalation paths when they’re violated.
Every business is unique, so there’s no set list of performance metrics to monitor. But it’s
certainly looking at more than what you can see in log files.
SLA bounds should incorporate both raw metadata such as job status, as well as
business-level events like sensitive data access. Successful practitioners of operational
readiness also set up metrics that help predict future SLA violations, so they can
proactively address and avoid them.
Defineandenforceservicelevel
agreements—Yes,evenonHadoop
4
© 2015 Concurrent, Inc. All rights reserved.
Understandinter-appdependencies
5 At technology companies blessed with enough capital to run a dedicated Hadoop
cluster for every use case, applications run more or less independently. That’s not the
case, however, at larger, more traditional enterprises, which tend to run their clusters
as a shared service across lines of business. As a result, each application has at least
a few “roommates” in the cluster, some of which can be noisy, disruptive, or otherwise
detrimental to its own performance.
To understand what’s behind the errant behavior of one Hadoop application, in other
words, you have to first understand what others were doing on the cluster when it
ran. Did a rogue app hog resources, causing others to perform poorly? Was the poor
performance of one application actually due to its dependency on data from some other
application, upstream, that failed to operate as expected?
Provide your operations team with as much cluster-related context as you can. For
example, just by tracking cluster usage by application, you’ll more quickly understand
when an SLA violation is really about a rogue app, rather than a problem in the one that
triggered the alert.
© 2015 Concurrent, Inc. All rights reserved.
Tracking applications
that consume more
than 10,000 mappers
Establishing and enforcing the rules for rationing cluster resources is vital for achieving a
meaningful state of operational readiness and meeting SLA contracts. You may have to
handle unusual edge cases. For example, is it acceptable for a recommender engine to
meets its SLA contract in terms of spitting out recommendations but totally consuming a
700-node cluster for the duration of its execution? (I saw this happen in real life!)
RationYourCluster
6 To optimize cluster usage and ROI, operators must ration resources on the cluster and
enforce the limits.
For example, an operator can budget 10,000 mappers for the execution of a particular
application. Then, the onus is on the application to do two things: comply with the
budget restriction, and then demonstrate that compliance. Lacking such proof, rationing
rules should prevent the application from being deployed on the cluster. After all, the
application is not trustworthy.
© 2015 Concurrent, Inc. All rights reserved.
Solving for data lineage and governance in an unstructured environment like Hadoop is
no easy task. Traditional techniques to manually maintain a metadata dictionary quickly
lead to stale and old repositories. In addition, there is no proof that the model that is
deployed in production is using the fields described in the metadata repository.
What is required is visibility and enforcement at the operational level on the use of data
fields. If you can track if and when a data field is accessed by an app, you can make the
case you need to make.
Tracking the lineage
of data fields at an
operational level
Tracedataaccessattheoperationallevel
7 Good Hadoop management isn’t only about rationing compute resources; it also means
regulating access to sensitive data. This is especially true in industries with heightened
privacy concerns, such as financial services, health care, insurance, even, these days,
social media.
For example, a data scientist may develop a vastly improved new model for reducing
lending risk, but unless the enterprise can prove that the application does not use any
private data, it cannot deploy the application to production.
© 2015 Concurrent, Inc. All rights reserved.
Recorddatamisfires
8 Compliance folks at large enterprises also want proof that a Hadoop application
processed every record in a dataset, and they look for documentation when it fails
to do so. Failures can result from format changes in upstream data sets or plain old
data corruption. Keeping track of all records that the application failed to process is
particularly vital in regulated industries.
© 2015 Concurrent, Inc. All rights reserved.
Tuneyourenginebeforeyoureplaceit
9 With new compute fabrics emerging all the time, teams are sometimes too quick to junk
their old ones in pursuit of better performance. However, it’s very often the case that
you can achieve equal or greater performance gains just by optimizing code and data
flows on your existing fabrics. That way you can avoid expensive infrastructure upgrades
unless they’re truly necessary.
© 2015 Concurrent, Inc. All rights reserved.
OperationalreadinessonHadoop:
Youcangettherefromhere Supreet is the Vice President of Field
Engineering at Concurrent. Prior to that,
he was Director of Big Data application
infrastructure for American Express, where
he led the development of use cases for
fraud, operational risk, marketing and
privacy on Big Data platforms. He holds
multiple patents in data engineering and
has held leadership positions at Real-Time
Innovations, Oracle, and Microsoft.
When Henry Ford launched the Model T, it was rugged and robust, the first
car fit for a broader market. The car was not only affordable to buy; it was
also practical to own and operate.
Perhaps Hadoop will get there, too. Until then, there’s a lot you can do to
avoid getting left on the side of the big data road.
If you’re looking for help with operational readiness on Hadoop, or are
curious about the charts and displays I’ve shown here, get in touch with me
at sales@concurrentinc.com or visit concurrentinc.com.
ABOUT SUPREET OBEROI
@supreet_online
www.concurrentinc.com
sales@concurrentinc.com

More Related Content

PDF
JIT Borawan Cloud computing part 2
PDF
TechVision: Avoiding Hefty Fines and Reputational Damage with Test Data Manag...
PDF
Mainframe as a Service: Cloud Capabilities for Your Core Business Applications
PPTX
Buckets, Funnels, Mobs and Cats or: How We Learned to Love Scaling Apps To Th...
PDF
Envisioning the Future Enterprise
PPTX
Simplifying Big Data ETL with Talend
PDF
B3 mobile development and deployment platform enabled by oracle fusion midd...
PDF
Learn How to Maximize Your ServiceNow Investment
JIT Borawan Cloud computing part 2
TechVision: Avoiding Hefty Fines and Reputational Damage with Test Data Manag...
Mainframe as a Service: Cloud Capabilities for Your Core Business Applications
Buckets, Funnels, Mobs and Cats or: How We Learned to Love Scaling Apps To Th...
Envisioning the Future Enterprise
Simplifying Big Data ETL with Talend
B3 mobile development and deployment platform enabled by oracle fusion midd...
Learn How to Maximize Your ServiceNow Investment

Viewers also liked (9)

PDF
Lavoro accessorio e prestazioni a sostegno del reddito
PDF
Silabo para tarea de curso de docencia
PDF
Herramientas educativas.
PPTX
My ideal society: "The Modern Islands"
PPTX
¿Emprendimiento?
PDF
Alzheimer
PDF
Berg Reference
PPTX
Peter Walton-'La complejidad de las Normas Internacionlaes de Información Fin...
PDF
Obesidad infantil
Lavoro accessorio e prestazioni a sostegno del reddito
Silabo para tarea de curso de docencia
Herramientas educativas.
My ideal society: "The Modern Islands"
¿Emprendimiento?
Alzheimer
Berg Reference
Peter Walton-'La complejidad de las Normas Internacionlaes de Información Fin...
Obesidad infantil
Ad

Similar to whitepaper_9bestpractices (20)

PDF
10 tips for enterprise cloud migration
PPTX
Implementing cloud based devops for distributed agile projects
PDF
Using Testing as a Service, Globe Testing Helping Startups Make Leap to Cloud...
PDF
how_to_build_a_robust_web_application_in_2023.pdf
PDF
Benefits Of Migrating Asp .Net Apps To The Cloud - GoDgtl
PDF
Migrating From Legacy Applications To The Cloud
PDF
Saa s versus-on-premise-erp
PPT
Moving To SaaS
PDF
Allow is the New Block
PPTX
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
PDF
Realizing Cloud POV
PDF
Scalable Mobile App Development for Business Growth1 (1).pdf
PPTX
Asymetric Modernization
PPTX
Big data an elephant business opportunities
PDF
How to Build a Scalable Web Application for Your Project
PDF
10 alternatives to heavy handed cloud app control
PDF
Understanding Cloud Application Development: A Comprehensive Introduction
PDF
The Eight Building Blocks of Enterprise Application Architecture
PDF
Essential_Skills_of_a_Site_Reliability_E.pdf
PPTX
To cloud or not to cloud
10 tips for enterprise cloud migration
Implementing cloud based devops for distributed agile projects
Using Testing as a Service, Globe Testing Helping Startups Make Leap to Cloud...
how_to_build_a_robust_web_application_in_2023.pdf
Benefits Of Migrating Asp .Net Apps To The Cloud - GoDgtl
Migrating From Legacy Applications To The Cloud
Saa s versus-on-premise-erp
Moving To SaaS
Allow is the New Block
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
Realizing Cloud POV
Scalable Mobile App Development for Business Growth1 (1).pdf
Asymetric Modernization
Big data an elephant business opportunities
How to Build a Scalable Web Application for Your Project
10 alternatives to heavy handed cloud app control
Understanding Cloud Application Development: A Comprehensive Introduction
The Eight Building Blocks of Enterprise Application Architecture
Essential_Skills_of_a_Site_Reliability_E.pdf
To cloud or not to cloud
Ad

whitepaper_9bestpractices

  • 1. © 2015 Concurrent, Inc. All rights reserved. ACHIEVINGOPERATIONAL READINESSONHADOOP 9 BEST PRACTICES FOR THE ENTERPRISE BY SUPREET OBEROI VP of Field Engineering, Concurrent
  • 2. © 2015 Concurrent, Inc. All rights reserved. Back then, only hobbyists and status seekers drove their own vehicles, which were temperamental, to put it mildly. To operate a car you pretty much had to be a mechanic, because the only thing you could count on was a breakdown. The dashboard showed little, if any, actionable information. Only after that changed, and you could reliably get from here to there without a wrench, did automobiles really take off. I draw the analogy to Hadoop because nearly every day I hear from enterprise IT teams in industries like retail, finance, health care, and insurance who thought Hadoop was a Camry, and are learning that it’s more like whatever predated the Model T. These teams were led to believe they could replicate the business successes of Twitter, LinkedIn, and Netflix simply taking a shiny new cluster for a spin, but now they’re struggling to deploy Hadoop applications with the standards of quality, reliability, and manageability that they have come to expect. In short, Hadoop, out of the box, is not operationally ready. In this paper, I’ll share 9 best practices for how IT organization can achieve operational readiness on Hadoop. Of course, there are not yet formal certifications or commonly accepted standards for overcoming the many challenges. But there is an emerging consensus around how Hadoop applications are best built, deployed, and managed. GettingstartedwithHadoop isalittlelikebuyingacar— 100yearsago.
  • 3. Buildcultureandtoolssupporting collaborationbetweendevelopers, operators,&otherHadoopteammembers Operators are not trained — and often can’t or don’t want to be trained — to look at the Java stack traces and debug code. Likewise, asking a developer to address performance problems is a little like asking a passenger to get out of your car and look under the hood (assuming your passenger is not a mechanic). Therefore, at least for the foreseeable future, optimizing performance on Hadoop is a team proposition. In the best case, an operator who detects a problem can easily collaborate with the developers, data scientists, and even business managers who have stakes in the application running smoothly. They should all see operational readiness as a shared responsibility, and be armed with tools that show them the complete picture when remediation is required. 1
  • 4. © 2015 Concurrent, Inc. All rights reserved. 2 Connectexecutionproblemswith applicationcontext At newer companies, especially those in the San Francisco Bay Area, I find that one person typically handles all the work — the data science, the development, and the deployment to production — around a big data application. If there are problems when the app runs, that same person can usually fix it. After all, she wrote it. For big, traditional enterprises, it’s a different story. The operations team running fraud and risk detection apps on Hadoop might live in Phoenix, while the team that developed them sits ten time zones away in India. In some cases, the operations team today is completely different from the one that first deployed an application. Therefore, when a Hadoop job fails or takes too long to execute, operators should be able to quickly link problems not only to the application that caused them, but also to the relevant data flow logic inside the application. It’s also great if the operator can immediately see detailed mapper/reducer stats tied to the problem. That way, they can more quickly understand if the problem is with the code, the data or the hardware. When a performance problem arises, operators should be able to investigate app logic and cluster usage.
  • 5. To an operator running hundreds or thousands of applications on a Hadoop cluster, all of them look the same — until there’s a problem. So you need tools that let you look at performance over groups of applications. Ideally, you should be able to segment performance tracking by application types, departments, teams and data- sensitivity levels. Monitorthefleet,notthevehicle 3
  • 6. © 2015 Concurrent, Inc. All rights reserved. Monitoring a fleet still means knowing when an individual vehicle performs poorly. Similarly, operators need to set SLA bounds on performance and define alerts and escalation paths when they’re violated. Every business is unique, so there’s no set list of performance metrics to monitor. But it’s certainly looking at more than what you can see in log files. SLA bounds should incorporate both raw metadata such as job status, as well as business-level events like sensitive data access. Successful practitioners of operational readiness also set up metrics that help predict future SLA violations, so they can proactively address and avoid them. Defineandenforceservicelevel agreements—Yes,evenonHadoop 4
  • 7. © 2015 Concurrent, Inc. All rights reserved. Understandinter-appdependencies 5 At technology companies blessed with enough capital to run a dedicated Hadoop cluster for every use case, applications run more or less independently. That’s not the case, however, at larger, more traditional enterprises, which tend to run their clusters as a shared service across lines of business. As a result, each application has at least a few “roommates” in the cluster, some of which can be noisy, disruptive, or otherwise detrimental to its own performance. To understand what’s behind the errant behavior of one Hadoop application, in other words, you have to first understand what others were doing on the cluster when it ran. Did a rogue app hog resources, causing others to perform poorly? Was the poor performance of one application actually due to its dependency on data from some other application, upstream, that failed to operate as expected? Provide your operations team with as much cluster-related context as you can. For example, just by tracking cluster usage by application, you’ll more quickly understand when an SLA violation is really about a rogue app, rather than a problem in the one that triggered the alert.
  • 8. © 2015 Concurrent, Inc. All rights reserved. Tracking applications that consume more than 10,000 mappers Establishing and enforcing the rules for rationing cluster resources is vital for achieving a meaningful state of operational readiness and meeting SLA contracts. You may have to handle unusual edge cases. For example, is it acceptable for a recommender engine to meets its SLA contract in terms of spitting out recommendations but totally consuming a 700-node cluster for the duration of its execution? (I saw this happen in real life!) RationYourCluster 6 To optimize cluster usage and ROI, operators must ration resources on the cluster and enforce the limits. For example, an operator can budget 10,000 mappers for the execution of a particular application. Then, the onus is on the application to do two things: comply with the budget restriction, and then demonstrate that compliance. Lacking such proof, rationing rules should prevent the application from being deployed on the cluster. After all, the application is not trustworthy.
  • 9. © 2015 Concurrent, Inc. All rights reserved. Solving for data lineage and governance in an unstructured environment like Hadoop is no easy task. Traditional techniques to manually maintain a metadata dictionary quickly lead to stale and old repositories. In addition, there is no proof that the model that is deployed in production is using the fields described in the metadata repository. What is required is visibility and enforcement at the operational level on the use of data fields. If you can track if and when a data field is accessed by an app, you can make the case you need to make. Tracking the lineage of data fields at an operational level Tracedataaccessattheoperationallevel 7 Good Hadoop management isn’t only about rationing compute resources; it also means regulating access to sensitive data. This is especially true in industries with heightened privacy concerns, such as financial services, health care, insurance, even, these days, social media. For example, a data scientist may develop a vastly improved new model for reducing lending risk, but unless the enterprise can prove that the application does not use any private data, it cannot deploy the application to production.
  • 10. © 2015 Concurrent, Inc. All rights reserved. Recorddatamisfires 8 Compliance folks at large enterprises also want proof that a Hadoop application processed every record in a dataset, and they look for documentation when it fails to do so. Failures can result from format changes in upstream data sets or plain old data corruption. Keeping track of all records that the application failed to process is particularly vital in regulated industries.
  • 11. © 2015 Concurrent, Inc. All rights reserved. Tuneyourenginebeforeyoureplaceit 9 With new compute fabrics emerging all the time, teams are sometimes too quick to junk their old ones in pursuit of better performance. However, it’s very often the case that you can achieve equal or greater performance gains just by optimizing code and data flows on your existing fabrics. That way you can avoid expensive infrastructure upgrades unless they’re truly necessary.
  • 12. © 2015 Concurrent, Inc. All rights reserved. OperationalreadinessonHadoop: Youcangettherefromhere Supreet is the Vice President of Field Engineering at Concurrent. Prior to that, he was Director of Big Data application infrastructure for American Express, where he led the development of use cases for fraud, operational risk, marketing and privacy on Big Data platforms. He holds multiple patents in data engineering and has held leadership positions at Real-Time Innovations, Oracle, and Microsoft. When Henry Ford launched the Model T, it was rugged and robust, the first car fit for a broader market. The car was not only affordable to buy; it was also practical to own and operate. Perhaps Hadoop will get there, too. Until then, there’s a lot you can do to avoid getting left on the side of the big data road. If you’re looking for help with operational readiness on Hadoop, or are curious about the charts and displays I’ve shown here, get in touch with me at sales@concurrentinc.com or visit concurrentinc.com. ABOUT SUPREET OBEROI @supreet_online www.concurrentinc.com sales@concurrentinc.com