whitepaper_9bestpractices

© 2015 Concurrent, Inc. All rights reserved.
ACHIEVINGOPERATIONAL
READINESSONHADOOP
9 BEST PRACTICES FOR THE ENTERPRISE
BY SUPREET OBEROI
VP of Field Engineering, Concurrent

Back then, only hobbyists and status seekers drove
their own vehicles, which were temperamental, to put
it mildly. To operate a car you pretty much had to be a
mechanic, because the only thing you could count on
was a breakdown. The dashboard showed little, if any,
actionable information. Only after that changed, and you
could reliably get from here to there without a wrench,
did automobiles really take off.
I draw the analogy to Hadoop because nearly every day
I hear from enterprise IT teams in industries like retail,
finance, health care, and insurance who thought Hadoop
was a Camry, and are learning that it’s more like whatever
predated the Model T. These teams were led to believe
they could replicate the business successes of Twitter,
LinkedIn, and Netflix simply taking a shiny new cluster
for a spin, but now they’re struggling to deploy Hadoop
applications with the standards of quality, reliability, and
manageability that they have come to expect.
In short, Hadoop, out of the box, is not operationally ready.
In this paper, I’ll share 9 best practices for how IT
organization can achieve operational readiness on
Hadoop. Of course, there are not yet formal certifications
or commonly accepted standards for overcoming the
many challenges. But there is an emerging consensus
around how Hadoop applications are best built, deployed,
and managed.
GettingstartedwithHadoop
isalittlelikebuyingacar—
100yearsago.

Buildcultureandtoolssupporting
collaborationbetweendevelopers,
operators,&otherHadoopteammembers
Operators are not trained — and often can’t or don’t want to be trained — to look
at the Java stack traces and debug code. Likewise, asking a developer to address
performance problems is a little like asking a passenger to get out of your car and look
under the hood (assuming your passenger is not a mechanic).
Therefore, at least for the foreseeable future, optimizing performance on Hadoop is
a team proposition. In the best case, an operator who detects a problem can easily
collaborate with the developers, data scientists, and even business managers who have
stakes in the application running smoothly. They should all see operational readiness as
a shared responsibility, and be armed with tools that show them the complete picture
when remediation is required.
1

2
Connectexecutionproblemswith
applicationcontext
At newer companies, especially those in the San Francisco Bay Area, I find that one
person typically handles all the work — the data science, the development, and the
deployment to production — around a big data application. If there are problems when
the app runs, that same person can usually fix it. After all, she wrote it.
For big, traditional enterprises, it’s a different story. The operations team running fraud
and risk detection apps on Hadoop might live in Phoenix, while the team that developed
them sits ten time zones away in India. In some cases, the operations team today is
completely different from the one that first deployed an application.
Therefore, when a Hadoop job fails or takes too long to execute, operators should be
able to quickly link problems not only to the application that caused them, but also to
the relevant data flow logic inside the application. It’s also great if the operator can
immediately see detailed mapper/reducer stats tied to the problem. That way, they can
more quickly understand if the problem is with the code, the data or the hardware.
When a performance
problem arises, operators
should be able to investigate
app logic and cluster usage.

To an operator running hundreds or thousands of applications on a Hadoop cluster,
all of them look the same — until there’s a problem. So you need tools that let you
look at performance over groups of applications. Ideally, you should be able to
segment performance tracking by application types, departments, teams and data-
sensitivity levels.
Monitorthefleet,notthevehicle
3

Monitoring a fleet still means knowing when an individual vehicle performs poorly.
Similarly, operators need to set SLA bounds on performance and define alerts and
escalation paths when they’re violated.
Every business is unique, so there’s no set list of performance metrics to monitor. But it’s
certainly looking at more than what you can see in log files.
SLA bounds should incorporate both raw metadata such as job status, as well as
business-level events like sensitive data access. Successful practitioners of operational
readiness also set up metrics that help predict future SLA violations, so they can
proactively address and avoid them.
Defineandenforceservicelevel
agreements—Yes,evenonHadoop
4

Understandinter-appdependencies
5 At technology companies blessed with enough capital to run a dedicated Hadoop
cluster for every use case, applications run more or less independently. That’s not the
case, however, at larger, more traditional enterprises, which tend to run their clusters
as a shared service across lines of business. As a result, each application has at least
a few “roommates” in the cluster, some of which can be noisy, disruptive, or otherwise
detrimental to its own performance.
To understand what’s behind the errant behavior of one Hadoop application, in other
words, you have to first understand what others were doing on the cluster when it
ran. Did a rogue app hog resources, causing others to perform poorly? Was the poor
performance of one application actually due to its dependency on data from some other
application, upstream, that failed to operate as expected?
Provide your operations team with as much cluster-related context as you can. For
example, just by tracking cluster usage by application, you’ll more quickly understand
when an SLA violation is really about a rogue app, rather than a problem in the one that
triggered the alert.

Tracking applications
that consume more
than 10,000 mappers
Establishing and enforcing the rules for rationing cluster resources is vital for achieving a
meaningful state of operational readiness and meeting SLA contracts. You may have to
handle unusual edge cases. For example, is it acceptable for a recommender engine to
meets its SLA contract in terms of spitting out recommendations but totally consuming a
700-node cluster for the duration of its execution? (I saw this happen in real life!)
RationYourCluster
6 To optimize cluster usage and ROI, operators must ration resources on the cluster and
enforce the limits.
For example, an operator can budget 10,000 mappers for the execution of a particular
application. Then, the onus is on the application to do two things: comply with the
budget restriction, and then demonstrate that compliance. Lacking such proof, rationing
rules should prevent the application from being deployed on the cluster. After all, the
application is not trustworthy.

Solving for data lineage and governance in an unstructured environment like Hadoop is
no easy task. Traditional techniques to manually maintain a metadata dictionary quickly
lead to stale and old repositories. In addition, there is no proof that the model that is
deployed in production is using the fields described in the metadata repository.
What is required is visibility and enforcement at the operational level on the use of data
fields. If you can track if and when a data field is accessed by an app, you can make the
case you need to make.
Tracking the lineage
of data fields at an
operational level
Tracedataaccessattheoperationallevel
7 Good Hadoop management isn’t only about rationing compute resources; it also means
regulating access to sensitive data. This is especially true in industries with heightened
privacy concerns, such as financial services, health care, insurance, even, these days,
social media.
For example, a data scientist may develop a vastly improved new model for reducing
lending risk, but unless the enterprise can prove that the application does not use any
private data, it cannot deploy the application to production.

Recorddatamisfires
8 Compliance folks at large enterprises also want proof that a Hadoop application
processed every record in a dataset, and they look for documentation when it fails
to do so. Failures can result from format changes in upstream data sets or plain old
data corruption. Keeping track of all records that the application failed to process is
particularly vital in regulated industries.

Tuneyourenginebeforeyoureplaceit
9 With new compute fabrics emerging all the time, teams are sometimes too quick to junk
their old ones in pursuit of better performance. However, it’s very often the case that
you can achieve equal or greater performance gains just by optimizing code and data
flows on your existing fabrics. That way you can avoid expensive infrastructure upgrades
unless they’re truly necessary.

OperationalreadinessonHadoop:
Youcangettherefromhere Supreet is the Vice President of Field
Engineering at Concurrent. Prior to that,
he was Director of Big Data application
infrastructure for American Express, where
he led the development of use cases for
fraud, operational risk, marketing and
privacy on Big Data platforms. He holds
multiple patents in data engineering and
has held leadership positions at Real-Time
Innovations, Oracle, and Microsoft.
When Henry Ford launched the Model T, it was rugged and robust, the first
car fit for a broader market. The car was not only affordable to buy; it was
also practical to own and operate.
Perhaps Hadoop will get there, too. Until then, there’s a lot you can do to
avoid getting left on the side of the big data road.
If you’re looking for help with operational readiness on Hadoop, or are
curious about the charts and displays I’ve shown here, get in touch with me
at sales@concurrentinc.com or visit concurrentinc.com.
ABOUT SUPREET OBEROI
@supreet_online
www.concurrentinc.com
sales@concurrentinc.com

whitepaper_9bestpractices

More Related Content

Viewers also liked (9)

Similar to whitepaper_9bestpractices (20)

whitepaper_9bestpractices