SlideShare a Scribd company logo
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
SPONSORED BY
Hadoop: Data Storage Locker
or Agile Analytics Platform?
It’s Up to You.
CONTENTS
Introduction	1
How Hadoop Becomes a Data Storage Locker	 1
Transforming Hadoop into an Agile Analytics Platform 4
Investment in Data Transformation for Hadoop Delivers
10x Productivity Gains 8
Conclusion 10
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
1
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
Introduction
Why is Hadoop so enticing to businesses? As an open source repository, Hadoop is a cutting
edge and disruptive technology. It has the capacity to handle quantities of data that tradi-
tional repositories simply cannot, and its storage is drastically cheaper than traditional data
warehouses. These factors contribute to extremely high expectations from business users.
Companies expect a return on their investment from Hadoop in the range of three to four
dollars for every dollar invested.
Yet, reality is starkly different. At present, Hadoop users are achieving
a return of 55 cents per dollar invested. This is a tenuous situation
for businesses, as the flow of big data will not slow down in order
for them to learn how to better utilize Hadoop. The problem is
only going to be compounded in the next decade: the amount of
big data is doubling every two years and is estimated to grow from
4.4 ZB in 2014 to 44 ZB in 2020. That’s as many pieces of data as
there are stars in the universe.1
Despite this opportunity, a Bain and Company study found that 66% of surveyed companies
believe they do not have the right technology to capitalize on data.2
Consequently, even com-
panies that recognize how powerful data could be do not have the knowledge, expertise, or
tools to make that vision a reality.
Hadoop lowers the barrier to storing data, but it doesn’t necessarily lower the barriers to
creating value from data. This CITO Research paper will describe what’s required to turn
Hadoop into a productive platform for agile analytics.
How Hadoop Becomes a Data Storage Locker
Hadoop’s economics are transformational. The cost per gigabyte of data makes Hadoop an
attractive data storage solution for many different applications and types of data.
On its own, Hadoop can’t parse the meaning of the data it is collecting. And while Hadoop
can amass multi-structured data, it is not designed to transform it or help companies decide
whether or not that data is useful.
The result is that frequently, Hadoop gathers data that lies fallow.
Sometimes this amassing of data is tellingly referred to as “Hadumping.”
66% of surveyed
companies believe they
do not have the right
technology to capitalize
on data
1
1
http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
2
http://www.bain.com/Images/BAIN%20_BRIEF_The_value_of_Big_Data.pdf
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
2
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
Store Now, Understand Later
The cross-that-bridge-when-we-come-to-it strategy of landing data in Hadoop and figuring
out what to do with it later has led to Hadoop functioning more like a data storage locker
than an agile analytics platform.
One of Hadoop’s strengths—the fact that it doesn’t need a predefined schema to load data
into it—also feeds into its weakness. For meaning to be applied to data when it is read
(referred to as schema-on-read), users need to understand the data and then add some
context to transform raw data into insights.
Data has the potential to be valuable, but companies need tools to explore and extract that
value. In truth, this is not a new problem: Even in the world of traditional data warehousing
with structured data repositories and rigid, top-down governance, 90% of business data
went unused. Obviously, with ever more data on hand, and Hadoop to store it, more data is
going unused than ever before.
Variety: The V You Need to Worry About
Big data is often framed in terms of the three Vs: volume, variety, and velocity. CITO Research
believes that variety is the most problematic of the three Vs. Here’s why.
For schema on read to work, that is, to apply understanding to data stored in Hadoop, some-
one has to understand what the data means. For each dataset, this application of meaning
must happen again. It doesn’t matter how large the dataset is; what’s really both a problem
and an opportunity is how many datasets are coming in. The meaning of each of these
datasets must be specified. In addition to evaluating the meaning of each dataset coming
in, determining the relationships between those datasets is often the critical breakthrough
point to uncovering business value in the data.
That understanding means the challenge is not volume (once you know what
the data means, you can read it) but variety (figuring out what the data means).
A VP of Data at a marketing data provider echoed this sentiment. “There are a great variety of
sources and all sizes and shapes and flavors of the data, and we have to understand them up
front. We can’t process them and then decide whether they are relevant. A lot of pre-analysis
happens with the data before we even accept it for modeling,” she said.
Datasets are coming in at high velocity, but knowing what to do with them requires dealing
first with what those data sources mean. The variety of data is problematic because the ques-
tion of meaning must be answered each time a new type of data appears. If data is stored
first and understood later, the pile of data to deal with at some future point only gets larger.
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
3
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
Expertise Is Scarce
To date, self-service access to Hadoop has been more dream than reality. The current frame-
work for assigning meaning to data in Hadoop requires analysts to rely on development
experts for their workflows. This reliance on Hadoop experts bogs down processes, creates
bottlenecks, and makes it difficult for people who can supply the needed business context
for the data—in other words, who have native understanding of the data from a business
perspective—to directly explore and interact with the data.
The need to transform data into usable forms is so acute that the fastest growing category
of specialist is now the data engineer, not the data scientist. As of September 2014, LinkedIn
had nearly 21,000 postings for jobs with “data engineer” in the title, compared to just over
11,000 for jobs with “data scientist” in the title.
Data Preparation Is Time-Consuming
Companies are forced to devote far too much of their time to preparing data, often repeat-
ing steps without business context. These tasks include the type of wrangling, munging, and
hand-coding exercises that devour time, whether that involves joins of disparate datasets or
just getting all the data into the same format.
Here is a sampling of a few common data preparation problems:
OO Problems stemming from business logic. For example, “price” might include taxes and
shipping in some data sources but not in others.
OO Missing values and outliers. When missing values or outliers (such as latitude and lon-
gitude in the middle of the ocean) show up, what should the person working with the data
do? Should the rest of the data for those records be included in the models or should
data records with missing values be omitted entirely? The answer can be highly specific
to the use case for the data.
OO Derived values. The data may contain answers, but it may take work to get at those
answers. Consider the task of figuring out how long a user spent on a website, which re-
quires sessionizing data from weblogs to ascertain the activities of a particular user. The
definition of a session is a derived value calculated by defining the starting time and end-
ing time. Finding a business definition of a session is an inexact science. It requires some
experimentation and observation of user behaviors to determine the session length ap-
propriate to analyze for the business questions being asked. This iterative process, when
executed by a non-business expert, is defined by a lot of extra trial and error.
Data preparation is notoriously time consuming; data scientists say that these types of ac-
tivities consume some 50 to 80% of their time.3
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
4
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
Transforming Hadoop
into an Agile Analytics Platform
When companies consider what kinds of platforms to adopt to get the most out of their data
and their Hadoop implementation, they should focus on the following factors to achieve
long-term success.
Agile versus Waterfall
What does it mean for analytics to be agile?
It means that you need a workflow that is
iterative and dynamic and allows users to
discover insights naturally (see Figure 1).
Consider the waterfall methodology, in
which the expected output was a report.
The report was designed to answer to a
question or a group of questions, defined
in advance. The output of the analysis was
often a static KPI or single chart for story-
telling. By making that starting assumption,
much data is thrown out immediately to
drive to a single answer. This simplifies data
management, but has the downside of re-
moving data from the analysis that might
highlight an unexpected insight that resides
in that dataset.
Agile analytics gives you the ability to ex-
plore, try new things, and then change your
mind. It’s an exploratory test-and-learn
approach in which questions lead to more
questions and then eventually to discov-
ered answers. The nature of agile analytics
is iterative. The following scenarios demon-
strate the need for agility.
Same dataset, different stakeholders.
Different stakeholders frequently use
the same data in different ways. Consider
logs of usage data from a personal fitness
bracelet. Product development wants to
correlate log data with support tickets.
They wonder which features cause users to
contact support and whether the product
design could be tweaked to make it more
intuitive.
Marketing looks at the same logs from a
completely different angle. The marketing
department might be more interested in
the correlation between application usage
trends, customer demographics, and en-
gagement on the product forums.
Each group consults
the product usage
logs, but uses
them in entirely
different ways.
Further, their
use of the data
will evolve over
time across groups.
3
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0
Agile analytics gives
you the ability to
explore, try new things,
and then change
your mind
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
5
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
Retransform the data to
find new signals. Suppose
product development found
out that female custom-
ers use the bracelet’s sleep
monitor feature more than
men do. They would like to
design a product geared
for women and want the
data transformed with finer
grained temporal informa-
tion. What days of the week
do women use the feature
most? Workdays? Weekends?
This information is in the
raw logs, but it was trans-
formed away during the first
iteration. Agility demands the
ability to go back to raw data
and retransform it to support
additional use cases.
These scenarios cover just
one data source and only two
lines of business. If you multi-
ply those needs by the variety
of sources of big data and the
number of stakeholders who
want to make data-driven de-
cisions, it’s easy to see why
data transformation work
dominates the time of data
engineers and data scientists.
Figure 1. Agile analytics is an iterative and flexible process that supports changing busi-
ness needs and requirements
DISCOVERY
Selecting the dataset and evaluating the
signal relative to the analysis at hand
ASSESSMENT
Profiling data and identifying outliers
to determine fit for analysis
SHAPING
Identifying data format and distribution
and structuring data for analysis
ENRICHMENT
Cleaning the data, joining it with other
data sets, and aggregating data at the
right level for the use case
REUSE
Sharing the script with others so that
it can be reused as a canonical model
Data
Analyst
Agile Analytics
10011011000101010100111001001000111
10110010111011000101001001101101010
0100010110100101010101101010011000
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
6
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
Leveraging Business Context
Human interaction is vital to the process of preparing data for analysis. Users need to identify
features in the data that are interesting to them and provide feedback on the representa-
tion and meaning of the data they’re attempting to use. Otherwise, the data is never tied to
organizational change or corrective actions.
A leader in the design software industry recently said in a CITO Research interview, “Every-
body at some level needs to be a data analyst. Anybody who’s working at the company who is
trying to improve internal processes, external processes, customer behavior, has to have will-
ingness to ask questions and work from the context of data. I’ve been advocating that each
product team have at least one or two people who are the mavens of their team’s data.” In
other words, business experts supply business context, imbuing the data with domain knowl-
edge. Even if more people are hired and tasked with data transformation, if they don’t have
the domain knowledge of their data, their work on behalf of business stakeholders will be less
efficient than a method that empowers domain experts to transform data themselves.
Democratizing Data Access
Companies looking to get the most out of Hadoop must overcome the shortage of Hadoop
experts. They need to tap into the broader community of analysts that already exist in their
organizations and implement platforms that bridge the divide between the business user
and the data. These platforms will not replace human participation in the data extraction
process. Quite the contrary, in fact.
The more employees who can use big data, the more powerful it can become.
Users in marketing and operations may ask very different questions that
together yield new and powerful insights.
Adopting new technology can be an intimidating process for employees
at all levels of an organization. Users are accustomed to the way existing
tools work and frequently develop their own workarounds to address
limitations. But with Hadoop, without new tools specifically designed for
the vast quantities of data from varying sources, users only see a limited
picture of the insights their data could be providing to them.
The more employees
who can use big data,
the more powerful
it can become
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
7
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
The best add-on platforms for Hadoop al-
low for a democratization of data that ignites
data-driven decision-making across an en-
tire organization. Data is no longer strictly
in the province of the data scientist; busi-
ness users can find value from interacting
with data themselves. Users can start with
smaller datasets and once they see that the
data transformation platform gives them
the results they need, their trust builds
and they can move to larger datasets. This
becomes a virtuous feedback cycle; giving
more people useful access to data drives
demand for more data. Suddenly, data be-
comes integral to the company and insights
come from unexpected places.
Using Machine Learning
The only way to provide skill-neutral access to data is to adopt platforms on top of Hadoop
that utilize an interactive interface, rather than code, to simplify the complexity of the data
investigation process. These easy-to-use interfaces allow all users to see how the data will be
transformed, and make data easier to read and understand. Point and click tools make even
the trickiest data as simple to manipulate as perusing a spreadsheet.
Beneath that intuitive exterior, however, these revolutionary data transformation tools are
using a predictive interaction approach that leverages machine learning to anticipate users’
needs and speed them through the process of understanding and manipulating their data.
By anticipating what users need, the platform allows people to scale their abilities rapidly. As
one user stated, “It’s the symbiosis between the machine and human that really makes the
analyst feel superhuman” in handling big data velocity and variety.
Predictive interaction interposes machine learning techniques be-
tween human users and the data that they see. By browsing the
data, a user’s behavior effectively teaches the machine what to
find in a given dataset. The machine builds its knowledge with
every piece of feedback. Users guide the process, but the ma-
chine does the detailed work, meaning experts are no longer
spending 80% of their time preparing the data. They’re able to
get up to their elbows in the data much faster, regardless of the
amount of data they’re looking to analyze.
Users guide
the process, but
the machine does the
detailed work, meaning
experts are no longer
spending 80% of their
time preparing
the data
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
8
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
It’s About Repeatable Speed and Scale
Hadoop requires tools that are fit for
schema on read. Companies often do not
budget for these add-ons. Without the
add-ons, they cannot get the cohesive data
manipulation that drives true analytic ca-
pacity from the data they store in Hadoop.
Once Hadoop adopters recognize their
need, many opt for makeshift solutions.
They either try writing scripts by hand to
transform data (not a scalable solution) or
try to retrofit existing tools that were de-
signed for traditional data warehouses and
traditional monolithic data structures.
Another approach is to rely heavily on
services. While the initial investment in
Hadoop may run in the tens of thousands,
it may take a couple of million dollars in
services to make it useful. Opting to use
a data transformation platform not only
saves money initially but also gives the or-
ganization a repeatable approach to agile
analytics. Repeatability is key. As the VP
of Data from a data distribution company
told CITO Research, “We are in the business
of data transformation and data manage-
ment, so of course everything we do has to
be repeatable.”
Investment in Data Transformation for Hadoop
Delivers 10x Productivity Gains
Businesses that have used data transformation tools have been astonished at how going
from a dozen people using data to a few hundred has impacted the decision making process.
The returns could be even greater if thousands within a business were using data.
CITO Research has found that data transformation platforms like Trifacta offer productivity
gains with a factor of 10 for data scientists and data engineers as well as business users.
With tools like Trifacta, no coding is required. Users:
OO Have an interface that guides them in transforming data
OO Get immediate visual feedback
OO Can detect any problems with the output
OO Obtain a sharable and repeatable history of steps taken to transform raw data into anal-
ysis-ready data
Perhaps most importantly, Trifacta provides greater visibility into datasets, ensuring wider
use of big data.
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
9
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
Trifacta is a prime example of how effective tooling can be for non-expert users. Using a
Google-like “auto-complete” approach, users guide Trifacta through a predictive interaction
process in which a portion of their big data is visualized to see whether it is correct. The tool
can predictively highlight information the user will find relevant. The visualization aspect of
a tool is crucial, but so too is the ease of use with which it incorporates user feedback and
learns from these inputs.
With data transformation tools in place, the job satisfaction and productivity of not only
analysts, but also data scientists and data engineers will increase. They will no longer have
to spend the bulk of their time wrangling data for others’ use. Instead, with data transforma-
tion tools like Trifacta, data scientists and data engineers will be able to collaborate more
efficiently with the business by sharing the same toolset. Business analysts can participate in
transforming data in Hadoop, collaborating more efficiently with data scientists and experts,
rendering it accessible via end-user BI tools like Tableau and QlikView.
Figure 2. With data transformation tools such as Trifacta, Hadoop becomes the basis for
agile analytics, actively used by all lines of business.
COLLECT PROCESS ORGANIZE LEARN
Ops
Sales
Marketing
Products
Self-Service
Data
Sources
CITO Research
Advancing the craft of technology leadershipAdvancing the craft of technology leadership
CITO Research
CITO Research is a source of news, analysis, research and knowledge for CIOs, CTOs and
other IT and business professionals. CITO Research engages in a dialogue with its audience
to capture technology trends that are harvested, analyzed and communicated in a sophisti-
cated way to help practitioners solve difficult business problems.
Visit us at http://www.citoresearch.com
Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou.
10
Conclusion
Unquestionably, Hadoop has the potential to change the way companies work with data.
Hadoop allows businesses to store, collect, and process the ever-increasing quantities of big
data that are being generated every second – data from customers, from business partners,
from devices, in all formats, whether that data is in tabular, textual, or other machine-gener-
ated formats. Yet, before jumping off the diving board into the Hadoop deep end, companies
should recognize that the advantages of Hadoop are only fully realized with additional tooling.
Hadoop is not a data transformation product. It is a data repository. It expects schema on
read. Someone must supply that schema in order for the data to be usable. By enabling
business users to provide that schema interactively, without having to understand even so
much as that term, Trifacta brings about data transformation and frees the data in Hadoop
for wide business use. CITO Research believes that without such tools, Hadoop will remain a
Hadumping ground, where more and more data is stored and gathers dust. Trifacta, relying
on machine learning and user input, liberates Hadoop data. Regardless of the origin of the
data, regardless of its format, Trifacta enables the process of standardizing data stored in
Hadoop and making it ready to use far more efficient.
When it comes to data preparation and transformation, Trifacta is a prime example of how
effective new platforms can be in accelerating Hadoop’s time to value. Data preparation has
been a huge barrier for early adopters of Hadoop, meaning the time for actual analytics and
using data to inform decisions has been limited. Trifacta’s ability to balance machine learning
and human input for productive data transformation can change that.
This paper was created by CITO Research and sponsored by Trifacta
Learn more about Trifacta

More Related Content

PDF
Big Data Management: Work Smarter Not Harder
PDF
Move It Don't Lose It: Is Your Big Data Collecting Dust?
PDF
Big Data Trends and Challenges Report - Whitepaper
PDF
Mighty Guides- Data Disruption
PPTX
How to Implement a Spend Analytics Program Using Machine Learning
PDF
Embracing data science
PDF
Is Your Company Braced Up for handling Big Data
PDF
3 Strategies to drive more data driven outcomes in financial services
Big Data Management: Work Smarter Not Harder
Move It Don't Lose It: Is Your Big Data Collecting Dust?
Big Data Trends and Challenges Report - Whitepaper
Mighty Guides- Data Disruption
How to Implement a Spend Analytics Program Using Machine Learning
Embracing data science
Is Your Company Braced Up for handling Big Data
3 Strategies to drive more data driven outcomes in financial services

What's hot (18)

PDF
Modern Manufacturing: 4 Ways Data is Transforming the Industry
PDF
Overview of mit sloan case study on ge data and analytics initiative titled g...
PDF
Creating the Foundations for the Internet of Things
PDF
Applications of AI in Supply Chain Management: Hype versus Reality
PDF
Odgers Berndtson and Unico Big Data White Paper
PDF
Top 10 BI Trends for 2013
PDF
the-real-world-of-the-database-administrator-white-paper-15623
PDF
Augmented Data Management
PDF
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
PDF
Big data-analytics-2013-peer-research-report
PDF
How to understand trends in the data & software market
PDF
Operationalizing the Buzz: Big Data 2013
PDF
Big Data at a Glance
DOCX
Bidata
PDF
Modern Data Management
PDF
Analytics3.0 e book
PPTX
Unlocking value in your (big) data
PPT
Cloud and business agility
Modern Manufacturing: 4 Ways Data is Transforming the Industry
Overview of mit sloan case study on ge data and analytics initiative titled g...
Creating the Foundations for the Internet of Things
Applications of AI in Supply Chain Management: Hype versus Reality
Odgers Berndtson and Unico Big Data White Paper
Top 10 BI Trends for 2013
the-real-world-of-the-database-administrator-white-paper-15623
Augmented Data Management
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Big data-analytics-2013-peer-research-report
How to understand trends in the data & software market
Operationalizing the Buzz: Big Data 2013
Big Data at a Glance
Bidata
Modern Data Management
Analytics3.0 e book
Unlocking value in your (big) data
Cloud and business agility
Ad

Similar to Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You. (20)

PPTX
Adding Hadoop to Your Analytics Mix?
PPSX
Big data with Hadoop - Introduction
PPT
Using big data_to_your_advantage
PDF
GigaOM Putting Big Data to Work by Brett Sheppard
DOCX
Big Data: Are you ready for it? Can you handle it?
PDF
Report: CIOs & Big Data
PPTX
Big data4businessusers
PDF
Influence of Hadoop in Big Data Analysis and Its Aspects
PDF
PDF
Chief data-officers-guide-on-transforming-to-a-data-driven-organization
DOC
Complete-SRS.doc
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PPTX
Pass bac jd_sm
PDF
Getting Started with Big Data for Business Managers
PDF
Expert Big Data Tips
PDF
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
PDF
IBM CEC Big Data 2011 06-11 final
PPTX
Fundamentals of Big Data
PDF
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Adding Hadoop to Your Analytics Mix?
Big data with Hadoop - Introduction
Using big data_to_your_advantage
GigaOM Putting Big Data to Work by Brett Sheppard
Big Data: Are you ready for it? Can you handle it?
Report: CIOs & Big Data
Big data4businessusers
Influence of Hadoop in Big Data Analysis and Its Aspects
Chief data-officers-guide-on-transforming-to-a-data-driven-organization
Complete-SRS.doc
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
Pass bac jd_sm
Getting Started with Big Data for Business Managers
Expert Big Data Tips
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
IBM CEC Big Data 2011 06-11 final
Fundamentals of Big Data
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Ad

More from Jennifer Walker (6)

PDF
Apperian 2014 Executive Enterprise Mobility Report
PDF
Apperian 2015 Executive Enterprise Mobility Survey
PDF
Apperian 2016 Executive Enterprise Mobility Report
PDF
Apperian 2017 Executive Enterprise Mobility Report
PDF
How Coca-Cola Started Working with Startups
PPTX
Qlik view selfservicebi_2013-08-01_marketing
Apperian 2014 Executive Enterprise Mobility Report
Apperian 2015 Executive Enterprise Mobility Survey
Apperian 2016 Executive Enterprise Mobility Report
Apperian 2017 Executive Enterprise Mobility Report
How Coca-Cola Started Working with Startups
Qlik view selfservicebi_2013-08-01_marketing

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
sap open course for s4hana steps from ECC to s4
PDF
KodekX | Application Modernization Development
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
sap open course for s4hana steps from ECC to s4
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx

Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.

  • 1. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership SPONSORED BY Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
  • 2. CONTENTS Introduction 1 How Hadoop Becomes a Data Storage Locker 1 Transforming Hadoop into an Agile Analytics Platform 4 Investment in Data Transformation for Hadoop Delivers 10x Productivity Gains 8 Conclusion 10
  • 3. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 1 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. Introduction Why is Hadoop so enticing to businesses? As an open source repository, Hadoop is a cutting edge and disruptive technology. It has the capacity to handle quantities of data that tradi- tional repositories simply cannot, and its storage is drastically cheaper than traditional data warehouses. These factors contribute to extremely high expectations from business users. Companies expect a return on their investment from Hadoop in the range of three to four dollars for every dollar invested. Yet, reality is starkly different. At present, Hadoop users are achieving a return of 55 cents per dollar invested. This is a tenuous situation for businesses, as the flow of big data will not slow down in order for them to learn how to better utilize Hadoop. The problem is only going to be compounded in the next decade: the amount of big data is doubling every two years and is estimated to grow from 4.4 ZB in 2014 to 44 ZB in 2020. That’s as many pieces of data as there are stars in the universe.1 Despite this opportunity, a Bain and Company study found that 66% of surveyed companies believe they do not have the right technology to capitalize on data.2 Consequently, even com- panies that recognize how powerful data could be do not have the knowledge, expertise, or tools to make that vision a reality. Hadoop lowers the barrier to storing data, but it doesn’t necessarily lower the barriers to creating value from data. This CITO Research paper will describe what’s required to turn Hadoop into a productive platform for agile analytics. How Hadoop Becomes a Data Storage Locker Hadoop’s economics are transformational. The cost per gigabyte of data makes Hadoop an attractive data storage solution for many different applications and types of data. On its own, Hadoop can’t parse the meaning of the data it is collecting. And while Hadoop can amass multi-structured data, it is not designed to transform it or help companies decide whether or not that data is useful. The result is that frequently, Hadoop gathers data that lies fallow. Sometimes this amassing of data is tellingly referred to as “Hadumping.” 66% of surveyed companies believe they do not have the right technology to capitalize on data 1 1 http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm 2 http://www.bain.com/Images/BAIN%20_BRIEF_The_value_of_Big_Data.pdf
  • 4. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 2 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. Store Now, Understand Later The cross-that-bridge-when-we-come-to-it strategy of landing data in Hadoop and figuring out what to do with it later has led to Hadoop functioning more like a data storage locker than an agile analytics platform. One of Hadoop’s strengths—the fact that it doesn’t need a predefined schema to load data into it—also feeds into its weakness. For meaning to be applied to data when it is read (referred to as schema-on-read), users need to understand the data and then add some context to transform raw data into insights. Data has the potential to be valuable, but companies need tools to explore and extract that value. In truth, this is not a new problem: Even in the world of traditional data warehousing with structured data repositories and rigid, top-down governance, 90% of business data went unused. Obviously, with ever more data on hand, and Hadoop to store it, more data is going unused than ever before. Variety: The V You Need to Worry About Big data is often framed in terms of the three Vs: volume, variety, and velocity. CITO Research believes that variety is the most problematic of the three Vs. Here’s why. For schema on read to work, that is, to apply understanding to data stored in Hadoop, some- one has to understand what the data means. For each dataset, this application of meaning must happen again. It doesn’t matter how large the dataset is; what’s really both a problem and an opportunity is how many datasets are coming in. The meaning of each of these datasets must be specified. In addition to evaluating the meaning of each dataset coming in, determining the relationships between those datasets is often the critical breakthrough point to uncovering business value in the data. That understanding means the challenge is not volume (once you know what the data means, you can read it) but variety (figuring out what the data means). A VP of Data at a marketing data provider echoed this sentiment. “There are a great variety of sources and all sizes and shapes and flavors of the data, and we have to understand them up front. We can’t process them and then decide whether they are relevant. A lot of pre-analysis happens with the data before we even accept it for modeling,” she said. Datasets are coming in at high velocity, but knowing what to do with them requires dealing first with what those data sources mean. The variety of data is problematic because the ques- tion of meaning must be answered each time a new type of data appears. If data is stored first and understood later, the pile of data to deal with at some future point only gets larger.
  • 5. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 3 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. Expertise Is Scarce To date, self-service access to Hadoop has been more dream than reality. The current frame- work for assigning meaning to data in Hadoop requires analysts to rely on development experts for their workflows. This reliance on Hadoop experts bogs down processes, creates bottlenecks, and makes it difficult for people who can supply the needed business context for the data—in other words, who have native understanding of the data from a business perspective—to directly explore and interact with the data. The need to transform data into usable forms is so acute that the fastest growing category of specialist is now the data engineer, not the data scientist. As of September 2014, LinkedIn had nearly 21,000 postings for jobs with “data engineer” in the title, compared to just over 11,000 for jobs with “data scientist” in the title. Data Preparation Is Time-Consuming Companies are forced to devote far too much of their time to preparing data, often repeat- ing steps without business context. These tasks include the type of wrangling, munging, and hand-coding exercises that devour time, whether that involves joins of disparate datasets or just getting all the data into the same format. Here is a sampling of a few common data preparation problems: OO Problems stemming from business logic. For example, “price” might include taxes and shipping in some data sources but not in others. OO Missing values and outliers. When missing values or outliers (such as latitude and lon- gitude in the middle of the ocean) show up, what should the person working with the data do? Should the rest of the data for those records be included in the models or should data records with missing values be omitted entirely? The answer can be highly specific to the use case for the data. OO Derived values. The data may contain answers, but it may take work to get at those answers. Consider the task of figuring out how long a user spent on a website, which re- quires sessionizing data from weblogs to ascertain the activities of a particular user. The definition of a session is a derived value calculated by defining the starting time and end- ing time. Finding a business definition of a session is an inexact science. It requires some experimentation and observation of user behaviors to determine the session length ap- propriate to analyze for the business questions being asked. This iterative process, when executed by a non-business expert, is defined by a lot of extra trial and error. Data preparation is notoriously time consuming; data scientists say that these types of ac- tivities consume some 50 to 80% of their time.3
  • 6. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 4 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. Transforming Hadoop into an Agile Analytics Platform When companies consider what kinds of platforms to adopt to get the most out of their data and their Hadoop implementation, they should focus on the following factors to achieve long-term success. Agile versus Waterfall What does it mean for analytics to be agile? It means that you need a workflow that is iterative and dynamic and allows users to discover insights naturally (see Figure 1). Consider the waterfall methodology, in which the expected output was a report. The report was designed to answer to a question or a group of questions, defined in advance. The output of the analysis was often a static KPI or single chart for story- telling. By making that starting assumption, much data is thrown out immediately to drive to a single answer. This simplifies data management, but has the downside of re- moving data from the analysis that might highlight an unexpected insight that resides in that dataset. Agile analytics gives you the ability to ex- plore, try new things, and then change your mind. It’s an exploratory test-and-learn approach in which questions lead to more questions and then eventually to discov- ered answers. The nature of agile analytics is iterative. The following scenarios demon- strate the need for agility. Same dataset, different stakeholders. Different stakeholders frequently use the same data in different ways. Consider logs of usage data from a personal fitness bracelet. Product development wants to correlate log data with support tickets. They wonder which features cause users to contact support and whether the product design could be tweaked to make it more intuitive. Marketing looks at the same logs from a completely different angle. The marketing department might be more interested in the correlation between application usage trends, customer demographics, and en- gagement on the product forums. Each group consults the product usage logs, but uses them in entirely different ways. Further, their use of the data will evolve over time across groups. 3 http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0 Agile analytics gives you the ability to explore, try new things, and then change your mind
  • 7. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 5 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. Retransform the data to find new signals. Suppose product development found out that female custom- ers use the bracelet’s sleep monitor feature more than men do. They would like to design a product geared for women and want the data transformed with finer grained temporal informa- tion. What days of the week do women use the feature most? Workdays? Weekends? This information is in the raw logs, but it was trans- formed away during the first iteration. Agility demands the ability to go back to raw data and retransform it to support additional use cases. These scenarios cover just one data source and only two lines of business. If you multi- ply those needs by the variety of sources of big data and the number of stakeholders who want to make data-driven de- cisions, it’s easy to see why data transformation work dominates the time of data engineers and data scientists. Figure 1. Agile analytics is an iterative and flexible process that supports changing busi- ness needs and requirements DISCOVERY Selecting the dataset and evaluating the signal relative to the analysis at hand ASSESSMENT Profiling data and identifying outliers to determine fit for analysis SHAPING Identifying data format and distribution and structuring data for analysis ENRICHMENT Cleaning the data, joining it with other data sets, and aggregating data at the right level for the use case REUSE Sharing the script with others so that it can be reused as a canonical model Data Analyst Agile Analytics 10011011000101010100111001001000111 10110010111011000101001001101101010 0100010110100101010101101010011000
  • 8. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 6 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. Leveraging Business Context Human interaction is vital to the process of preparing data for analysis. Users need to identify features in the data that are interesting to them and provide feedback on the representa- tion and meaning of the data they’re attempting to use. Otherwise, the data is never tied to organizational change or corrective actions. A leader in the design software industry recently said in a CITO Research interview, “Every- body at some level needs to be a data analyst. Anybody who’s working at the company who is trying to improve internal processes, external processes, customer behavior, has to have will- ingness to ask questions and work from the context of data. I’ve been advocating that each product team have at least one or two people who are the mavens of their team’s data.” In other words, business experts supply business context, imbuing the data with domain knowl- edge. Even if more people are hired and tasked with data transformation, if they don’t have the domain knowledge of their data, their work on behalf of business stakeholders will be less efficient than a method that empowers domain experts to transform data themselves. Democratizing Data Access Companies looking to get the most out of Hadoop must overcome the shortage of Hadoop experts. They need to tap into the broader community of analysts that already exist in their organizations and implement platforms that bridge the divide between the business user and the data. These platforms will not replace human participation in the data extraction process. Quite the contrary, in fact. The more employees who can use big data, the more powerful it can become. Users in marketing and operations may ask very different questions that together yield new and powerful insights. Adopting new technology can be an intimidating process for employees at all levels of an organization. Users are accustomed to the way existing tools work and frequently develop their own workarounds to address limitations. But with Hadoop, without new tools specifically designed for the vast quantities of data from varying sources, users only see a limited picture of the insights their data could be providing to them. The more employees who can use big data, the more powerful it can become
  • 9. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 7 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. The best add-on platforms for Hadoop al- low for a democratization of data that ignites data-driven decision-making across an en- tire organization. Data is no longer strictly in the province of the data scientist; busi- ness users can find value from interacting with data themselves. Users can start with smaller datasets and once they see that the data transformation platform gives them the results they need, their trust builds and they can move to larger datasets. This becomes a virtuous feedback cycle; giving more people useful access to data drives demand for more data. Suddenly, data be- comes integral to the company and insights come from unexpected places. Using Machine Learning The only way to provide skill-neutral access to data is to adopt platforms on top of Hadoop that utilize an interactive interface, rather than code, to simplify the complexity of the data investigation process. These easy-to-use interfaces allow all users to see how the data will be transformed, and make data easier to read and understand. Point and click tools make even the trickiest data as simple to manipulate as perusing a spreadsheet. Beneath that intuitive exterior, however, these revolutionary data transformation tools are using a predictive interaction approach that leverages machine learning to anticipate users’ needs and speed them through the process of understanding and manipulating their data. By anticipating what users need, the platform allows people to scale their abilities rapidly. As one user stated, “It’s the symbiosis between the machine and human that really makes the analyst feel superhuman” in handling big data velocity and variety. Predictive interaction interposes machine learning techniques be- tween human users and the data that they see. By browsing the data, a user’s behavior effectively teaches the machine what to find in a given dataset. The machine builds its knowledge with every piece of feedback. Users guide the process, but the ma- chine does the detailed work, meaning experts are no longer spending 80% of their time preparing the data. They’re able to get up to their elbows in the data much faster, regardless of the amount of data they’re looking to analyze. Users guide the process, but the machine does the detailed work, meaning experts are no longer spending 80% of their time preparing the data
  • 10. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 8 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. It’s About Repeatable Speed and Scale Hadoop requires tools that are fit for schema on read. Companies often do not budget for these add-ons. Without the add-ons, they cannot get the cohesive data manipulation that drives true analytic ca- pacity from the data they store in Hadoop. Once Hadoop adopters recognize their need, many opt for makeshift solutions. They either try writing scripts by hand to transform data (not a scalable solution) or try to retrofit existing tools that were de- signed for traditional data warehouses and traditional monolithic data structures. Another approach is to rely heavily on services. While the initial investment in Hadoop may run in the tens of thousands, it may take a couple of million dollars in services to make it useful. Opting to use a data transformation platform not only saves money initially but also gives the or- ganization a repeatable approach to agile analytics. Repeatability is key. As the VP of Data from a data distribution company told CITO Research, “We are in the business of data transformation and data manage- ment, so of course everything we do has to be repeatable.” Investment in Data Transformation for Hadoop Delivers 10x Productivity Gains Businesses that have used data transformation tools have been astonished at how going from a dozen people using data to a few hundred has impacted the decision making process. The returns could be even greater if thousands within a business were using data. CITO Research has found that data transformation platforms like Trifacta offer productivity gains with a factor of 10 for data scientists and data engineers as well as business users. With tools like Trifacta, no coding is required. Users: OO Have an interface that guides them in transforming data OO Get immediate visual feedback OO Can detect any problems with the output OO Obtain a sharable and repeatable history of steps taken to transform raw data into anal- ysis-ready data Perhaps most importantly, Trifacta provides greater visibility into datasets, ensuring wider use of big data.
  • 11. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership 9 Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. Trifacta is a prime example of how effective tooling can be for non-expert users. Using a Google-like “auto-complete” approach, users guide Trifacta through a predictive interaction process in which a portion of their big data is visualized to see whether it is correct. The tool can predictively highlight information the user will find relevant. The visualization aspect of a tool is crucial, but so too is the ease of use with which it incorporates user feedback and learns from these inputs. With data transformation tools in place, the job satisfaction and productivity of not only analysts, but also data scientists and data engineers will increase. They will no longer have to spend the bulk of their time wrangling data for others’ use. Instead, with data transforma- tion tools like Trifacta, data scientists and data engineers will be able to collaborate more efficiently with the business by sharing the same toolset. Business analysts can participate in transforming data in Hadoop, collaborating more efficiently with data scientists and experts, rendering it accessible via end-user BI tools like Tableau and QlikView. Figure 2. With data transformation tools such as Trifacta, Hadoop becomes the basis for agile analytics, actively used by all lines of business. COLLECT PROCESS ORGANIZE LEARN Ops Sales Marketing Products Self-Service Data Sources
  • 12. CITO Research Advancing the craft of technology leadershipAdvancing the craft of technology leadership CITO Research CITO Research is a source of news, analysis, research and knowledge for CIOs, CTOs and other IT and business professionals. CITO Research engages in a dialogue with its audience to capture technology trends that are harvested, analyzed and communicated in a sophisti- cated way to help practitioners solve difficult business problems. Visit us at http://www.citoresearch.com Hadoop:DataStorageLockerorAgileAnalyticsPlatform?It’sUptoYou. 10 Conclusion Unquestionably, Hadoop has the potential to change the way companies work with data. Hadoop allows businesses to store, collect, and process the ever-increasing quantities of big data that are being generated every second – data from customers, from business partners, from devices, in all formats, whether that data is in tabular, textual, or other machine-gener- ated formats. Yet, before jumping off the diving board into the Hadoop deep end, companies should recognize that the advantages of Hadoop are only fully realized with additional tooling. Hadoop is not a data transformation product. It is a data repository. It expects schema on read. Someone must supply that schema in order for the data to be usable. By enabling business users to provide that schema interactively, without having to understand even so much as that term, Trifacta brings about data transformation and frees the data in Hadoop for wide business use. CITO Research believes that without such tools, Hadoop will remain a Hadumping ground, where more and more data is stored and gathers dust. Trifacta, relying on machine learning and user input, liberates Hadoop data. Regardless of the origin of the data, regardless of its format, Trifacta enables the process of standardizing data stored in Hadoop and making it ready to use far more efficient. When it comes to data preparation and transformation, Trifacta is a prime example of how effective new platforms can be in accelerating Hadoop’s time to value. Data preparation has been a huge barrier for early adopters of Hadoop, meaning the time for actual analytics and using data to inform decisions has been limited. Trifacta’s ability to balance machine learning and human input for productive data transformation can change that. This paper was created by CITO Research and sponsored by Trifacta Learn more about Trifacta