SlideShare a Scribd company logo
Leaving Data on
the Table
Data Scientists Reveal Obstacles
to Big Data Analytics
Paradigm4 Data Scientist Survey 2
While Big Data enjoys widespread media coverage, not enough attention has been paid to what
practitioners think — data scientists who manage and analyze massive volumes of data.
We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists
for their help separating Big Data hype from reality. What we learned is that data scientists face multiple
challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data
— and money — on the table.
This survey uses the terms “complex analytics” and “basic analytics” for which respondents were given these definitions:
This distinction is important because basic analytics are “embarrassingly parallel” whereas complex analytics
are not. Here’s what we mean. “Embarrassingly Parallel” (sometimes referred to as “data parallel”) refers to problems that
can be separated into multiple independent sub-problems that can run in parallel and do not require access to all the data
at once. This is the divide-and-conquer approach used by MapReduce/Hadoop. In contrast, “non-embarrassingly parallel”
problems require using and sharing all the data at once and communicating intermediate results among processes.
Matrix multiplication on matrices too large to fit on one server is an example of a non-embarrassingly parallel function.
Their experiences should help inform businesses on what to look for as they investigate options to expand
their analytics infrastructure.
For insight on the issues and obstacles facing data scientists, read on.
We asked data scientists questions such as:
What obstacles prevent them from gaining insights into their data?
How many use Hadoop and which limitations have they encountered
when attempting to use Hadoop for complex analytics?
What data types and sources would they like to leverage more effectively?
Whether they’ll adopt complex analytics solutions (see below)
— and how quickly?
“Complex analytics” means math functions like covariance, clustering, machine learning, principal components
analysis and graph operations.
“Basic analytics” means business intelligence reporting such as sums, counts and aggregates.
Paradigm4 Data Scientist Survey 3
We’ve all heard how hard it is to analyze massive and rapidly growing data volumes. But
data scientists say variety presents a bigger challenge. They are at times leaving data out
of their analyses as they wrestle with how to integrate and analyze more types of data such
as time-stamped sensor, location, image and behavioral data as well as network data.
Data scientists are turning to large-scale complex analytics both for unbiased data-
driven exploration and to wrest more value from their data.
For complex analytics, data scientists are forced to move large volumes of data
from existing data stores to dedicated mathematical and statistical computing
software. This time-consuming and coding-intensive step adds no analytical value
and impedes productivity.
While Hadoop has garnered widespread media coverage, 76 percent of data
scientists have encountered serious limitations using it. Hadoop is well suited for
embarrassingly-parallel problems but falls short for large-scale complex analytics.
Incorporating the diverse data types into analytical workflows is a major pain point
for data scientists using traditional relational database software.
For data scientists, Big Data means Big Stress. 39 percent say it’s made their job
more stressful.
1
2
3
4
5
6
The Big Takeaways
Paradigm4 Data Scientist Survey 4
What Is The Biggest Problem You Face In
Gaining Insights From Your Big Data?
Which types of data do you anticipate using in the next year?
The overwhelming volume of corporate and organizational data continues to generate headlines but it’s the
diverse types of data that pose a bigger challenge. Nearly three-quarters of data scientists — 71 percent —
said Big Data had made their analytics more difficult and data variety, not just volume, was the challenge.
71%TRUE
I struggle with managing new types and sources of data
I know how to get the answer but it takes too long (my data is too big to move to a math/ analytics software package)
I don’t know what questions to ask of my data
I know what I want to ask but don’t know how to get the answers
Time-series
Business transaction
Geospatial / Location
Graph (network)
Clickstream
Health records
Sensor
Image
Genomic
I know how to get the answer but my analysis runs out of memory
29%
40%
36%
24%
18%
17%
66%
66%
55%
46%
35%
25%
17%
13%
7%
FALSE
My Analytics Are Becoming More Difficult Because of the Variety
and Types of Data Sources (Not Just the Volume)
Data Variety Is Proving to Be
More Important Than Volume
Paradigm4 Data Scientist Survey 5
The trend toward hyper-personalization and precision targeting illustrates this well.
Recommendations, search results and ads are becoming ever more relevant and micro-targeted
as they tap more and diverse data like social networks, current location, and browsing and
purchasing history. Personalized insurance offerings are augmenting sensor data about driver
behaviortoincorporatecontextualdataliketime-of-dayandroadcongestion.Precisionmedicine
providers are gaining a more refined understanding of what works for whom by integrating
molecular data with clinical, behavioral, electronic health records and environmental data. But
the ability to use diverse data types poses a serious challenge. (For more on this topic, see, “Big
Data at Work: Dispelling the Myths, Uncovering the Opportunities,” by Thomas Davenport,
Chapter 1: “Why Big Data is Important to you and your Organization.”)
What It Means:
The ability to effectively use diverse data sources is proving to
be a competitive differentiator in many industries.
Paradigm4 Data Scientist Survey 6
Data Scientists Are Turning to Complex
Analytics to Analyze Their Big Data
When will your company begin to use complex
analytics on your Big Data?
59%
1%
4%
4%
16%
W
e use it now
In
the next 3 years
M
ore than
3 years down
the road
No plans to use com
plex analytics
In
the next 2 years
W
eplantouseitinthenextyear
15%
The point is not to be dazzled by the volume of data,
but rather to analyze it — to convert it into insights,
innovations, and business value.
— Thomas Davenport, “Big Data at Work: Dispelling
the Myths, Uncovering the Opportunities,” page 2.
“
”
Paradigm4 Data Scientist Survey 7
Many new analytical uses require significantly more powerful algorithms and computational
approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly
need to leverage all data sources in novel ways, using tools and analytical infrastructures suitable
for the task. As we have already seen in this survey, organizations are moving from simple SQL
aggregates and summary statistics to next-generation analytics such as machine learning,
clustering, correlation, and principal components analysis on moderately sized data sets. The
move from simple to complex analytics on Big Data presages an emerging need for analytics
that scale beyond single server memory limits and handle sparsity, missing values and mixed
sampling frequencies appropriately. These complex analytics methods can also provide data
scientists with unsupervised and assumption-free approaches, letting all the data speak for itself.
What It Means:
The “low hanging fruit” of Big Data has been exploited.
Paradigm4 Data Scientist Survey 8
Data scientists face another growing challenge: conventional analytic workflows require them to move data
to mathematical and statistical computing software. This workflow made sense with small or sampled data
but is either woefully inefficient or breaks with even moderately large data volumes.
of data scientists utilize software capable of
complex analytics in addition to their data
management software
of data scientists say it takes too long to get
insights from their data because it is too
big to move to their analytics software
Moving Big Data Poses Difficult
Challenges to Data Scientists
78%
36%
Paradigm4 Data Scientist Survey 9
This forces data scientists to make compromises, analyzing samples instead of the whole
data set, leaving data and money on the table. Data scientists risk missing rare events, weak
signals or important anomalies when restricted to working with samples or computing on
subsets independently. (For more on this topic, see “Scaling Big Data Mining Infrastructure:
The Twitter Experience,” by Twitter Engineering Manager Dmitriy Ryaboy and University of
Maryland Associate Professor Jimmy Lin). What’s needed are tools capable of conducting
complex analytics over massive data volumes efficiently — without sampling and without
moving the data.
What It Means:
The size and diversity of today’s data sets pose a significant hurdle
to doing more sophisticated analytics because so much time is lost
moving data from files or from a database to analysis tools.
Paradigm4 Data Scientist Survey 10
While the Hadoop software platform garners significant media attention, Hadoop is not a viable solution
for many use cases, especially those that require complex analytics. Fewer than half of data scientists
surveyed (48 percent) have used Hadoop or SPARK — and of those, 76 percent cited significant limitations
to its use.
Hadoop Only Takes You So Far
From the 76% reporting problems, what are the limitations of Hadoop / SPARK?
It takes too much effort to program
It’s too slow for interactive, ad-hoc queries
It’s too slow for real-time analytics
It’s not well-suited for my analytics (not embarrassingly parallel)
39%
37%
30%
22%
of data scientists who tried Hadoop or
SPARK have stopped using it
35%
Paradigm4 Data Scientist Survey 11
But even Hadoop vendors have recognized the limitations. They are adding SQL functionality to
theirproductstoaccommodatedatascientists’preferenceforahigher-levelquerylanguageinstead
of programming languages like Java and to address the limitations of MapReduce. (E.g., Cloudera
has abandoned MapReduce and is offering Impala to provide SQL on HDFS.) A growing number of
complex analytics use cases are proving to be unworkable in Hadoop. First-wave Hadoop adopters
like Google, Facebook and LinkedIn required a small army of developers to program and maintain
Hadoop. But many organizations either don’t have the required staff or face complex analytics
challenges that can’t be readily solved with Hadoop. This presents a real challenge for the Hadoop
infrastructure that has to address these shortcomings or risk being replaced.
What It Means:
Hadoop was unrealistically hyped as a universal and
disruptive Big Data solution.
Paradigm4 Data Scientist Survey 12
Given the growing diversification of data types and sources coupled with the limitations of existing relational
databases, it’s no surprise that many data scientists are frustrated leveraging these data sources in their
analytical workflows.
Existing relational database management systems are
inadequate for analyzing the variety of data sources
I am finding it harder to fit my data into relational database tables
TRUE
FALSE
49%
51%
Paradigm4 Data Scientist Survey 13
By comparison, temporal, spatial and network data may be quite sparse (containing
large amounts of missing values), have mixed sampling frequencies and a natural order.
Relational databases require predefined access patterns for each line of inquiry, an obvious
non-starter for data scientists doing ad hoc data exploration.
What It Means:
Relational databases were built for storing and querying densely
populated transactional data such as business purchases and
customer information.
Paradigm4 Data Scientist Survey 14
of data scientists say the growth of Big Data has made
their job more stressful in the last year
say they don’t know which questions to ask of their Big Data
There’s another side of the Big Data story: 39 percent of data scientists say their job has become more
stressful with the growth of Big Data. That’s nearly four times the number who say it’s made their job
less stressful.
Big Data Means Big Stress for Data Scientists
Quotes from data scientists:
24%
My biggest problem is linking various data sources.
”“
The data is just too big.
”“
The biggest problem is putting
multiple sources of data together.
”“
39%
Paradigm4 Data Scientist Survey 15
Fulfilling those expectations falls on the data scientist. But outdated software approaches
better suited to traditional transactional data — not today’s diverse data sources and rapidly
growing volumes — often make it impossible to fulfill these expectations. It’s a recipe for
stress. Deriving business value from organizational data starts with ad hoc analysis. Tools and
workflows need to enable data scientists to conduct analysis quickly and efficiently, making
data scientists more productive and lowering stress levels as a result.
What It Means:
Driven in part by media hype, organizations have developed
inflated expectations around the value they’ll get out of Big Data.
Paradigm4 Data Scientist Survey 16
Data scientists play a pivotal role helping organizations unlock the potential of their Big Data. But
current software tools fall short in some areas as indicated in the survey. Hype has exceeded reality
and data scientists are forced to compromise, sometimes leaving data on the table. Choosing the
right software solution is key but don’t expect to get there by browsing vendors’ websites. The fact
that so many data scientists identified shortcomings in their infrastructure suggests that the only way
to tell which solution is best suited to your organization is to do a pilot project using your data and
your use cases.
So What?
The Paradigm4 Data Scientist Survey was fielded by Innovation Enterprise, an independent research
firm, from March 27 to April 23, 2014. The responses were generated from a survey of 111 data
scientists in the U.S.
Paradigm4 is the creator of SciDB, a computational database management system used to solve
large-scale, complex analytics challenges on Big — and Diverse — Data. Led by industry visionaries
and veterans Michael Stonebraker, Marilyn Matz, Paul Brown and Bryan Lewis, Paradigm4 enables
data-obsessed organizations in life sciences, e-commerce, finance, and manufacturing to answer
harder questions faster.
For more information, visit www.paradigm4.com
About the Survey
About Paradigm4

More Related Content

PPTX
Session 01 designing and scoping a data science project
PPTX
Data analytics
PPTX
Data analytics
PDF
Introduction to data analytics
PPTX
Data Analytics
PPTX
Data Analytics
PPT
Data analytics & its Trends
Session 01 designing and scoping a data science project
Data analytics
Data analytics
Introduction to data analytics
Data Analytics
Data Analytics
Data analytics & its Trends

What's hot (20)

PPTX
Analytics 2
PPTX
Introduction to Data Analytics
PDF
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
PPTX
Data analytics
PDF
Introduction to Data Science (Data Summit, 2017)
PPTX
PDF
Big Data Analytics
PPTX
Introduction to Data Science
PDF
Self-service analytics risk_September_2016
PPTX
Big data and Predictive Analytics By : Professor Lili Saghafi
PPT
Analysis of ‘Unstructured’ Data
PPTX
DataSpryng Overview
PPTX
Data science and data analytics major similarities and distinctions (1)
PDF
Data Architecture: OMG It’s Made of People
PPTX
Big data analytics
PDF
Lect 1 introduction
PPTX
Data analytics
PDF
SAS/MIT/Sloan Data Analytics
PPTX
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
PPTX
Data analytics
Analytics 2
Introduction to Data Analytics
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Data analytics
Introduction to Data Science (Data Summit, 2017)
Big Data Analytics
Introduction to Data Science
Self-service analytics risk_September_2016
Big data and Predictive Analytics By : Professor Lili Saghafi
Analysis of ‘Unstructured’ Data
DataSpryng Overview
Data science and data analytics major similarities and distinctions (1)
Data Architecture: OMG It’s Made of People
Big data analytics
Lect 1 introduction
Data analytics
SAS/MIT/Sloan Data Analytics
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Data analytics
Ad

Similar to Paradigm4 Research Report: Leaving Data on the table (20)

PDF
Emcien overview v6 01282013
PDF
PPTX
Big data unit 2
PPTX
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
PPTX
Analytics for actuaries cia
PPTX
Fundamentals of Big Data
PDF
Starting small with big data
PDF
02 a holistic approach to big data
PPTX
Usama Fayyad talk in South Africa: From BigData to Data Science
PDF
SymEx 2015 - Agile Process for Big Data Analytic
PDF
Mighty Guides- Data Disruption
PPTX
Big Data Analytics MIS presentation
PDF
S ba0881 big-data-use-cases-pearson-edge2015-v7
PDF
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
PDF
Getting down to business on Big Data analytics
PPTX
Big data
PDF
Big Data at a Glance
PDF
Getting down to business on Big Data analytics
PDF
Big data Analytics
PDF
Python's Role in the Future of Data Analysis
Emcien overview v6 01282013
Big data unit 2
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Analytics for actuaries cia
Fundamentals of Big Data
Starting small with big data
02 a holistic approach to big data
Usama Fayyad talk in South Africa: From BigData to Data Science
SymEx 2015 - Agile Process for Big Data Analytic
Mighty Guides- Data Disruption
Big Data Analytics MIS presentation
S ba0881 big-data-use-cases-pearson-edge2015-v7
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
Getting down to business on Big Data analytics
Big data
Big Data at a Glance
Getting down to business on Big Data analytics
Big data Analytics
Python's Role in the Future of Data Analysis
Ad

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
Approach and Philosophy of On baking technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
A comparative analysis of optical character recognition models for extracting...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
Approach and Philosophy of On baking technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Electronic commerce courselecture one. Pdf
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks

Paradigm4 Research Report: Leaving Data on the table

  • 1. Leaving Data on the Table Data Scientists Reveal Obstacles to Big Data Analytics
  • 2. Paradigm4 Data Scientist Survey 2 While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table. This survey uses the terms “complex analytics” and “basic analytics” for which respondents were given these definitions: This distinction is important because basic analytics are “embarrassingly parallel” whereas complex analytics are not. Here’s what we mean. “Embarrassingly Parallel” (sometimes referred to as “data parallel”) refers to problems that can be separated into multiple independent sub-problems that can run in parallel and do not require access to all the data at once. This is the divide-and-conquer approach used by MapReduce/Hadoop. In contrast, “non-embarrassingly parallel” problems require using and sharing all the data at once and communicating intermediate results among processes. Matrix multiplication on matrices too large to fit on one server is an example of a non-embarrassingly parallel function. Their experiences should help inform businesses on what to look for as they investigate options to expand their analytics infrastructure. For insight on the issues and obstacles facing data scientists, read on. We asked data scientists questions such as: What obstacles prevent them from gaining insights into their data? How many use Hadoop and which limitations have they encountered when attempting to use Hadoop for complex analytics? What data types and sources would they like to leverage more effectively? Whether they’ll adopt complex analytics solutions (see below) — and how quickly? “Complex analytics” means math functions like covariance, clustering, machine learning, principal components analysis and graph operations. “Basic analytics” means business intelligence reporting such as sums, counts and aggregates.
  • 3. Paradigm4 Data Scientist Survey 3 We’ve all heard how hard it is to analyze massive and rapidly growing data volumes. But data scientists say variety presents a bigger challenge. They are at times leaving data out of their analyses as they wrestle with how to integrate and analyze more types of data such as time-stamped sensor, location, image and behavioral data as well as network data. Data scientists are turning to large-scale complex analytics both for unbiased data- driven exploration and to wrest more value from their data. For complex analytics, data scientists are forced to move large volumes of data from existing data stores to dedicated mathematical and statistical computing software. This time-consuming and coding-intensive step adds no analytical value and impedes productivity. While Hadoop has garnered widespread media coverage, 76 percent of data scientists have encountered serious limitations using it. Hadoop is well suited for embarrassingly-parallel problems but falls short for large-scale complex analytics. Incorporating the diverse data types into analytical workflows is a major pain point for data scientists using traditional relational database software. For data scientists, Big Data means Big Stress. 39 percent say it’s made their job more stressful. 1 2 3 4 5 6 The Big Takeaways
  • 4. Paradigm4 Data Scientist Survey 4 What Is The Biggest Problem You Face In Gaining Insights From Your Big Data? Which types of data do you anticipate using in the next year? The overwhelming volume of corporate and organizational data continues to generate headlines but it’s the diverse types of data that pose a bigger challenge. Nearly three-quarters of data scientists — 71 percent — said Big Data had made their analytics more difficult and data variety, not just volume, was the challenge. 71%TRUE I struggle with managing new types and sources of data I know how to get the answer but it takes too long (my data is too big to move to a math/ analytics software package) I don’t know what questions to ask of my data I know what I want to ask but don’t know how to get the answers Time-series Business transaction Geospatial / Location Graph (network) Clickstream Health records Sensor Image Genomic I know how to get the answer but my analysis runs out of memory 29% 40% 36% 24% 18% 17% 66% 66% 55% 46% 35% 25% 17% 13% 7% FALSE My Analytics Are Becoming More Difficult Because of the Variety and Types of Data Sources (Not Just the Volume) Data Variety Is Proving to Be More Important Than Volume
  • 5. Paradigm4 Data Scientist Survey 5 The trend toward hyper-personalization and precision targeting illustrates this well. Recommendations, search results and ads are becoming ever more relevant and micro-targeted as they tap more and diverse data like social networks, current location, and browsing and purchasing history. Personalized insurance offerings are augmenting sensor data about driver behaviortoincorporatecontextualdataliketime-of-dayandroadcongestion.Precisionmedicine providers are gaining a more refined understanding of what works for whom by integrating molecular data with clinical, behavioral, electronic health records and environmental data. But the ability to use diverse data types poses a serious challenge. (For more on this topic, see, “Big Data at Work: Dispelling the Myths, Uncovering the Opportunities,” by Thomas Davenport, Chapter 1: “Why Big Data is Important to you and your Organization.”) What It Means: The ability to effectively use diverse data sources is proving to be a competitive differentiator in many industries.
  • 6. Paradigm4 Data Scientist Survey 6 Data Scientists Are Turning to Complex Analytics to Analyze Their Big Data When will your company begin to use complex analytics on your Big Data? 59% 1% 4% 4% 16% W e use it now In the next 3 years M ore than 3 years down the road No plans to use com plex analytics In the next 2 years W eplantouseitinthenextyear 15% The point is not to be dazzled by the volume of data, but rather to analyze it — to convert it into insights, innovations, and business value. — Thomas Davenport, “Big Data at Work: Dispelling the Myths, Uncovering the Opportunities,” page 2. “ ”
  • 7. Paradigm4 Data Scientist Survey 7 Many new analytical uses require significantly more powerful algorithms and computational approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly need to leverage all data sources in novel ways, using tools and analytical infrastructures suitable for the task. As we have already seen in this survey, organizations are moving from simple SQL aggregates and summary statistics to next-generation analytics such as machine learning, clustering, correlation, and principal components analysis on moderately sized data sets. The move from simple to complex analytics on Big Data presages an emerging need for analytics that scale beyond single server memory limits and handle sparsity, missing values and mixed sampling frequencies appropriately. These complex analytics methods can also provide data scientists with unsupervised and assumption-free approaches, letting all the data speak for itself. What It Means: The “low hanging fruit” of Big Data has been exploited.
  • 8. Paradigm4 Data Scientist Survey 8 Data scientists face another growing challenge: conventional analytic workflows require them to move data to mathematical and statistical computing software. This workflow made sense with small or sampled data but is either woefully inefficient or breaks with even moderately large data volumes. of data scientists utilize software capable of complex analytics in addition to their data management software of data scientists say it takes too long to get insights from their data because it is too big to move to their analytics software Moving Big Data Poses Difficult Challenges to Data Scientists 78% 36%
  • 9. Paradigm4 Data Scientist Survey 9 This forces data scientists to make compromises, analyzing samples instead of the whole data set, leaving data and money on the table. Data scientists risk missing rare events, weak signals or important anomalies when restricted to working with samples or computing on subsets independently. (For more on this topic, see “Scaling Big Data Mining Infrastructure: The Twitter Experience,” by Twitter Engineering Manager Dmitriy Ryaboy and University of Maryland Associate Professor Jimmy Lin). What’s needed are tools capable of conducting complex analytics over massive data volumes efficiently — without sampling and without moving the data. What It Means: The size and diversity of today’s data sets pose a significant hurdle to doing more sophisticated analytics because so much time is lost moving data from files or from a database to analysis tools.
  • 10. Paradigm4 Data Scientist Survey 10 While the Hadoop software platform garners significant media attention, Hadoop is not a viable solution for many use cases, especially those that require complex analytics. Fewer than half of data scientists surveyed (48 percent) have used Hadoop or SPARK — and of those, 76 percent cited significant limitations to its use. Hadoop Only Takes You So Far From the 76% reporting problems, what are the limitations of Hadoop / SPARK? It takes too much effort to program It’s too slow for interactive, ad-hoc queries It’s too slow for real-time analytics It’s not well-suited for my analytics (not embarrassingly parallel) 39% 37% 30% 22% of data scientists who tried Hadoop or SPARK have stopped using it 35%
  • 11. Paradigm4 Data Scientist Survey 11 But even Hadoop vendors have recognized the limitations. They are adding SQL functionality to theirproductstoaccommodatedatascientists’preferenceforahigher-levelquerylanguageinstead of programming languages like Java and to address the limitations of MapReduce. (E.g., Cloudera has abandoned MapReduce and is offering Impala to provide SQL on HDFS.) A growing number of complex analytics use cases are proving to be unworkable in Hadoop. First-wave Hadoop adopters like Google, Facebook and LinkedIn required a small army of developers to program and maintain Hadoop. But many organizations either don’t have the required staff or face complex analytics challenges that can’t be readily solved with Hadoop. This presents a real challenge for the Hadoop infrastructure that has to address these shortcomings or risk being replaced. What It Means: Hadoop was unrealistically hyped as a universal and disruptive Big Data solution.
  • 12. Paradigm4 Data Scientist Survey 12 Given the growing diversification of data types and sources coupled with the limitations of existing relational databases, it’s no surprise that many data scientists are frustrated leveraging these data sources in their analytical workflows. Existing relational database management systems are inadequate for analyzing the variety of data sources I am finding it harder to fit my data into relational database tables TRUE FALSE 49% 51%
  • 13. Paradigm4 Data Scientist Survey 13 By comparison, temporal, spatial and network data may be quite sparse (containing large amounts of missing values), have mixed sampling frequencies and a natural order. Relational databases require predefined access patterns for each line of inquiry, an obvious non-starter for data scientists doing ad hoc data exploration. What It Means: Relational databases were built for storing and querying densely populated transactional data such as business purchases and customer information.
  • 14. Paradigm4 Data Scientist Survey 14 of data scientists say the growth of Big Data has made their job more stressful in the last year say they don’t know which questions to ask of their Big Data There’s another side of the Big Data story: 39 percent of data scientists say their job has become more stressful with the growth of Big Data. That’s nearly four times the number who say it’s made their job less stressful. Big Data Means Big Stress for Data Scientists Quotes from data scientists: 24% My biggest problem is linking various data sources. ”“ The data is just too big. ”“ The biggest problem is putting multiple sources of data together. ”“ 39%
  • 15. Paradigm4 Data Scientist Survey 15 Fulfilling those expectations falls on the data scientist. But outdated software approaches better suited to traditional transactional data — not today’s diverse data sources and rapidly growing volumes — often make it impossible to fulfill these expectations. It’s a recipe for stress. Deriving business value from organizational data starts with ad hoc analysis. Tools and workflows need to enable data scientists to conduct analysis quickly and efficiently, making data scientists more productive and lowering stress levels as a result. What It Means: Driven in part by media hype, organizations have developed inflated expectations around the value they’ll get out of Big Data.
  • 16. Paradigm4 Data Scientist Survey 16 Data scientists play a pivotal role helping organizations unlock the potential of their Big Data. But current software tools fall short in some areas as indicated in the survey. Hype has exceeded reality and data scientists are forced to compromise, sometimes leaving data on the table. Choosing the right software solution is key but don’t expect to get there by browsing vendors’ websites. The fact that so many data scientists identified shortcomings in their infrastructure suggests that the only way to tell which solution is best suited to your organization is to do a pilot project using your data and your use cases. So What? The Paradigm4 Data Scientist Survey was fielded by Innovation Enterprise, an independent research firm, from March 27 to April 23, 2014. The responses were generated from a survey of 111 data scientists in the U.S. Paradigm4 is the creator of SciDB, a computational database management system used to solve large-scale, complex analytics challenges on Big — and Diverse — Data. Led by industry visionaries and veterans Michael Stonebraker, Marilyn Matz, Paul Brown and Bryan Lewis, Paradigm4 enables data-obsessed organizations in life sciences, e-commerce, finance, and manufacturing to answer harder questions faster. For more information, visit www.paradigm4.com About the Survey About Paradigm4