Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

Published Oct 31, 2018

The latest KDnuggets Poll asked:

What was the largest dataset you analyzed / data mined?

This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.

Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.

Highlights:

Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.

Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018

2018 data is shown as a column, to stand apart from lines for previous years.

This poll also asked about employment type, and the breakdown was

Company or Self-Employed, 62% (was also 62% in 2016)
Student, 17% (was 20% in 2016)
Academia/University, 13% (was 10% in 2016)
Government/non-profit, 4.8% (was 5.1% in 2016)
Other, 3.2% (was 2.4% in 2016)

Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median

Circle size corresponds to the number of responses.

Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:

Europe, 34.9% (was 35.1%)
US/Canada, 34.4% (was 36.9% in 2016)
Asia, 15.6% (was 17%)
Latin America, 6.9% (was 5.6%)
Africa/Middle East, 4.9% (was 3.2%)
Australia/NZ, 3.2% (was 2.3%)

Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.

Read the rest on KDnuggets:

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends-

https://www.kdnuggets.com/2018/10/poll-results-largest-dataset-analyzed.html

Gene Ferruzza

SVP Decision Sciences

Interesting survey, particularly the international input. I'm curious if you think this is a measure of available data for analysis, or the typical size of an assembled dataset? In most cases the former is constantly growing. Thanks for conducting this survey!

1 Reaction

📊 Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization | Causal Inference

It would interesting to include maximum number of records and variables rather than just large file sizes. I find problems become more complex with large numbers of data points

See more comments

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

More articles by this author

Others also viewed

5 ways to turn big data into insights

Largest Dataset Analyzed Poll shows surprising stability, more junior Data Scientists

Seven Habits of Highly Effective CDOs = Masters of the External (Data) Environment

Big Data present and future challenges

Is it possible to empower people through Big Data?

Big Data Challenges

The 7 success factors towards data transformation

How Big MNC's manage their Big Data?

Making sense of big data – a case for simple maths

Data DOOM. The Curse Of Too Many Dimension In Data.

Explore topics

KDnuggets: Personal History and Nuggets of Experience

Dec 4, 2021

Which Data Science Skills are core and which are hot/emerging ones?

Sep 17, 2019

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

Feb 11, 2019

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

Dec 4, 2018

How Important is that Machine Learning Model be Understandable?

Nov 19, 2018

Anticipating the next move in data science – my interview with Thomson Reuters

Nov 18, 2018

How many Data Scientists are there and is there a shortage?

Sep 19, 2018

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

Jul 30, 2018

SuperDataScience Podcast: Insights from the Founder of KDnuggets

Jul 23, 2018

The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

Jun 6, 2018