Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends
KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

The latest KDnuggets Poll asked:

What was the largest dataset you analyzed / data mined?

This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.

Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.

Highlights:

  • Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
  • Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
  • Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
  • Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.

Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018

2018 data is shown as a column, to stand apart from lines for previous years.

This poll also asked about employment type, and the breakdown was

  • Company or Self-Employed, 62% (was also 62% in 2016)
  • Student, 17% (was 20% in 2016)
  • Academia/University, 13% (was 10% in 2016)
  • Government/non-profit, 4.8% (was 5.1% in 2016)
  • Other, 3.2% (was 2.4% in 2016)

Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median

Circle size corresponds to the number of responses.

Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:

  • Europe, 34.9% (was 35.1%)
  • US/Canada, 34.4% (was 36.9% in 2016)
  • Asia, 15.6% (was 17%)
  • Latin America, 6.9% (was 5.6%)
  • Africa/Middle East, 4.9% (was 3.2%)
  • Australia/NZ, 3.2% (was 2.3%)

Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.

Read the rest on KDnuggets:

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends-

https://www.kdnuggets.com/2018/10/poll-results-largest-dataset-analyzed.html

 

Gene Ferruzza

SVP Decision Sciences

6y

Interesting survey, particularly the international input.  I'm curious if you think this is a measure of available data for analysis, or the typical size of an assembled dataset?  In most cases the former is constantly growing.  Thanks for conducting this survey!

📊 Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization | Causal Inference

6y

It would interesting to include maximum number of records and variables rather than just large file sizes. I find problems become more complex with large numbers of data points

To view or add a comment, sign in

Others also viewed

Explore topics