SlideShare a Scribd company logo
What Does ‘Big Data’ Mean
   and Who Will Win?


   Michael Stonebraker
The Meaning of Big Data - 3 V’s

• Big Volume
   — With simple (SQL) analytics

   — With complex (non-SQL) analytics



• Big Velocity
   — Drink from the fire hose



• Big Variety
   — Large number of diverse data sources to integrate




                                                         2
Big Volume - Little Analytics

• Well addressed by data warehouse crowd

• Who are pretty good at SQL analytics on
  — Hundreds of nodes

  — Petabytes of data



• I know of a dozen or so multi-petabyte data
  warehouses in production from multiple vendors on
  more than 100 nodes of iron….


                                                      3
Big Data - Big Analytics

• Complex math operations (machine learning, clustering,
  trend detection, ….)
   — In your market, the world of the “quants”

   — Mostly specified as linear algebra on array data



• A dozen or so common ‘inner loops’
   — Matrix multiply

   — QR decomposition

   — SVD decomposition

   — Linear regression




                                                           4
Big Data - Big Analytics
                     An Example


• Consider closing price on all trading days for the last
  5 years for two stocks A and B

• What is the covariance between the two time-
  series?
      (1/N) * sum (Ai - mean(A)) * (Bi - mean (B))




                                                            5
Now Make It Interesting …

• Do this for all pairs of 4000 stocks
   — The data is the following 4000 x 1000 matrix


Stock     t1     t2    t3    t4     t5   t6   t7   ….   t1000

S1
S2
…
S4000

     Hourly data?     All securities?

                                                                6
Array Answer

• Ignoring the (1/N) and subtracting off the
  means ….

           Stock * StockT

• Try this in SQL with some relational simulation
  of the stock array!!!!




                                                    7
Solution Options
                    R, SAS, Matlab, et al


• Weak or non-existent data management
   —   Do the correlation only for companies with revenue > $1B ?


• File system storage

• R doesn’t scale and is not a parallel system
   — Revolution does a bit better




                                                                    8
Solution Options
                    RDBMS alone


• SQL simulator (MadLib) is slooooow
   — And only does some of the required operations



• Coding operations as UDFs still requires you to
  simulate arrays on top of tables --- sloooow
   — And current UDF model not powerful enough to
     support iteration




                                                     9
Solution Options
                       R + RDBMS


• Have to extract and transform the data from RDBMS
  table to math package data format (e.g. data frames)
• ‘move the world’ nightmare
• Need to learn 2 systems
• And R still doesn’t scale and is not a parallel system




                                                           10
Solution Options


• New Array DBMS designed with this market in mind




                                                11
An Example Array Engine DB
                       Paradigm4/SciDB

• All-in-one: data management with massively scalable
  advanced analytics

• Data is updated; not overwritten
   —   Supports reproducibility for research and compliance
   —   Time-series data
   —   Scenario testing

• Supports uncertain data, provenance

• Open source

• Runs in cloud or private grid of commodity HW

                                                              12
Solution Options: Hadoop


• Simple analytics (Hive queries)
   — 100 times slower than a parallel DBMS

• Complex analytics (Mahout or roll-your-own)
   — 100 times slower than Scalapack

• Parallel programming
   — Parallel grep (great)

   — Everything else (awful)

• Hadoop lacks
   — Stateful computations

   — Point-to-point communication




                                                13
Solution Options: Hadoop


• Lot and lots and lots of people are piloting Hadoop
• Many will hit a scalability wall when they get to
  production
   — Unless they are doing parallel grep




• My prediction: the bloom will come off the rose




                                                        14
Big Velocity

• Sensor tagging everything of material
  significance – and reporting state in real
  time -- sends volumes through the roof
  —   Including shopping carts (customers)
  —   And retail items
  —   Marathon runners
  —   Library books
  —   Broken equipment

• Breaks all your infrastructure

• And it will just get worse

                                               15
Two Different Solutions

• Big pattern - little state (electronic trading)
   — Find me a ‘strawberry’ followed within
     100 msec by a ‘banana’

• Complex event processing (CEP) is focused
  on this problem
   — Patterns in a firehose




                                                    16
Two Different Solutions

• Big state - little pattern
   — For every customer location in a store,
     decide whether to offer a real-time
     coupon to an item at eye-level

• Looks like high performance OLTP
   — Want to update a database at very high
     speed

• Looks exactly like ad placement on the web

                                               17
My Suspicion

• Your have 10 Big state - little pattern
  problems for every one Big pattern – little
  state problem




                                                18
New OLTP

• You need to ingest a fire
  hose in real-time

• You need to perform high
  volume OLTP

• You often need real-time
  analytics




                              19
Solution Choices

• Old SQL
   — The elephants


• No SQL
   — 75 or so vendors giving up both SQL and ACID


• New SQL
   — Retain SQL and ACID but go fast with a new
     architecture




                                                    20
Why Not Use Old SQL?

• Sloooow
   — By a couple orders of magnitude



• Because of
   — Disk

   — Heavy-weight transactions

   — Multi-threading



• See “Through the OLTP Looking Glass”
   — VLDB 2007




                                         21
No SQL

• Give up SQL
   — Interesting to note that

     Cassandra and Mongo are
     moving to (yup) SQL

• Give up ACID
   — If you need ACID, this is a

     decision to tear your hair out
     by doing it in user code
   — Can you guarantee you won’t
     need ACID tomorrow?


                                      22
VoltDB: an example of New SQL

• A main memory SQL engine

• Open source

• Shared nothing, Linux, TCP/IP on jelly beans

• Light-weight transactions
   — Run-to-completion with no locking


• Single-threaded
   — Multi-core by splitting main memory


• About 100x RDBMS on TPC-C


                                                 23
Big Variety – Traditional Solution

• Construct a global schema
• Have a programmer understand each data source –
  and map local objects to the global schema
• Using some scripting language
   — Sometimes this is very expensive…..

   — I.e. ask the customer

• And then you have the problem of cleaning the
  data….
   — More scripts



• Works for 20 (or so) data sources

                                                    24
Big Variety

• Typical enterprise has 5000 operational systems
   — Only a few get into the data warehouse

   — What about the rest?



• And what about all the rest of your data?
   — Spreadsheets

   — Access data bases

   — Web pages



• And public data from the web?


                                                    25
The World of Data Integration

        the rest of your data




     enterprise         text
     data warehouse




                                26
Summary

• The rest of your data (public and private)
   — Is a treasure trove of incredibly valuable

     information

   —   Largely untapped




                                                  27
Data Tamer

• Integrate the rest of your data

• Has to
   — Be scalable to 1000s of sites

   — Deal with incomplete, conflicting, and incorrect data

   — Be incremental

      • Task is never done




                                                        28
Data Tamer in a Nutshell

• Apply machine learning and statistics to perform
  automatic:
   — Discovery of structure

   — Entity resolution

   — Transformation



• With a human assist if necessary
  — WYSIWYG tool (Wrangler)




                                                     29
Data Tamer

• MIT research project

• Looking for more integration problems
   — Wanna partner?




                                          30
Take away

• One size does not fit all

• Plan on (say) 6 DBMS architectures
   — Use the right tool for the job



• Elephants are not competitive
   — At anything

   — Have a bad ‘innovator’s dilemma’ problem




                                                31

More Related Content

PDF
Is NoSQL The Future of Data Storage?
PPT
SQL, NoSQL, BigData in Data Architecture
PPT
Making MySQL Great For Business Intelligence
PPTX
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
PPT
Data Warehouse Logical Design using Mysql
PDF
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
PDF
Considerations for using NoSQL technology on your next IT project
PDF
NoSQL Overview
Is NoSQL The Future of Data Storage?
SQL, NoSQL, BigData in Data Architecture
Making MySQL Great For Business Intelligence
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
Data Warehouse Logical Design using Mysql
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
Considerations for using NoSQL technology on your next IT project
NoSQL Overview

What's hot (20)

PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
PDF
Hybrid my sql_hadoop_datawarehouse
PDF
Big data and hadoop overvew
PPTX
Intro to Big Data and NoSQL
PDF
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
PPT
SQL or NoSQL, that is the question!
PPTX
Column Stores and Google BigQuery
PDF
Oracle vs NoSQL – The good, the bad and the ugly
PDF
Hadoop: The Default Machine Learning Platform ?
PPTX
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
PPTX
Sql vs NoSQL
PPTX
Hadoop overview
PDF
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
PPTX
Hadoop bigdata overview
PDF
Hadoop Overview & Architecture
 
PPTX
Relational and non relational database 7
KEY
NoSQL databases and managing big data
PDF
Big Data and NoSQL in Microsoft-Land
PDF
Hadoop Overview kdd2011
PDF
Fb talk arch_summit
NoSQL A brief look at Apache Cassandra Distributed Database
Hybrid my sql_hadoop_datawarehouse
Big data and hadoop overvew
Intro to Big Data and NoSQL
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
SQL or NoSQL, that is the question!
Column Stores and Google BigQuery
Oracle vs NoSQL – The good, the bad and the ugly
Hadoop: The Default Machine Learning Platform ?
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
Sql vs NoSQL
Hadoop overview
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
Hadoop bigdata overview
Hadoop Overview & Architecture
 
Relational and non relational database 7
NoSQL databases and managing big data
Big Data and NoSQL in Microsoft-Land
Hadoop Overview kdd2011
Fb talk arch_summit
Ad

Similar to What Does Big Data Mean and Who Will Win (20)

PPTX
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
PDF
Big data rmoug
PPTX
Big Data (NJ SQL Server User Group)
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
PDF
Where Does Big Data Meet Big Database - QCon 2012
PPT
Big Data = Big Decisions
PDF
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
PPTX
NoSQL for the SQL Server Pro
PDF
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
PDF
DataStax & 451 Group Webinar - Real NoSQL Applications in the Enterprise Today
PPTX
JasperWorld 2012: Reinventing Data Management by Max Schireson
PPTX
Anexinet Big Data Solutions
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
PDF
The return of big iron?
PDF
Nosql intro
PDF
Morning with MongoDB Paris 2012 - Making Big Data Small
PPT
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
PPTX
Big data 101
PDF
Database Revolution - Exploratory Webcast
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big data rmoug
Big Data (NJ SQL Server User Group)
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
Where Does Big Data Meet Big Database - QCon 2012
Big Data = Big Decisions
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
NoSQL for the SQL Server Pro
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
DataStax & 451 Group Webinar - Real NoSQL Applications in the Enterprise Today
JasperWorld 2012: Reinventing Data Management by Max Schireson
Anexinet Big Data Solutions
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
The return of big iron?
Nosql intro
Morning with MongoDB Paris 2012 - Making Big Data Small
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
Big data 101
Database Revolution - Exploratory Webcast
Ad

More from BigDataCloud (20)

PDF
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
PDF
Crime Analysis & Prediction System
PDF
REAL-TIME RECOMMENDATION SYSTEMS
PDF
Cloud Computing Services
PDF
Google Enterprise Cloud Platform - Resources & $2000 credit!
PDF
Big Data in the Cloud - Solutions & Apps
PDF
Big Data Analytics in Motorola on the Google Cloud Platform
PDF
Streak + Google Cloud Platform
PDF
Using Advanced Analyics to bring Business Value
PDF
Creating Business Value from Big Data, Analytics & Technology.
PDF
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
PPTX
Recommendation Engines - An Architectural Guide
PPTX
Why Hadoop is the New Infrastructure for the CMO?
PDF
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
PPTX
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
PDF
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
PDF
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
PPT
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
PPT
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Crime Analysis & Prediction System
REAL-TIME RECOMMENDATION SYSTEMS
Cloud Computing Services
Google Enterprise Cloud Platform - Resources & $2000 credit!
Big Data in the Cloud - Solutions & Apps
Big Data Analytics in Motorola on the Google Cloud Platform
Streak + Google Cloud Platform
Using Advanced Analyics to bring Business Value
Creating Business Value from Big Data, Analytics & Technology.
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Recommendation Engines - An Architectural Guide
Why Hadoop is the New Infrastructure for the CMO?
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Cloud computing and distributed systems.
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
KodekX | Application Modernization Development
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx
Cloud computing and distributed systems.
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
Building Integrated photovoltaic BIPV_UPV.pdf

What Does Big Data Mean and Who Will Win

  • 1. What Does ‘Big Data’ Mean and Who Will Win? Michael Stonebraker
  • 2. The Meaning of Big Data - 3 V’s • Big Volume — With simple (SQL) analytics — With complex (non-SQL) analytics • Big Velocity — Drink from the fire hose • Big Variety — Large number of diverse data sources to integrate 2
  • 3. Big Volume - Little Analytics • Well addressed by data warehouse crowd • Who are pretty good at SQL analytics on — Hundreds of nodes — Petabytes of data • I know of a dozen or so multi-petabyte data warehouses in production from multiple vendors on more than 100 nodes of iron…. 3
  • 4. Big Data - Big Analytics • Complex math operations (machine learning, clustering, trend detection, ….) — In your market, the world of the “quants” — Mostly specified as linear algebra on array data • A dozen or so common ‘inner loops’ — Matrix multiply — QR decomposition — SVD decomposition — Linear regression 4
  • 5. Big Data - Big Analytics An Example • Consider closing price on all trading days for the last 5 years for two stocks A and B • What is the covariance between the two time- series? (1/N) * sum (Ai - mean(A)) * (Bi - mean (B)) 5
  • 6. Now Make It Interesting … • Do this for all pairs of 4000 stocks — The data is the following 4000 x 1000 matrix Stock t1 t2 t3 t4 t5 t6 t7 …. t1000 S1 S2 … S4000 Hourly data? All securities? 6
  • 7. Array Answer • Ignoring the (1/N) and subtracting off the means …. Stock * StockT • Try this in SQL with some relational simulation of the stock array!!!! 7
  • 8. Solution Options R, SAS, Matlab, et al • Weak or non-existent data management — Do the correlation only for companies with revenue > $1B ? • File system storage • R doesn’t scale and is not a parallel system — Revolution does a bit better 8
  • 9. Solution Options RDBMS alone • SQL simulator (MadLib) is slooooow — And only does some of the required operations • Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow — And current UDF model not powerful enough to support iteration 9
  • 10. Solution Options R + RDBMS • Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames) • ‘move the world’ nightmare • Need to learn 2 systems • And R still doesn’t scale and is not a parallel system 10
  • 11. Solution Options • New Array DBMS designed with this market in mind 11
  • 12. An Example Array Engine DB Paradigm4/SciDB • All-in-one: data management with massively scalable advanced analytics • Data is updated; not overwritten — Supports reproducibility for research and compliance — Time-series data — Scenario testing • Supports uncertain data, provenance • Open source • Runs in cloud or private grid of commodity HW 12
  • 13. Solution Options: Hadoop • Simple analytics (Hive queries) — 100 times slower than a parallel DBMS • Complex analytics (Mahout or roll-your-own) — 100 times slower than Scalapack • Parallel programming — Parallel grep (great) — Everything else (awful) • Hadoop lacks — Stateful computations — Point-to-point communication 13
  • 14. Solution Options: Hadoop • Lot and lots and lots of people are piloting Hadoop • Many will hit a scalability wall when they get to production — Unless they are doing parallel grep • My prediction: the bloom will come off the rose 14
  • 15. Big Velocity • Sensor tagging everything of material significance – and reporting state in real time -- sends volumes through the roof — Including shopping carts (customers) — And retail items — Marathon runners — Library books — Broken equipment • Breaks all your infrastructure • And it will just get worse 15
  • 16. Two Different Solutions • Big pattern - little state (electronic trading) — Find me a ‘strawberry’ followed within 100 msec by a ‘banana’ • Complex event processing (CEP) is focused on this problem — Patterns in a firehose 16
  • 17. Two Different Solutions • Big state - little pattern — For every customer location in a store, decide whether to offer a real-time coupon to an item at eye-level • Looks like high performance OLTP — Want to update a database at very high speed • Looks exactly like ad placement on the web 17
  • 18. My Suspicion • Your have 10 Big state - little pattern problems for every one Big pattern – little state problem 18
  • 19. New OLTP • You need to ingest a fire hose in real-time • You need to perform high volume OLTP • You often need real-time analytics 19
  • 20. Solution Choices • Old SQL — The elephants • No SQL — 75 or so vendors giving up both SQL and ACID • New SQL — Retain SQL and ACID but go fast with a new architecture 20
  • 21. Why Not Use Old SQL? • Sloooow — By a couple orders of magnitude • Because of — Disk — Heavy-weight transactions — Multi-threading • See “Through the OLTP Looking Glass” — VLDB 2007 21
  • 22. No SQL • Give up SQL — Interesting to note that Cassandra and Mongo are moving to (yup) SQL • Give up ACID — If you need ACID, this is a decision to tear your hair out by doing it in user code — Can you guarantee you won’t need ACID tomorrow? 22
  • 23. VoltDB: an example of New SQL • A main memory SQL engine • Open source • Shared nothing, Linux, TCP/IP on jelly beans • Light-weight transactions — Run-to-completion with no locking • Single-threaded — Multi-core by splitting main memory • About 100x RDBMS on TPC-C 23
  • 24. Big Variety – Traditional Solution • Construct a global schema • Have a programmer understand each data source – and map local objects to the global schema • Using some scripting language — Sometimes this is very expensive….. — I.e. ask the customer • And then you have the problem of cleaning the data…. — More scripts • Works for 20 (or so) data sources 24
  • 25. Big Variety • Typical enterprise has 5000 operational systems — Only a few get into the data warehouse — What about the rest? • And what about all the rest of your data? — Spreadsheets — Access data bases — Web pages • And public data from the web? 25
  • 26. The World of Data Integration the rest of your data enterprise text data warehouse 26
  • 27. Summary • The rest of your data (public and private) — Is a treasure trove of incredibly valuable information — Largely untapped 27
  • 28. Data Tamer • Integrate the rest of your data • Has to — Be scalable to 1000s of sites — Deal with incomplete, conflicting, and incorrect data — Be incremental • Task is never done 28
  • 29. Data Tamer in a Nutshell • Apply machine learning and statistics to perform automatic: — Discovery of structure — Entity resolution — Transformation • With a human assist if necessary — WYSIWYG tool (Wrangler) 29
  • 30. Data Tamer • MIT research project • Looking for more integration problems — Wanna partner? 30
  • 31. Take away • One size does not fit all • Plan on (say) 6 DBMS architectures — Use the right tool for the job • Elephants are not competitive — At anything — Have a bad ‘innovator’s dilemma’ problem 31