What Does Big Data Mean and Who Will Win

What Does ‘Big Data’ Mean
and Who Will Win?

Michael Stonebraker

The Meaning of Big Data - 3 V’s

• Big Volume
— With simple (SQL) analytics

— With complex (non-SQL) analytics

• Big Velocity
— Drink from the fire hose

• Big Variety
— Large number of diverse data sources to integrate

2

Big Volume - Little Analytics

• Well addressed by data warehouse crowd

• Who are pretty good at SQL analytics on
— Hundreds of nodes

— Petabytes of data

• I know of a dozen or so multi-petabyte data
warehouses in production from multiple vendors on
more than 100 nodes of iron….

3

Big Data - Big Analytics

• Complex math operations (machine learning, clustering,
trend detection, ….)
— In your market, the world of the “quants”

— Mostly specified as linear algebra on array data

• A dozen or so common ‘inner loops’
— Matrix multiply

— QR decomposition

— SVD decomposition

— Linear regression

4

Big Data - Big Analytics
An Example

• Consider closing price on all trading days for the last
5 years for two stocks A and B

• What is the covariance between the two time-
series?
(1/N) * sum (Ai - mean(A)) * (Bi - mean (B))

5

Now Make It Interesting …

• Do this for all pairs of 4000 stocks
— The data is the following 4000 x 1000 matrix

Stock t1 t2 t3 t4 t5 t6 t7 …. t1000

S1
S2
…
S4000

Hourly data? All securities?

6

Array Answer

• Ignoring the (1/N) and subtracting off the
means ….

Stock * StockT

• Try this in SQL with some relational simulation
of the stock array!!!!

7

Solution Options
R, SAS, Matlab, et al

• Weak or non-existent data management
— Do the correlation only for companies with revenue > $1B ?

• File system storage

• R doesn’t scale and is not a parallel system
— Revolution does a bit better

8

Solution Options
RDBMS alone

• SQL simulator (MadLib) is slooooow
— And only does some of the required operations

• Coding operations as UDFs still requires you to
simulate arrays on top of tables --- sloooow
— And current UDF model not powerful enough to
support iteration

9

Solution Options
R + RDBMS

• Have to extract and transform the data from RDBMS
table to math package data format (e.g. data frames)
• ‘move the world’ nightmare
• Need to learn 2 systems
• And R still doesn’t scale and is not a parallel system

10

Solution Options

• New Array DBMS designed with this market in mind

11

An Example Array Engine DB
Paradigm4/SciDB

• All-in-one: data management with massively scalable
advanced analytics

• Data is updated; not overwritten
— Supports reproducibility for research and compliance
— Time-series data
— Scenario testing

• Supports uncertain data, provenance

• Open source

• Runs in cloud or private grid of commodity HW

12

Solution Options: Hadoop

• Simple analytics (Hive queries)
— 100 times slower than a parallel DBMS

• Complex analytics (Mahout or roll-your-own)
— 100 times slower than Scalapack

• Parallel programming
— Parallel grep (great)

— Everything else (awful)

• Hadoop lacks
— Stateful computations

— Point-to-point communication

13

Solution Options: Hadoop

• Lot and lots and lots of people are piloting Hadoop
• Many will hit a scalability wall when they get to
production
— Unless they are doing parallel grep

• My prediction: the bloom will come off the rose

14

Big Velocity

• Sensor tagging everything of material
significance – and reporting state in real
time -- sends volumes through the roof
— Including shopping carts (customers)
— And retail items
— Marathon runners
— Library books
— Broken equipment

• Breaks all your infrastructure

• And it will just get worse

15

Two Different Solutions

• Big pattern - little state (electronic trading)
— Find me a ‘strawberry’ followed within
100 msec by a ‘banana’

• Complex event processing (CEP) is focused
on this problem
— Patterns in a firehose

16

Two Different Solutions

• Big state - little pattern
— For every customer location in a store,
decide whether to offer a real-time
coupon to an item at eye-level

• Looks like high performance OLTP
— Want to update a database at very high
speed

• Looks exactly like ad placement on the web

17

My Suspicion

• Your have 10 Big state - little pattern
problems for every one Big pattern – little
state problem

18

New OLTP

• You need to ingest a fire
hose in real-time

• You need to perform high
volume OLTP

• You often need real-time
analytics

19

Solution Choices

• Old SQL
— The elephants

• No SQL
— 75 or so vendors giving up both SQL and ACID

• New SQL
— Retain SQL and ACID but go fast with a new
architecture

20

Why Not Use Old SQL?

• Sloooow
— By a couple orders of magnitude

• Because of
— Disk

— Heavy-weight transactions

— Multi-threading

• See “Through the OLTP Looking Glass”
— VLDB 2007

21

No SQL

• Give up SQL
— Interesting to note that

Cassandra and Mongo are
moving to (yup) SQL

• Give up ACID
— If you need ACID, this is a

decision to tear your hair out
by doing it in user code
— Can you guarantee you won’t
need ACID tomorrow?

22

VoltDB: an example of New SQL

• A main memory SQL engine

• Open source

• Shared nothing, Linux, TCP/IP on jelly beans

• Light-weight transactions
— Run-to-completion with no locking

• Single-threaded
— Multi-core by splitting main memory

• About 100x RDBMS on TPC-C

23

Big Variety – Traditional Solution

• Construct a global schema
• Have a programmer understand each data source –
and map local objects to the global schema
• Using some scripting language
— Sometimes this is very expensive…..

— I.e. ask the customer

• And then you have the problem of cleaning the
data….
— More scripts

• Works for 20 (or so) data sources

24

Big Variety

• Typical enterprise has 5000 operational systems
— Only a few get into the data warehouse

— What about the rest?

• And what about all the rest of your data?
— Spreadsheets

— Access data bases

— Web pages

• And public data from the web?

25

The World of Data Integration

the rest of your data

enterprise text
data warehouse

26

Summary

• The rest of your data (public and private)
— Is a treasure trove of incredibly valuable

information

— Largely untapped

27

Data Tamer

• Integrate the rest of your data

• Has to
— Be scalable to 1000s of sites

— Deal with incomplete, conflicting, and incorrect data

— Be incremental

• Task is never done

28

Data Tamer in a Nutshell

• Apply machine learning and statistics to perform
automatic:
— Discovery of structure

— Entity resolution

— Transformation

• With a human assist if necessary
— WYSIWYG tool (Wrangler)

29

Data Tamer

• MIT research project

• Looking for more integration problems
— Wanna partner?

30

Take away

• One size does not fit all

• Plan on (say) 6 DBMS architectures
— Use the right tool for the job

• Elephants are not competitive
— At anything

— Have a bad ‘innovator’s dilemma’ problem

31

What Does Big Data Mean and Who Will Win

More Related Content

What's hot (20)

Similar to What Does Big Data Mean and Who Will Win (20)

More from BigDataCloud (20)

Recently uploaded (20)

What Does Big Data Mean and Who Will Win