SlideShare a Scribd company logo
SQL..
.
SQL!
SQL?
SQL
Hadoop
BI Isn’t Big Data, Big Data Isn’t BI
September, 2015
Mark Madsen
www.ThirdNature.net
@markmadsen
© Third Nature Inc.
Summary
Common uses and commodity technology
lead to
Novel practices
lead to
Different data and different technology needs
lead to
New architectures
Lead to
Common uses and commodity technology 
© Third Nature Inc.
Our ideas about
information and
how it’s used are
outdated.
© Third Nature Inc.
How We Think of Users
Our design point is the 
passive consumer of 
information.
Proof: methodology
▪ IT role is requirements, 
design, build, deploy, 
administer
▪ User role is run reports
Self‐serve BI is not like 
picking the right doughnut 
from a box.
Slide 4
© Third Nature Inc.
How We Think of Users
Our design point is the 
passive consumer of 
information.
Proof: methodology
▪ IT role is requirements, 
design, build, deploy, 
administer
▪ User role is run reports
Self‐serve BI is not like 
picking the right doughnut 
from a box.
How We Want Users to 
Think of Us
© Third Nature Inc.
How We Think of Users What Users Really Think
© Third Nature Inc.
We think of BI as publishing, an old metaphor.
Publishing has value, but may not be actionable.
© Third Nature Inc.
Planning data strategy means understanding the 
context of data use so we can build infrastructure
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
We need to focus on what people do with information
as the primary task, not on the data or the technology.
© Third Nature Inc.
General model for organizational use of data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act within the process
Usually real-time to daily
© Third Nature Inc.
Origin of BI and data warehouse concepts
The general concept of a 
separate architecture for BI 
has been around longer, but 
this paper by Devlin and 
Murphy is the first formal 
data warehouse architecture 
and definition published.
10
“An architecture for a business and
information system”, B. A. Devlin,
P. T. Murphy, IBM Systems Journal,
Vol.27, No. 1, (1988)
Slide 10Copyright Third Nature, Inc.
© Third Nature Inc.
Origins: in 1988 there was only big hair.
▪ No real commercial email, public internet barely started
▪ Storage state of the art: 100MB, cost $10,000/GB
▪ Oracle Applications v1 GL released; SAP goes public, 
enters US market
▪ Unix is mostly run by long‐haired freaks
▪ Mobile was this
This is the context: scarcity of data, of system resources, of automated 
systems outside core financials, of money to pay for storage.
© Third Nature Inc.
General model for organizational use of data
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Copyright Third Nature, Inc.
© Third Nature Inc.
You need to be able to support both paths
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
Act on the process
Act within the process
Conventional BI, addition of EDM
Causal analysis, “data science”
Copyright Third Nature, Inc.
© Third Nature Inc.
The usage models for conventional BI
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Act within the process
Usually real-time to daily
This is what we’ve been
doing with BI so far: static
reporting, dashboards,
ad-hoc query, OLAP
Copyright Third Nature, Inc.
© Third Nature Inc.
The usage models for analytics and “big data” 
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Act within the process
Usually real-time to daily
Analytics and big data is
focused on new use
cases: deeper analysis,
causes, prediction,
optimizing decisions
This isn’t ad-hoc,
reporting, or OLAP.
Copyright Third Nature, Inc.
© Third Nature Inc.
When you first give people access to information 
that was unavailable…
OH GOD
I can see into forever
© Third Nature Inc.
After a while it becomes the new normal
© Third Nature Inc.
As practices evolve based on new capabilities…
A new level of 
complexity 
develops over 
top of the 
older, now 
better 
understood 
processes, 
leading to new 
data and 
analysis needs.
© Third Nature Inc.
I never said the
“E” in EDW meant
“everything”…
What do you
mean, “Just
doughnuts?”
© Third Nature Inc.
The data warehouse vs business agility
All the data
Common, typed, tabular data
The bottleneck is you
© Third Nature Inc.
It’s going to get a lot worse
Not E
E
Conclusion: any methodology built on the premise that you 
must know and model all the data first is untenable 
© Third Nature Inc.
Old market says: There’s nothing wrong with what 
you have, just keep buying new products from us
© Third Nature Inc.
The emerging big data market has an answer…
© Third Nature Inc.
The data lake
© Third Nature Inc.
The data lake after a little while
© Third Nature Inc.
TANSTAAFL
When replacing the old 
with the new (or ignoring 
the new over the old) you 
always make tradeoffs, 
and usually you won’t see 
them for a long time.
Technologies are not 
perfect replacements for 
one another. Often not 
better, only different.
© Third Nature Inc.
“Big data is unprecedented.”
‐ Anyone involved with big data in even the 
most barely perceptible way
© Third Nature Inc.
We’ve been here before
Source: Bill Schmarzo, EMC
© Third Nature Inc.
“Big” is well supported by databases now
Source:Noumenal,Inc.
© Third Nature Inc.
Orders of magnitude: 20 years ago TB, today PB
Shifts in data availability by orders of magnitude 
necessitate new means of managing and using it.
© Third Nature Inc.
Analytics embiggens the data volume problem
Many of the processing problems are O(n2) or worse, so 
moderate data can be a problem for DB‐based platforms
© Third Nature Inc.
Much of the big data value comes from analytics
BI is a retrieval problem, not a computational problem.
Five basic things you can do with analytics
▪Prediction – what is most likely to happen?
▪Estimation – what’s the future value of a variable?
▪Description – what relationships exist in the data?
▪Simulation – what could happen?
▪Prescription – what should you do?
Slide 36
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
© Third Nature Inc.
Most people do not need special technologyNumberofpeople
The distribution of data
size is about normal, yet
these guys set the tone of
the market today.
Bigness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
Analytics: This is really raw data under storageNumberofjobs
Microsoft study of 174,000 analytic
jobs in their cluster: median size ???
Bigness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
Working data for analytics most often not bigNumberofjobs
14 GB
Smallness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
An (overly) Simple Division of the Problem SpaceComputation
LittleLots
Data volume
Little Lots
Big analytics, little data
Specialized computing,
modeling problems:
supercomputing, GPUs
Big analytics, big data
Complex math over large
data volumes requires
shared nothing architectures
Little analytics, little data
The entry point; SAS, SMP
databases, even OLAP
cubes can work
Little analytics, big data
The BI/DW space, for the
most part, with work done
in databases
© Third Nature Inc.© Third Nature Inc.
What makes data “big”?
Very large amounts
Hierarchical structures
Nested structures
Linked structures
Encoded values
Non‐standard (for a 
database) types
Deep structure
Human authored text
“big” is better off being defined as “complex” or “hard to manage”
Copyright Third Nature, Inc.
© Third Nature Inc.
Categorizing the measurement data we collect
The convenient data is the 
transactional data.
▪ Goes in the DW and is used, even 
if it isn’t the right measurement.
The inconvenient data is 
observational data.
▪ It’s not neat, clean, or designed 
into most systems of operation.
The difficult and misleading data 
is declarative data.
▪ What people say and what they 
do require ground truth.
We need an architecture that 
supports all three categories.
Copyright Third Nature, Inc.
© Third Nature Inc.
Transactions vs “big data”
The classic example of “structured data”
Transaction data includes:
▪ quantification details (date, value, count)
▪ reference data for explanation (product, 
customer, account)
▪ Lots of meaningful information
Reference data is usually shared across the 
organization, hence its importance. There 
are two parts:
▪ identifier to uniquely identify the subject
▪ descriptive attributes with common or 
standardized value domains
Transaction details
Reference data
© Third Nature Inc.
Today it’s different data: observations, not transactions
Sensor data doesn’t fit well with current methods of collection and
storage, or with the technology to process and analyze it.
Copyright Third Nature, Inc.
© Third Nature Inc.
Big data as a type of data: Transactions vs. Events
Transactions:
▪ Each one is valuable
▪ Mutable
▪ The elements of a transaction can be aggregated easily
▪ A set of transactions does not usually have important ordering 
or dependency
Events:
▪ A single event often has no value, e.g. what is the value of one 
click in a series? Some events are extremely valuable, but this 
is only detectable within the context of other events.
▪ Elements of events are often not easily aggregated
▪ A set of events usually has a natural order and dependencies
▪ Immutable
© Third Nature Inc.
Example “big data”: Web tracking data
USER_ID 301212631165031
SESSION_ID 590387153892659
VISIT_DATE 1/10/2010 0:00
SESSION_START_DATE 1:41:44 AM
PAGE_VIEW_DATE 1/10/2010 9:59
DESTINATION_URL
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐
1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Google.com
REFERRAL_URL
http://www.google.com/search?sourceid=navclient&aq=0h&
oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia
n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
BROWSER_OS_NAME
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS 
NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
© Third Nature Inc.
Web tracking data has a nested structure
USER_ID 301212631165031
SESSION_ID 590387153892659
VISIT_DATE 1/10/2010 0:00
SESSION_START_DATE 1:41:44 AM
PAGE_VIEW_DATE 1/10/2010 9:59
DESTINATION_URL
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐
1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Direct
REFERRAL_URL ‐
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
BROWSER_OS_NAME
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS 
NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
“unstructured” data
embedded in the
logged message:
complex strings
© Third Nature Inc.
The missing ingredient from most big data
© Third Nature Inc.
The creation, flow and use of data is different for 
transactions and machine‐generated events
Data entry Extract Cleanse Load UseStore
Transactions
MDM
Generate Store
Use
UseCleanse
Program
Capture
This runs at human speed
This runs at machine speed, with higher latency feedback cycles
We collect large volumes of text, a rare practice 
ten years ago. Today we can turn text into data.
Categories,
taxonomies
Topics, genres,
relationships,
abstracts
Sentiment, tone,
opinion
Words & counts,
keywords, tags
Entities
people, places,
things, events, IDs
Copyright Third Nature, Inc.
© Third Nature Inc.
You can store this data in an RDBMS, but…
Example data: Twitter Message API Payload
Looks like:
This is really just a record format
much like a DB row.
Datetime, userID, name, location,
description, message, message
metadata, etc.
But it’s In json or xml.
© Third Nature Inc.
@markmadsen Check out: From #MongoDB to #Cassandra: 
Why The Atlas Platform Is Migrating http://owl.li/cvxFK
A tweet has lots of fields, but one important one
The payload is free text but has other elements:
From these things you likely want to generate or link to 
reference data.
‘To’ username Hashtag HashtagURL
© Third Nature Inc.© Third Nature Inc.
Internal payload elements form a new graph
The @elements point to 
other records and create a 
deeply linked structure.
You have to assemble the 
linked structure to see 
what’s really there, which 
means repeated scanning 
some/all of the data.
The derived pattern is 
interesting data, 
sometimes more than the 
individual messages.
© Third Nature Inc.© Third Nature Inc.
There are many patterns in the data
Follower / following networks are easy – they are explicit 
and independent of the events.
Community detection requires looking at patterns of @ 
communication in addition to follow relationships.
What do you do with these after discovery?
Follower network Conversational communities
© Third Nature Inc.
More data: patterns emerge from lots of event data
Patterns emerge from 
the underlying structure 
of the entire dataset.
The patterns are more 
interesting than sums 
and counts of the events.
Web paths: clicks in a 
session as network node 
traversal.
Email: traffic analysis 
producing a network
The event stream is a source for analysis, generating
another set of data that is the source for different analysis.
© Third Nature Inc.
Big changes for data warehousing workloads
The results of analytic 
processing can, often do, 
feed back into the 
system from which they 
originate.
Much of the data is being 
read, written and 
processed in real time.
Our design point was not 
changing tables and 
ephemeral patterns.
Unstructured is Not Really Unstructured
Slide 58
Unstructured data isn’t 
really unstructured: 
language has structure.   
Text can contain traditional 
structured data elements. 
The problem is that the 
content is unmodeled.
© Third Nature Inc. Slide 59
THE BIG CHANGE ISN’T
TECHNOLOGY, IT’S ARCHITECTURE
© Third Nature Inc.
There are really three workloads to consider, not two
1. Operational: OLTP systems
2. Analytic: OLAP systems
3. Processing: Computational systems
Unit of focus:
1. Transaction
2. Query
3. Computation
Different problems require different platforms
© Third Nature Inc.
Workloads
OLTP BI Analytics
Access Read‐Write Read‐only Read‐mostly
Predictability Predictable Unpredictable Fixed path
Selectivity High Low Low
Retrieval Low Low High
Latency Milliseconds < seconds msecs to days
Concurrency Huge Moderate 1 to huge
Model 3NF, nested object Dim, denorm BWT
Task size Small Large Small to huge
© Third Nature Inc.
These do exactly the same thing:
One is a set of technologies. One is an architecture.
An idea promoted by big data vendors
Data
Warehouse
© Third Nature Inc.
Reality: Hadoop disaggregates the database
One of the key things Hadoop does is to separate the 
storage, execution and API layers of a database. This 
allows for processing flexibility, but it does not permit 
one to build a reliable, high performance database 
across the layers.
Hadoop distributed filesystem (HDFS)
General-purpose data engines
Abstraction layers
Storage management
© Third Nature Inc.
A more specific look at layers and engines
Base storage
SQL, MDX
Kylin
Storage mgmt
Engine
Abstraction 
layer / API
You can program to any layer you
choose. Some projects already build on
top of multiple others.
Language/API Engine
Hadoop distributed filesystem (HDFS)
MapReduce Tez
Cascading
Spark
Storage (filetypes in HDFS, Hbase, etc)
Crunch
Pig
Hive
SparkSQL
NativeAPI
Giraph
Hive
Crunch
Pig
Impala
Drill
Presto
NativeAPI
NativeAPI
Hive
Pig
NativeAPI
Hbase
Phoenix
© Third Nature Inc.
An important Hadoop + cloud computing benefit
Scalability is free – if your task requires 10 units of 
work, you can decide when you want results:
10 servers, 1 unit of time
Cost is the same. Not true of the conventional IT model
Time
1 server, 10 units of time
X X
© Third Nature Inc.
Hadoop: a summary of the magic
1. Provides both storage and complex processing as part 
of the same platform
2. Makes parallel programming more accessible
3. Schemaless (just files) therefore flexible
4. Inexpensive, reliable scale‐out
5. Potential for fast, scalable ingest
6. Cheaper than a database (for non‐database work)
The bad stuff:
▪ Not great for mutable data
▪ Mostly file‐based sequential processing, or you store data 
many times in different datastores (locality is important)
▪ Minimal data management (today)
© Third Nature Inc.
The geography has been redefined
The box we created:
• not any data, rigidly typed data
• not any form, tabular rows and 
columns of typed data
• not any latency, persist what the 
DB can keep up with
• not any process, only queries
The digital world was diminished 
to only what’s inside the box until 
we forgot the box was there.
© Third Nature Inc.
Layered data architecture
The DW assumed a single flat 
model of data, DB in the center. 
New technology enables new 
ways to organize data:
▪ Raw – straight from the source
▪ Enhanced –cleaned, standardized
▪ Integrated – modeled, 
augmented, ~semi‐persistent
▪ Derived – analytic output, 
pattern based sets, ephemeral
Implies a new technology architecture 
and data modeling approaches.
© Third Nature Inc.
Decouple the Data Architecture
The core of the data warehouse isn’t the 
database, it’s the data architecture that the 
database and tools implement.
We need a data architecture that is not limiting:
▪ Deals with change more easily and at scale
▪ Does not enforce requirements and models up front
▪ Does not limit the format or structure of data
▪ Assumes the range of data latencies in and out, from 
streaming to one‐time bulk
© Third Nature Inc.
Deconstructing the data warehouse
There are three 
things happening 
in a DW:
▪ Data acquisition
▪ Data management
▪ Data delivery
Isolate them from 
one another.
Data
Warehouse
© Third Nature Inc.
Integrate
Manage
Decouple the data architecture by stage
Use
In reality, you are building three systems, not one. Treat them that way.
Collect
Transactions Observations Declarations
© Third Nature Inc.
Food supply chain: an analogy for data
Multiple contexts of use, differing quality levels
© Third Nature Inc.
Data infrastructure is a platform
▪ Any data – structures, forms
▪ Any latency –in motion, at rest
▪ Any process – query, algorithm, transformation
▪ Any access – SQL, API, queue, file movement
© Third Nature Inc.
The evolution of DW is to a data platform, which means 
separating application from infrastructure.
Derived data
Raw data
Infrastructure layer:
Process and analyze
Store and manage
Application layer:
Deliver and use
The new model also encompasses data at rest and data in motion
Multiple access methods
Enhanced
data
Multiple ingest methods
BI, data extracts, 
analytics, applications
The platform has to do more than serve queries; it has to be read-write.
© Third Nature Inc.
Away from “one throat to choke”, back to best of breed
“The extremely specialized 
nature of mass production 
raises the costs of product 
change and therefore slows 
down innovation.”
‐ Abernathy, 1978
Tight coupling leads to slow 
changes.
In a rapidly evolving market 
componentized architectures, 
modularity  and loose coupling 
are favorable over monolithic 
stacks, single‐vendor 
architectures and tight 
coupling.
© Third Nature Inc.
Staff and skills are a problem in a build market
@BigDataBorat: Give man Hadoop
cluster he gain insight for a day. Teach
man build Hadoop cluster he soon
leave for better job #bigdata
© Third Nature Inc.
Technology Adoption
Some people can’t resist 
getting the next new thing 
because it’s new and new is 
always better.
Many IT organizations are like 
this, promoting a solution and 
hunting for the problem that 
matches it.
Better to ask “What is the 
problem for which this 
technology is the answer?”
Copyright Third Nature, Inc.
© Third Nature Inc.
Four core capabilities big data technologies add
1. Unlimited scale of storage, processing
▪ Agility, faster turnaround for new data requests (but not a
replacement for BI)
▪ Fewer staff to accomplish same goals
2. New data accessibility
▪ More data retained for longer period
▪ Access to data unused due to cost or processing limits
▪ Any digital information becomes usable data
3. Scalable realtime processing
▪ Brings ability to monitor and act on data as events occur
4. Arbitrary analytics
▪ Faster analysis
▪ Deeper analysis
▪ More broadly accessible analytics
© Third Nature Inc.
As a technology moves from emerging to commodity the 
nature of acquiring, using and managing it changes
Generate
options
Innovation
Novel practice
Maximize value
Maturation
Constrain
choices
Adaptation
Good practice
Optimize
Standardize /
minimize choice
Acquisition
Best practice
Minimize costs
SaturationInnovation
Copyright Third Nature, Inc.
Agile & open 
source* methods 
6 Sigma & process 
methods
© Third Nature Inc.
Today: repeating the experience of the 80s & 90s
This is the turbulent
phase of the market
as it goes through
rapid development,
then product and
service changes.
Copyright Third Nature, Inc.
The Internet combined with commodity computing is forcing a new
business and IT structural evolution, already underway.
Maturation SaturationInnovation
© Third Nature Inc.
How we develop best practices: survival bias
We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
© Third Nature Inc.
Welcome to the big data revolution, more of an evolution
Be pragmatic, not dogmatic
© Third Nature Inc.
CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
acorn_blue.jpg ‐ http://www.flickr.com/photos/rogersmith/314324893/
wheat_field.jpg ‐ http://www.flickr.com/photos/ecstaticist/1120119742/
Phone dump ‐ Richard Barnes
ponies in field.jpg ‐ http://www.flickr.com/photos/bulle_de/352732514/
straw men.jpg ‐ http://www.flickr.com/photos/robinellis/6034919721/
text composition ‐ http://flickr.com/photos/candiedwomanire/60224567/
girl on cell tokyo .jpg ‐ http://flickr.com/photos/8024992@N06/986538717/
hamadan people mosaic.jpg ‐ http://flickr.com/photos/hamed/225868856/
twitter_network_bw.jpg ‐ http://www.flickr.com/photos/dr/2048034334/
klein_bottle_red.jpg ‐ http://flickr.com/photos/sveinhal/2081201200/
donuts_4_views.jpg ‐ http://www.flickr.com/photos/le_hibou/76718773/
subway dc metro  ‐ http://flickr.com/photos/musaeum/509899161/
About the Presenter
Mark Madsen is president of Third 
Nature, a consulting and advisory firm 
focused on analytics, business 
intelligence and data management. 
Mark is an award‐winning author, 
architect and CTO. Over the past ten 
years Mark received awards for his work 
from the American Productivity & 
Quality Center, TDWI, and the 
Smithsonian Institute. He is an 
international speaker, a contributor to 
Forbes, member of the O’Reilly Strata 
program committee. For more 
information or to contact Mark, follow 
@markmadsen on Twitter or visit  
http://ThirdNature.net 
© Third Nature Inc.
About Third Nature
Third Nature is a consulting and advisory firm focused on new and
emerging technology and practices in information strategy, analytics,
business intelligence and data management. If your question is related to
data, analytics, information strategy and technology infrastructure then
you‘re at the right place.
Our goal is to help organizations solve problems using data. We offer
education, consulting and research services to support business and IT
organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in strategy and architecture, so we look at emerging
technologies and markets, evaluating how technologies are applied to
solve problems rather than evaluating product features.

More Related Content

PDF
Everything has changed except us
PDF
Big Data and Bad Analogies
PDF
Everything Has Changed Except Us: Modernizing the Data Warehouse
PDF
Disruptive Innovation: how do you use these theories to manage your IT?
PDF
Briefing room: An alternative for streaming data collection
PDF
Data Architecture: OMG It’s Made of People
PDF
Solve User Problems: Data Architecture for Humans
PDF
Assumptions about Data and Analysis: Briefing room webcast slides
Everything has changed except us
Big Data and Bad Analogies
Everything Has Changed Except Us: Modernizing the Data Warehouse
Disruptive Innovation: how do you use these theories to manage your IT?
Briefing room: An alternative for streaming data collection
Data Architecture: OMG It’s Made of People
Solve User Problems: Data Architecture for Humans
Assumptions about Data and Analysis: Briefing room webcast slides

What's hot (20)

PDF
Architecting a Platform for Enterprise Use - Strata London 2018
PDF
How to understand trends in the data & software market
PPTX
Innovation med big data – chr. hansens erfaringer
PDF
2013: Trends from the Trenches
PDF
2015 CDC Workshop on ScienceDMZ
PDF
The Black Box: Interpretability, Reproducibility, and Data Management
PDF
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
PDF
BioIT World 2016 - HPC Trends from the Trenches
PPTX
BioIT Trends - 2014 Internet2 Technology Exchange
PPTX
Taming Big Science Data Growth with Converged Infrastructure
PDF
Operationalizing Machine Learning in the Enterprise
PDF
Pay no attention to the man behind the curtain - the unseen work behind data ...
PDF
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
PPTX
Cloud Security for Life Science R&D
PDF
Building Data Science Teams
 
PPTX
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
PDF
Lean approach to IT development
PDF
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
PDF
5 Factors Impacting Your Big Data Project's Performance
PDF
2014 BioIT World - Trends from the trenches - Annual presentation
Architecting a Platform for Enterprise Use - Strata London 2018
How to understand trends in the data & software market
Innovation med big data – chr. hansens erfaringer
2013: Trends from the Trenches
2015 CDC Workshop on ScienceDMZ
The Black Box: Interpretability, Reproducibility, and Data Management
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
BioIT World 2016 - HPC Trends from the Trenches
BioIT Trends - 2014 Internet2 Technology Exchange
Taming Big Science Data Growth with Converged Infrastructure
Operationalizing Machine Learning in the Enterprise
Pay no attention to the man behind the curtain - the unseen work behind data ...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Cloud Security for Life Science R&D
Building Data Science Teams
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Lean approach to IT development
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
5 Factors Impacting Your Big Data Project's Performance
2014 BioIT World - Trends from the trenches - Annual presentation

Viewers also liked (20)

PPSX
The Analytics Data Store: Information Supply Framework
PDF
Briefing Room analyst comments - streaming analytics
PDF
Crossing the chasm with a high performance dynamically scalable open source p...
PDF
A Pragmatic Approach to Analyzing Customers
PDF
Determine the Right Analytic Database: A Survey of New Data Technologies
PPTX
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversation
PDF
Blueprint for integrating big data analytics and bi
PDF
Text visualization - by Jeff Clark
PDF
Integrating BI - Data Warehouse and Big Data
PDF
Next generation big data bi
PDF
The State of Open Source BI Adoption
PDF
Malaysia Big Data Analytics Initiative: 2015 Imperatives
PPTX
Design Principles for a Modern Data Warehouse
PDF
Bi on Big Data - Strata 2016 in London
PDF
On the edge: analytics for the modern enterprise (analyst comments)
PDF
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
PDF
How big data is transforming BI
PDF
What is bi analytics and big data
PDF
Building the Enterprise Data Lake: A look at architecture
PPTX
Big Data and BI Best Practices
The Analytics Data Store: Information Supply Framework
Briefing Room analyst comments - streaming analytics
Crossing the chasm with a high performance dynamically scalable open source p...
A Pragmatic Approach to Analyzing Customers
Determine the Right Analytic Database: A Survey of New Data Technologies
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversation
Blueprint for integrating big data analytics and bi
Text visualization - by Jeff Clark
Integrating BI - Data Warehouse and Big Data
Next generation big data bi
The State of Open Source BI Adoption
Malaysia Big Data Analytics Initiative: 2015 Imperatives
Design Principles for a Modern Data Warehouse
Bi on Big Data - Strata 2016 in London
On the edge: analytics for the modern enterprise (analyst comments)
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
How big data is transforming BI
What is bi analytics and big data
Building the Enterprise Data Lake: A look at architecture
Big Data and BI Best Practices

Similar to Bi isn't big data and big data isn't BI (updated) (20)

PDF
How to succeed at data without even trying!
PDF
The Role of Data Wrangling in Driving Hadoop Adoption
PDF
Data and data scientists are not equal to money david hoyle
PDF
Putting data science in your business a first utility feedback
PDF
Building a Data Platform Strata SF 2019
PPTX
Review on the Ted Talk- What do we do with all this big data?
PDF
Implementing Data Mesh WP LTIMindtree White Paper
PDF
Introduction to Big Data
PDF
How to Prepare for a Career in Data Science
PDF
What makes an effective data team?
PDF
A strategy for security data analytics - SIRACon 2016
PDF
Demystifying ML/AI
PPTX
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
PDF
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
PDF
Wake up and smell the data
PDF
Big dataplatform operationalstrategy
PPTX
Making big data work
PPTX
Semantech Inc. - Mastering Enterprise Big Data - Intro
PDF
Landing a career in data science
PDF
Course 8 : How to start your big data project by Eric Rodriguez
How to succeed at data without even trying!
The Role of Data Wrangling in Driving Hadoop Adoption
Data and data scientists are not equal to money david hoyle
Putting data science in your business a first utility feedback
Building a Data Platform Strata SF 2019
Review on the Ted Talk- What do we do with all this big data?
Implementing Data Mesh WP LTIMindtree White Paper
Introduction to Big Data
How to Prepare for a Career in Data Science
What makes an effective data team?
A strategy for security data analytics - SIRACon 2016
Demystifying ML/AI
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
Wake up and smell the data
Big dataplatform operationalstrategy
Making big data work
Semantech Inc. - Mastering Enterprise Big Data - Intro
Landing a career in data science
Course 8 : How to start your big data project by Eric Rodriguez

More from mark madsen (9)

PDF
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
PDF
Don't let data get in the way of a good story
PDF
Don't follow the followers
PDF
Exploring cloud for data warehousing
PDF
Open Data: Free Data Isn't the Same as Freeing Data
PDF
Exploring cloud for data warehousing
PDF
Big Data Wonderland: Two Views on the Big Data Revolution
PDF
Using Data Virtualization to Integrate With Big Data
PDF
One Size Doesn't Fit All: The New Database Revolution
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
Don't let data get in the way of a good story
Don't follow the followers
Exploring cloud for data warehousing
Open Data: Free Data Isn't the Same as Freeing Data
Exploring cloud for data warehousing
Big Data Wonderland: Two Views on the Big Data Revolution
Using Data Virtualization to Integrate With Big Data
One Size Doesn't Fit All: The New Database Revolution

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Introduction to Business Data Analytics.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
.pdf is not working space design for the following data for the following dat...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Miokarditis (Inflamasi pada Otot Jantung)
STUDY DESIGN details- Lt Col Maksud (21).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
IBA_Chapter_11_Slides_Final_Accessible.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Business Data Analytics.

Bi isn't big data and big data isn't BI (updated)

  • 2. © Third Nature Inc. Summary Common uses and commodity technology lead to Novel practices lead to Different data and different technology needs lead to New architectures Lead to Common uses and commodity technology 
  • 3. © Third Nature Inc. Our ideas about information and how it’s used are outdated.
  • 4. © Third Nature Inc. How We Think of Users Our design point is the  passive consumer of  information. Proof: methodology ▪ IT role is requirements,  design, build, deploy,  administer ▪ User role is run reports Self‐serve BI is not like  picking the right doughnut  from a box. Slide 4
  • 5. © Third Nature Inc. How We Think of Users Our design point is the  passive consumer of  information. Proof: methodology ▪ IT role is requirements,  design, build, deploy,  administer ▪ User role is run reports Self‐serve BI is not like  picking the right doughnut  from a box. How We Want Users to  Think of Us
  • 6. © Third Nature Inc. How We Think of Users What Users Really Think
  • 7. © Third Nature Inc. We think of BI as publishing, an old metaphor. Publishing has value, but may not be actionable.
  • 8. © Third Nature Inc. Planning data strategy means understanding the  context of data use so we can build infrastructure Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing We need to focus on what people do with information as the primary task, not on the data or the technology.
  • 9. © Third Nature Inc. General model for organizational use of data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act within the process Usually real-time to daily
  • 10. © Third Nature Inc. Origin of BI and data warehouse concepts The general concept of a  separate architecture for BI  has been around longer, but  this paper by Devlin and  Murphy is the first formal  data warehouse architecture  and definition published. 10 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 10Copyright Third Nature, Inc.
  • 11. © Third Nature Inc. Origins: in 1988 there was only big hair. ▪ No real commercial email, public internet barely started ▪ Storage state of the art: 100MB, cost $10,000/GB ▪ Oracle Applications v1 GL released; SAP goes public,  enters US market ▪ Unix is mostly run by long‐haired freaks ▪ Mobile was this This is the context: scarcity of data, of system resources, of automated  systems outside core financials, of money to pay for storage.
  • 12. © Third Nature Inc. General model for organizational use of data Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Copyright Third Nature, Inc.
  • 13. © Third Nature Inc. You need to be able to support both paths Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process Conventional BI, addition of EDM Causal analysis, “data science” Copyright Third Nature, Inc.
  • 14. © Third Nature Inc. The usage models for conventional BI Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily This is what we’ve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP Copyright Third Nature, Inc.
  • 15. © Third Nature Inc. The usage models for analytics and “big data”  Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily Analytics and big data is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions This isn’t ad-hoc, reporting, or OLAP. Copyright Third Nature, Inc.
  • 16. © Third Nature Inc. When you first give people access to information  that was unavailable… OH GOD I can see into forever
  • 17. © Third Nature Inc. After a while it becomes the new normal
  • 18. © Third Nature Inc. As practices evolve based on new capabilities… A new level of  complexity  develops over  top of the  older, now  better  understood  processes,  leading to new  data and  analysis needs.
  • 19. © Third Nature Inc. I never said the “E” in EDW meant “everything”… What do you mean, “Just doughnuts?”
  • 20. © Third Nature Inc. The data warehouse vs business agility All the data Common, typed, tabular data The bottleneck is you
  • 21. © Third Nature Inc. It’s going to get a lot worse Not E E Conclusion: any methodology built on the premise that you  must know and model all the data first is untenable 
  • 22. © Third Nature Inc. Old market says: There’s nothing wrong with what  you have, just keep buying new products from us
  • 23. © Third Nature Inc. The emerging big data market has an answer…
  • 24. © Third Nature Inc. The data lake
  • 25. © Third Nature Inc. The data lake after a little while
  • 26. © Third Nature Inc. TANSTAAFL When replacing the old  with the new (or ignoring  the new over the old) you  always make tradeoffs,  and usually you won’t see  them for a long time. Technologies are not  perfect replacements for  one another. Often not  better, only different.
  • 27. © Third Nature Inc. “Big data is unprecedented.” ‐ Anyone involved with big data in even the  most barely perceptible way
  • 28. © Third Nature Inc. We’ve been here before Source: Bill Schmarzo, EMC
  • 29. © Third Nature Inc. “Big” is well supported by databases now Source:Noumenal,Inc.
  • 30. © Third Nature Inc. Orders of magnitude: 20 years ago TB, today PB Shifts in data availability by orders of magnitude  necessitate new means of managing and using it.
  • 31. © Third Nature Inc. Analytics embiggens the data volume problem Many of the processing problems are O(n2) or worse, so  moderate data can be a problem for DB‐based platforms
  • 32. © Third Nature Inc. Much of the big data value comes from analytics BI is a retrieval problem, not a computational problem. Five basic things you can do with analytics ▪Prediction – what is most likely to happen? ▪Estimation – what’s the future value of a variable? ▪Description – what relationships exist in the data? ▪Simulation – what could happen? ▪Prescription – what should you do? Slide 36 Copyright Third Nature, Inc. Copyright Third Nature, Inc.
  • 33. © Third Nature Inc. Most people do not need special technologyNumberofpeople The distribution of data size is about normal, yet these guys set the tone of the market today. Bigness of data Copyright Third Nature, Inc.
  • 34. © Third Nature Inc. Analytics: This is really raw data under storageNumberofjobs Microsoft study of 174,000 analytic jobs in their cluster: median size ??? Bigness of data Copyright Third Nature, Inc.
  • 35. © Third Nature Inc. Working data for analytics most often not bigNumberofjobs 14 GB Smallness of data Copyright Third Nature, Inc.
  • 36. © Third Nature Inc. An (overly) Simple Division of the Problem SpaceComputation LittleLots Data volume Little Lots Big analytics, little data Specialized computing, modeling problems: supercomputing, GPUs Big analytics, big data Complex math over large data volumes requires shared nothing architectures Little analytics, little data The entry point; SAS, SMP databases, even OLAP cubes can work Little analytics, big data The BI/DW space, for the most part, with work done in databases
  • 37. © Third Nature Inc.© Third Nature Inc. What makes data “big”? Very large amounts Hierarchical structures Nested structures Linked structures Encoded values Non‐standard (for a  database) types Deep structure Human authored text “big” is better off being defined as “complex” or “hard to manage” Copyright Third Nature, Inc.
  • 38. © Third Nature Inc. Categorizing the measurement data we collect The convenient data is the  transactional data. ▪ Goes in the DW and is used, even  if it isn’t the right measurement. The inconvenient data is  observational data. ▪ It’s not neat, clean, or designed  into most systems of operation. The difficult and misleading data  is declarative data. ▪ What people say and what they  do require ground truth. We need an architecture that  supports all three categories. Copyright Third Nature, Inc.
  • 39. © Third Nature Inc. Transactions vs “big data” The classic example of “structured data” Transaction data includes: ▪ quantification details (date, value, count) ▪ reference data for explanation (product,  customer, account) ▪ Lots of meaningful information Reference data is usually shared across the  organization, hence its importance. There  are two parts: ▪ identifier to uniquely identify the subject ▪ descriptive attributes with common or  standardized value domains Transaction details Reference data
  • 40. © Third Nature Inc. Today it’s different data: observations, not transactions Sensor data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it. Copyright Third Nature, Inc.
  • 41. © Third Nature Inc. Big data as a type of data: Transactions vs. Events Transactions: ▪ Each one is valuable ▪ Mutable ▪ The elements of a transaction can be aggregated easily ▪ A set of transactions does not usually have important ordering  or dependency Events: ▪ A single event often has no value, e.g. what is the value of one  click in a series? Some events are extremely valuable, but this  is only detectable within the context of other events. ▪ Elements of events are often not easily aggregated ▪ A set of events usually has a natural order and dependencies ▪ Immutable
  • 42. © Third Nature Inc. Example “big data”: Web tracking data USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐ 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Google.com REFERRAL_URL http://www.google.com/search?sourceid=navclient&aq=0h& oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS  NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
  • 43. © Third Nature Inc. Web tracking data has a nested structure USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐ 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Direct REFERRAL_URL ‐ PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS  NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322) “unstructured” data embedded in the logged message: complex strings
  • 44. © Third Nature Inc. The missing ingredient from most big data
  • 45. © Third Nature Inc. The creation, flow and use of data is different for  transactions and machine‐generated events Data entry Extract Cleanse Load UseStore Transactions MDM Generate Store Use UseCleanse Program Capture This runs at human speed This runs at machine speed, with higher latency feedback cycles
  • 47. © Third Nature Inc. You can store this data in an RDBMS, but…
  • 48. Example data: Twitter Message API Payload Looks like: This is really just a record format much like a DB row. Datetime, userID, name, location, description, message, message metadata, etc. But it’s In json or xml.
  • 49. © Third Nature Inc. @markmadsen Check out: From #MongoDB to #Cassandra:  Why The Atlas Platform Is Migrating http://owl.li/cvxFK A tweet has lots of fields, but one important one The payload is free text but has other elements: From these things you likely want to generate or link to  reference data. ‘To’ username Hashtag HashtagURL
  • 50. © Third Nature Inc.© Third Nature Inc. Internal payload elements form a new graph The @elements point to  other records and create a  deeply linked structure. You have to assemble the  linked structure to see  what’s really there, which  means repeated scanning  some/all of the data. The derived pattern is  interesting data,  sometimes more than the  individual messages.
  • 51. © Third Nature Inc.© Third Nature Inc. There are many patterns in the data Follower / following networks are easy – they are explicit  and independent of the events. Community detection requires looking at patterns of @  communication in addition to follow relationships. What do you do with these after discovery? Follower network Conversational communities
  • 52. © Third Nature Inc. More data: patterns emerge from lots of event data Patterns emerge from  the underlying structure  of the entire dataset. The patterns are more  interesting than sums  and counts of the events. Web paths: clicks in a  session as network node  traversal. Email: traffic analysis  producing a network The event stream is a source for analysis, generating another set of data that is the source for different analysis.
  • 53. © Third Nature Inc. Big changes for data warehousing workloads The results of analytic  processing can, often do,  feed back into the  system from which they  originate. Much of the data is being  read, written and  processed in real time. Our design point was not  changing tables and  ephemeral patterns.
  • 55. © Third Nature Inc. Slide 59 THE BIG CHANGE ISN’T TECHNOLOGY, IT’S ARCHITECTURE
  • 56. © Third Nature Inc. There are really three workloads to consider, not two 1. Operational: OLTP systems 2. Analytic: OLAP systems 3. Processing: Computational systems Unit of focus: 1. Transaction 2. Query 3. Computation Different problems require different platforms
  • 57. © Third Nature Inc. Workloads OLTP BI Analytics Access Read‐Write Read‐only Read‐mostly Predictability Predictable Unpredictable Fixed path Selectivity High Low Low Retrieval Low Low High Latency Milliseconds < seconds msecs to days Concurrency Huge Moderate 1 to huge Model 3NF, nested object Dim, denorm BWT Task size Small Large Small to huge
  • 58. © Third Nature Inc. These do exactly the same thing: One is a set of technologies. One is an architecture. An idea promoted by big data vendors Data Warehouse
  • 59. © Third Nature Inc. Reality: Hadoop disaggregates the database One of the key things Hadoop does is to separate the  storage, execution and API layers of a database. This  allows for processing flexibility, but it does not permit  one to build a reliable, high performance database  across the layers. Hadoop distributed filesystem (HDFS) General-purpose data engines Abstraction layers Storage management
  • 60. © Third Nature Inc. A more specific look at layers and engines Base storage SQL, MDX Kylin Storage mgmt Engine Abstraction  layer / API You can program to any layer you choose. Some projects already build on top of multiple others. Language/API Engine Hadoop distributed filesystem (HDFS) MapReduce Tez Cascading Spark Storage (filetypes in HDFS, Hbase, etc) Crunch Pig Hive SparkSQL NativeAPI Giraph Hive Crunch Pig Impala Drill Presto NativeAPI NativeAPI Hive Pig NativeAPI Hbase Phoenix
  • 61. © Third Nature Inc. An important Hadoop + cloud computing benefit Scalability is free – if your task requires 10 units of  work, you can decide when you want results: 10 servers, 1 unit of time Cost is the same. Not true of the conventional IT model Time 1 server, 10 units of time X X
  • 62. © Third Nature Inc. Hadoop: a summary of the magic 1. Provides both storage and complex processing as part  of the same platform 2. Makes parallel programming more accessible 3. Schemaless (just files) therefore flexible 4. Inexpensive, reliable scale‐out 5. Potential for fast, scalable ingest 6. Cheaper than a database (for non‐database work) The bad stuff: ▪ Not great for mutable data ▪ Mostly file‐based sequential processing, or you store data  many times in different datastores (locality is important) ▪ Minimal data management (today)
  • 63. © Third Nature Inc. The geography has been redefined The box we created: • not any data, rigidly typed data • not any form, tabular rows and  columns of typed data • not any latency, persist what the  DB can keep up with • not any process, only queries The digital world was diminished  to only what’s inside the box until  we forgot the box was there.
  • 64. © Third Nature Inc. Layered data architecture The DW assumed a single flat  model of data, DB in the center.  New technology enables new  ways to organize data: ▪ Raw – straight from the source ▪ Enhanced –cleaned, standardized ▪ Integrated – modeled,  augmented, ~semi‐persistent ▪ Derived – analytic output,  pattern based sets, ephemeral Implies a new technology architecture  and data modeling approaches.
  • 65. © Third Nature Inc. Decouple the Data Architecture The core of the data warehouse isn’t the  database, it’s the data architecture that the  database and tools implement. We need a data architecture that is not limiting: ▪ Deals with change more easily and at scale ▪ Does not enforce requirements and models up front ▪ Does not limit the format or structure of data ▪ Assumes the range of data latencies in and out, from  streaming to one‐time bulk
  • 66. © Third Nature Inc. Deconstructing the data warehouse There are three  things happening  in a DW: ▪ Data acquisition ▪ Data management ▪ Data delivery Isolate them from  one another. Data Warehouse
  • 67. © Third Nature Inc. Integrate Manage Decouple the data architecture by stage Use In reality, you are building three systems, not one. Treat them that way. Collect Transactions Observations Declarations
  • 68. © Third Nature Inc. Food supply chain: an analogy for data Multiple contexts of use, differing quality levels
  • 69. © Third Nature Inc. Data infrastructure is a platform ▪ Any data – structures, forms ▪ Any latency –in motion, at rest ▪ Any process – query, algorithm, transformation ▪ Any access – SQL, API, queue, file movement
  • 70. © Third Nature Inc. The evolution of DW is to a data platform, which means  separating application from infrastructure. Derived data Raw data Infrastructure layer: Process and analyze Store and manage Application layer: Deliver and use The new model also encompasses data at rest and data in motion Multiple access methods Enhanced data Multiple ingest methods BI, data extracts,  analytics, applications The platform has to do more than serve queries; it has to be read-write.
  • 71. © Third Nature Inc. Away from “one throat to choke”, back to best of breed “The extremely specialized  nature of mass production  raises the costs of product  change and therefore slows  down innovation.” ‐ Abernathy, 1978 Tight coupling leads to slow  changes. In a rapidly evolving market  componentized architectures,  modularity  and loose coupling  are favorable over monolithic  stacks, single‐vendor  architectures and tight  coupling.
  • 72. © Third Nature Inc. Staff and skills are a problem in a build market @BigDataBorat: Give man Hadoop cluster he gain insight for a day. Teach man build Hadoop cluster he soon leave for better job #bigdata
  • 73. © Third Nature Inc. Technology Adoption Some people can’t resist  getting the next new thing  because it’s new and new is  always better. Many IT organizations are like  this, promoting a solution and  hunting for the problem that  matches it. Better to ask “What is the  problem for which this  technology is the answer?” Copyright Third Nature, Inc.
  • 74. © Third Nature Inc. Four core capabilities big data technologies add 1. Unlimited scale of storage, processing ▪ Agility, faster turnaround for new data requests (but not a replacement for BI) ▪ Fewer staff to accomplish same goals 2. New data accessibility ▪ More data retained for longer period ▪ Access to data unused due to cost or processing limits ▪ Any digital information becomes usable data 3. Scalable realtime processing ▪ Brings ability to monitor and act on data as events occur 4. Arbitrary analytics ▪ Faster analysis ▪ Deeper analysis ▪ More broadly accessible analytics
  • 75. © Third Nature Inc. As a technology moves from emerging to commodity the  nature of acquiring, using and managing it changes Generate options Innovation Novel practice Maximize value Maturation Constrain choices Adaptation Good practice Optimize Standardize / minimize choice Acquisition Best practice Minimize costs SaturationInnovation Copyright Third Nature, Inc. Agile & open  source* methods  6 Sigma & process  methods
  • 76. © Third Nature Inc. Today: repeating the experience of the 80s & 90s This is the turbulent phase of the market as it goes through rapid development, then product and service changes. Copyright Third Nature, Inc. The Internet combined with commodity computing is forcing a new business and IT structural evolution, already underway. Maturation SaturationInnovation
  • 77. © Third Nature Inc. How we develop best practices: survival bias We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
  • 78. © Third Nature Inc. Welcome to the big data revolution, more of an evolution Be pragmatic, not dogmatic
  • 79. © Third Nature Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: acorn_blue.jpg ‐ http://www.flickr.com/photos/rogersmith/314324893/ wheat_field.jpg ‐ http://www.flickr.com/photos/ecstaticist/1120119742/ Phone dump ‐ Richard Barnes ponies in field.jpg ‐ http://www.flickr.com/photos/bulle_de/352732514/ straw men.jpg ‐ http://www.flickr.com/photos/robinellis/6034919721/ text composition ‐ http://flickr.com/photos/candiedwomanire/60224567/ girl on cell tokyo .jpg ‐ http://flickr.com/photos/8024992@N06/986538717/ hamadan people mosaic.jpg ‐ http://flickr.com/photos/hamed/225868856/ twitter_network_bw.jpg ‐ http://www.flickr.com/photos/dr/2048034334/ klein_bottle_red.jpg ‐ http://flickr.com/photos/sveinhal/2081201200/ donuts_4_views.jpg ‐ http://www.flickr.com/photos/le_hibou/76718773/ subway dc metro  ‐ http://flickr.com/photos/musaeum/509899161/
  • 81. © Third Nature Inc. About Third Nature Third Nature is a consulting and advisory firm focused on new and emerging technology and practices in information strategy, analytics, business intelligence and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in strategy and architecture, so we look at emerging technologies and markets, evaluating how technologies are applied to solve problems rather than evaluating product features.