SlideShare a Scribd company logo
BI on Big Data
Strata San Jose, March, 2018
Shant Hovsepian, Arcadia Data
Mark Madsen, Think Big Analytics
Users hate BI. And have for a long time
The old problem was access, the new problem is analysis
You keep using that word. I
do not think it means what
you think it means.
What do you mean by “analytics”?
User-focused criteria: context and point of use
Information use is diverse and
varies based on context:
▪ Get a quick answer
▪ Solve a one-off problem
▪ Analyze causes
▪ Do experiments
▪ Make repetitive decisions
▪ Use data in routine processes
▪ Make complex decisions
▪ Choose a course of action
▪ Convince others to take
action
One size doesn’t fit all.
There are two parts to “analytics”
The mathy stuff The query & reporting stuff
Analysis: the verb that’s ignored
Analysis is something people do across this spectrum
Good paper on the topic of analysis: Enterprise Data Analysis and Visualization: An Interview Study
http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf
Old market says: There’s nothing wrong with what
you have, just keep buying new products from us
The big data market has an answer…
The data lake: just dump the data in!
Combine
with self-
service tools:
we’ll figure it
out later!
BI on Big Data Presentation
The primary view of BI, self service is publishing data
An architectural history of BI tools
First there were files and reporting programs.
Application files feed through a data processing pipeline
generate an output file. The file is used by a report
formatter for print/screen.
Every report is a program written by a developer.
An architectural history of BI tools
We had the concept of cubes before we had RDBMSs.
First commercial product (Express) in 1970.
Provided interactive response time but had problems:
rigidly defined schema, inflexible definition, data size
limits, slow cube build times, resulting in cube explosions.
An architectural history of BI tools
Then we had databases and tools with embedded SQL,
and query-by-example templating soon after.
These had better scalability and flexibility over OLAP
models. They decoupled storage from access from the
rendering of the interface.
SQL
An architectural history of BI tools
With more regular schema models, in particular
dimensional models that didn’t contain cyclic join paths,
it was possible to automate SQL generation via semantic
mapping layers. Query via business terms made BI usable
by non-technical people.
SQL
Response time is a key driver for analysis
Many approaches today are a step backward. Unless you
resolve this task performance gap, real analysis work is a
challenge and will remain the domain of Excel and tools
that extract subsets of data into memory.
Days
Hours
Minutes
Seconds
Instantaneous
come back tomorrow
go to lunch
take a break
get some coffee
check email/FB
take a sip of coffee
immerse yourself in work
Flow is possible only in the “less
than 3 second” range
BI server architecture shifts
The SQL-generating server model of BI scales
extremely well but has poor user response time.
Solution 1: pre-cache
query results or prebuild
datasets on the BI server
(i.e. the old OLAP model)
Well-known problems
with this.
Solution 2: Shove all the
data into a BI server
repository. Avoids subset
problems. Adds potential
scaling problems.
IT reality today is multiple data stores, not one place
Independent, purpose-built databases and processing systems for
different types of data and query / computing workloads is the new
norm for information delivery. No single place for data or access.
BI, dashboards,
analytics, apps
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
Query
processing
Databases Documents Flat Files Objects Streams ERP SaaS Applications
Source Environments
Data
processing
Stream
processing
There is always a third way
The previous choices were driven by client-server
thinking. We have a distributed (cloud) environment.
Possibilities:
Don’t force all the compute
into the DB or server.
Don’t force all the compute
to the client.
Data on demand, bring it to
the analysis from where it is,
and/or execute the analysis
local to where the data is.
Connectors to data sources the tool can
communicate with.
Semantic layer tools: what we’re used to
Client requests and responses in
native format
BI server:
Semantic layer
Query generation
Connectors
SQL and NoSQL
connectors
Note: most BI servers
still can’t talk to more
than one DB at the same
time
Map-based tools: what we’re back to
Internal or external
API-accessible sources.
Connector based data
sources the tool can
communicate with.
Query from client / server
Queries sources directly
Any integration done locally
No semantic layer translating
SQL and NoSQL
connectors
XML, JSON, text,
binary return
formats
Note: most products have minimal to
no local join ability
With analysis, the BI approach has to be inverted
The process used today for data warehousing:
1. Model
2. Collect
3. Analyze
The new process is:
1. Collect
2. Analyze
3. Model
This is a shift from
planned design to
adaptive design for data
management, and a
multi-skill team.
Analysis and BI: Discoverability and Repeatability
Discover
Explore
Analyze
Model
Consume
Promote
Focus on repeatability
Application cycle time
Focus on discoverability
Analyst cycle time
80% of data use 80% of analysis
Just enough modeling: an analogy for analytic data
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking,
you can’t unmake dough once it’s mixed.
A core problem with one global schema is change
Big data answer? Schema on read
Prof. Dr. Jens Albrecht
Schema on read is an answer, sometimes
Schema-on-read is really only good for the developer who doesn’t
know what to do with the data. There is a price to pay with schema
on read, but you usually don’t see it at the beginning.
SoR problem: metadata is what you wished your data looked like.
Reality is not requirements = code
Reality is the data, not the metadata
How did we get to this
point with BI & big data?
There’s a difference
between having no past
and actively rejecting it.
Move Data to Separate BI Server
Move Data to BI Server
BI & Visualization Server
Pros
✓ Least Costly
✓ Use existing BI tools
Cons
✘ Shallow insights – summary data
✘ Requires IT/DBA: new views & data
movement
✘ Separate security models
✘ Not real-time: batch data updates
✘ Heaviest burden on network
ODBC to SQL in Big Data, SQL on Big Data
Pros
✓ Can get detailed data
✓ Performance leverages the architecture
Cons
✘ Lower user concurrency
✘ Cannot access unstructured data (requires
schema)
✘ Cost - Manage security in multiple tools,
separate administration for metadata
(Impala, Vertica, Hawq,
Presto)BI & Visualization Server
(SparkSQL, Hive, Athena)
BI & Visualization Server
(R)OLAP on Hadoop
Pros
✓ Use existing BI tools
✓ Higher user concurrency
Cons
✘ Lacks ad-hoc freedom - Requires IT/DBA for
new views
✘ Not real-time: batch data updates
✘ Cannot access unstructured data (requires
schema)
✘ Cost – Multiple tools and data duplication
✘ Increased administration – Separate security
models, administration
OLAP Middleware
Native BI
Pros
✓ Greatest user concurrency
✓ Linear scalability
✓ Agility for analysts (drill to detail)
✓ Supports complex data sources
✓ Real time
✓ Lowest TCO: simplified architecture
Cons
✘ Newer technology and approach
✘ Requires some Hadoop skills to set up and
maintainBI runs on Hadoop
Hadoop: it disaggregates the database
One of the key things Hadoop does is to separate the
storage, execution and API layers of a database. This
allows for processing flexibility, but it does not permit
one to build a reliable, high performance database
across the layers. You trade these for write flexibility.
Hadoop distributed filesystem (HDFS)
General-purpose data engines
Abstraction layers
Storage management
A more specific look at layers and engines
Base storage
SQL, MDX
Kylin
Storage mgmt
Engine
Abstraction
layer / API
You can program to any layer you
choose. Some projects build on top of
multiple others.
Language/API Engine
Hadoop distributed filesystem (HDFS)
MapReduce Tez
Cascading
Spark
Storage (filetypes in HDFS, Hbase, etc)
Crunch
Pig
Hive
SparkSQL
NativeAPI
Giraph
Hive
Crunch
Pig
Impala
Drill
Presto
NativeAPI
NativeAPI
Hive
Pig
NativeAPI
Hbase
Phoenix
Four models for SQL on Hadoop
1. Parse and compile SQL into MapReduce jobs
2. Put a SQL interpreter on a generic execution engine
3. Run a native SQL engine in the cluster
4. Run a SQL interpreter on a non-generic engine (this will
limit SQL functionality based on the underlying engine).
Hadoop distributed filesystem (HDFS)
MapReduce TezSpark
Storage (filetypes in HDFS, Hbase, etc)
Hive
SparkSQL
Hive
Impala
Drill
Presto
Hive
Druid
Hbase
Phoenix1 2 32 4
Metanautix
JethroDB
What’s under the hood matters in when querying
Source: Randy Bias
The core problem of old BI was scalability. This is solved.
New uses require new platforms for different workloads.
Source: Noumenal, Inc.
Note: this was 8 years
ago. Largest MPP
RDBMS I know is 99PB
Standard DBs
MPP DBs
Platforms
The shifting BI paradigm
The tool market is shifting,
driven by new architectures
that are enabled by new
technologies.
Front-end tools are evolving
away from BI-as-publishing,
which changes their design,
increases the burden on back
end databases and creates new
interaction challenges.
You need to evaluate tools
based on more usage scenarios
and interactive capabilities,
less on report-building /
dashboard features.
One standard tool is not the
norm, it’s the exception.
TANSTAAFL
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.
Technologies are not
perfect replacements for
one another. Often not
better, only different.
The right tool is the one that people will actually use,
not the one you want them to use
“But we already have an enterprise standard”
Mark Madsen is the global head of
architecture at Teradata, Prior to that
he was president of Third Nature, a
research and consulting firm focused
on analytics, data integration and data
management. Mark is an award-
winning author, architect and CTO
whose work has been featured in
numerous industry publications. Over
the past ten years Mark received
awards for his work from the American
Productivity & Quality Center, TDWI,
and the Smithsonian Institute. He is an
international speaker, chairs several
conferences, and is on the O’Reilly
Strata program committee. For more
information or to contact Mark, follow
@markmadsen on Twitter or visit
http://ThirdNature.net
About the Presenter
About the Presenter
Shant Hovsepian is a cofounder
and CTO of Arcadia Data, where he
is responsible for the company’s
long-term innovation and technical
direction. Previously, Shant was a
member of the engineering team
at Teradata, which he joined
through the acquisition of Aster
Data. Shant interned at Google,
where he worked on optimizing the
AdWords database, and was a
graduate student in computer
science at UCLA. He is the coauthor
of publications in the areas of
modular database design and high-
performance storage systems.

More Related Content

PDF
3 джозеп курто превращаем вашу организацию в big data компанию
PDF
Data Skills for Digital Era-مهارت های داده ای
PDF
SuanIct-Bigdata desktop-final
PPTX
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
PPT
Going MAD: A Framework For Delivering Pervasive BI Solutions
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
PPTX
Architecting for Big Data: Trends, Tips, and Deployment Options
PPTX
Predictive Analytics - Big Data Warehousing Meetup
3 джозеп курто превращаем вашу организацию в big data компанию
Data Skills for Digital Era-مهارت های داده ای
SuanIct-Bigdata desktop-final
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Going MAD: A Framework For Delivering Pervasive BI Solutions
Accelerating Data Lakes and Streams with Real-time Analytics
Architecting for Big Data: Trends, Tips, and Deployment Options
Predictive Analytics - Big Data Warehousing Meetup

Similar to BI on Big Data Presentation (20)

PPTX
Big Data Analytics with Microsoft
PPTX
Skilwise Big data
PPT
Qiagram
PDF
Horses for Courses: Database Roundtable
PPTX
Skillwise Big Data part 2
PPTX
The Six pillars for Building big data analytics ecosystems
PDF
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
PDF
Self-Service Analytics with Guard Rails
PDF
Building a New Platform for Customer Analytics
PDF
How to build data accessibility for everyone
PDF
The Data Lake and Getting Buisnesses the Big Data Insights They Need
PPTX
Intro big data analytics
PPTX
Introduction to Big Data
PDF
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
 
PPTX
Big Data: Setting Up the Big Data Lake
PDF
DataOps - The Foundation for Your Agile Data Architecture
PPTX
Introduction to Data Science
PDF
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
PPTX
Real Time Analytics
PDF
Creating a Modern Data Architecture for Digital Transformation
Big Data Analytics with Microsoft
Skilwise Big data
Qiagram
Horses for Courses: Database Roundtable
Skillwise Big Data part 2
The Six pillars for Building big data analytics ecosystems
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
Self-Service Analytics with Guard Rails
Building a New Platform for Customer Analytics
How to build data accessibility for everyone
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Intro big data analytics
Introduction to Big Data
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
 
Big Data: Setting Up the Big Data Lake
DataOps - The Foundation for Your Agile Data Architecture
Introduction to Data Science
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Real Time Analytics
Creating a Modern Data Architecture for Digital Transformation
Ad

More from Arcadia Data (13)

PPTX
Visualizing Geospatial Data at Scale
PPTX
Trends for Modernizing Analytics and Data Warehousing in 2019
PDF
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
PPTX
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
PPTX
How Hewlett Packard Enterprise Gets Real with IoT Analytics
PPTX
Unlocking the Power of the Data Lake
PDF
Are Data Lakes for Business Users Webinar
PDF
When everybody wants Big Data Who gets it?
PDF
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial Markets
PDF
RegTech: Leveraging Alternative Data for Compliance
PPTX
How to Scale BI and Analytics with Hadoop-based Platforms
PDF
A Tale of Two BI Standards
PDF
Four Key Considerations for your Big Data Analytics Strategy
Visualizing Geospatial Data at Scale
Trends for Modernizing Analytics and Data Warehousing in 2019
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Unlocking the Power of the Data Lake
Are Data Lakes for Business Users Webinar
When everybody wants Big Data Who gets it?
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial Markets
RegTech: Leveraging Alternative Data for Compliance
How to Scale BI and Analytics with Hadoop-based Platforms
A Tale of Two BI Standards
Four Key Considerations for your Big Data Analytics Strategy
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Global journeys: estimating international migration
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Mega Projects Data Mega Projects Data
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Reliability_Chapter_ presentation 1221.5784
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Global journeys: estimating international migration
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
.pdf is not working space design for the following data for the following dat...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Business Acumen Training GuidePresentation.pptx
Introduction to Business Data Analytics.
Introduction to Knowledge Engineering Part 1
Mega Projects Data Mega Projects Data
Supervised vs unsupervised machine learning algorithms
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

BI on Big Data Presentation

  • 1. BI on Big Data Strata San Jose, March, 2018 Shant Hovsepian, Arcadia Data Mark Madsen, Think Big Analytics
  • 2. Users hate BI. And have for a long time
  • 3. The old problem was access, the new problem is analysis
  • 4. You keep using that word. I do not think it means what you think it means. What do you mean by “analytics”?
  • 5. User-focused criteria: context and point of use Information use is diverse and varies based on context: ▪ Get a quick answer ▪ Solve a one-off problem ▪ Analyze causes ▪ Do experiments ▪ Make repetitive decisions ▪ Use data in routine processes ▪ Make complex decisions ▪ Choose a course of action ▪ Convince others to take action One size doesn’t fit all.
  • 6. There are two parts to “analytics” The mathy stuff The query & reporting stuff
  • 7. Analysis: the verb that’s ignored Analysis is something people do across this spectrum Good paper on the topic of analysis: Enterprise Data Analysis and Visualization: An Interview Study http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf
  • 8. Old market says: There’s nothing wrong with what you have, just keep buying new products from us
  • 9. The big data market has an answer…
  • 10. The data lake: just dump the data in!
  • 13. The primary view of BI, self service is publishing data
  • 14. An architectural history of BI tools First there were files and reporting programs. Application files feed through a data processing pipeline generate an output file. The file is used by a report formatter for print/screen. Every report is a program written by a developer.
  • 15. An architectural history of BI tools We had the concept of cubes before we had RDBMSs. First commercial product (Express) in 1970. Provided interactive response time but had problems: rigidly defined schema, inflexible definition, data size limits, slow cube build times, resulting in cube explosions.
  • 16. An architectural history of BI tools Then we had databases and tools with embedded SQL, and query-by-example templating soon after. These had better scalability and flexibility over OLAP models. They decoupled storage from access from the rendering of the interface. SQL
  • 17. An architectural history of BI tools With more regular schema models, in particular dimensional models that didn’t contain cyclic join paths, it was possible to automate SQL generation via semantic mapping layers. Query via business terms made BI usable by non-technical people. SQL
  • 18. Response time is a key driver for analysis Many approaches today are a step backward. Unless you resolve this task performance gap, real analysis work is a challenge and will remain the domain of Excel and tools that extract subsets of data into memory. Days Hours Minutes Seconds Instantaneous come back tomorrow go to lunch take a break get some coffee check email/FB take a sip of coffee immerse yourself in work Flow is possible only in the “less than 3 second” range
  • 19. BI server architecture shifts The SQL-generating server model of BI scales extremely well but has poor user response time. Solution 1: pre-cache query results or prebuild datasets on the BI server (i.e. the old OLAP model) Well-known problems with this. Solution 2: Shove all the data into a BI server repository. Avoids subset problems. Adds potential scaling problems.
  • 20. IT reality today is multiple data stores, not one place Independent, purpose-built databases and processing systems for different types of data and query / computing workloads is the new norm for information delivery. No single place for data or access. BI, dashboards, analytics, apps 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA Query processing Databases Documents Flat Files Objects Streams ERP SaaS Applications Source Environments Data processing Stream processing
  • 21. There is always a third way The previous choices were driven by client-server thinking. We have a distributed (cloud) environment. Possibilities: Don’t force all the compute into the DB or server. Don’t force all the compute to the client. Data on demand, bring it to the analysis from where it is, and/or execute the analysis local to where the data is.
  • 22. Connectors to data sources the tool can communicate with. Semantic layer tools: what we’re used to Client requests and responses in native format BI server: Semantic layer Query generation Connectors SQL and NoSQL connectors Note: most BI servers still can’t talk to more than one DB at the same time
  • 23. Map-based tools: what we’re back to Internal or external API-accessible sources. Connector based data sources the tool can communicate with. Query from client / server Queries sources directly Any integration done locally No semantic layer translating SQL and NoSQL connectors XML, JSON, text, binary return formats Note: most products have minimal to no local join ability
  • 24. With analysis, the BI approach has to be inverted The process used today for data warehousing: 1. Model 2. Collect 3. Analyze The new process is: 1. Collect 2. Analyze 3. Model This is a shift from planned design to adaptive design for data management, and a multi-skill team.
  • 25. Analysis and BI: Discoverability and Repeatability Discover Explore Analyze Model Consume Promote Focus on repeatability Application cycle time Focus on discoverability Analyst cycle time 80% of data use 80% of analysis
  • 26. Just enough modeling: an analogy for analytic data Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  • 27. A core problem with one global schema is change
  • 28. Big data answer? Schema on read Prof. Dr. Jens Albrecht
  • 29. Schema on read is an answer, sometimes Schema-on-read is really only good for the developer who doesn’t know what to do with the data. There is a price to pay with schema on read, but you usually don’t see it at the beginning. SoR problem: metadata is what you wished your data looked like. Reality is not requirements = code Reality is the data, not the metadata
  • 30. How did we get to this point with BI & big data? There’s a difference between having no past and actively rejecting it.
  • 31. Move Data to Separate BI Server Move Data to BI Server BI & Visualization Server Pros ✓ Least Costly ✓ Use existing BI tools Cons ✘ Shallow insights – summary data ✘ Requires IT/DBA: new views & data movement ✘ Separate security models ✘ Not real-time: batch data updates ✘ Heaviest burden on network
  • 32. ODBC to SQL in Big Data, SQL on Big Data Pros ✓ Can get detailed data ✓ Performance leverages the architecture Cons ✘ Lower user concurrency ✘ Cannot access unstructured data (requires schema) ✘ Cost - Manage security in multiple tools, separate administration for metadata (Impala, Vertica, Hawq, Presto)BI & Visualization Server (SparkSQL, Hive, Athena) BI & Visualization Server
  • 33. (R)OLAP on Hadoop Pros ✓ Use existing BI tools ✓ Higher user concurrency Cons ✘ Lacks ad-hoc freedom - Requires IT/DBA for new views ✘ Not real-time: batch data updates ✘ Cannot access unstructured data (requires schema) ✘ Cost – Multiple tools and data duplication ✘ Increased administration – Separate security models, administration OLAP Middleware
  • 34. Native BI Pros ✓ Greatest user concurrency ✓ Linear scalability ✓ Agility for analysts (drill to detail) ✓ Supports complex data sources ✓ Real time ✓ Lowest TCO: simplified architecture Cons ✘ Newer technology and approach ✘ Requires some Hadoop skills to set up and maintainBI runs on Hadoop
  • 35. Hadoop: it disaggregates the database One of the key things Hadoop does is to separate the storage, execution and API layers of a database. This allows for processing flexibility, but it does not permit one to build a reliable, high performance database across the layers. You trade these for write flexibility. Hadoop distributed filesystem (HDFS) General-purpose data engines Abstraction layers Storage management
  • 36. A more specific look at layers and engines Base storage SQL, MDX Kylin Storage mgmt Engine Abstraction layer / API You can program to any layer you choose. Some projects build on top of multiple others. Language/API Engine Hadoop distributed filesystem (HDFS) MapReduce Tez Cascading Spark Storage (filetypes in HDFS, Hbase, etc) Crunch Pig Hive SparkSQL NativeAPI Giraph Hive Crunch Pig Impala Drill Presto NativeAPI NativeAPI Hive Pig NativeAPI Hbase Phoenix
  • 37. Four models for SQL on Hadoop 1. Parse and compile SQL into MapReduce jobs 2. Put a SQL interpreter on a generic execution engine 3. Run a native SQL engine in the cluster 4. Run a SQL interpreter on a non-generic engine (this will limit SQL functionality based on the underlying engine). Hadoop distributed filesystem (HDFS) MapReduce TezSpark Storage (filetypes in HDFS, Hbase, etc) Hive SparkSQL Hive Impala Drill Presto Hive Druid Hbase Phoenix1 2 32 4 Metanautix JethroDB
  • 38. What’s under the hood matters in when querying Source: Randy Bias
  • 39. The core problem of old BI was scalability. This is solved. New uses require new platforms for different workloads. Source: Noumenal, Inc. Note: this was 8 years ago. Largest MPP RDBMS I know is 99PB Standard DBs MPP DBs Platforms
  • 40. The shifting BI paradigm The tool market is shifting, driven by new architectures that are enabled by new technologies. Front-end tools are evolving away from BI-as-publishing, which changes their design, increases the burden on back end databases and creates new interaction challenges. You need to evaluate tools based on more usage scenarios and interactive capabilities, less on report-building / dashboard features. One standard tool is not the norm, it’s the exception.
  • 41. TANSTAAFL When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time. Technologies are not perfect replacements for one another. Often not better, only different.
  • 42. The right tool is the one that people will actually use, not the one you want them to use “But we already have an enterprise standard”
  • 43. Mark Madsen is the global head of architecture at Teradata, Prior to that he was president of Third Nature, a research and consulting firm focused on analytics, data integration and data management. Mark is an award- winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, chairs several conferences, and is on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net About the Presenter
  • 44. About the Presenter Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he is responsible for the company’s long-term innovation and technical direction. Previously, Shant was a member of the engineering team at Teradata, which he joined through the acquisition of Aster Data. Shant interned at Google, where he worked on optimizing the AdWords database, and was a graduate student in computer science at UCLA. He is the coauthor of publications in the areas of modular database design and high- performance storage systems.