SlideShare a Scribd company logo
Using Data
Platforms that are
Fit-For-Purpose
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
A 2-time Inc. 5000 Company
@williammcknight
www.mcknightcg.com
(214) 514-1444
© All rights reserved. Matillion 20 21
A ve n d or p e rsp e ctive from Matillion
Modern Data Storage
Evolutions
© All rights reserved. Matillion 20 21
2
Paul Lacey
Sr. Dire ctor P rod u ct Marke tin g
Matillion
Sp e ake r
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
History of Data Warehousing
1960
DBMS
1970
RDBMS
1980
SQL
1988
Da ta
W a re h ou se
1992
P u b lish e d :
Bu ild in g th e
Da ta W a re h ou se
1996
P u b lish e d :
Th e Da ta
W a re h ou se
Toolkit
2005
Ha d oop
Big Da ta
2013
Clou d Da ta
W a re h ou se
2015
Clou d ETL
2017
La ke h ou se
3
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
History of Data Warehousing
2005
Ha d oop
‘Big Da ta ’
2011 2013 2014 2017
Clou d ETL
2015
Sp e ctru m
2017 2019
2013 2014
2011
4
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Architectural Paradigms
Effort to Reward
Innovation
Original Big Data Stack
Pipeline 2.0
Hybrid Storage
Lakehouse
2005
2013
2015
2017
5
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
The Original Big Data Stack
6
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Pipeline 2.0
7
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Hybrid Storage
8
© All rights reserved. Matillion 20 21
ML & BI Se p arate d an d Siloe d
Data
En g in e e rin g
Data Scie n ce
W ran g lin g ETL Data P re p
Storag e Data W are h ou se Data Lake
P roce ssin g Scala P an d as
Orch e stration Airflow
Ju p yte r
Note b ooks
Visu alization Tab le au Matp lotlib
9
9
© All rights reserved. Matillion 20 21
Com b in e d ata
scie n ce an d d ata
e n g in e e rin g
w orkflow s
The
Lakehouse
Approach
1
10
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Lake h ou se – b e st of b oth w orld s
Data W are h ou se Data Lake
Stre am in g
An alytics
BI Data
Scie n ce
Mach in e
Le arn in g
Stru ctu re d , Se m i-Stru ctu re d an d
Un stru ctu re d Data
11
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Lakehouse
12
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Fam iliar In te rface s
On e Datase t Analysts
Matillion ELT Logic and Orchestration
SQL
ACID
PySpark
Spark
Data
Scientists
Data
Engineers
Data Eng Data Science
Integrators Business Users
13
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Acce ssib le
In te g ration
Brin g s Un ifie d
An alytics
14
© All rights reserved. Matillion 2021
More In fo:
m atillion .com
15
© All rights reserved. Matillion 20 21
Matillion # Te am Gre e n Matillion # te am g re e n
m atillion .com
Th an k You !
16
Performance
• Performance is a critical point of interest when it
comes to selecting an analytics platform.
• To measure data warehouse performance, we use
similarly priced specifications across data
warehouse competitors.
• Usually when people say they care about
performance, it is the ultimate metric of
price/performance.
• The realities of creating fair tests can be
overwhelming to many shops, and is a task usually
underestimated.
2
The Perils of Performance Alone
• A modern workload is less
frequently a set number of
queries, but more of an
interactive variable number of
queries
• A lack of certain key features
and functions in the chosen
platform leads to increased
time spent
• There can be some hidden
downsides to some data
warehouse platforms have
features that appear beneficial
and desirable
3
Enterprise Analytic Platforms
Data is FFP when it is…
• In a leveragable platform
• In an appropriate platform for its profile and
usage
• With high non-functionals (Availability,
performance, scalability, stability, durability,
secure)
• Data is captured at the most granular level
• Data is at a data quality standard (as defined by
Data Governance)
5
Product Setup
6
Cost Predictability and Transparency
• The cost profile options for cloud databases
are straightforward if you accept the defaults
for simple workload or proof-of-concept (POC)
environments
• Initial entry costs and inadequately scoped
environments can artificially lower
expectations of the true costs of jumping into
a cloud data warehouse environment.
• For some, you pay for compute resources as a
function of time, but you also choose the
hourly rate based on certain enterprise
features you need.
7
Cost Consciousness and Licensing Structure
• Be on the lookout for cost optimizations like not
paying when the system is idle, compression to
save storage costs, and moving or isolating
workloads to avoid contention.
• Look for the ability to directly operate on compact
open file formats Parquet and ORC
• Also, costs can spin out of control if you have to
pay a separate license for each deployment option
or each machine learning algorithm.
8
Easy Administration
• Overall costs, time, as well as storage and compute
resources are affected by the simplicity of
configurability and overall use.
• The platform should have embraced a self-sufficiency
model for its customers and be well into the process of
automating repetitive tasks.
• Easy administration starts with setup that is a simple
process of asking basic information and providing
helpful information for selecting the storage and node
configurations.
9
Optimizer Robustness
• The data warehouse should be designed for
complex decision support and machine learning
activity in a multi-user, mixed workload, highly
concurrent environment.
• Check on conditional parallelism and what the
causes are of variations in the parallelism
deployed.
• Check on dynamic and controllable
prioritization of resources for queries.
10
Dedicated Compute
• The dedicated compute category represents the heart of
the analytics stack—the data warehouse itself.
• A modern cloud data warehouse must have separate
compute and storage architecture.
• The power to scale compute and storage independently of
one another has transitioned from an industry trend to an
industry standard.
11
Dedicated Storage
• The dedicated storage category represents
storage of the enterprise data.
• In former days, this data was tightly-coupled to
the data warehouse itself, but modern cloud
architecture allows for the data to be stored
separately (and priced separately).
12
Data Integration
• The data integration category represents the
movement of enterprise data from source to
the target data warehouse through
conventional ETL (extract-transform-load) and
ELT (extract-load-transform) methods.
13
Data Access
• Azure Synapse and Google BigQuery have a “serverless” pricing model that allows
users to run queries and only pay for the data they scan and not an hourly rate for
compute.
• Redshift has the Spectrum service to scan data in S3 without loading it into the data
warehouse; however, you pay for the data scanned, plus you need a running Redshift
cluster at an additional charge.
• For Snowflake, you pay for the compute, but not for data scanned. For all these
scenarios (except Snowflake), we assumed 500TB scanned per month for the
Medium-tier enterprise and 2,000TB scanned for Large organizations.
14
Data Lake
• The data lake category represents the use of a
data lake that is separate from the data. This is
common in many modern data-driven
organizations as a way to store and analyze
massive data sets of “colder” data that don’t
necessarily belong in the data warehouse.
15
Sample Breakout (AWS)
16
Dedicated
Compute
43%
Storage
0%
Data Integration
14%
Streaming
4%
Spark Analytics
3%
Data Exploration
6%
Data Lake
20%
BI
5%
Machine
Learning
5%
Identity
Management
0%
Data Catalog
0%
Product Utilization
17
Concurrency Scaling
• If the database has concurrency limitations, designing
around them is difficult at best, and limiting to effective
data usage.
• If the data warehouse automatically scales up to
overcome concurrency limitations, this may be costly if
the data warehouse charges by compute node.
• If the data warehouse charges per user, costs will also
increase as the data warehouse is put to more use in the
company.
• Look for a data warehouse to provide linear scaling in
overall query workload performance as concurrent users
are added.
18
Resource Elasticity
• A data warehouse needs to be able to scale up and down and
take advantage of the elastic compute and storage
capabilities in the cloud, public or private, without disruption
or delay.
• The more the customer needs to be involved in resource
determination and provisioning, the less elastic, and less
modern, the solution is.
• One thing to watch for in elasticity scaling is keeping the
amount of money spent by the customer under the
customer’s control.
19
Machine Learning
• Today, data warehouse query languages need to be extended to include machine
learning, or firms may find the programming required will be too challenging to keep
pace.
• Data warehouses today need to weave machine learning into their data processing
workflows.
• Vendors must accommodate and extend SQL to include machine learning functions
and algorithms to expand the capabilities of those tools and users.
• If your database does not include machine learning, there are many extra things to
be concerned with.
• Other components will be needed to complete the toolbox and get the job done.
• Ideally, security for machine learning will be the same as database security.
• The data warehouse also needs to be able to operate at scale, beyond sampling.
20
Data Storage Format Alternatives
• Cloud object storage is relatively inexpensive making data storage at high
scale affordable.
• On-premises, specialized private cloud storage options such as Pure
Flashblades tend to offer similar data type storage flexibility
• To take full advantage of the elasticity of the cloud without driving up costs,
data warehouse compute and storage need to be scaled separately.
• To take full advantage of the many types of data available, such as Apache
ORC, Apache Parquet, JSON, Apache Avro, etc., modern data warehouses
need to be able to analyze that data without moving or altering it.
• A unified analytics warehouse that supports these various data formats
means you have the benefit of querying them directly, without greatly
expanding the hierarchical complex data types to a standard tabular data
structure for analysis.
• You should also be able to import data directly from these formats
• The ability to join data for analysis between the various internal and external
data formats provides the highest level of analytic flexibility.
21
Hadoop Sequence File and Parquet File
22
Graph Databases
Bridge
vertex
Bridge
vertex
23
• Subject: John R Peterson Predicate: Knows Object: Frank T Smith
• Subject: Triple #1 Predicate: Confidence Percent Object: 70
• Subject: Triple #1 Predicate: Provenance Object: Mary L Jones
USAGE UNDERSTANDING BY THE BUILDERS
DATA
CULTIVATION
Data Warehouse
Data Lake
Balance of Analytics
Analytic Applications
DW
Data Lake
Analytic Applications
DW
Data Lake
Analytic Applications
DW
Data Lake
DW
Design Your Test
• What are you benchmarking?
– Query performance
– Load performance
– Query performance with concurrency
– Ease of use
• Competition
• Queries, Schema, Data
• Scale
• Cost
• Query Cut-Off
• Number of runs/cache
• Number of nodes
• Tuning allowed
• Vendor Involvement
• Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server)
• Any not-free third party, SaaS, or on-demand software
• Instance type of nodes
• Measure Price/Performance!
26
Using Data
Platforms that are
Fit-For-Purpose
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
A 2 time Inc. 5000 Company
@williammcknight
www.mcknightcg.com
(214) 514-1444

More Related Content

PDF
Advanced Analytics: Analytic Platforms Should Be Columnar Orientation
PDF
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
PDF
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
PDF
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
PDF
Data Management Meets Human Management - Why Words Matter
PDF
Webinar: Initiating a Customer MDM/Data Governance Program
PDF
Slides: Migrate BI Dashboards to Run Directly on a Cloud Data Lake in Five Ea...
PDF
ADV Slides: Comparing the Enterprise Analytic Solutions
Advanced Analytics: Analytic Platforms Should Be Columnar Orientation
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Data Management Meets Human Management - Why Words Matter
Webinar: Initiating a Customer MDM/Data Governance Program
Slides: Migrate BI Dashboards to Run Directly on a Cloud Data Lake in Five Ea...
ADV Slides: Comparing the Enterprise Analytic Solutions

What's hot (20)

PDF
Unlocking the Value of Your Data Lake
PDF
ADV Slides: 2021 Trends in Enterprise Analytics
PDF
Data Quality Best Practices
PDF
Building an Effective Data & Analytics Operating Model A Data Modernization G...
PDF
Data-Ed Online: Unlock Business Value through Reference & MDM
PDF
Data-Ed Webinar: Data Modeling Fundamentals
PPTX
IDERA Slides: Managing Complex Data Environments
PDF
The Shifting Landscape of Data Integration
PDF
Enterprise Architecture vs. Data Architecture
PDF
Slides: Enterprise Architecture vs. Data Architecture
PDF
DAS Slides: Data Virtualization – Separating Myth from Reality
PDF
Data Strategy Best Practices
PDF
Big Data and MDM altogether: the winning association
PDF
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
PPTX
10 Worst Practices in Master Data Management
 
PDF
Modern Integrated Data Environment - Whitepaper | Qubole
PDF
Measuring Data Quality Return on Investment
PPTX
Virtual Governance in a Time of Crisis Workshop
 
PDF
Slides: Beyond Metadata — Enrich Your Metadata Management with Deep-Level Dat...
PDF
Platforming the Major Analytic Use Cases for Modern Engineering
Unlocking the Value of Your Data Lake
ADV Slides: 2021 Trends in Enterprise Analytics
Data Quality Best Practices
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Data-Ed Online: Unlock Business Value through Reference & MDM
Data-Ed Webinar: Data Modeling Fundamentals
IDERA Slides: Managing Complex Data Environments
The Shifting Landscape of Data Integration
Enterprise Architecture vs. Data Architecture
Slides: Enterprise Architecture vs. Data Architecture
DAS Slides: Data Virtualization – Separating Myth from Reality
Data Strategy Best Practices
Big Data and MDM altogether: the winning association
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
10 Worst Practices in Master Data Management
 
Modern Integrated Data Environment - Whitepaper | Qubole
Measuring Data Quality Return on Investment
Virtual Governance in a Time of Crisis Workshop
 
Slides: Beyond Metadata — Enrich Your Metadata Management with Deep-Level Dat...
Platforming the Major Analytic Use Cases for Modern Engineering
Ad

Similar to Using Data Platforms That Are Fit-For-Purpose (20)

PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
PDF
Data Virtualization: An Essential Component of a Cloud Data Lake
PDF
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
PDF
Benefits of a data lake
PDF
Unlock Your Data for ML & AI using Data Virtualization
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
PDF
Achieve data democracy in data lake with data integration
PDF
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
PDF
How to select a modern data warehouse and get the most out of it?
PDF
Building a Logical Data Fabric using Data Virtualization (ASEAN)
PDF
When and How Data Lakes Fit into a Modern Data Architecture
PDF
single store faster analytics for warehousing
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
PPTX
Data warehousing Concepts and Design.pptx
PPTX
introduction & conceptsdatawarehousing.pptx
PPTX
Manish tripathi-ea-dw-bi
 
PDF
An Overview of Data Lake
PDF
Data Lakes: A Logical Approach for Faster Unified Insights
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Benefits of a data lake
Unlock Your Data for ML & AI using Data Virtualization
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Achieve data democracy in data lake with data integration
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
How to select a modern data warehouse and get the most out of it?
Building a Logical Data Fabric using Data Virtualization (ASEAN)
When and How Data Lakes Fit into a Modern Data Architecture
single store faster analytics for warehousing
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Data warehousing Concepts and Design.pptx
introduction & conceptsdatawarehousing.pptx
Manish tripathi-ea-dw-bi
 
An Overview of Data Lake
Data Lakes: A Logical Approach for Faster Unified Insights
Ad

More from DATAVERSITY (20)

PDF
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
PDF
Data at the Speed of Business with Data Mastering and Governance
PDF
Exploring Levels of Data Literacy
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
PDF
Make Data Work for You
PDF
Data Catalogs Are the Answer – What is the Question?
PDF
Data Catalogs Are the Answer – What Is the Question?
PDF
Data Modeling Fundamentals
PDF
Showing ROI for Your Analytic Project
PDF
How a Semantic Layer Makes Data Mesh Work at Scale
PDF
Is Enterprise Data Literacy Possible?
PDF
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
PDF
Emerging Trends in Data Architecture – What’s the Next Big Thing?
PDF
Data Governance Trends - A Look Backwards and Forwards
PDF
Data Governance Trends and Best Practices To Implement Today
PDF
2023 Trends in Enterprise Analytics
PDF
Data Strategy Best Practices
PDF
Who Should Own Data Governance – IT or Business?
PDF
Data Management Best Practices
PDF
MLOps – Applying DevOps to Competitive Advantage
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Data at the Speed of Business with Data Mastering and Governance
Exploring Levels of Data Literacy
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Make Data Work for You
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What Is the Question?
Data Modeling Fundamentals
Showing ROI for Your Analytic Project
How a Semantic Layer Makes Data Mesh Work at Scale
Is Enterprise Data Literacy Possible?
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends and Best Practices To Implement Today
2023 Trends in Enterprise Analytics
Data Strategy Best Practices
Who Should Own Data Governance – IT or Business?
Data Management Best Practices
MLOps – Applying DevOps to Competitive Advantage

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Mega Projects Data Mega Projects Data
PDF
Introduction to Business Data Analytics.
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Clinical guidelines as a resource for EBP(1).pdf
Mega Projects Data Mega Projects Data
Introduction to Business Data Analytics.
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Fluorescence-microscope_Botany_detailed content
Galatica Smart Energy Infrastructure Startup Pitch Deck
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
climate analysis of Dhaka ,Banglades.pptx

Using Data Platforms That Are Fit-For-Purpose

  • 1. Using Data Platforms that are Fit-For-Purpose Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group A 2-time Inc. 5000 Company @williammcknight www.mcknightcg.com (214) 514-1444
  • 2. © All rights reserved. Matillion 20 21 A ve n d or p e rsp e ctive from Matillion Modern Data Storage Evolutions
  • 3. © All rights reserved. Matillion 20 21 2 Paul Lacey Sr. Dire ctor P rod u ct Marke tin g Matillion Sp e ake r
  • 4. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 History of Data Warehousing 1960 DBMS 1970 RDBMS 1980 SQL 1988 Da ta W a re h ou se 1992 P u b lish e d : Bu ild in g th e Da ta W a re h ou se 1996 P u b lish e d : Th e Da ta W a re h ou se Toolkit 2005 Ha d oop Big Da ta 2013 Clou d Da ta W a re h ou se 2015 Clou d ETL 2017 La ke h ou se 3
  • 5. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 History of Data Warehousing 2005 Ha d oop ‘Big Da ta ’ 2011 2013 2014 2017 Clou d ETL 2015 Sp e ctru m 2017 2019 2013 2014 2011 4
  • 6. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 Architectural Paradigms Effort to Reward Innovation Original Big Data Stack Pipeline 2.0 Hybrid Storage Lakehouse 2005 2013 2015 2017 5
  • 7. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 The Original Big Data Stack 6
  • 8. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 Pipeline 2.0 7
  • 9. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 Hybrid Storage 8
  • 10. © All rights reserved. Matillion 20 21 ML & BI Se p arate d an d Siloe d Data En g in e e rin g Data Scie n ce W ran g lin g ETL Data P re p Storag e Data W are h ou se Data Lake P roce ssin g Scala P an d as Orch e stration Airflow Ju p yte r Note b ooks Visu alization Tab le au Matp lotlib 9 9
  • 11. © All rights reserved. Matillion 20 21 Com b in e d ata scie n ce an d d ata e n g in e e rin g w orkflow s The Lakehouse Approach 1 10
  • 12. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 Lake h ou se – b e st of b oth w orld s Data W are h ou se Data Lake Stre am in g An alytics BI Data Scie n ce Mach in e Le arn in g Stru ctu re d , Se m i-Stru ctu re d an d Un stru ctu re d Data 11
  • 13. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 Lakehouse 12
  • 14. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 Fam iliar In te rface s On e Datase t Analysts Matillion ELT Logic and Orchestration SQL ACID PySpark Spark Data Scientists Data Engineers Data Eng Data Science Integrators Business Users 13
  • 15. © All rights reserved. Matillion 20 21 © Matillion. All rights reserved 2021 Acce ssib le In te g ration Brin g s Un ifie d An alytics 14
  • 16. © All rights reserved. Matillion 2021 More In fo: m atillion .com 15
  • 17. © All rights reserved. Matillion 20 21 Matillion # Te am Gre e n Matillion # te am g re e n m atillion .com Th an k You ! 16
  • 18. Performance • Performance is a critical point of interest when it comes to selecting an analytics platform. • To measure data warehouse performance, we use similarly priced specifications across data warehouse competitors. • Usually when people say they care about performance, it is the ultimate metric of price/performance. • The realities of creating fair tests can be overwhelming to many shops, and is a task usually underestimated. 2
  • 19. The Perils of Performance Alone • A modern workload is less frequently a set number of queries, but more of an interactive variable number of queries • A lack of certain key features and functions in the chosen platform leads to increased time spent • There can be some hidden downsides to some data warehouse platforms have features that appear beneficial and desirable 3
  • 21. Data is FFP when it is… • In a leveragable platform • In an appropriate platform for its profile and usage • With high non-functionals (Availability, performance, scalability, stability, durability, secure) • Data is captured at the most granular level • Data is at a data quality standard (as defined by Data Governance) 5
  • 23. Cost Predictability and Transparency • The cost profile options for cloud databases are straightforward if you accept the defaults for simple workload or proof-of-concept (POC) environments • Initial entry costs and inadequately scoped environments can artificially lower expectations of the true costs of jumping into a cloud data warehouse environment. • For some, you pay for compute resources as a function of time, but you also choose the hourly rate based on certain enterprise features you need. 7
  • 24. Cost Consciousness and Licensing Structure • Be on the lookout for cost optimizations like not paying when the system is idle, compression to save storage costs, and moving or isolating workloads to avoid contention. • Look for the ability to directly operate on compact open file formats Parquet and ORC • Also, costs can spin out of control if you have to pay a separate license for each deployment option or each machine learning algorithm. 8
  • 25. Easy Administration • Overall costs, time, as well as storage and compute resources are affected by the simplicity of configurability and overall use. • The platform should have embraced a self-sufficiency model for its customers and be well into the process of automating repetitive tasks. • Easy administration starts with setup that is a simple process of asking basic information and providing helpful information for selecting the storage and node configurations. 9
  • 26. Optimizer Robustness • The data warehouse should be designed for complex decision support and machine learning activity in a multi-user, mixed workload, highly concurrent environment. • Check on conditional parallelism and what the causes are of variations in the parallelism deployed. • Check on dynamic and controllable prioritization of resources for queries. 10
  • 27. Dedicated Compute • The dedicated compute category represents the heart of the analytics stack—the data warehouse itself. • A modern cloud data warehouse must have separate compute and storage architecture. • The power to scale compute and storage independently of one another has transitioned from an industry trend to an industry standard. 11
  • 28. Dedicated Storage • The dedicated storage category represents storage of the enterprise data. • In former days, this data was tightly-coupled to the data warehouse itself, but modern cloud architecture allows for the data to be stored separately (and priced separately). 12
  • 29. Data Integration • The data integration category represents the movement of enterprise data from source to the target data warehouse through conventional ETL (extract-transform-load) and ELT (extract-load-transform) methods. 13
  • 30. Data Access • Azure Synapse and Google BigQuery have a “serverless” pricing model that allows users to run queries and only pay for the data they scan and not an hourly rate for compute. • Redshift has the Spectrum service to scan data in S3 without loading it into the data warehouse; however, you pay for the data scanned, plus you need a running Redshift cluster at an additional charge. • For Snowflake, you pay for the compute, but not for data scanned. For all these scenarios (except Snowflake), we assumed 500TB scanned per month for the Medium-tier enterprise and 2,000TB scanned for Large organizations. 14
  • 31. Data Lake • The data lake category represents the use of a data lake that is separate from the data. This is common in many modern data-driven organizations as a way to store and analyze massive data sets of “colder” data that don’t necessarily belong in the data warehouse. 15
  • 32. Sample Breakout (AWS) 16 Dedicated Compute 43% Storage 0% Data Integration 14% Streaming 4% Spark Analytics 3% Data Exploration 6% Data Lake 20% BI 5% Machine Learning 5% Identity Management 0% Data Catalog 0%
  • 34. Concurrency Scaling • If the database has concurrency limitations, designing around them is difficult at best, and limiting to effective data usage. • If the data warehouse automatically scales up to overcome concurrency limitations, this may be costly if the data warehouse charges by compute node. • If the data warehouse charges per user, costs will also increase as the data warehouse is put to more use in the company. • Look for a data warehouse to provide linear scaling in overall query workload performance as concurrent users are added. 18
  • 35. Resource Elasticity • A data warehouse needs to be able to scale up and down and take advantage of the elastic compute and storage capabilities in the cloud, public or private, without disruption or delay. • The more the customer needs to be involved in resource determination and provisioning, the less elastic, and less modern, the solution is. • One thing to watch for in elasticity scaling is keeping the amount of money spent by the customer under the customer’s control. 19
  • 36. Machine Learning • Today, data warehouse query languages need to be extended to include machine learning, or firms may find the programming required will be too challenging to keep pace. • Data warehouses today need to weave machine learning into their data processing workflows. • Vendors must accommodate and extend SQL to include machine learning functions and algorithms to expand the capabilities of those tools and users. • If your database does not include machine learning, there are many extra things to be concerned with. • Other components will be needed to complete the toolbox and get the job done. • Ideally, security for machine learning will be the same as database security. • The data warehouse also needs to be able to operate at scale, beyond sampling. 20
  • 37. Data Storage Format Alternatives • Cloud object storage is relatively inexpensive making data storage at high scale affordable. • On-premises, specialized private cloud storage options such as Pure Flashblades tend to offer similar data type storage flexibility • To take full advantage of the elasticity of the cloud without driving up costs, data warehouse compute and storage need to be scaled separately. • To take full advantage of the many types of data available, such as Apache ORC, Apache Parquet, JSON, Apache Avro, etc., modern data warehouses need to be able to analyze that data without moving or altering it. • A unified analytics warehouse that supports these various data formats means you have the benefit of querying them directly, without greatly expanding the hierarchical complex data types to a standard tabular data structure for analysis. • You should also be able to import data directly from these formats • The ability to join data for analysis between the various internal and external data formats provides the highest level of analytic flexibility. 21
  • 38. Hadoop Sequence File and Parquet File 22
  • 39. Graph Databases Bridge vertex Bridge vertex 23 • Subject: John R Peterson Predicate: Knows Object: Frank T Smith • Subject: Triple #1 Predicate: Confidence Percent Object: 70 • Subject: Triple #1 Predicate: Provenance Object: Mary L Jones
  • 40. USAGE UNDERSTANDING BY THE BUILDERS DATA CULTIVATION Data Warehouse Data Lake
  • 41. Balance of Analytics Analytic Applications DW Data Lake Analytic Applications DW Data Lake Analytic Applications DW Data Lake DW
  • 42. Design Your Test • What are you benchmarking? – Query performance – Load performance – Query performance with concurrency – Ease of use • Competition • Queries, Schema, Data • Scale • Cost • Query Cut-Off • Number of runs/cache • Number of nodes • Tuning allowed • Vendor Involvement • Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server) • Any not-free third party, SaaS, or on-demand software • Instance type of nodes • Measure Price/Performance! 26
  • 43. Using Data Platforms that are Fit-For-Purpose Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group A 2 time Inc. 5000 Company @williammcknight www.mcknightcg.com (214) 514-1444