SlideShare a Scribd company logo
Kelly Stirman
VP Strategy
@kstirman
Options for Data Prep - A Survey of the Current Market
Why data prep?
Analytics on modern
data is incredibly hard
Unprecedented complexity
The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
Your analysts are hungry for data
SQL
Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+
Data Prep
Options for Data Prep - A Survey of the Current Market
Source: Forrester
Old v. new approaches
Data Integration v. Data Prep
Data Integration Data Prep
Primary user IT Business Analyst
User works from Metadata Data samples
Prioritizes Governance, security Ease of use, time to insight
Sample vendors Informatica, IBM, SAS, SQL tools Alteryx, Trifacta, Paxata
Data Integration is the standard
• For 25+ years, Data Integration has been an essential tool for IT
• Pros
• Mature, robust
• Deep integrations to enterprise standards
• Security and governance controls
• Server-based: scalable, centralized
• Cons
• IT users only
• Assumes minimal data quality
• Mature for enterprise sources
• Less mature for cloud, 3rd party apps, Hadoop, NoSQL
• Complex, expensive
Data Prep prioritizes speed, ease of use
• Newer entrants, architected for modern resources
• Pros
• User experience works for both IT, Business
• Data-centric model vs. metadata-centric model
• Support for Hadoop, NoSQL, Cloud, machine learning
• Can leverage Hadoop and/or cloud for processing, storage
• Faster time to value
• Cons
• Less mature tech stack
• Small vendors, limited ecosystem of integrations and skills
• Security integrations less comprehensive
• Assumes governance, authority, lineage handled elsewhere
• Still need IT on board and coordinating process
Gartner 2016 Forrester 2016 Bloor 2017
Analyst coverage (see references)
Open source alternatives
• RDBMS
• Pros: SQL based; mature; ecosystem
• Cons: non-relational sources; scalability; ease of use
• Apache Hive
• Pros: scalabilty; SQL based; Hadoop integrations;
• Cons: latency; ease of use; integrations
• Apache Spark
• Pros: scalability; performance; Python/R integration; ML
• Cons: ease of use; integrations; maturity
• Python Pandas
• Pros: performance; pervasive skills; ecosystem; flexibility
• Cons: scale out is complex; ease of use
• R dplyr
• Pros: performance; pervasive skills; ecosystem; flexibility
• Cons: scale out is complex; ease of use
Screenshot commentary
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
How to decide?
Category Good Fit Primary User Model Scalability
ETL Tools
Static, predictable
integrations between
enterprise tech
IT
Data Pipeline,
metadata-based
Single server
BI Tools ”Last Mile” data prep Business Embedded Desktop
Trifacta, Paxata
Scalable, collaborative
data prep for business
users
Business
Spreadsheet,
sample-based
Hadoop cluster
Custom Scripts Maximum flexibility IT
Data Pipeline,
metadata-based
Single server
Alteryx, Datawatch
Building BI extracts,
easier to use than ETL
IT
Data Pipeline,
metadata-based
Desktop (single
server optional)
SAS Data Loader IT users IT
Data Pipeline,
metadata-based
Single server
Tamr
Human-aided ML for
data cleansing
Business
Spreadsheet,
sample-based
Single server
Important questions to ask
• Usability – knowing data is more important than knowing tech
• Collaboration – essential feature for business users
• Data sources – ODBC for NoSQL, cloud, Hadoop not enough
• License model – will influence how you adopt the tool
• Governance – solving problems or creating new ones?
• Complexity – how many moving parts for your end to end analytical
process
• Vendor viability – crowded market of small players
• Ecosystem – no technology is an island
Market predictions
• BI tools build integrated capabilities
• But customers want one solution for all tools
• ETL vendors try to become “business friendly”
• Legacy technology stack is an impediment, not an enabler
• Hadoop vendors acquire emerging data prep players
• What about data outside of Hadoop?
• Opportunity for new approach
• Truly self service for the business (no IT required)
• Works with all data sources (relational, cloud, NoSQL, Hadoop)
• Works with all analytical tools (BI, SQL, R, Python, Spark)
• Integrates all layers of the analytical stack
References
• Gartner (Market Guide) https://www.gartner.com/doc/3418832/market-guide-selfservice-data-preparation
• Forrester (Wave) https://www.forrester.com/report/The+Forrester+Wave+Data+Preparation+Tools+Q1+2017/-/E-RES128464
• Forrester (Vendor Landscape) https://www.forrester.com/report/Vendor+Landscape+Data+Preparation+Tools/-/E-RES128561
• Bloor Research: http://www.bloorresearch.com/technology/data-preparation-self-service/
• Informatica Demo: https://youtu.be/UBsUrJjggwc
• Alteryx Demo: https://youtu.be/LwO6VL1ScXk?t=1m25s
• SAS (data prep) Demo: https://youtu.be/9e_uxQBUPsQ?t=2m34s
• Trifacta Demo: https://www.youtube.com/watch?v=4VpW6oJ3cQI
• Paxata Demo: https://youtu.be/TR1smNYB4ks?t=18m6s
• Datawatch Demo: https://youtu.be/6hc_cafMsCs?t=2m22s
• Tableau (data prep) Demo: https://youtu.be/vlwfD9VyJME?t=20m49s
• Tamr Demo: https://youtu.be/PI_EqvIX45o
Kelly Stirman
VP Strategy
@kstirman
Want to try a new approach?
Contact me about the Dremio Beta Program
kelly@dremio.com

More Related Content

PPTX
Introduction to Dremio
PPTX
Free Training: How to Build a Lakehouse
PDF
Dremio introduction
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PPTX
Lakehouse Analytics with Dremio
PDF
Modernizing to a Cloud Data Architecture
PDF
RWDG Slides: A Complete Set of Data Governance Roles & Responsibilities
Introduction to Dremio
Free Training: How to Build a Lakehouse
Dremio introduction
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Lakehouse Analytics with Dremio
Modernizing to a Cloud Data Architecture
RWDG Slides: A Complete Set of Data Governance Roles & Responsibilities

What's hot (20)

PDF
Data Management is Data Governance
PDF
How to Strengthen Enterprise Data Governance with Data Quality
PPTX
Zero to Snowflake Presentation
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Introducing Databricks Delta
PDF
Snowflake Data Science and AI/ML at Scale
PDF
The Role of Data Governance in a Data Strategy
PDF
Collibra - Forrester Presentation : Data Governance 2.0
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
PPTX
Databricks for Dummies
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Implementing Effective Data Governance
PPTX
Data Observability.pptx
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PPTX
Building a Virtual Data Lake with Apache Arrow
PDF
Democratization of Data @Indix
PDF
How to Implement Data Governance Best Practice
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Data Management is Data Governance
How to Strengthen Enterprise Data Governance with Data Quality
Zero to Snowflake Presentation
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Introducing Databricks Delta
Snowflake Data Science and AI/ML at Scale
The Role of Data Governance in a Data Strategy
Collibra - Forrester Presentation : Data Governance 2.0
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Databricks for Dummies
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Implementing Effective Data Governance
Data Observability.pptx
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Building a Virtual Data Lake with Apache Arrow
Democratization of Data @Indix
How to Implement Data Governance Best Practice
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Building Lakehouses on Delta Lake with SQL Analytics Primer
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Ad

Viewers also liked (10)

PDF
Data Science Languages and Industry Analytics
PPTX
Apache Arrow - An Overview
PDF
The twins that everyone loved too much
PDF
Bi on Big Data - Strata 2016 in London
PDF
Apache Calcite: One planner fits all
PDF
SQL on everything, in memory
PPTX
Apache Arrow: In Theory, In Practice
PDF
Don’t optimize my queries, optimize my data!
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
Apache Calcite overview
Data Science Languages and Industry Analytics
Apache Arrow - An Overview
The twins that everyone loved too much
Bi on Big Data - Strata 2016 in London
Apache Calcite: One planner fits all
SQL on everything, in memory
Apache Arrow: In Theory, In Practice
Don’t optimize my queries, optimize my data!
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Apache Calcite overview
Ad

Similar to Options for Data Prep - A Survey of the Current Market (20)

PDF
Architecting Agile Data Applications for Scale
PDF
Hadoop and the Data Warehouse: When to Use Which
PPTX
Data Warehouse Optimization
PPTX
Feature Store as a Data Foundation for Machine Learning
PPTX
Big Data Strategy for the Relational World
PDF
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Meta scale kognitio hadoop webinar
PDF
An overview of modern scalable web development
PPTX
Transform your DBMS to drive engagement innovation with Big Data
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
PPTX
Derfor skal du bruge en DataLake
PPTX
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
2022 Trends in Enterprise Analytics
PDF
Moving data to the cloud BY CESAR ROJAS from Pivotal
PDF
Hadoop and Your Enterprise Data Warehouse
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PPT
SQL, NoSQL, BigData in Data Architecture
Architecting Agile Data Applications for Scale
Hadoop and the Data Warehouse: When to Use Which
Data Warehouse Optimization
Feature Store as a Data Foundation for Machine Learning
Big Data Strategy for the Relational World
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Meta scale kognitio hadoop webinar
An overview of modern scalable web development
Transform your DBMS to drive engagement innovation with Big Data
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
Derfor skal du bruge en DataLake
Innovation in the Enterprise Rent-A-Car Data Warehouse
Architect’s Open-Source Guide for a Data Mesh Architecture
2022 Trends in Enterprise Analytics
Moving data to the cloud BY CESAR ROJAS from Pivotal
Hadoop and Your Enterprise Data Warehouse
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
SQL, NoSQL, BigData in Data Architecture

Recently uploaded (20)

PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
history of c programming in notes for students .pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Nekopoi APK 2025 free lastest update
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
ISO 45001 Occupational Health and Safety Management System
VVF-Customer-Presentation2025-Ver1.9.pptx
history of c programming in notes for students .pptx
top salesforce developer skills in 2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Nekopoi APK 2025 free lastest update
How to Migrate SBCGlobal Email to Yahoo Easily
How to Choose the Right IT Partner for Your Business in Malaysia
Softaken Excel to vCard Converter Software.pdf
ManageIQ - Sprint 268 Review - Slide Deck
Which alternative to Crystal Reports is best for small or large businesses.pdf
Transform Your Business with a Software ERP System
Design an Analysis of Algorithms I-SECS-1021-03
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Odoo Companies in India – Driving Business Transformation.pdf
L1 - Introduction to python Backend.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf

Options for Data Prep - A Survey of the Current Market

  • 4. Analytics on modern data is incredibly hard Unprecedented complexity
  • 5. The demands for data are growing rapidly Increasing demands Reporting New products Forecasting Threat detection BI Machine Learning Segmenting Fraud prevention
  • 6. Your analysts are hungry for data SQL
  • 7. Today you engineer data flows and reshaping Data Staging • Custon ETL • Fragile transforms • Slow moving SQL
  • 8. Today you engineer data flows and reshaping Data Staging Data Warehouse • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL
  • 9. Today you engineer data flows and reshaping Data Staging Data Warehouse Cubes, BI Extracts & Aggregation Tables • Data sprawl • Governance issues • Slow to update • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL + + + + + + + + +
  • 13. Old v. new approaches
  • 14. Data Integration v. Data Prep Data Integration Data Prep Primary user IT Business Analyst User works from Metadata Data samples Prioritizes Governance, security Ease of use, time to insight Sample vendors Informatica, IBM, SAS, SQL tools Alteryx, Trifacta, Paxata
  • 15. Data Integration is the standard • For 25+ years, Data Integration has been an essential tool for IT • Pros • Mature, robust • Deep integrations to enterprise standards • Security and governance controls • Server-based: scalable, centralized • Cons • IT users only • Assumes minimal data quality • Mature for enterprise sources • Less mature for cloud, 3rd party apps, Hadoop, NoSQL • Complex, expensive
  • 16. Data Prep prioritizes speed, ease of use • Newer entrants, architected for modern resources • Pros • User experience works for both IT, Business • Data-centric model vs. metadata-centric model • Support for Hadoop, NoSQL, Cloud, machine learning • Can leverage Hadoop and/or cloud for processing, storage • Faster time to value • Cons • Less mature tech stack • Small vendors, limited ecosystem of integrations and skills • Security integrations less comprehensive • Assumes governance, authority, lineage handled elsewhere • Still need IT on board and coordinating process
  • 17. Gartner 2016 Forrester 2016 Bloor 2017 Analyst coverage (see references)
  • 18. Open source alternatives • RDBMS • Pros: SQL based; mature; ecosystem • Cons: non-relational sources; scalability; ease of use • Apache Hive • Pros: scalabilty; SQL based; Hadoop integrations; • Cons: latency; ease of use; integrations • Apache Spark • Pros: scalability; performance; Python/R integration; ML • Cons: ease of use; integrations; maturity • Python Pandas • Pros: performance; pervasive skills; ecosystem; flexibility • Cons: scale out is complex; ease of use • R dplyr • Pros: performance; pervasive skills; ecosystem; flexibility • Cons: scale out is complex; ease of use
  • 29. Category Good Fit Primary User Model Scalability ETL Tools Static, predictable integrations between enterprise tech IT Data Pipeline, metadata-based Single server BI Tools ”Last Mile” data prep Business Embedded Desktop Trifacta, Paxata Scalable, collaborative data prep for business users Business Spreadsheet, sample-based Hadoop cluster Custom Scripts Maximum flexibility IT Data Pipeline, metadata-based Single server Alteryx, Datawatch Building BI extracts, easier to use than ETL IT Data Pipeline, metadata-based Desktop (single server optional) SAS Data Loader IT users IT Data Pipeline, metadata-based Single server Tamr Human-aided ML for data cleansing Business Spreadsheet, sample-based Single server
  • 30. Important questions to ask • Usability – knowing data is more important than knowing tech • Collaboration – essential feature for business users • Data sources – ODBC for NoSQL, cloud, Hadoop not enough • License model – will influence how you adopt the tool • Governance – solving problems or creating new ones? • Complexity – how many moving parts for your end to end analytical process • Vendor viability – crowded market of small players • Ecosystem – no technology is an island
  • 31. Market predictions • BI tools build integrated capabilities • But customers want one solution for all tools • ETL vendors try to become “business friendly” • Legacy technology stack is an impediment, not an enabler • Hadoop vendors acquire emerging data prep players • What about data outside of Hadoop? • Opportunity for new approach • Truly self service for the business (no IT required) • Works with all data sources (relational, cloud, NoSQL, Hadoop) • Works with all analytical tools (BI, SQL, R, Python, Spark) • Integrates all layers of the analytical stack
  • 32. References • Gartner (Market Guide) https://www.gartner.com/doc/3418832/market-guide-selfservice-data-preparation • Forrester (Wave) https://www.forrester.com/report/The+Forrester+Wave+Data+Preparation+Tools+Q1+2017/-/E-RES128464 • Forrester (Vendor Landscape) https://www.forrester.com/report/Vendor+Landscape+Data+Preparation+Tools/-/E-RES128561 • Bloor Research: http://www.bloorresearch.com/technology/data-preparation-self-service/ • Informatica Demo: https://youtu.be/UBsUrJjggwc • Alteryx Demo: https://youtu.be/LwO6VL1ScXk?t=1m25s • SAS (data prep) Demo: https://youtu.be/9e_uxQBUPsQ?t=2m34s • Trifacta Demo: https://www.youtube.com/watch?v=4VpW6oJ3cQI • Paxata Demo: https://youtu.be/TR1smNYB4ks?t=18m6s • Datawatch Demo: https://youtu.be/6hc_cafMsCs?t=2m22s • Tableau (data prep) Demo: https://youtu.be/vlwfD9VyJME?t=20m49s • Tamr Demo: https://youtu.be/PI_EqvIX45o
  • 33. Kelly Stirman VP Strategy @kstirman Want to try a new approach? Contact me about the Dremio Beta Program kelly@dremio.com

Editor's Notes

  • #5: BI assumes single relational database, but… Data in non-relational technologies Data fragmented across many systems Massive scale and velocity
  • #6: Data is the business, and… Era of impatient smartphone natives Rise of self-service BI Accelerating time to market Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle: Slow or non-responsive IT “Shadow Analytics” Data governance risk Illusive data engineers Immature software Competing strategic initiatives
  • #7: Here’s the problem everyone is trying to solve today. You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL. Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3. So how are you going to get the data to the people asking for it?
  • #8: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
  • #9: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
  • #10: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so … You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts. In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change. But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data: They open a ticket with IT IT begins an engineering project to build another set of pipelines, over several weeks or months