SlideShare a Scribd company logo
Analysis of Data
Science Software
2020
Analysis of Data Science Software, 2020
Goal of the Analysis
The goal was to identify the closest competitors, functionally speaking,
in the data science software industry.
Our hypothesis, was that text analysis and novel nearest neighbor
algorithms could distill text based reports into a useful summary
visualization of the products in the space.
Step 1: Text analysis, Data Transformation:
Two related reports covering the range of product capabilities across
four use cases were used for source data.
Source report content was converted to numeric representations of the
text. A matrix was populated with quantitative values ranging from 1 to
5.
Scope Adjustment
The source reports did not provide full breakdown of sub dimensions
within the four use cases. As a result, many fields in the matrix had
missing values.
The estimated completion time for a full analysis on all four use cases
exceeded cost benefit metrics. The goal was narrowed to focus on the
first use case, which had four sub dimensions: access, preparation,
exploration, and automation.
Step 2: Imputing Values and Adjusting Imputed Values
A value of [3] was set as the estimate for all missing values.
Scores for each sub dimension were averaged into a total for the use
case. Result totals in the cqtinf model score table came close to the
source report result totals for the use case.
Minor adjustments to sub dimension values brought 12 of the 16
product scores into very close alignment with the source scores,
without concern of overfitting.
Alteryx [AYX] was chosen as the fixed variable for further analysis.
Step 3: Analysis Model Outputs
The cqtinf model provided two outputs:
1] a short list of AYX closest competitors, based on the number of times
a competitor is within range, where frequency represents closeness:
2] an input for a complex topological / nearest neighbor data analysis,
based on actual distance measures of competitors.
Dataiku Datawatch TIBCO SAS VDMML KNIME
4 3.3 3.3 3 3
Step 4: Nearest Neighbor Conversion
To perform this nearest neighbor analysis, the matrix score values had
to be transformed into [x, y] grid coordinates which could be plotted on
a graph. cqtinf heuristics provided the conversion.
Once the modeling was completed, the full set of DSML software
products could be positioned on a grid, for summary visualization.
Step 5: Selection of Graphic Style
Four dimensions were required, and a layout that would support a
simple representation where product nodes could straddle two
dimensions without any crisscrossing of relationships, was designed
from scratch.
Comparing two Model Outputs: The resulting TDA map varied slightly
from the simpler frequency table.
Dataiku, #1 in the frequency table, fell just outside the map inclusion
criteria. Expanding the cqtinf model’s ‘top 5’ constraint from 5 to 6
would result in Dataiku being on the map.
According to the map, Rapidminer appears to be within shortlist
distance of AYX, which is inconsistent with other arguments.
The cqtinf node positioning heuristic delivers maps quickly, and in
theory these visualizations are explicatory. Spending more time on
additional calculations may repair this ‘problem,’ but since the model is
transparent, analysts can explain strengths or weaknesses in the
underlying data, and the positioning algorithm, and we can accept that
if some outputs of the model are not perfect, they are still useful.

More Related Content

PDF
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
PDF
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
PDF
Introduction to Streaming with Apache Flink
PDF
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
PPTX
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
PDF
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
PPTX
SIEM Modernization: Build a Situationally Aware Organization with Apache Kafka®
PDF
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Introduction to Streaming with Apache Flink
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
SIEM Modernization: Build a Situationally Aware Organization with Apache Kafka®
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo

What's hot (20)

PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Time series-analysis-using-an-event-streaming-platform -_v3_final
PDF
Change Data Streaming Patterns for Microservices With Debezium
PPTX
Comparing three data ingestion approaches where Apache Kafka integrates with ...
PDF
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
PDF
The Future of Data Pipelines
PDF
Introduction to Streaming with Apache Flink
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
PDF
dA Platform Overview
PDF
How to leverage Kafka data streams with Neo4j
PPTX
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
PPTX
Building a Codeless Log Pipeline w/ Confluent Sink Connector | Pollyanna Vale...
PDF
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
PDF
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
PDF
What's new in confluent platform 5.4 online talk
PDF
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
PPTX
Make streaming processing towards ANSI SQL
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Time series-analysis-using-an-event-streaming-platform -_v3_final
Change Data Streaming Patterns for Microservices With Debezium
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
The Future of Data Pipelines
Introduction to Streaming with Apache Flink
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
dA Platform Overview
How to leverage Kafka data streams with Neo4j
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
Building a Codeless Log Pipeline w/ Confluent Sink Connector | Pollyanna Vale...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Modern ETL Pipelines with Change Data Capture
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
What's new in confluent platform 5.4 online talk
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Make streaming processing towards ANSI SQL
Ad

Similar to Analysis of data science software 2020 (20)

PDF
Architecting for Data Science
PDF
Machine Learning on dirty data - Dataiku - Forum du GFII 2014
PPTX
Navigating-the-World-of-Data-Science.pptx
PDF
15 DATA SCIENCE TRENDS TO RULE IN 2023.pdf
PPTX
2016 Strata Conference New York - Vendor Briefings
PDF
Data Competitive
PDF
Applied_Data_Science_Presented_by_Yhat
DOCX
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
PDF
Ayasdi & Teradata : Applying Topological Data Analysis to Complex Data
PPTX
Choosing a Data Visualization Tool for Data Scientists_Final
PDF
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
PPTX
u2_platform.pptx
PPTX
platform for Machine Learning
PDF
Data science mastery course in pitampura
POT
From Data to Visualization: Emerging Tools for Research / Jan Johansson
PDF
1415 gold sanford
PPTX
Data Science and Analysis.pptx
PPTX
Machine Learning - Startup weekend UCSB 2018
PDF
Data science guide
PDF
Introduction to Daigo Tanaka @ Anelen
Architecting for Data Science
Machine Learning on dirty data - Dataiku - Forum du GFII 2014
Navigating-the-World-of-Data-Science.pptx
15 DATA SCIENCE TRENDS TO RULE IN 2023.pdf
2016 Strata Conference New York - Vendor Briefings
Data Competitive
Applied_Data_Science_Presented_by_Yhat
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Ayasdi & Teradata : Applying Topological Data Analysis to Complex Data
Choosing a Data Visualization Tool for Data Scientists_Final
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
u2_platform.pptx
platform for Machine Learning
Data science mastery course in pitampura
From Data to Visualization: Emerging Tools for Research / Jan Johansson
1415 gold sanford
Data Science and Analysis.pptx
Machine Learning - Startup weekend UCSB 2018
Data science guide
Introduction to Daigo Tanaka @ Anelen
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Lecture1 pattern recognition............
PPTX
Global journeys: estimating international migration
PPTX
Database Infoormation System (DBIS).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Quality review (1)_presentation of this 21
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Lecture1 pattern recognition............
Global journeys: estimating international migration
Database Infoormation System (DBIS).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Acumen Training GuidePresentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
Moving the Public Sector (Government) to a Digital Adoption
Fluorescence-microscope_Botany_detailed content
Quality review (1)_presentation of this 21
Taxes Foundatisdcsdcsdon Certificate.pdf

Analysis of data science software 2020

  • 1. Analysis of Data Science Software 2020
  • 2. Analysis of Data Science Software, 2020
  • 3. Goal of the Analysis The goal was to identify the closest competitors, functionally speaking, in the data science software industry. Our hypothesis, was that text analysis and novel nearest neighbor algorithms could distill text based reports into a useful summary visualization of the products in the space.
  • 4. Step 1: Text analysis, Data Transformation: Two related reports covering the range of product capabilities across four use cases were used for source data. Source report content was converted to numeric representations of the text. A matrix was populated with quantitative values ranging from 1 to 5.
  • 5. Scope Adjustment The source reports did not provide full breakdown of sub dimensions within the four use cases. As a result, many fields in the matrix had missing values. The estimated completion time for a full analysis on all four use cases exceeded cost benefit metrics. The goal was narrowed to focus on the first use case, which had four sub dimensions: access, preparation, exploration, and automation.
  • 6. Step 2: Imputing Values and Adjusting Imputed Values A value of [3] was set as the estimate for all missing values. Scores for each sub dimension were averaged into a total for the use case. Result totals in the cqtinf model score table came close to the source report result totals for the use case. Minor adjustments to sub dimension values brought 12 of the 16 product scores into very close alignment with the source scores, without concern of overfitting. Alteryx [AYX] was chosen as the fixed variable for further analysis.
  • 7. Step 3: Analysis Model Outputs The cqtinf model provided two outputs: 1] a short list of AYX closest competitors, based on the number of times a competitor is within range, where frequency represents closeness: 2] an input for a complex topological / nearest neighbor data analysis, based on actual distance measures of competitors. Dataiku Datawatch TIBCO SAS VDMML KNIME 4 3.3 3.3 3 3
  • 8. Step 4: Nearest Neighbor Conversion To perform this nearest neighbor analysis, the matrix score values had to be transformed into [x, y] grid coordinates which could be plotted on a graph. cqtinf heuristics provided the conversion. Once the modeling was completed, the full set of DSML software products could be positioned on a grid, for summary visualization.
  • 9. Step 5: Selection of Graphic Style Four dimensions were required, and a layout that would support a simple representation where product nodes could straddle two dimensions without any crisscrossing of relationships, was designed from scratch.
  • 10. Comparing two Model Outputs: The resulting TDA map varied slightly from the simpler frequency table. Dataiku, #1 in the frequency table, fell just outside the map inclusion criteria. Expanding the cqtinf model’s ‘top 5’ constraint from 5 to 6 would result in Dataiku being on the map. According to the map, Rapidminer appears to be within shortlist distance of AYX, which is inconsistent with other arguments. The cqtinf node positioning heuristic delivers maps quickly, and in theory these visualizations are explicatory. Spending more time on additional calculations may repair this ‘problem,’ but since the model is transparent, analysts can explain strengths or weaknesses in the underlying data, and the positioning algorithm, and we can accept that if some outputs of the model are not perfect, they are still useful.