SlideShare a Scribd company logo
Wes Holler, Chief Architect
Algebraix Data
JUST-IN-TIME ANALYTICS
AND THE NEED FOR AUTONOMOUS
DATABASE ADMINISTRATION
• A novel application of Set
Theory to Data Processing
• Applicable to many data
models including SQL
Data Algebra
3
Just-in-Time Analytics
Autonomous Data
Management
needs…
JIT Analytics and the Life of a
Modern Analyst
Statistics
& ML
Short
RTT
Big Data
systems
Model
business
questions
4
This shouldn’t
require intimate
knowledge of how
underlying systems
work.
Spark for JIT Analytics: The Good
• Unified API
• Schema-on-read and Heterogeneous Data
Sources
• Declarative Languages/APIs and Catalyst
• Elastic Compute
5
Spark for JIT Analytics: The Bad
• Challenges for interactivity, efficiency, and
scalability
• Cost of creating and maintaining “glue
code”
• Data scientists and engineers are doing
DBA work
Database Management Responsibilities
7
Capacity planning Configuration
Performance tuning A billion other things
#manual
We will focus on the
performance and
tuning aspects
Improving and Maintaining
Performance
8
• Indexes
• Materialize views
• Pre-aggregate data
• Lots of configuration
Performance Tuning Strategies in
Spark
• Segment, cache, and checkpoint
• Configure cluster parameters
• spark.sql.shuffle.partitions
9
What is the Problem with Manual
Tuning?
• Varies with the data (skew and scale), queries, and
hardware
• Often done through trial and error
• Problems are exacerbated with JIT analytics case
• Shared resources
10
What is the Problem with Manual
Tuning?
11
It is hard and
time-
consuming.
A Motivating Example for
Autonomous Data Management
12
A Motivating Example for
Autonomous Data Management
13
Data Algebra
SQL-DA
Entity Store
Optimizer
𝑄: = 𝑓𝑖𝑙𝑡𝑒𝑟 * +,--(𝑓𝑖𝑙𝑡𝑒𝑟 - +[0] 𝐴𝛻𝐵 	)
𝐴𝛻𝐵 =
0 ↦ 𝛼, …, 3 ↦ 42.0 : 3,
0 ↦ 𝛽 : 1,
…
,
0
𝑎
…
…
3
𝑏
4
𝑏2
𝑖𝑛𝑡 … 𝑓𝑙𝑜𝑎𝑡 𝑖𝑛𝑡
𝐴 𝐴 𝐵 𝐵
bar
baz
To
SQL-DA
Analyze
Entity Store
bar
baz
bar
baz Optimize
To SQL
Entity Store
bar
baz
To
SQL-DA
Analyze
bar
baz
Entity Store
bar
baz
To
SQL-DA
Analyze
bar
baz
Optimize
Materialize
View
Entity Store
To
SQL-DA
Analyze
bar
baz
Optimize
To SQL
bar
baz
Complex Query Expressions
are Turned Into Look-ups
19
Benefits of Autonomous Data
Management
• Reduce query time
• Reduce computation resources required
• Allow the analyst to focus on problem
solving, not data management
20
Algebraix Inside:
An Implementation of ADM
21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Before After
The PySpark API (DataFrames and SQL) is shimmed.
from pyspark import *
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
names = sc.readText(“people.txt”)
namesDF = sc.createDataFrame(names)
namesDF.registerTempTable(“names”)
sqlContext.sql(”””
SELECT * FROM names
”””).show()
from aqaspark import *
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
names = sc.readText(“people.txt”)
namesDF = sc.createDataFrame(names)
namesDF.registerTempTable(“names”)
sqlContext.sql(”””
SELECT * FROM names
”””).show()
Wrap Up
22
Autonomous Data
Management makes Spark
great for SQL analytics.
Thank You.
@wes_holler
wholler@algebraixdata.com
www.algebraixdata.com
tstraub@algebraixdata.com

More Related Content

PDF
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PPTX
Building a Virtual Data Lake with Apache Arrow
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
PDF
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Building a Virtual Data Lake with Apache Arrow
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi

What's hot (20)

PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PDF
Accelerating Data Science with Better Data Engineering on Databricks
PDF
Data Science Across Data Sources with Apache Arrow
PDF
Using Databricks as an Analysis Platform
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
PPTX
Large Scale Graph Analytics with JanusGraph
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PDF
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
PDF
How to use Parquet as a Sasis for ETL and Analytics
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
PDF
End-to-End Data Pipelines with Apache Spark
PPTX
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
PDF
Semi-Supervised Learning In An Adversarial Environment
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Pandas UDF: Scalable Analysis with Python and PySpark
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Accelerating Data Science with Better Data Engineering on Databricks
Data Science Across Data Sources with Apache Arrow
Using Databricks as an Analysis Platform
How We Optimize Spark SQL Jobs With parallel and sync IO
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Large Scale Graph Analytics with JanusGraph
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
How to use Parquet as a Sasis for ETL and Analytics
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
End-to-End Data Pipelines with Apache Spark
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
Semi-Supervised Learning In An Adversarial Environment
Announcing Databricks Cloud (Spark Summit 2014)
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Ad

Similar to Just-in-Time Analytics and the Need for Autonomous Database Administration with Wes Holler (20)

PDF
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
PDF
ARTIFICIAL INTELLIGENCE FOR DATA MANAGEMENT
PDF
How Organizations are Using AI for Data Management
PDF
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
PDF
Get ready for_an_autonomous_data_driven_future_ext
PDF
Auto-Pilot for Apache Spark Using Machine Learning
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PPTX
SoftServe BI/BigData Workshop in Utah
PDF
Sydney Spark Meetup Dec 08, 2015
PDF
Melbourne Spark Meetup Dec 09 2015
PDF
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
PPTX
2010/09 - Database Architechs - Performance & Tuning Tool
PDF
Database@Home : The Future is Data Driven
PDF
Machine Learning and AI at Oracle
PDF
Singapore Spark Meetup Dec 01 2015
PDF
Introduction to Machine Learning and Data Science using the Autonomous databa...
PDF
Machine Learning in Autonomous Data Warehouse
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PPTX
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
PDF
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
ARTIFICIAL INTELLIGENCE FOR DATA MANAGEMENT
How Organizations are Using AI for Data Management
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Get ready for_an_autonomous_data_driven_future_ext
Auto-Pilot for Apache Spark Using Machine Learning
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
SoftServe BI/BigData Workshop in Utah
Sydney Spark Meetup Dec 08, 2015
Melbourne Spark Meetup Dec 09 2015
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
2010/09 - Database Architechs - Performance & Tuning Tool
Database@Home : The Future is Data Driven
Machine Learning and AI at Oracle
Singapore Spark Meetup Dec 01 2015
Introduction to Machine Learning and Data Science using the Autonomous databa...
Machine Learning in Autonomous Data Warehouse
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Lecture1 pattern recognition............
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PPTX
Computer network topology notes for revision
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Lecture1 pattern recognition............
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Qualitative Qantitative and Mixed Methods.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Clinical guidelines as a resource for EBP(1).pdf
Fluorescence-microscope_Botany_detailed content
Supervised vs unsupervised machine learning algorithms
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Computer network topology notes for revision
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Just-in-Time Analytics and the Need for Autonomous Database Administration with Wes Holler