SlideShare a Scribd company logo
Review on
Load Distribution of Analytical Query Workloads
For Database Cluster Architectures
Research Paper
by
Thomas Phan
Yahoo!, Inc.,
Sunnyvale, CA, USA
Wen-Syan Li
IBM Almaden Research Center,
San Jose, CA, USA
Introduction
Database Cluster
- A group of databases
Ex: a company may have two identical database systems, one as a production
system for online transactions and one as a hot standby. And also some
additional database systems which have a subset of base tables for various
day today applications. These systems have their dedicated missions in the
day time, but they could be left idle or under-utilized in the evening when
batch workloads are processed. In such systems, query workloads can be
distributed across different servers for better performance.
Load Distribution of Analytical Query Workloads
For Database Cluster Architectures
Introduction Cont.
• Online query workloads - touch a relatively small region of the data on
the disk. Such queries are said to be selective
• Analytical Query Workloads - Touch large volumes of the disk.
• On Line Analytical Processing (OLAP)
OLAP is an approach to answer multi-dimensional analytical (MDA)
queries swiftly. OLAP is part of the broader category of business
intelligence, which also encompasses relational database, report writing
and data mining.
MQTs are required in OLAP applications as the query workloads tend
to have complex structures and syntax.
• Materialized query tables (MQT)
MQT is a table whose definition is based on the result of a query, and
whose data is in the form of pre-computed results that are taken from
one or more tables.
Idea of the Research
Framework for coordinating and optimizing execution of OLAP query
workloads across a cluster of database servers with shared-nothing
architecture.
For a database cluster, such an optimization is achieved when,
the maximum completion time of the workloads across all
database servers is minimized
Completion time at each database server includes,
MQT and index building time + query workload execution time
Expected Outcome : Server – MQT – Query mapping mechanism where
workload execution time is minimized
Current Scenario
Load distribution by dividing the workload into multiple sub-workloads
and assign each sub-workload to a database server in some greedy manner,
such as a round robin distribution.
Problems Identified
‽ Queries routed to a database server may not be collocated with their
needed MQTs
‽ Some MQTs may not t in the data server that has a limited disk space
‽ Some sub-workloads may be more expensive to execute than others, so
some server may be idle while other servers are still busy
‽ Some servers may be more powerful than others
Proposed Methodology
Proposed Methodology Cont.
Key component - Scheduler
(operates on the queries and the MQTs produced by an
existing MQT Advisor product)
Scheduler distributes the workload's queries and MQTs according to query-
to-server and MQT-to-server mappings.
Distribution problem
number of different distribution
combinations =
 Even for small parameters, the
solution space grows exponentially,
making an exhaustive search
infeasible
Approach
To search through the solution space, Genetic Algorithm (GA)
search heuristic is used which finds a near-optimal mapping of
• queries-to-servers
• MQTs-to-servers and also
• query-to-MQTs requirement mapping
Other options may had used
• Tabu search
• Simulated annealing
• Steepest-ascent hill climbing
 The justification for using GA as the searching algorithm
instead of other methods, is not sufficient.
Genetic Algorithm Design
Chromosome - Specific mappings of queries-to-servers and MQTs-to-servers
along with its associated workload execution time
Best organism – mapping with lowest workload execution time
Previous work depicts, having two separate sets of chromosomes for a multi-
objective optimization has a hard time converging.
To overcome that a single unified chromosome is used that represents
collocations of queries and MQTs onto the same server.
Genetic Algorithm Crossover
GA Evaluation Function
• Initialize the execution times for all servers in the chromosome.
• Check to see what MQTs have been materialized and at what server and
accordingly adds these materialization times to the respective servers.
• key loop that looks across all collocations.
• If a query needs a particular MQT, the query's execution with the MQT is
added to the execution time if the query and MQT are collocated. If they
are not collocated, then the query's execution time without the MQT is
added
• The function returns the maximum execution time among the servers.
• And it is omitted from the chromosome, go for next generation.
Mapping Optimization
Developed Genetic Algorithm is compared with an exhaustive search and
three standard greedy algorithms.
And the paper shows that GA is a better solution in finding optimal mapping
mechanism.
Narrowing down the research in to OLAP query workloads and usage of
MQTs is not clearly justified.
And also there are some problems in the method followed to find the better
solution.
Literature Review
1. Need to justify the single server approach Vs database cluster approach
• Only describes the use of QCC (Query Cost Calibrator) in federated systems
and not in database clusters
2. No enough evidence to emphasize the relativity of GA to solve optimization
problem
e.g. Evolutionary Optimization of File Assignment for a Large-Scale Video-
on- Demand System by Jun Guo
3. Reasons for the use of scheduler algorithms for experiment is not clearly
mentioned
4. Purpose for the use of OLAP is not identified
• OLAP is used because MQTs are required OLAP
5. No review on the preparation of greedy algorithms
e.g. Valuated matroids: a new look at the greedy algorithm by Andreas W.M.
Dress,
Walter Wenzel
What’s new on this topic ?
• IQuery Performance Prediction for Analytical Workloads by Jennie Duggan
Uses a combination of isolated and concurrent query execution samples, as
well as new query workload features and metrics that can capture how
different query classes behave for various levels of resource availability and
contention.
• Dynamic Prioritization of Database Queries by Sivaramakrishnan
Narayanan, Florian Waas
Presents a mechanism that continuously determines and re-computes the ideal
target velocity of concurrent database processes based on their run-time
statistics to achieve
prioritization. In this scheme, every process autonomously adjusts its resource
consumption using basic control theory principles.
Sources and data collection
•5 types of scheduler algorithms
GA
Exhaustive
Greedy1-MQTs-then-queries
Greedy2-queries-then-MQTs
Greedy3-no-MQTs
•2 data sources
workload produced by the standard TPC benchmark H generator
synthetic benchmark based on the queries and MQTs from the TPC
benchmarks
Tests implemented in standard C++ and run on an off the shelf desk top
computer running Red Hat Linux.
GA run for 100 generations with a population size of 100 chromosomes.
Workload running time with increasing number of
queries
Exhaustive search and GA
Comparison against exhaustive search
shows that GA gives very close results
as in exhaustive search which implies
accuracy.
Workload running time with
increasing number of servers
GA, exhaustive search, greedy1, 2, 3
Workload running time with increasing
number of queries
Issues
•Assumption as having infinite disk space
But disk space is limited
•Tested for homogeneous servers only
But if heterogeneous!!!
•Scalability tested up to limited number queries only
But when increased???
•Emphasis on the suitability of GA from the very beginning.
But results show Greedy 2 is also significant
Conclusion
-problem identification is successful
-prior conclusions of results???
-further improvements for algorithms is needed
Thank You !

More Related Content

PDF
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
PDF
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
PDF
Netezza workload management
PDF
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PDF
Optimized Assignment of Independent Task for Improving Resources Performance ...
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
Netezza workload management
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
Big Data Berlin v8.0 Stream Processing with Apache Apex
Optimized Assignment of Independent Task for Improving Resources Performance ...

What's hot (19)

PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PDF
Survey of streaming data warehouse update scheduling
PDF
Scalable scheduling of updates in streaming data warehouses
PDF
Minimize Staleness and Stretch in Streaming Data Warehouses
PDF
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PPTX
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
PPTX
Capital One's Next Generation Decision in less than 2 ms
PPT
Moving Towards a Streaming Architecture
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Next Gen Big Data Analytics with Apache Apex
PDF
A novel scheduling algorithm for cloud computing environment
PPTX
Introduction to Real-Time Data Processing
PDF
Cad report
PDF
Architecting for the cloud elasticity security
PDF
Drinking from the Firehose - Real-time Metrics
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
Stateful stream processing with Apache Flink
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Survey of streaming data warehouse update scheduling
Scalable scheduling of updates in streaming data warehouses
Minimize Staleness and Stretch in Streaming Data Warehouses
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
Architectual Comparison of Apache Apex and Spark Streaming
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Capital One's Next Generation Decision in less than 2 ms
Moving Towards a Streaming Architecture
Ingestion and Dimensions Compute and Enrich using Apache Apex
Next Gen Big Data Analytics with Apache Apex
A novel scheduling algorithm for cloud computing environment
Introduction to Real-Time Data Processing
Cad report
Architecting for the cloud elasticity security
Drinking from the Firehose - Real-time Metrics
Scalable Distributed Real-Time Clustering for Big Data Streams
Stateful stream processing with Apache Flink
Ad

Viewers also liked (16)

PPTX
How the Internet has changed distribution in the airline industry.
PDF
Airline Distribution Infographic 2
PDF
Airline Distribution Technology and Strategy Conference 2016
PPT
Optimal Load Distribution in Large Scale WLANs Utilizing a Power Management A...
PPT
Using Web Tools and Methods to Support Earth Science Collaborations
PPT
Airline Distribution Panel (Vienna 2009)
PPTX
WIT - Trends Really Affecting the Travel Industry
PPT
China air transport and airport industry report, 2010 2011
PDF
Passenger load factor
PDF
Neal Ford Emergent Design And Evolutionary Architecture
PDF
Principles and Techniques of Evolutionary Architecture with Dr. Rebecca Parsons
DOCX
Airlines Database Design
PPTX
AIRCRAFT WEIGHT AND BALANCE BASIC FOR LOAD CONTROL
PPT
Introduction to airline reservation systems
PPT
Air ticket reservation system presentation
PPT
Introduction to Airline Information System
How the Internet has changed distribution in the airline industry.
Airline Distribution Infographic 2
Airline Distribution Technology and Strategy Conference 2016
Optimal Load Distribution in Large Scale WLANs Utilizing a Power Management A...
Using Web Tools and Methods to Support Earth Science Collaborations
Airline Distribution Panel (Vienna 2009)
WIT - Trends Really Affecting the Travel Industry
China air transport and airport industry report, 2010 2011
Passenger load factor
Neal Ford Emergent Design And Evolutionary Architecture
Principles and Techniques of Evolutionary Architecture with Dr. Rebecca Parsons
Airlines Database Design
AIRCRAFT WEIGHT AND BALANCE BASIC FOR LOAD CONTROL
Introduction to airline reservation systems
Air ticket reservation system presentation
Introduction to Airline Information System
Ad

Similar to Load distribution of analytical query workloads for database cluster architecture (20)

PDF
Beyond EXPLAIN: Query Optimization From Theory To Code
PDF
Optimized Access Strategies for a Distributed Database Design
PDF
Issues in Query Processing and Optimization
PDF
Readings in Database Systems Fourth Edition Joseph M. Hellerstein
PDF
[Www.pkbulk.blogspot.com]dbms13
PPTX
CS 542 -- Query Execution
PDF
Cal Essay
PDF
Introducing workload analysis
PPT
Query optimization and challenges in DDBMS with Review Algorithms.
PDF
(eBook PDF) Database System Concepts 6th Edition
PDF
(eBook PDF) Database System Concepts 6th Edition
PDF
Query optimization in oodbms identifying subquery for query management
PPTX
Concepts of Query Processing in ADBMS.pptx
PDF
IRJET- A Comprehensive Review on Query Optimization for Distributed Databases
PDF
Fntdb07 architecture
PDF
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
PDF
Readings in Database Systems Fourth Edition Joseph M. Hellerstein
PDF
Don't optimize my queries, organize my data!
PDF
(eBook PDF) Database System Concepts 6th Edition
Beyond EXPLAIN: Query Optimization From Theory To Code
Optimized Access Strategies for a Distributed Database Design
Issues in Query Processing and Optimization
Readings in Database Systems Fourth Edition Joseph M. Hellerstein
[Www.pkbulk.blogspot.com]dbms13
CS 542 -- Query Execution
Cal Essay
Introducing workload analysis
Query optimization and challenges in DDBMS with Review Algorithms.
(eBook PDF) Database System Concepts 6th Edition
(eBook PDF) Database System Concepts 6th Edition
Query optimization in oodbms identifying subquery for query management
Concepts of Query Processing in ADBMS.pptx
IRJET- A Comprehensive Review on Query Optimization for Distributed Databases
Fntdb07 architecture
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
Readings in Database Systems Fourth Edition Joseph M. Hellerstein
Don't optimize my queries, organize my data!
(eBook PDF) Database System Concepts 6th Edition

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Load distribution of analytical query workloads for database cluster architecture

  • 1. Review on Load Distribution of Analytical Query Workloads For Database Cluster Architectures Research Paper by Thomas Phan Yahoo!, Inc., Sunnyvale, CA, USA Wen-Syan Li IBM Almaden Research Center, San Jose, CA, USA
  • 2. Introduction Database Cluster - A group of databases Ex: a company may have two identical database systems, one as a production system for online transactions and one as a hot standby. And also some additional database systems which have a subset of base tables for various day today applications. These systems have their dedicated missions in the day time, but they could be left idle or under-utilized in the evening when batch workloads are processed. In such systems, query workloads can be distributed across different servers for better performance. Load Distribution of Analytical Query Workloads For Database Cluster Architectures
  • 3. Introduction Cont. • Online query workloads - touch a relatively small region of the data on the disk. Such queries are said to be selective • Analytical Query Workloads - Touch large volumes of the disk. • On Line Analytical Processing (OLAP) OLAP is an approach to answer multi-dimensional analytical (MDA) queries swiftly. OLAP is part of the broader category of business intelligence, which also encompasses relational database, report writing and data mining. MQTs are required in OLAP applications as the query workloads tend to have complex structures and syntax. • Materialized query tables (MQT) MQT is a table whose definition is based on the result of a query, and whose data is in the form of pre-computed results that are taken from one or more tables.
  • 4. Idea of the Research Framework for coordinating and optimizing execution of OLAP query workloads across a cluster of database servers with shared-nothing architecture. For a database cluster, such an optimization is achieved when, the maximum completion time of the workloads across all database servers is minimized Completion time at each database server includes, MQT and index building time + query workload execution time Expected Outcome : Server – MQT – Query mapping mechanism where workload execution time is minimized
  • 5. Current Scenario Load distribution by dividing the workload into multiple sub-workloads and assign each sub-workload to a database server in some greedy manner, such as a round robin distribution. Problems Identified ‽ Queries routed to a database server may not be collocated with their needed MQTs ‽ Some MQTs may not t in the data server that has a limited disk space ‽ Some sub-workloads may be more expensive to execute than others, so some server may be idle while other servers are still busy ‽ Some servers may be more powerful than others
  • 7. Proposed Methodology Cont. Key component - Scheduler (operates on the queries and the MQTs produced by an existing MQT Advisor product) Scheduler distributes the workload's queries and MQTs according to query- to-server and MQT-to-server mappings. Distribution problem number of different distribution combinations =  Even for small parameters, the solution space grows exponentially, making an exhaustive search infeasible
  • 8. Approach To search through the solution space, Genetic Algorithm (GA) search heuristic is used which finds a near-optimal mapping of • queries-to-servers • MQTs-to-servers and also • query-to-MQTs requirement mapping Other options may had used • Tabu search • Simulated annealing • Steepest-ascent hill climbing  The justification for using GA as the searching algorithm instead of other methods, is not sufficient.
  • 9. Genetic Algorithm Design Chromosome - Specific mappings of queries-to-servers and MQTs-to-servers along with its associated workload execution time Best organism – mapping with lowest workload execution time Previous work depicts, having two separate sets of chromosomes for a multi- objective optimization has a hard time converging. To overcome that a single unified chromosome is used that represents collocations of queries and MQTs onto the same server.
  • 11. GA Evaluation Function • Initialize the execution times for all servers in the chromosome. • Check to see what MQTs have been materialized and at what server and accordingly adds these materialization times to the respective servers. • key loop that looks across all collocations. • If a query needs a particular MQT, the query's execution with the MQT is added to the execution time if the query and MQT are collocated. If they are not collocated, then the query's execution time without the MQT is added • The function returns the maximum execution time among the servers. • And it is omitted from the chromosome, go for next generation.
  • 12. Mapping Optimization Developed Genetic Algorithm is compared with an exhaustive search and three standard greedy algorithms. And the paper shows that GA is a better solution in finding optimal mapping mechanism. Narrowing down the research in to OLAP query workloads and usage of MQTs is not clearly justified. And also there are some problems in the method followed to find the better solution.
  • 13. Literature Review 1. Need to justify the single server approach Vs database cluster approach • Only describes the use of QCC (Query Cost Calibrator) in federated systems and not in database clusters 2. No enough evidence to emphasize the relativity of GA to solve optimization problem e.g. Evolutionary Optimization of File Assignment for a Large-Scale Video- on- Demand System by Jun Guo 3. Reasons for the use of scheduler algorithms for experiment is not clearly mentioned 4. Purpose for the use of OLAP is not identified • OLAP is used because MQTs are required OLAP 5. No review on the preparation of greedy algorithms e.g. Valuated matroids: a new look at the greedy algorithm by Andreas W.M. Dress, Walter Wenzel
  • 14. What’s new on this topic ? • IQuery Performance Prediction for Analytical Workloads by Jennie Duggan Uses a combination of isolated and concurrent query execution samples, as well as new query workload features and metrics that can capture how different query classes behave for various levels of resource availability and contention. • Dynamic Prioritization of Database Queries by Sivaramakrishnan Narayanan, Florian Waas Presents a mechanism that continuously determines and re-computes the ideal target velocity of concurrent database processes based on their run-time statistics to achieve prioritization. In this scheme, every process autonomously adjusts its resource consumption using basic control theory principles.
  • 15. Sources and data collection •5 types of scheduler algorithms GA Exhaustive Greedy1-MQTs-then-queries Greedy2-queries-then-MQTs Greedy3-no-MQTs •2 data sources workload produced by the standard TPC benchmark H generator synthetic benchmark based on the queries and MQTs from the TPC benchmarks Tests implemented in standard C++ and run on an off the shelf desk top computer running Red Hat Linux. GA run for 100 generations with a population size of 100 chromosomes.
  • 16. Workload running time with increasing number of queries Exhaustive search and GA Comparison against exhaustive search shows that GA gives very close results as in exhaustive search which implies accuracy.
  • 17. Workload running time with increasing number of servers GA, exhaustive search, greedy1, 2, 3 Workload running time with increasing number of queries
  • 18. Issues •Assumption as having infinite disk space But disk space is limited •Tested for homogeneous servers only But if heterogeneous!!! •Scalability tested up to limited number queries only But when increased??? •Emphasis on the suitability of GA from the very beginning. But results show Greedy 2 is also significant
  • 19. Conclusion -problem identification is successful -prior conclusions of results??? -further improvements for algorithms is needed

Editor's Notes

  • #4: You can think of an MQT as a kind of materialized view. Both views and MQTs are defined on the basis of a query. The query on which a view is based is run whenever the view is referenced; however, an MQT actually stores the query results as data, and you can work with the data that is in the MQT instead of the data that is in the underlying tables.