SlideShare a Scribd company logo
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/1
Outline
• Introduction
• Background
• Distributed Database Design
• Database Integration
• Semantic Data Control
• Distributed Query Processing
➡ Overview
➡ Query decomposition and localization
➡ Distributed query optimization
• Multidatabase Query Processing
• Distributed Transaction Management
• Data Replication
• Parallel Database Systems
• Distributed Object DBMS
• Peer-to-Peer Data Management
• Web Data Management
• Current Issues
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/2
Query Processing in a DDBMS
high level user query
query
processor
Low-level data manipulation
commands for D-DBMS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/3
Query Processing Components
•Query language that is used
➡ SQL: “intergalactic dataspeak”
•Query execution methodology
➡ The steps that one goes through in executing high-level (declarative) user
queries.
•Query optimization
➡ How do we determine the “best” execution plan?
•We assume a homogeneous D-DBMS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/4
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = "Manager"
Strategy 1
ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG))
Strategy 2
 ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))
Strategy 2 avoids Cartesian product, so may be “better”
Selecting Alternatives
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/5
What is the Problem?
Site 1 Site 2 Site 3 Site 4 Site 5
EMP1= ENO≤“E3”(EMP) EMP2= ENO>“E3”(EMP)
ASG2= ENO>“E3”(ASG)
ASG1=ENO≤“E3”(ASG) Result
Site 5
Site 1 Site 2 Site 3 Site 4
ASG1 EMP1 EMP2
ASG2
Site 4
Site 3
Site 1 Site 2
Site 5
EMP’
1=EMP1 ⋈ENO ASG’
1
'
2
EMP
EMP
result 
= '
1
1
Manager"
"
RESP
1 ASG
σ
ASG =
=
'
2
Manager"
"
RESP
2 ASG
σ
ASG =
=
'
'
1
ASG '
2
ASG
'
1
EMP '
2
EMP
result= (EMP1 × EMP2)⋈ENOσRESP=“Manager”(ASG1× ASG2)
EMP’
2=EMP2 ⋈ENO ASG’
2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/6
Cost of Alternatives
•Assume
➡ size(EMP) = 400, size(ASG) = 1000
➡ tuple access cost = 1 unit; tuple transfer cost = 10 units
•Strategy 1
➡ produce ASG': (10+10)  tuple access cost 20
➡ transfer ASG' to the sites of EMP: (10+10)  tuple transfer cost 200
➡ produce EMP': (10+10)  tuple access cost  2 40
➡ transfer EMP' to result site: (10+10)  tuple transfer cost 200
Total Cost 460
•Strategy 2
➡ transfer EMP to site 5: 400  tuple transfer cost 4,000
➡ transfer ASG to site 5: 1000  tuple transfer cost 10,000
➡ produce ASG': 1000  tuple access cost 1,000
➡ join EMP and ASG': 400  20  tuple access cost 8,000
Total Cost 23,000
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/7
Query Optimization Objectives
•Minimize a cost function
I/O cost + CPU cost + communication cost
These might have different weights in different distributed environments
•Wide area networks
➡ communication cost may dominate or vary much
✦ bandwidth
✦ speed
✦ high protocol overhead
•Local area networks
➡ communication cost not that dominant
➡ total cost function should be considered
•Can also maximize throughput
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/8
Complexity of Relational Operations
•Assume
➡ relations of cardinality n
➡ sequential scan
Operation Complexity
Select
Project
(without duplicate elimination)
O(n)
Project
(with duplicate elimination)
Group
O(n  log n)
Join
Semi-join
Division
Set Operators
O(n  log n)
Cartesian Product O(n2)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/9
Query Optimization Issues – Types
Of Optimizers
• Exhaustive search
➡ Cost-based
➡ Optimal
➡ Combinatorial complexity in the number of relations
• Heuristics
➡ Not optimal
➡ Regroup common sub-expressions
➡ Perform selection, projection first
➡ Replace a join by a series of semijoins
➡ Reorder operations to reduce intermediate relation size
➡ Optimize individual operations
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/10
Query Optimization Issues –
Optimization Granularity
• Single query at a time
➡ Cannot use common intermediate results
• Multiple queries at a time
➡ Efficient if many similar queries
➡ Decision space is much larger
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/11
Query Optimization Issues –
Optimization Timing
•Static
➡ Compilation ➔ optimize prior to the execution
➡ Difficult to estimate the size of the intermediate resultserror
propagation
➡ Can amortize over many executions
➡ R*
•Dynamic
➡ Run time optimization
➡ Exact information on the intermediate relation sizes
➡ Have to reoptimize for multiple executions
➡ Distributed INGRES
•Hybrid
➡ Compile using a static algorithm
➡ If the error in estimate sizes > threshold, reoptimize at run time
➡ Mermaid
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/12
Query Optimization Issues –
Statistics
•Relation
➡ Cardinality
➡ Size of a tuple
➡ Fraction of tuples participating in a join with another relation
•Attribute
➡ Cardinality of domain
➡ Actual number of distinct values
•Common assumptions
➡ Independence between different attribute values
➡ Uniform distribution of attribute values within their domain
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/13
Query Optimization Issues –
Decision Sites
•Centralized
➡ Single site determines the “best” schedule
➡ Simple
➡ Need knowledge about the entire distributed database
•Distributed
➡ Cooperation among sites to determine the schedule
➡ Need only local information
➡ Cost of cooperation
•Hybrid
➡ One site determines the global schedule
➡ Each site optimizes the local subqueries
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/14
Query Optimization Issues –
Network Topology
• Wide area networks (WAN) – point-to-point
➡ Characteristics
✦ Low bandwidth
✦ Low speed
✦ High protocol overhead
➡ Communication cost will dominate; ignore all other cost factors
➡ Global schedule to minimize communication cost
➡ Local schedules according to centralized query optimization
• Local area networks (LAN)
➡ Communication cost not that dominant
➡ Total cost function should be considered
➡ Broadcasting can be exploited (joins)
➡ Special algorithms exist for star networks
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/15
Distributed Query Processing
Methodology
Calculus Query on Distributed Relations
CONTROL
SITE
LOCAL
SITES
Query
Decomposition
Data
Localization
AlgebraicQuery on Distributed
Relations
Global
Optimization
Fragment Query
Local
Optimization
Optimized Fragment Query
with CommunicationOperations
Optimized Local Queries
GLOBAL
SCHEMA
FRAGMENT
SCHEMA
STATS ON
FRAGMENTS
LOCAL
SCHEMAS

More Related Content

PPTX
Database , 6 Query Introduction
PPTX
AUERY.pptxHDSOILDKCJSIDVCBIDCSDCJNSOIDCNSOD
PPTX
Query_Introduction and query optimization
PPTX
Query_Optimization in Distributed Database
PPTX
Database ,18 Current Issues
PPTX
Database ,14 Parallel DBMS
PPTX
Database , 8 Query Optimization
PPTX
Hadoop Map Reduce OS
Database , 6 Query Introduction
AUERY.pptxHDSOILDKCJSIDVCBIDCSDCJNSOIDCNSOD
Query_Introduction and query optimization
Query_Optimization in Distributed Database
Database ,18 Current Issues
Database ,14 Parallel DBMS
Database , 8 Query Optimization
Hadoop Map Reduce OS

Similar to 6-Query_Intro (5).pdf (20)

PPTX
Database Integration in Distributed Database
PPTX
Database ,16 P2P
PPTX
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPTX
1 introduction
PPTX
Designing analytics for big data
PPTX
DBMS Notes for BSC Students for all batch
PDF
HFM vs Essbase BSO: A Comparative Anatomy
PPTX
Database , 4 Data Integration
PDF
Meta scale kognitio hadoop webinar
PPTX
1 introduction DDBS
PPTX
Database , 1 Introduction
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PPTX
Ledingkart Meetup #4: Data pipeline @ lk
PPTX
Database , 17 Web
PPTX
Database , 13 Replication
PDF
Polyglot Persistence - Two Great Tastes That Taste Great Together
PDF
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
ODP
BigData Hadoop
Database Integration in Distributed Database
Database ,16 P2P
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
1 introduction
Designing analytics for big data
DBMS Notes for BSC Students for all batch
HFM vs Essbase BSO: A Comparative Anatomy
Database , 4 Data Integration
Meta scale kognitio hadoop webinar
1 introduction DDBS
Database , 1 Introduction
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Ledingkart Meetup #4: Data pipeline @ lk
Database , 17 Web
Database , 13 Replication
Polyglot Persistence - Two Great Tastes That Taste Great Together
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
BigData Hadoop
Ad

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Introduction to Inferential Statistics.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
How to run a consulting project- client discovery
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
New ISO 27001_2022 standard and the changes
PDF
Introduction to Data Science and Data Analysis
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Global Data and Analytics Market Outlook Report
PDF
Business Analytics and business intelligence.pdf
PPTX
Database Infoormation System (DBIS).pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Inferential Statistics.pptx
CYBER SECURITY the Next Warefare Tactics
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
How to run a consulting project- client discovery
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
New ISO 27001_2022 standard and the changes
Introduction to Data Science and Data Analysis
Pilar Kemerdekaan dan Identi Bangsa.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Optimise Shopper Experiences with a Strong Data Estate.pdf
ISS -ESG Data flows What is ESG and HowHow
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
[EN] Industrial Machine Downtime Prediction
Qualitative Qantitative and Mixed Methods.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Global Data and Analytics Market Outlook Report
Business Analytics and business intelligence.pdf
Database Infoormation System (DBIS).pptx
Ad

6-Query_Intro (5).pdf

  • 1. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/1 Outline • Introduction • Background • Distributed Database Design • Database Integration • Semantic Data Control • Distributed Query Processing ➡ Overview ➡ Query decomposition and localization ➡ Distributed query optimization • Multidatabase Query Processing • Distributed Transaction Management • Data Replication • Parallel Database Systems • Distributed Object DBMS • Peer-to-Peer Data Management • Web Data Management • Current Issues
  • 2. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/2 Query Processing in a DDBMS high level user query query processor Low-level data manipulation commands for D-DBMS
  • 3. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/3 Query Processing Components •Query language that is used ➡ SQL: “intergalactic dataspeak” •Query execution methodology ➡ The steps that one goes through in executing high-level (declarative) user queries. •Query optimization ➡ How do we determine the “best” execution plan? •We assume a homogeneous D-DBMS
  • 4. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/4 SELECT ENAME FROM EMP,ASG WHERE EMP.ENO = ASG.ENO AND RESP = "Manager" Strategy 1 ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG)) Strategy 2  ENAME(EMP ⋈ENO (RESP=“Manager” (ASG)) Strategy 2 avoids Cartesian product, so may be “better” Selecting Alternatives
  • 5. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/5 What is the Problem? Site 1 Site 2 Site 3 Site 4 Site 5 EMP1= ENO≤“E3”(EMP) EMP2= ENO>“E3”(EMP) ASG2= ENO>“E3”(ASG) ASG1=ENO≤“E3”(ASG) Result Site 5 Site 1 Site 2 Site 3 Site 4 ASG1 EMP1 EMP2 ASG2 Site 4 Site 3 Site 1 Site 2 Site 5 EMP’ 1=EMP1 ⋈ENO ASG’ 1 ' 2 EMP EMP result  = ' 1 1 Manager" " RESP 1 ASG σ ASG = = ' 2 Manager" " RESP 2 ASG σ ASG = = ' ' 1 ASG ' 2 ASG ' 1 EMP ' 2 EMP result= (EMP1 × EMP2)⋈ENOσRESP=“Manager”(ASG1× ASG2) EMP’ 2=EMP2 ⋈ENO ASG’ 2
  • 6. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/6 Cost of Alternatives •Assume ➡ size(EMP) = 400, size(ASG) = 1000 ➡ tuple access cost = 1 unit; tuple transfer cost = 10 units •Strategy 1 ➡ produce ASG': (10+10)  tuple access cost 20 ➡ transfer ASG' to the sites of EMP: (10+10)  tuple transfer cost 200 ➡ produce EMP': (10+10)  tuple access cost  2 40 ➡ transfer EMP' to result site: (10+10)  tuple transfer cost 200 Total Cost 460 •Strategy 2 ➡ transfer EMP to site 5: 400  tuple transfer cost 4,000 ➡ transfer ASG to site 5: 1000  tuple transfer cost 10,000 ➡ produce ASG': 1000  tuple access cost 1,000 ➡ join EMP and ASG': 400  20  tuple access cost 8,000 Total Cost 23,000
  • 7. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/7 Query Optimization Objectives •Minimize a cost function I/O cost + CPU cost + communication cost These might have different weights in different distributed environments •Wide area networks ➡ communication cost may dominate or vary much ✦ bandwidth ✦ speed ✦ high protocol overhead •Local area networks ➡ communication cost not that dominant ➡ total cost function should be considered •Can also maximize throughput
  • 8. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/8 Complexity of Relational Operations •Assume ➡ relations of cardinality n ➡ sequential scan Operation Complexity Select Project (without duplicate elimination) O(n) Project (with duplicate elimination) Group O(n  log n) Join Semi-join Division Set Operators O(n  log n) Cartesian Product O(n2)
  • 9. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/9 Query Optimization Issues – Types Of Optimizers • Exhaustive search ➡ Cost-based ➡ Optimal ➡ Combinatorial complexity in the number of relations • Heuristics ➡ Not optimal ➡ Regroup common sub-expressions ➡ Perform selection, projection first ➡ Replace a join by a series of semijoins ➡ Reorder operations to reduce intermediate relation size ➡ Optimize individual operations
  • 10. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/10 Query Optimization Issues – Optimization Granularity • Single query at a time ➡ Cannot use common intermediate results • Multiple queries at a time ➡ Efficient if many similar queries ➡ Decision space is much larger
  • 11. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/11 Query Optimization Issues – Optimization Timing •Static ➡ Compilation ➔ optimize prior to the execution ➡ Difficult to estimate the size of the intermediate resultserror propagation ➡ Can amortize over many executions ➡ R* •Dynamic ➡ Run time optimization ➡ Exact information on the intermediate relation sizes ➡ Have to reoptimize for multiple executions ➡ Distributed INGRES •Hybrid ➡ Compile using a static algorithm ➡ If the error in estimate sizes > threshold, reoptimize at run time ➡ Mermaid
  • 12. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/12 Query Optimization Issues – Statistics •Relation ➡ Cardinality ➡ Size of a tuple ➡ Fraction of tuples participating in a join with another relation •Attribute ➡ Cardinality of domain ➡ Actual number of distinct values •Common assumptions ➡ Independence between different attribute values ➡ Uniform distribution of attribute values within their domain
  • 13. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/13 Query Optimization Issues – Decision Sites •Centralized ➡ Single site determines the “best” schedule ➡ Simple ➡ Need knowledge about the entire distributed database •Distributed ➡ Cooperation among sites to determine the schedule ➡ Need only local information ➡ Cost of cooperation •Hybrid ➡ One site determines the global schedule ➡ Each site optimizes the local subqueries
  • 14. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/14 Query Optimization Issues – Network Topology • Wide area networks (WAN) – point-to-point ➡ Characteristics ✦ Low bandwidth ✦ Low speed ✦ High protocol overhead ➡ Communication cost will dominate; ignore all other cost factors ➡ Global schedule to minimize communication cost ➡ Local schedules according to centralized query optimization • Local area networks (LAN) ➡ Communication cost not that dominant ➡ Total cost function should be considered ➡ Broadcasting can be exploited (joins) ➡ Special algorithms exist for star networks
  • 15. Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/15 Distributed Query Processing Methodology Calculus Query on Distributed Relations CONTROL SITE LOCAL SITES Query Decomposition Data Localization AlgebraicQuery on Distributed Relations Global Optimization Fragment Query Local Optimization Optimized Fragment Query with CommunicationOperations Optimized Local Queries GLOBAL SCHEMA FRAGMENT SCHEMA STATS ON FRAGMENTS LOCAL SCHEMAS