SlideShare a Scribd company logo
Parallel and Distributed Databases
1
By:- Manjeet Singh
Barr Code :-2140131
Group :- 220
Introduction
⚫What is a Centralized Database ?
-all the data is maintained at a single site and assumed that the processing of
individual transaction is essentially sequential.
2
PARALLEL DBMSs
WHY DO WE NEED THEM?
3
1
• Moreand More Data!
We havedatabases that hold a high amountof
data, in theorderof 1012 bytes:
10,000,000,000,000 bytes!
• Fasterand FasterAccess!
We havedataapplications that need toprocess
dataatvery high speeds:
10,000s transactionspersecond!
SINGLE-PROCESSOR DBMS AREN’T UP TO THE JOB!
5
Why Parallel Access To Data?
1 Terabyte
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
1.5 minute to scan.
10 MB/s
Parallelism:
divide a big problem
into many smaller ones
to be solved in paralle
Parallel DB
⚫ Parallel database
5
system seeks to improve performance through
parallelization of various operations such as loading data ,building
indexes, and evaluating queries by using multiple CPUs and Disks in
Parallel.
⚫ Motivation for Parallel DB
⚫ Parallel machines are becoming quite common and affordable
⚫ Prices of microprocessors, memory and disks have dropped sharply
⚫ Databases are growing increasingly large
⚫ large volumes of transaction data are collected and stored for later
analysis.
⚫ multimedia objects like images are increasingly stored in databases
PARALLEL DBMSs
6
BENEFITS OF A PARALLEL DBMS
 Improves ResponseTime.
INTERQUERY PARALLELISM
It is possible to processa numberof transactions in
parallel with each other.
 ImprovesThroughput.
INTRAQUERYPARALLELISM
It is possible to process ‘sub-tasks’ of a transaction in
parallel with each other.
 Speed-Up
– Adding more resources results in proportionally less running time for a
fixed amount of data.
10 seconds to scan a DB of 10,000 records using 1 CPU
1 second to scan a DB of 10,000 records using 10 CPUs
PARALLEL DBMSs
HOW TO MEASURE THE BENEFITS
 Scale-Up
 If resources are increased in proportion to an increase in data/problem
size, the overall time should remain constant
– 1 second to scan a DB of 1,000 records using 1 CPU
1 second toscan a DB of 10,000 recordsusing 10 CPUs
7
Architectures for Parallel Databases
⚫ The basic idea behind Parallel DB is to carry out evaluation steps in
parallel whenever is possible.
⚫ There are many opportunities for parallelism in RDBMS.
⚫ 3 main architectures have been proposed for building parallel DBMSs.
1. Shared Memory
2. Shared Disk
3. Shared Nothing
8
Shared Memory
⚫ Advantages:
1. It is closer to conventional
machine , Easy to program
2. overhead is low.
3. OS services are leveraged to
utilize the additional CPUs.
⚫ Disadvantage:
1. It leads to bottleneck problem
2. Expensive to build
3. It is less sensitive to
partitioning
9
Shared Disk
⚫ Advantages:
1. Almost same
⚫ Disadvantages:
1. More interference
2. Increases N/W band width
3. Shared disk less sensitive to
partitioning
10
Shared Nothing
⚫ Advantages:
1. It provides linear scale up
&linear speed up
2. Shared nothing benefits from
"good" partitioning
3. Cheap to build
⚫ Disadvantage
1. Hard to program
2. Addition of new nodes
requires reorganizing
11
Sub-linear speed-up
Linear speed-up (ideal)
Number
of
transactions/second
1000/Sec
5 CPUs
2000/Sec
10 CPUs 16 CPUs
1600/Sec
2/12/201
Numberof CPUs
1. Parallel DB /D.S.Jagli 3
13
12
1. Parallel DB /D.S.Jagli
PARALLEL DBMSs
SPEED-UP
10 CPUs
2 GB Database
Number
of
transactions/second
Linear scale-up (ideal)
Sub-linear scale-up
1000/Sec
5 CPUs
1 GB Database
900/Sec
2/12/201
1. Parallel DB /D.S
N.Ja
ugli
mberof CPUs, Database size
1. Parallel DB /D.S.Jagli
3
14
13
PARALLEL DBMSs
SCALE-UP
PARALLEL QUERY EVALUATION
A relational query execution plan is graph/tree of
relational algebra operators (based on this operators can
execute in parallel)
1. Parallel DB /D.S.Jagli
1. 1. Parallel DB /D.S.Jagli
2/12/2013
15
14
Different Types of DBMS ||-ism
⚫ Parallel evaluation of a relational query in DBMS With shared –nothing
architecture
1. Inter-query parallelism
⚫ Multiple queries run on different sites
2. Intra-query parallelism
⚫ Parallel execution of single query run on different sites.
a) Intra-operator parallelism
a) get all machines working together to compute a given operation (scan, sort,
join).
b) Inter-operator parallelism
⚫ each operator may run concurrently on a different site (exploits
pipelining).
⚫ In order to evaluate different operators in parallel, we need to
evaluate each operator in query plan in Parallel.
1. Parallel DB /D.S.Jagli
1. 1. Parallel DB /D.S.Jagli
2/12/2013
15
15
Data Partitioning
2/12/2013
1. Parallel DB /D.S.Jagli
17
16
⚫ Types of Partitioning
1. Horizontal Partitioning: tuple of a relation are divided among
many disks such that each tuple resides on one disk.
⚫ It enables to exploit the I/O band width of disks by reading & writing
them in parallel.
relations from disk by
⚫ Reduce the time required to retrieve
partitioning the relations on multiple disks.
1. Range Partitioning
2. Hash Partitioning
3. Round Robin Partitioning
2. Vertical Partitioning
1.Range Partitioning
⚫ Tuples are sorted (conceptually), and n ranges are chosen for
the sort key values so that each range contains roughly the
same number of tuples;
⚫ tuples in range i are assigned to processor i.
⚫ Eg:
⚫ sailor _id 1-10 assigned to disk 1
⚫sailor _id 10-20 assigned to disk 2
⚫sailor _id 20-30 assigned to disk 3
⚫ range partitioning can lead to data skew; that is, partitions with widely
varying number of tuples across
17
2.Hash Partitioning
⚫ Ahash function is applied to selected fields of a tuple to determine its
processor.
⚫ Hash partitioning has the additional virtue that it keeps data evenly
distributed even if the data grows and shrinks over time.
18
3.Round Robin Partitioning
⚫ If there are n processors, the i th tuple is assigned to processor i mod n in
round-robin partitioning.
⚫ Round-robin partitioning is suitable for efficiently evaluating queries that
access the entire relation.
⚫ If only a subset of the tuples (e.g., those that satisfy the selection
condition age = 20) is required, hash partitioning and range partitioning
are better than round-robin partitioning
19
Range Hash Round Robin
A...E F...J K...N O...S T
...Z A...E F...J K...N O...
S
T...Z A...E F...J K...N O...
S
T...Z
Good for equijoins,
exact-match queries,
and range queries
20
Good for equijoins,
exact match queries
Good to spread load
Parallelizing Sequential Operator
Evaluation Code
1. An elegant software architecture for parallel DBMSs enables us to
readily parallelize existing code for sequentially evaluating a
relational operator.
2. The basic idea is to use parallel data streams.
3. Streams are merged as needed to provide the inputs for a relational
operator.
4. The output of an operator is split as needed to parallelize subsequent
processing.
5. A parallel evaluation plan consists of a dataflow network of
relational, merge, and split operators.
21
PARALLELIZING INDIVIDUAL
OPERATIONS
⚫ How various operations can be implemented in parallel in a shared-
nothing architecture?
⚫ Techniques
1. Bulk loading& scanning
2. Sorting
3. Joins
22
1.Bulk Loading and scanning
⚫ scanning a relation: Pages can be read in parallel while scanning a
relation, and the retrieved tuples can then be merged, if the relation is
partitioned across several disks.
⚫ bulk loading: if a relation has associated indexes, any sorting of data
entries required for building the indexes during bulk loading can also
be done in parallel.
1. Parallel DB /D.S.Jagli
1. Parallel DB /D.S.Jagli
2/12/2013
22
23
2.Parallel Sorting :
2/12/2013
25
1. Parallel DB /D.S.Jagli 24
⚫ Parallel sorting steps:
1. First redistribute all tuples in the relation using range partitioning.
2. Each processor then sorts the tuples assigned to it
3. The entire sorted relation can be retrieved by visiting the processors in
an order corresponding to the ranges assigned to them.
⚫ Problem: Data skew
⚫ Solution: “sample” the data at the outset to determine good
range partition points.
Aparticularly important application of parallel sorting is sorting the data
entries in tree-structured indexes.
1. Parallel DB /D.S.Jagli
TWO-PHASE COMMIT (2PC) - commit
1. Parallel DB /D.S.Jagli 2/12/2013
25
TWO-PHASE COMMIT (2PC) - ABORT
1. Parallel DB /D.S.Jagli 2/12/2013
26

More Related Content

PPTX
PARALLEL DATABASE SYSTEM in Computer Science.pptx
PPTX
A tour of Amazon Redshift
PPTX
Mapping Data Flows Perf Tuning April 2021
PDF
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
PPT
Seminar Presentation Hadoop
PPTX
Megastore by Google
PDF
Understanding and building big data Architectures - NoSQL
PARALLEL DATABASE SYSTEM in Computer Science.pptx
A tour of Amazon Redshift
Mapping Data Flows Perf Tuning April 2021
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
Seminar Presentation Hadoop
Megastore by Google
Understanding and building big data Architectures - NoSQL

Similar to Manjeet Singh.pptx (20)

PDF
MongoDB Sharding
PPTX
Scalable Data Analytics: Technologies and Methods
PPT
HDFS_architecture.ppt
PPTX
nnnn.pptx
PPTX
DBMS.pptx
PPTX
KIISE:SIGDB Workshop presentation.
PPTX
Sql Server
PDF
System design handwritten notes guidance
PPT
Implementing the Databese Server session 02
PDF
System Design.pdf
PPTX
Azure Data Factory Data Flow Performance Tuning 101
PPTX
database slide on modern techniques for optimizing database queries.pptx
PDF
Dremel Paper Review
PPTX
Google file system
PPTX
Azure Data Lake Analytics Deep Dive
PPTX
adap-stability-202310.pptx
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PPTX
Lectures 9-HCE 311.pptx;parallel systems
PPT
Hadoop - Introduction to HDFS
PDF
FAQ
MongoDB Sharding
Scalable Data Analytics: Technologies and Methods
HDFS_architecture.ppt
nnnn.pptx
DBMS.pptx
KIISE:SIGDB Workshop presentation.
Sql Server
System design handwritten notes guidance
Implementing the Databese Server session 02
System Design.pdf
Azure Data Factory Data Flow Performance Tuning 101
database slide on modern techniques for optimizing database queries.pptx
Dremel Paper Review
Google file system
Azure Data Lake Analytics Deep Dive
adap-stability-202310.pptx
Mapreduce is for Hadoop Ecosystem in Data Science
Lectures 9-HCE 311.pptx;parallel systems
Hadoop - Introduction to HDFS
FAQ
Ad

More from RAMCHANDRASHARMA7 (7)

PPTX
Manjeet Singh..pptx
PPTX
Manjeet Singh.pptx
PPTX
QESUTGPfbmYH3WQF257.pptx
PPTX
ABDOMEN -01 copy.pptx
PPTX
Lecture 10-11-12 2.pptx
PPTX
DOC-20230426-WA0016..pptx
PPT
nrc_peds_oi_july09_parasitic.ppt
Manjeet Singh..pptx
Manjeet Singh.pptx
QESUTGPfbmYH3WQF257.pptx
ABDOMEN -01 copy.pptx
Lecture 10-11-12 2.pptx
DOC-20230426-WA0016..pptx
nrc_peds_oi_july09_parasitic.ppt
Ad

Recently uploaded (20)

PPTX
KVL KCL ppt electrical electronics eee tiet
PPTX
Embeded System for Artificial intelligence 2.pptx
PPT
Hypersensitivity Namisha1111111111-WPS.ppt
PDF
Dynamic Checkweighers and Automatic Weighing Machine Solutions
PPTX
Prograce_Present.....ggation_Simple.pptx
PPTX
Lecture-3-Computer-programming for BS InfoTech
PPTX
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PPTX
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
PPTX
Lecture 3b C Library _ ESP32.pptxjfjfjffkkfkfk
PPTX
making presentation that do no stick.pptx
PPTX
INFERTILITY (FEMALE FACTORS).pptxgvcghhfcg
PDF
-DIGITAL-INDIA.pdf one of the most prominent
PPTX
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
PPTX
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的
PPTX
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
PPTX
Computers and mobile device: Evaluating options for home and work
PDF
Prescription1 which to be used for periodo
PPTX
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx
KVL KCL ppt electrical electronics eee tiet
Embeded System for Artificial intelligence 2.pptx
Hypersensitivity Namisha1111111111-WPS.ppt
Dynamic Checkweighers and Automatic Weighing Machine Solutions
Prograce_Present.....ggation_Simple.pptx
Lecture-3-Computer-programming for BS InfoTech
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
Smarter Security: How Door Access Control Works with Alarms & CCTV
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
Lecture 3b C Library _ ESP32.pptxjfjfjffkkfkfk
making presentation that do no stick.pptx
INFERTILITY (FEMALE FACTORS).pptxgvcghhfcg
-DIGITAL-INDIA.pdf one of the most prominent
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
Computers and mobile device: Evaluating options for home and work
Prescription1 which to be used for periodo
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx

Manjeet Singh.pptx

  • 1. Parallel and Distributed Databases 1 By:- Manjeet Singh Barr Code :-2140131 Group :- 220
  • 2. Introduction ⚫What is a Centralized Database ? -all the data is maintained at a single site and assumed that the processing of individual transaction is essentially sequential. 2
  • 3. PARALLEL DBMSs WHY DO WE NEED THEM? 3 1 • Moreand More Data! We havedatabases that hold a high amountof data, in theorderof 1012 bytes: 10,000,000,000,000 bytes! • Fasterand FasterAccess! We havedataapplications that need toprocess dataatvery high speeds: 10,000s transactionspersecond! SINGLE-PROCESSOR DBMS AREN’T UP TO THE JOB!
  • 4. 5 Why Parallel Access To Data? 1 Terabyte At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 1.5 minute to scan. 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in paralle
  • 5. Parallel DB ⚫ Parallel database 5 system seeks to improve performance through parallelization of various operations such as loading data ,building indexes, and evaluating queries by using multiple CPUs and Disks in Parallel. ⚫ Motivation for Parallel DB ⚫ Parallel machines are becoming quite common and affordable ⚫ Prices of microprocessors, memory and disks have dropped sharply ⚫ Databases are growing increasingly large ⚫ large volumes of transaction data are collected and stored for later analysis. ⚫ multimedia objects like images are increasingly stored in databases
  • 6. PARALLEL DBMSs 6 BENEFITS OF A PARALLEL DBMS  Improves ResponseTime. INTERQUERY PARALLELISM It is possible to processa numberof transactions in parallel with each other.  ImprovesThroughput. INTRAQUERYPARALLELISM It is possible to process ‘sub-tasks’ of a transaction in parallel with each other.
  • 7.  Speed-Up – Adding more resources results in proportionally less running time for a fixed amount of data. 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs PARALLEL DBMSs HOW TO MEASURE THE BENEFITS  Scale-Up  If resources are increased in proportion to an increase in data/problem size, the overall time should remain constant – 1 second to scan a DB of 1,000 records using 1 CPU 1 second toscan a DB of 10,000 recordsusing 10 CPUs 7
  • 8. Architectures for Parallel Databases ⚫ The basic idea behind Parallel DB is to carry out evaluation steps in parallel whenever is possible. ⚫ There are many opportunities for parallelism in RDBMS. ⚫ 3 main architectures have been proposed for building parallel DBMSs. 1. Shared Memory 2. Shared Disk 3. Shared Nothing 8
  • 9. Shared Memory ⚫ Advantages: 1. It is closer to conventional machine , Easy to program 2. overhead is low. 3. OS services are leveraged to utilize the additional CPUs. ⚫ Disadvantage: 1. It leads to bottleneck problem 2. Expensive to build 3. It is less sensitive to partitioning 9
  • 10. Shared Disk ⚫ Advantages: 1. Almost same ⚫ Disadvantages: 1. More interference 2. Increases N/W band width 3. Shared disk less sensitive to partitioning 10
  • 11. Shared Nothing ⚫ Advantages: 1. It provides linear scale up &linear speed up 2. Shared nothing benefits from "good" partitioning 3. Cheap to build ⚫ Disadvantage 1. Hard to program 2. Addition of new nodes requires reorganizing 11
  • 12. Sub-linear speed-up Linear speed-up (ideal) Number of transactions/second 1000/Sec 5 CPUs 2000/Sec 10 CPUs 16 CPUs 1600/Sec 2/12/201 Numberof CPUs 1. Parallel DB /D.S.Jagli 3 13 12 1. Parallel DB /D.S.Jagli PARALLEL DBMSs SPEED-UP
  • 13. 10 CPUs 2 GB Database Number of transactions/second Linear scale-up (ideal) Sub-linear scale-up 1000/Sec 5 CPUs 1 GB Database 900/Sec 2/12/201 1. Parallel DB /D.S N.Ja ugli mberof CPUs, Database size 1. Parallel DB /D.S.Jagli 3 14 13 PARALLEL DBMSs SCALE-UP
  • 14. PARALLEL QUERY EVALUATION A relational query execution plan is graph/tree of relational algebra operators (based on this operators can execute in parallel) 1. Parallel DB /D.S.Jagli 1. 1. Parallel DB /D.S.Jagli 2/12/2013 15 14
  • 15. Different Types of DBMS ||-ism ⚫ Parallel evaluation of a relational query in DBMS With shared –nothing architecture 1. Inter-query parallelism ⚫ Multiple queries run on different sites 2. Intra-query parallelism ⚫ Parallel execution of single query run on different sites. a) Intra-operator parallelism a) get all machines working together to compute a given operation (scan, sort, join). b) Inter-operator parallelism ⚫ each operator may run concurrently on a different site (exploits pipelining). ⚫ In order to evaluate different operators in parallel, we need to evaluate each operator in query plan in Parallel. 1. Parallel DB /D.S.Jagli 1. 1. Parallel DB /D.S.Jagli 2/12/2013 15 15
  • 16. Data Partitioning 2/12/2013 1. Parallel DB /D.S.Jagli 17 16 ⚫ Types of Partitioning 1. Horizontal Partitioning: tuple of a relation are divided among many disks such that each tuple resides on one disk. ⚫ It enables to exploit the I/O band width of disks by reading & writing them in parallel. relations from disk by ⚫ Reduce the time required to retrieve partitioning the relations on multiple disks. 1. Range Partitioning 2. Hash Partitioning 3. Round Robin Partitioning 2. Vertical Partitioning
  • 17. 1.Range Partitioning ⚫ Tuples are sorted (conceptually), and n ranges are chosen for the sort key values so that each range contains roughly the same number of tuples; ⚫ tuples in range i are assigned to processor i. ⚫ Eg: ⚫ sailor _id 1-10 assigned to disk 1 ⚫sailor _id 10-20 assigned to disk 2 ⚫sailor _id 20-30 assigned to disk 3 ⚫ range partitioning can lead to data skew; that is, partitions with widely varying number of tuples across 17
  • 18. 2.Hash Partitioning ⚫ Ahash function is applied to selected fields of a tuple to determine its processor. ⚫ Hash partitioning has the additional virtue that it keeps data evenly distributed even if the data grows and shrinks over time. 18
  • 19. 3.Round Robin Partitioning ⚫ If there are n processors, the i th tuple is assigned to processor i mod n in round-robin partitioning. ⚫ Round-robin partitioning is suitable for efficiently evaluating queries that access the entire relation. ⚫ If only a subset of the tuples (e.g., those that satisfy the selection condition age = 20) is required, hash partitioning and range partitioning are better than round-robin partitioning 19
  • 20. Range Hash Round Robin A...E F...J K...N O...S T ...Z A...E F...J K...N O... S T...Z A...E F...J K...N O... S T...Z Good for equijoins, exact-match queries, and range queries 20 Good for equijoins, exact match queries Good to spread load
  • 21. Parallelizing Sequential Operator Evaluation Code 1. An elegant software architecture for parallel DBMSs enables us to readily parallelize existing code for sequentially evaluating a relational operator. 2. The basic idea is to use parallel data streams. 3. Streams are merged as needed to provide the inputs for a relational operator. 4. The output of an operator is split as needed to parallelize subsequent processing. 5. A parallel evaluation plan consists of a dataflow network of relational, merge, and split operators. 21
  • 22. PARALLELIZING INDIVIDUAL OPERATIONS ⚫ How various operations can be implemented in parallel in a shared- nothing architecture? ⚫ Techniques 1. Bulk loading& scanning 2. Sorting 3. Joins 22
  • 23. 1.Bulk Loading and scanning ⚫ scanning a relation: Pages can be read in parallel while scanning a relation, and the retrieved tuples can then be merged, if the relation is partitioned across several disks. ⚫ bulk loading: if a relation has associated indexes, any sorting of data entries required for building the indexes during bulk loading can also be done in parallel. 1. Parallel DB /D.S.Jagli 1. Parallel DB /D.S.Jagli 2/12/2013 22 23
  • 24. 2.Parallel Sorting : 2/12/2013 25 1. Parallel DB /D.S.Jagli 24 ⚫ Parallel sorting steps: 1. First redistribute all tuples in the relation using range partitioning. 2. Each processor then sorts the tuples assigned to it 3. The entire sorted relation can be retrieved by visiting the processors in an order corresponding to the ranges assigned to them. ⚫ Problem: Data skew ⚫ Solution: “sample” the data at the outset to determine good range partition points. Aparticularly important application of parallel sorting is sorting the data entries in tree-structured indexes. 1. Parallel DB /D.S.Jagli
  • 25. TWO-PHASE COMMIT (2PC) - commit 1. Parallel DB /D.S.Jagli 2/12/2013 25
  • 26. TWO-PHASE COMMIT (2PC) - ABORT 1. Parallel DB /D.S.Jagli 2/12/2013 26