Advanced Topics on Database - Unit-1 AU17

Distributed Data Processing
• Distributed data processing allows multiple computers to be used
anywhere.
• Distributed data processing allows multiple computers to be working
among multiple geographically separate sites where local computers
handle local processing needs.
• Distributed processing is a database's logical processing is shared
among two or more physically independent sites that are connected
through a network.

• It dramatically reduced workstation costs and improved user
interfaces and desktop power. It increases ability to share data
across multiple servers.
• We define a distributed system as one in which hardware or software
components located at networked computers communicate and
coordinate their actions only by-passing messages.

Advantages :
1. Availability
2. Resource Sharing
3. Incremental Growth
4. Increased User Involvement and Control
5. End-user Productivity
6. Distance and location independence
7. Privacy and security
Disadvantages :
1. More difficulty test and failure diagnosis
2. More components and dependence on communication means more points of failure
3. Incompatibility of components
4. Incompatibility of data
5. More complex management and control
6. Difficulty in control of corporate information resources
7. Suboptimal procurement
8. Duplication of effort

Distributed System Architecture
• Database System Architecture
• Database architecture focuses on the design, development, implementation and
maintenance of computer programs that store and organize information for businesses,
agencies and institutions.
• Centralized Architecture
• The centralized database system consists of a single processor together with its associated
data storage devices and other peripherals.
• Data can be accessed from the multiple sites with the use of a computer network while
the database is maintained at the central site.
• General-purpose computer system: One to a few CPUs and a number of device controllers
that are connected through a common bus that provides access to shared memory.
• Single-user system (e.g., personal computer or workstation) : Desk-top unit, single user,
usually has only one CPU and one or two hard disks; the OS may support only one user.
• Multi-user system: More disks, more memory, multiple CPUs and a multi-user OS. Serve a
large number of users who are connected to the system via terminals. Often called server
systems.

Advantages
1. The data integrity is maximized as the whole database is stored
at a single physical location. This means that it is easier to
coordinate the data and it is as accurate and consistent as
possible.
2. The data redundancy is minimal in the centralized database.
Disadvantages:
1. If server fails, whole system fails.
2. It takes more time to search and access
3. Bottlenecks can appear when the traffic spikes.

• Client-Server Architecture
• The functionality is divided into two classes: Server and Client.
• A client is defined as a requester of services and a server is defined as the provider
of services. Client/server computing is a very efficient and safe means of sharing
database information in a multi-user environment
• This provides two-level architecture which makes it easier to manage the
complexity of modern DBMSs and the complexity of distribution.
• Client: User machine that provides user interface capabilities and local processing.
• Server: System containing both hardware and software. It provides services to the
client machines such as file access, printing, archiving, or database access.
• Server functions: Mainly data management, including query processing,
optimization, transaction management, etc.
• Client functions: Might also include some data management functions not just
user interface.

• Client-server architecture may be either two-tier or three-tier.
• In a two-tier architecture, the server performs database functions
and the clients perform the user interface functions. Either the
server or the clients may perform business functions.
• In a three-tier architecture, the clients perform the user interface
functions, a database server performs the database functions, and
separate computers, called application servers, perform the
business functions and act as interface between clients and
database server.
• There is various classification of client server system.
1. Multiple client - single server 2. Multiple client - multiple server

• Multiple clients - single server

Multiple Clients- Multiple Server

Problems with this type of architecture is as follows :
1. Single point of failure i.e. server failure
2. Bottleneck at server
3. Difficult to scale a database

2. Multiple Client - Multiple Server :
• Two alternative management strategies are possible: Either each client
manages its own connection to the appropriate server or each client
knows of only its” home server” which then communicates with other
servers as required
• It consists of clients running client software, a set of servers which provide
all database functionalities and a reliable communication infrastructure.
• When a server receives a, query that requires access to data at other
servers, it generates appropriate sub-query to be executed by other server
and put the results together to compute answer to the original query

Advantages of client-server architectures
1. Horizontal and vertical scaling of resources
2. Better price/performance on client machines
3. Ability to use familiar tools on client machines
4. Client access to remote data
Disadvantages of client-server architectures
1. Maintenance cost is more.
2. It suffers from security problem as number of user and
processing sites increases.
3. Complexity increases

Server System Architecture
• Server systems can be broadly categorized into two kinds:
1. Transaction servers which are widely used in relational
database systems.
2. Data servers, used in object-oriented database systems.
Transaction server
A transaction server is a specialized type of server that manages
the operations of software based transactions or transaction
processing. It manages application and database transactions on a
network or Internet, within a distributed computing environment.

• Transaction server allows you to break down transactions into
components that perform discrete functions.
• A component notifies transaction server whether it has completed
successfully or not.
• If all the components in a transaction complete successfully, the
transaction is committed. If not, the transaction is rolled back.
• Transaction server is also called query server systems or SQL server
systems.
• Requests specified in SQL, and communicated to the server through a
Remote Procedure Call (RPC) mechanism.
• Open Database Connectivity (ODBC) is an application program
interface standard from Microsoft for connecting to a server, sending
SQL requests, and receiving results.

• Data server:
Data server is used in LANs, where there is a very high-speed connection
between the clients and the server.
Issues are as follows
• Page-shipping versus Item-shipping
• Locking
• Data caching
• Lock caching

Parallel System
• A system is said to be a parallel system in which multiple processor
have direct access to shared memory which forms a common
address space.
• Two main performance measures are:
a. Throughput: The number of tasks that can be completed in a
given time interval.
b. Response time: The amount of time it takes to complete a
single task from the time it is submitted.

• Speed-up and Scale-up:
Speedup: A fixed-sized problem executing on a small system is
given to a system which is N -times larger.
Scaleup: Increase the size of both the problem and the system.
The N -times larger system used to perform N -times larger job.

• Factors Limiting Speedup and Scaleup
1. Startup costs: Cost of starting up multiple processes may
dominate computation time, if the degree of parallelism is
high.
2. Interference: Processes accessing shared resources (e.g., system
bus, disks, or locks) compete with each other, thus
spending time waiting on other processes, rather
than performing useful work.
3. Skew: Increasing the degree of parallelism increases the variance
in service times of parallelly executing tasks. Overall
execution time determined by slowest of
parallelly executing tasks.

Distributed System
• A distributed system is one in which components located at
networked computers communicate and co-ordinate their actions
only by-passing messages.
• Usually, distributed systems are asynchronous, i.e., they do not
use a common clock and do not impose any bounds on relative
processor speeds or message transfer times.
Differentiate between local and global transactions are as follows
a) A local transaction accesses data in the single site at which the
transaction was initiated.
b) A global transaction either accesses data in a site different from the
one at which the transaction was initiated or accesses data in
several different sites.

Trade-offs in distributed systems :
a. Sharing Data : Users at one site able to access the data residing at
some other sites.
b. Autonomy: Each site is able to retain a degree of control over data
stored locally.
c. Higher System Availability through Redundancy : Data can be
replicated at remote sites, and system can function even if
a site fails.

Parallel Database
• Parallel database systems consist of multiple processors and multiple disks
connected by a fast interconnection network.
• Parallel database improves the performance of processing of data using
multiple resources simultaneously. Multiple resources like multiple CPU,
Disks can be used simultaneously.

Goals of Parallel Databases
• Improve performance: The performance of the system can be
improved by connecting multiple CPU and disks in parallel.
• Improve availability of data: Data can be copied to multiple locations
to improve the availability of data.
• Improve reliability: Reliability of system is improved with
completeness, accuracy and availability of data.
• Provide distributed access of data: Companies having many branches
in multiple cities can access data with the help of parallel database
system

• Advantages of Parallel Databases:
a) Performance improvement
b) High availability : Same data is stored at multiple locations
c) Increases reliability : Even if data site fails execution can continue as
other copy of data are available.

I/O Parallelism
• I/O parallelism refers to reducing the time required to retrieve
relations from disk by partitioning the relations on multiple disks.
• The most common form of data partitioning in a parallel database
environment is horizontal partitioning.
• In horizontal partitioning, so that each tuple residthe tuples of a
relation are divided among many disks,es on one disk. Several
partitioning strategies have been proposed.
• Three basic data-partitioning strategies are
• Round Robin,
• Hash Partitioning
• Range Partitioning.

Partitioning techniques are as follows :
• Round-robin: In this method, scans the relation in any order and sends the ith
tuple to disk number Di mod n. It ensures an even distribution of tuples
across disks and each disk has approximately the same number of tuples as
the others.
• Hash partitioning: It choose one or more attributes as the partitioning
attributes. A hash function is chosen whose range is |0, 1,..., n — 1|. Each
tuple of the original relation is hashed on the partitioning attributes. If the
hash function returns i, then the tuple is placed on disk Di . Hash partitioning
is also useful for sequential scans of the entire relation.
• Range partitioning: Choose an attribute as the partitioning attribute. A
partitioning vector [v 0 , v 1-,-- , v n-2 ] is chosen. Let v be the partitioning
attribute value of a tuple. Tuples such that v i<+vi+1 go to disk I + 1. Tuples
with v < v 0 go to disk 0 and tuples with v >vn-2 go to disk n — 1.

Handling of Skew
• Data skew primarily refers to a non uniform distribution in a dataset.
• The direct impact of data skew on parallel execution of complex
database queries is a poor load balancing leading to high response
time.
Skew are of two types : Attribute-value skew and Partition skew.
1. Attribute-value skew : Some values appear in the partitioning
attributes of many tuples. All the tuples with the same value for
the partitioning attribute end up in the same partition, resulting
in skew.
2. Partition skew: refers to the fact that there may be load
imbalance in the partitioning, even when there is no attribute
skew.

Interquery Parallelism
• In interquery parallelism, different queries or transaction execute in
parallel with one another.
• The response times of individual transactions are not faster than they
would be if the transactions were run in isolation.
• Thus, the primary use of interquery parallelism is to scale up a
transaction processing system to support a more significant number
of transactions per second.

Intraquery Parallelism
• Intraquery parallelism defines the execution of a single query in
parallel on multiple processors and disks. Using intraquery
parallelism is essential for speeding up long-running queries.
Two complementary forms of intraquery parallelism:
1. Intraoperation parallelism : Parallelize the execution of each
individual operation in the query.
2. Interoperation parallelism : Execute the different operations in a
query expression in parallel.

Distributed Database Concepts
• A logically interrelated collection of shared data physically distributed over a
computer network.
DDBMS components must include the following components
1. Computer workstations
2. Network hardware and software
3. Communications media
4. Transaction Processor: Software component found in each computer that
receives and processes the application's requests data
5. Data processor or data manager: Software component residing on each
computer that stores and retrieves data located at the site. It may be a
centralized DBMS

Data is stored using following methods.
1. Replication
2. Fragmentation
3. Hybrid
4. Allocation
• Allocation of fragments is depends on the how the database will be
used. First thing is to design database schema and then design
application program.
• Information required from application is as follows :
1. Transaction execution frequency
2. Site in which transaction is executed
3. Condition for transaction performed

The distributed database systems contain local and global schema for all sites. Data
storage is responsibility of data sites.
Data is stored using following methods.
1. Replication
2. Fragmentation
3. Hybrid
4. Allocation
Allocation of fragments is depends on the how the database will be used.

Information required from application is as follows :
1. Transaction execution frequency
2. Site in which transaction is executed
3. Condition for transaction performed

• There are five big reasons for using a distributed database system:
1. Many organizations are distributed in nature.
2. Multiple databases can be accessed transparently.
3. Database can be expanded incrementally - As needs arise, additional
computers can be connected to the distributed database system.
4. Reliability and availability are increased - Distributed database can
replicate data among several sites. So even if one site fails,
redundancy in data will lead to increased availability and reliability
of the data as a whole.
5. Performance will increase - Query processing can be performed at
multiple sites and as such distributed database systems can mimic
parallel database systems in a high-performance network.

Homogenous and Heterogeneous Databases.
a)Homogeneous DDBMS
b) All sites use same DBMS product
c) It is much easier to design and manage
d) The approach provides incremental growth and allows increased
performance.
e) Homogeneous systems are much easier to design and manage
f) Homogeneous DBs can communicate directly with each other

Heterogeneous DDBMS
a) Sites may run different DBMS products, with possibly different
underlying data models.
b) This occurs when sites have implemented their own databases first,
and integration is considered later.
c) Translations are required to allow for different hardware and/or
different DBMS products
d) Typical solution is to use gateways
e) Heterogeneous DBs communicate through gateway interfaces
f) No DDBMS currently provides full support for heterogeneous or fully
heterogeneous DDBMSs

Distributed Data Storage
A Relation (r) is stored in the database. Two methods are used
for storing relations on the distributed database.
a) Replication: The system maintains several identical replicas of the
relation and stores each replica at a different site. The alternative to
replication is to store only one copy of relation r.
b) Fragmentation: The system partitions the relation into several
fragments, and stores each fragment at a different site.

Data Replication
• If relation r is replicated, a copy of relation r is stored in two or more
sites.
• Using data replication, each logical data item of a database has
several physical copies, each of them located on a different machine,
also referred to as site or node.
• In full replication the entire database is replicated and in partial
replication some selected part is replicated to some of the sites.

Data Fragmentation
• Fragmentation is a design technique to divide a single relation or class
of a database into two or more partitions such that the
combination of the partitions provides the original database without
any loss of information.
The main reasons of fragmentation of the relations are to
1. Increase locality of reference of the queries submitted to database,
2. Improve reliability and availability of data and performance of the
system,
3. Balance storage capacities and minimize communication costs
among sites.

Horizontal Fragmentation
• The horizontal fragmentation of a relation R is the subdivision of its
tuples into subsets called fragments
Each fragment, Ti of table T contains a subset of the rows. Each tuple of
T is assigned to one or more fragments. Horizontal fragmentation is
lossless. It is defined as selection operation.

There are two versions of horizontal partitioning:
a. Primary horizontal fragmentation of a relation is achieved through
the use of predicates defined on that relation which restricts the
tuples of the relation.
b. Derived horizontal fragmentation is realized by using predicates that
are defined on other relations.

Vertical Fragmentation
• Some of the columns of a relation are projected into a base relation at
one of the sites, and other columns are projected into a base relation at
another site.

Two types of heuristics for vertical fragmentation exist :
1. Grouping : assign each attribute to one fragment, and at each
step, join some of the fragments until some criteria is satisfied. It
uses bottom-up approach.
2. Splitting : starts with a relation and decides on beneficial
partitioning based on the access behavior of applications to the
attributes. It uses top-down approach.

Distributed Transparency
• Distributed Transparency Distribution transparency, which allows a
distributed database to be treated as a single logical database.
Following are different levels of distribution transparency are
recognized:
1. Fragmentation transparency
2. Location transparency
3. Replication transparency

• Fragmentation transparency is the highest level of transparency. The
end user or programmer does not need to know that a database is
partitioned.
Location Transparency
Location transparency exists when the end user or programmer must
specify the database fragment name but does not need to specify
where those fragments are located.

Replication Transparency
• There might be more than one copy of a table stored in the system
should be hidden from the user. This provides for replication
transparency, which enables the user to query any table as if there
were only one copy of it.

Distributed Transactions
• A distributed transaction is composed of several sub-transactions,
each running on a different site. Each database manager can decide to
abort.

• When a transaction is submitted the transaction manager at that site
breaks it up into one or more sub transactions that execute at different
sites, submits them to the transaction manager at those sites, and
coordinates their activity.
Why are distributed transactions hard?
1. Atomic : Different parts of a transaction may be at different sites. How
do we ensure all or none committed?
2. Consistent : Failure may affect only part of transaction
3. Isolated : Commitment must occur "simultaneously" at all sites
4. Durable : Not much different when other problems solved. It also makes
"delayed commit" difficult.

For each such transaction, the coordinator is responsible for :
1. Starting the execution of the transaction.
2. Breaking the transaction into a number of sub-transactions and
distributing these sub-transactions to the appropriate sites for
execution.
3. Coordinating the termination of the transaction, which may result in
the transaction being committed at all sites or aborted at all
sites.

System Failure Modes
The basic failure types are:
1. Failure of a site
2. Loss of messages
3. Failure of a communication link
4. Network partition

Scenario:
• Blue (1) sends to Blue (2) “lets attack tomorrow at dawn” later,
• Blue (2) sends confirmation to Blue (1) “splendid idea, see you at dawn” but,
• Blue (1) realizes that Blue (2) does not know if the message arrived,
• So, Blue (1) sends to Blue (2) “message arrived, battle set” then, Blue (2) realizes that Blue(1) does
not know if the message arrived etc.
• The two blue armies can never be sure because of the unreliable communication.
• No certain agreement can be reached using this method.
• Transaction : Sequence of actions treated as an atomic action to preserve consistency (e.g. access to a database).
• Commit a transaction: Unconditional guarantee that the transaction will complete successfully (even in the
presence of failures).
• Abort a transaction : Unconditional guarantee to back out of a transaction, i e., that all the effects of the
transaction have been removed. Events that may cause aborting a transaction are deadlocks, timeouts,
protection violation etc.
• Mechanisms that facilitate backing out of an aborting transaction are Write-ahead-log protocol and
shadow pages.
• Commit protocol ensure that all the sites either commit or abort transaction unanimously, even in the
presence of multiple and repetitive failures.

Two-Phase Commit Protocol
• The two-phase commit protocol is a distributed algorithm which lets all sites in a
distributed system agrees to commit a transaction. The protocol results in either all
nodes committing the transaction or aborting, even in the case of site failures and
message losses.

The two phases of the algorithm are broken into:
1. The COMMIT-REQUEST phase, where the COORDINATOR
attempts to prepare all the COHORTS, and
2. The COMMIT phase, where the COORDINATOR completes the
transactions at all COHORTS.

• Basic algorithm
• During phase 1, initially the coordinator sends a query to commit message to all
cohorts. Then it waits for all cohorts to report back with the agreement message.
• The cohorts, if the transaction was successful, write an entry to the undo
log and an entry to the redo log.
• Then the cohorts reply with an agree message, or an abort if the
transaction failed at a cohort node.
• During phase 2, if the coordinator receives an agree message from all cohorts, then it
writes a commit record into its log and sends a commit message to all the cohorts.
• If all agreement messages do not come back the coordinator sends an abort message.
• Next the coordinator waits for the acknowledgement from the cohorts.
• When acks are received from all cohorts the coordinator writes a complete record to
its log.
• Note that, the coordinator will wait forever for all the acknowledgements to come
back.
• If the cohort receives a commit message, it releases all the locks and resources held
during the transaction and send an acknowledgement to the coordinator.
• If the message is abort, then the cohort undoes the transaction with the undo log and
releases the resources and locks held during the transaction. Then it sends an
acknowledgement.
•At the COORDINATOR:

Three-Phase Commits Protocol
• Three-Phase Commits (3PC) protocol or Non-blocking for site failures, except in event of
failure of all sites.
• Communication failures can result in different sites reaching different decisions,
thereby violating atomicity of global transactions.
• Introduces third phase, called pre-commit, between voting and global decision.
• On receiving all votes from participants, coordinator sends global pre-commit message.
Participant who receives global pre-commit, knows all other participants have voted
commit and that, in time, participant itself will definitely commit
sent
Abort
received
Prepare
sent sent

Basic 3PC protocol :
Phase 1 :
• The coordinator sends VOTE_REQ to all participants.
• When a participant receives VOTE_REQ, it responds with YES or NO, depending on
its vote. If a participant votes NO, it decides abort and stops.
Phase 2 :
• The coordinator collects all votes. If any vote was NO, then the coordinator decides abort,
sends ABORT to all participants that voted YES, and stops. Otherwise, the coordinator
sends PRE_COMMIT messages to all participants.
• A participant that votes YES waits for a PRE_COMMIT or ABORT message from the
coordinator. If it receives a PRE_COMMIT, then it responds with an ACK message.
Phase 3 :
• The coordinator collects the ACKs. When they have all been received, it decides commit,
sends COMMITs to all participants, and stops.
• A participant waits for a COMMIT from the coordinator. When it receives that message, it
decides commit and stops.

Concurrency Control and its Problem
In distributed database systems, database is typically used by
many users. These systems usually allow multiple transactions to run
concurrently i.e. at the same time.
Concurrency control is the activity of coordinating concurrent
accesses to a database in a multiuser database management system
(DBMS).

Main Objectives of Distributed Concurrency Control
1. It must be recovery from site and communication failures.
2. It must support parallel execution of transaction.
3. Storage mechanism and computational method should be modest to
minimize overhead.
4. Communication delay is less.
5. Few constraints on structure of atomic actions of transactions.

Main Objectives of Distributed Concurrency Control
1. It must be recovery from site and communication failures.
2. It must support parallel execution of transaction.
3. Storage mechanism and computational method should be modest to
minimize overhead.
4. Communication delay is less.
5. Few constraints on structure of atomic actions of transactions.
Concurrency Control Anomalies
• Co-ordination of simultaneous transaction execution in a multiprocessing
database system
• Lack of Concurrency Control can create data integrity and consistency
problems

1. Lost updates
2. Uncommitted data
3. Inconsistent retrievals

• Lost Update Problems (W - W Conflict)
• The problem occurs when two different database transactions perform the
read/write operations on the same database items in an interleaved
manner (i.e., concurrent execution) that makes the values of the items
incorrect hence making the database inconsistent
• Consider the below diagram where two transactions TX and TY, are
performed on the same account A where the balance of account A
is $300

• At time t1, transaction TX reads the value of account A, i.e., $300 (only
read).
• At time t2, transaction TX deducts $50 from account A that becomes $250
(only deducted and not updated/write).
• Alternately, at time t3, transaction TY reads the value of account A that
will be $300 only because TX didn't update the value yet.
• At time t4, transaction TY adds $100 to account A that becomes $400
(only added but not updated/write).
• At time t6, transaction TX writes the value of account A that will be
updated as $250 only, as TY didn't update the value yet.
• Similarly, at time t7, transaction TY writes the values of account A, so it
will write as done at time t4 that will be $400. It means the value written
by TX is lost, i.e., $250 is lost.
• Hence data becomes incorrect, and database sets to inconsistent.

Unrepeatable Read Problem (W-R
Conflict)
• Its also known as Inconsistent Retrievals Problem that occurs when in
a transaction, two different values are read for the same database
item.
• For example:
Consider two transactions, TX and TY, performing the read/write
operations on account A, having an available balance = $300. The
diagram is shown below:

• At time t1, transaction TX reads the value from account A, i.e., $300.
• At time t2, transaction TY reads the value from account A, i.e., $300.
• At time t3, transaction TY updates the value of account A by adding
$100 to the available balance, and then it becomes $400.
• At time t4, transaction TY writes the updated value, i.e., $400.
• After that, at time t5, transaction TX reads the available value of
account A, and that will be read as $400.
• It means that within the same transaction TX, it reads two different
values of account A, i.e., $ 300 initially, and after updation made by
transaction TY, it reads $400. It is an unrepeatable read and is
therefore known as the Unrepeatable read problem.

Advanced Topics on Database - Unit-1 AU17

More Related Content

Similar to Advanced Topics on Database - Unit-1 AU17 (20)

More from LOGANATHANK24 (6)

Recently uploaded (20)

Advanced Topics on Database - Unit-1 AU17