CS6302-SCAD-MSM-by www.LearnEngineering.in.pdf

ENGINEERING COLLEGES
2016 – 17 Odd Semester
IMPORTANT QUESTIONS & ANSWERS
Common to CSE & IT
SUBJECT CODE: CS6302
SUBJECT NAME: Data Base Management Systems
Regulation: 2013 Semester and Year: 03 and II
Visit & Downloaded from : www.LearnEngineering.in

ANNA UNIVERSITY, CHENNAI-25
SYLLABUS
REGULATION 2013
CS6302 DATABASE MANAGEMENT SYSTEMS L T P C 3 0 0 3
UNIT I - INTRODUCTION TO DBMS 10
File Systems Organization - Sequential, Pointer, Indexed, Direct - Purpose of
Database System- Database System Terminologies-Database characteristics- Data
models – Types of data models – Components of DBMS- Relational Algebra.
LOGICAL DATABASE DESIGN: Relational DBMS - Codd's Rule - Entity-
Relationship model - Extended ER Normalization – Functional Dependencies,
Anomaly- 1NF to 5NF- Domain Key Normal Form – Denormalization.
UNIT II - SQL & QUERY OPTIMIZATION 8
SQL Standards - Data types - Database Objects- DDL-DML-DCL-TCL-Embedded
SQL-Static Vs Dynamic SQL - QUERY OPTIMIZATION: Query Processing and
Optimization - Heuristics and Cost Estimates in Query Optimization.
UNIT III - TRANSACTION PROCESSING AND CONCURRENCY CONTROL 8
Introduction-Properties of Transaction- Serializability- Concurrency Control –
Locking Mechanisms- Two Phase Commit Protocol-Dead lock.
UNIT IV - TRENDS IN DATABASE TECHNOLOGY 10
Overview of Physical Storage Media – Magnetic Disks – RAID – Tertiary storage –
File Organization – Organization of Records in Files – Indexing and Hashing –
Ordered Indices – B+ tree Index Files – B tree Index Files – Static Hashing –
Dynamic Hashing - Introduction to Distributed Databases- Client server technology-
Multidimensional and Parallel databases- Spatial and multimedia databases- Mobile
and web databases- Data Warehouse-Mining- Data marts.

UNIT V - ADVANCED TOPICS 9
DATABASE SECURITY: Data Classification-Threats and risks – Database access
Control – Types of Privileges –Cryptography- Statistical Databases- Distributed
Databases-Architecture-Transaction Processing-Data Warehousing and Mining-
Classification-Association rules-Clustering-Information Retrieval- Relevance ranking-
Crawling and Indexing the Web- Object Oriented Databases-XML Databases.
TOTAL: 45 PERIODS
TEXT BOOK:
1. Ramez Elmasri and Shamkant B. Navathe, ―Fundamentals of Database Systems‖,
Fifth Edition, Pearson Education, 2008.
REFERENCES:
Abraham Silberschatz, Henry F. Korth and S. Sudharshan, ―Database System
Concepts‖, Sixth Edition, Tata McGraw Hill, 2011.
C.J.Date, A.Kannan and S.Swamynathan, ―An Introduction to Database Systems‖,
Eighth Edition, Pearson Education, 2006.
3. Atul Kahate, ―Introduction to Database Management Systems‖, Pearson Education,
New Delhi, 2006.
4. Alexis Leon and Mathews Leon, ―Database Management Systems‖, Vikas
Publishing House Private Limited, New Delhi, 2003.
5. Raghu Ramakrishnan, ―Database Management Systems‖, Fourth Edition, Tata
McGraw Hill, 2010.
G.K.Gupta,‖Database Management Systems‖, Tata McGraw Hill, 2011.
Rob Cornell, ―Database Systems Design and Implementation‖, Cengage Learning
2011

TABLE OF CONTENTS
S.No Description Page No.
a. Aim ,Objectives and Outcomes of the subject 1
b. Detailed Lesson Plan 2
UNIT I - Introduction to DBMS
c. Part - A 5
d. Part - B 12
1 Characteristics of Database 12
2. DBMS and Database Users 16
3. Database System Structure 18
4. Data Models 21
5. DBMS Architecture 24
6. ER Model 26
7. Functional dependencies 43
e Part - C 46
8. Normalization 46
9. Relational Algebra 55
UNIT II – SQL and Query Optimization
f. Part - A 65
g. Part - B 69
10. SQL fundamentals 69
11. SQL Commands 74
12. SQL Query structure 80
13. Query Processing 83
14. Query Optimization 86
h. Part - C 93
15. Cost based optimization 91
16` SQL Queries Exercises 93
UNIT III – Transaction Processing and Concurrency Control
i. Part - A 96

j. Part - B 101
17. Transaction Concept and States 101
18. Serializability 107
19. Commit Protocols 114
20. Concurrency Control 119
21. Lock Conversion 127
k. Part - C 129
22. Deadlock 129
23. Deadlock Detection and Recovery 131
UNIT IV – Trends in Database Technology
l. Part - A 134
m. Part - B 137
24. RAID Technology 137
25. Indexing 142
26. Hashing 149
27. B Trees 152
28. Spatial and Multimedia Database 153
n. Part - C 156
29. Mobile and Web Database 156
30. Distributed Databases 160
UNIT V – Advanced Topics
o. Part - A 166
p. Part - B 170
31 Distributed Transactions 170
32 Database Security 172
33 K-Means Algorithm 174
34 Classification and clustering 176
35 Object Oriented Database 178
37 Access Control 181
38 Threats and risks 186
39 Datawarehousing and Datamining 189

35 Distributed Architecures 195
q. Part - C 197
36. XML Databases 197
37. Cryptography and statistical Database 204
r. Industrial and Practical Connectivity of the Subject 208
51. Question Bank 209

AIM, OBJECTIVE & OUTCOMES OF THE SUBJECT
AIM :
To expose the students to the fundamental and advanced concepts of Data base
systems.
OBJECTIVES:
To expose the students to the fundamentals of Database Management Systems.
To make the students understand the relational model and Database design using
Normalization.
To familiarize the students with ER diagrams.
To expose the students to SQL and make the students to write SQL queries.
To make the students to understand the fundamentals of Transaction Processing
and Query Processing.
To familiarize the students with the different types of databases. To
make the students understand the Security Issues in Databases.
OUTCOMES:
At the end of the course, the student should be able to:
Design Databases for applications.
Use the Relational model, ER diagrams.
Apply concurrency control &recovery mechanisms for practical problems.
Design the Query Processor and Transaction Processor.
Apply security concepts to databases.
1

DETAILED LESSON PLAN
Text Book:
TB1. Ramez Elmasri and Shamkant B. Navathe, ―Fundamentals of Database
Systems‖, Fifth Edition, Pearson Education, 2008.
Reference Book:
RB1. Abraham Silberschatz, Henry F. Korth and S. Sudharshan, ―Database System
Concepts‖, Sixth Edition, Tata McGraw Hill, 2011.
RB2. C.J.Date, A.Kannan and S.Swamynathan, ―An Introduction to Database
Systems‖, Eighth Edition, Pearson Education, 2006.
RB3. Alexis Leon and Mathews Leon, ―Database Management Systems‖, Vikas
Publishing House Private Limited, New Delhi, 2003.
RB4. Raghu Ramakrishnan, ―Database Management Systems‖, Fourth Edition,
Tata McGraw Hill, 2010.
Sl. Unit
Topic / Portions to be Covered
Hours Cumu Books
Required lative
No No Referred
/ Planned Hrs
UNIT I – INTRODUCTION TO DBMS
1 I
File Systems Organization– Sequential,
1 1
TB1,
Pointer, Indexed, Direct RB1
2 I
Purpose of Database System- Database
1 2 TB1,RB1
System Terminologies
3 I
Database characteristics- Data models –
1 3 TB1, RB1
Types of data models
4 I
Components of DBMS- Relational Algebra.
2 5 TB1, RB1
LOGICAL DATABASE DESIGN:
5 I Entity-Relationship model 1 6
TB1,
RB1,RB2
6 I Extended ER 1 7
TB1,
RB1,RB2
7 I
Functional Dependencies, Anomaly-1NF to
2 9
TB1,
5NF RB1,RB2
8 I
Domain Key Normal Form –
1 10
TB1,
Denormalization RB1
2

UNIT II – SQL AND QUERY OPTIMIZATION
9 II SQL Standards – Data types 2 12
TB1,
RB1
10 II Database Objects- DDL-DML-DCL-TCL 2 14
TB1,
RB1,RB3
11 II Embedded SQL-Static Vs Dynamic SQL 2 16
TB1,
RB1
12 II
QUERY OPTIMIZATION: Query Processing
1 17
TB1,
and Optimization RB1
13 II
Heuristics and Cost Estimates in Query
1 18
TB1,
Optimization. RB1
UNIT III – TRANSACTION PROCESSING AND CONCURRENCY CONTROL
14 III Introduction 1 19
TB1,
RB1
15 III Properties of Transaction 1 20
TB1,
RB1
16 III Serializability 1 21
TB1,
RB1
17 III Concurrency Control 1 22
TB1,
RB1
18 III Locking Mechanisms 2 24
TB1,
RB1
19 III Two Phase Commit Protocol 1 25
TB1,
RB1
20 III Dead lock 1 26
TB1,
RB1
UNIT IV – TRENDS IN DATABASE TECHNOLOGY
21 IV
RAID – Tertiary storage File Organization
1 27
TB1,
– Organization of Records in Files RB1
22 IV Indexing and Hashing –Ordered Indices 1 28
TB1,
RB1
23 IV B+ tree Index Files – B tree Index Files 1 29
TB1,
RB1
24 IV Static Hashing – Dynamic Hashing 1 30
TB1,
RB1,RB2
3

Sl. Unit
Topic / Portions to be Covered
Hours Cumu Books
Required lative
No No Referred
/ Planned Hrs
25 IV
Introduction to Distributed Databases-
1 31
TB1,
Client server technology- RB1
26 IV
Multidimensional and Parallel databases-
2 33
TB1,
Spatial and multimedia databases RB1,RB4
27 IV
Mobile and web databases- Data
1 34
TB1,
Warehouse-Mining- Data marts. RB1,RB4
UNIT V ADVANCED TOPICS
28 V
DATABASE SECURITY: Data
1 37
TB1,
Classification-Threats and risks RB1
29 V
Database access Control – Types
1 38
TB1,
of Privileges RB1
30 V Cryptography- Statistical Databases 1 39
TB1,
RB1
31 V Distributed Databases-Architecture 1 40
TB1,
RB1
32 V
Transaction Processing-Data Warehousing
2 42
TB1,
and Mining-Classification RB1
33 V
Association rules-Clustering-
1 43
TB1,
Information Retrieval RB1
34 V
Relevance ranking-Crawling and Indexing
1 44
TB1,
the Web RB1
35 V
Object Oriented Databases-
1 45
TB1,
XML Databases. RB1
4

UNIT I
INTRODUCTION TO DBMS
File Systems Organization - Sequential, Pointer, Indexed and Direct - Purpose of
Database System - Database System Terminologies-Database characteristics- Data
models – Types of data models – Components of DBMS - Relational Algebra.
LOGICAL DATABASE DESIGN: Relational DBMS - Codd's Rule – Entity -
Relationship model - Extended ER Normalization – Functional Dependencies,
Anomaly- 1NF to 5NF- Domain Key Normal Form – Denormalization.
PART - A
1. Write the characteristics that distinguish the Database approach with the file
based approach. Apr / May 2015
(or) List any two advantages of database systems.
Database characteristics (or) Advantages of Database system:
Controlled Data redundancy and Consistency
Self Describing Nature (of Database System)
Data isolation or Abstraction
Integrity
Atomicity
Concurrent Access or sharing of Data
Security
Support for multiple views of Data
2. Define Functional dependency. Apr / May 2015
A functional dependency is a constraint between two sets of attributes from
the database.Let R be the relational schema R={A1,A2,…An}. A functional
dependency denoted by X

Y
X functionally determines Y in a relational schema R iff whenever two tuples
of r(R) agree on their X value , they must necessarily agree on their Y value.
5

For any two tuples t1 and t2 in r if t1[x]=t2[x], we must have t1[y]=t2[y], ie.,
values of Y component of a tuple in r depend on and determined by the values of X
component.
3. What are the disadvantages of file processing system? May / June 2016
Data redundancy and inconsistency
Structure of file is embedded in programs.
Difficult to enforce standards
Difficulty in sharing and concurrent access.
4. Explain Entity Relationship model. May / June 2016
ER model maps real world on to conceptual schema. It consists of a collection
of basic objects called entities and of relationships among these objects. It represents
the overall logical structure of a database. It is a semantic model.
5. Define DBMS.
What is the purpose of Database Management System? Nov /Dec 2014
DBMS is a collection of programs that enables to create and maintain
database. It is a general purpose Software system that facilitates the process of
defining, constructing and manipulating data bases for various applications.
Defining involves specifying data types, structures and constraints for the data to be
stored in the database.
Constructing is the process of storing the database itself on some storage medium
that is controlled by DBMS.
Manipulating includes functions such as querying the database to retrieve specific
data, updating database to reflect changes and generating reports from the data. Eg.
Oracle, Ms-access, Sybase, Informix, Foxpro
6

6. Why 4NF in Normal Form is more desirable than BCNF? Nov/Dec 2014
Database must be already achieved to 3NF to take it to BCNF, but database
must be in 3NF and BCNF, to reach 4NF.
• In fourth normal form, there are no multi-valued dependencies of the tables, but in
BCNF, there can be multi-valued dependency data in the tables.
7. State the anomalies of 1NF. Nov/ Dec 2015
Redundancies in INF relations lead to a variety of data anomalies. Data
anomalies are divided into three general categories: insertion, deletion and update
anomalies.
8. Is it possible for several attributes to have the same domain? Illustrate your
answer with suitable examples. Nov/ Dec 2015
Yes. Domain is a set of values that an attribute of a tuple can take. Domain is
similar to datatype. Therefore multiple attributes can have same domain.
Eg. Name varchar2(25) and Place varchar2(25)
9. List five responsibilities of DB manager or DBA. May/June 2007
Data base Administrator: (Any 5)
Administer Primary (DB) and secondary resources (DBMS + software)

Schema Definition: DBA creates original Database schema by extracting a set
of data definition statements in DDL.

Storage structure and access method definition

Schema and physical organization modification

Granting of Authorization for Data access

Coordinate, monitor and acquire hardware and software resources

Monitor Security problems

Routine Maintenance:
Backing up the database periodically
Check the disk space availability
Monitor jobs and ensuring performance
7

10. Define Data independence.
Differentiate physical and logical data independence. Nov / Dec 2003
Data Independence:
The capacity to change the schema at one level of database system without having to
change the schema at the next higher level.
Types:
Logical Data independence:
The capacity to change the conceptual schema without having to change
the External schema.
Physical Data Independence:
The capacity to change the physical schema without having to change
the Conceptual schema.
11. Define Data model. List the various types of data models.
Data model is a collection of concepts that can be used to describe the
structure of a database (ie., Data, Relationships, Datatypes and constraints)
Categories of Data model:
High level or Conceptual: Close to users

Low level or Physical: how data is stored in computer

Representational or Implementational : Concepts understood by users and not
too far from the way data is organized Eg. Network, Hierarchical Model.


What do you mean by View?
A view is a subset of a database that is generated from a query and stored as a
permanent object. A view is a virtual table based on the result-set of an SQL
statement. A view contains rows and columns, just like a real table. A view
represents an external schema.
8

13. Give the various levels of abstraction in DBMS architecture.
Three schema architecture. There are three levels of abstraction in DBMs
architecture.
Internal or physical level

Conceptual level

External or view level



Differentiate super key, candidate key, primary key and foreign key.
May / June 2009
Superkey: A superkey s of R is a set of attributes with the property that no two
tuples t1 and t2 in r(R) will have t1[s] = t2[s]. Different set of attributes, which are
able to identify any row in the database, is known as super key.
Candidate Key: A candidate key is a minimal superkey. The removal of any
attribute from K will not cause K to be a super key any more. K={A1,…Am} K-Ai
is not a key. Eg{SSN,Ename} is a superkey and {ssn} is a candidate key.
Primary Key: Primary key could be any key, which is able to identify a specific
row in database in a unique manner. Primary key could be any key, which is able to
identify a specific row in database in a unique manner.
Foreign Key: A foreign key is a field (or collection of fields) in one table that
uniquely identifies a row of another table. The foreign key is defined in a second
table, but it refers to the primary key in the first table
15. Define Normalisation and Denormalization? What is the need for
normalisation? April / May 2004
Normalization is a process of analyzing the given relational schema based on
their FDs and primary keys to achieve desirable properties of minimizing
redundancy and minimizing the insertion and Updation anamolies.
The process of storing the join of higher normal form relations as a base
relation in lower normal form is known as denormalization.
9

INTRODUCTION TO DATABASE SYSTEMS
Database System Terminologies:
a) Data:
Data are Known facts that can be recorded and that have implicit meaning.
b) Database:
Database is a collection of related data with some inherent meaning. It is
designed, built and populated with data for specific purpose. It represents some
aspects of real world.
c) Database Management System:
It is a collection of programs that enables to create and maintain database. It is
a general purpose Software system that facilitates the process of defining,
constructing and manipulating data bases for various applications.
Defining involves specifying data types, structures and constraints for the data to be
stored in the database
Constructing is the process of storing the database itself on some storage medium
that is controlled by DBMS.
Manipulating includes functions such as querying the database to retrieve specific
data, updating database to reflect changes and generating reports from the data.
Eg. Oracle, Ms-access, Sybase, Informix, Foxpro
d) Database System: Database and DBMS together known as Database system.
10

User / programmers
DataBaseSystem Application programs / Queries
Software to process Queries / Programs
Software to access Stored Data
DBMS
Stored Database Stored
Definition (Metadata) Database
Fig 1.1 Database System Environment
Applications of Database System:
Banking – Customer information, accounts, loans and transactions

Airlines – Reservation and schedule information

Universities – Student, course, registration and grades

Credit and Transactions – purchases on cards and report

Telecommunications – Records of calls, report, balances etc.,

Finance – Holdings, sales, stocks and bond purchase.

Sales – customer, product, purchase.

Manufacturing – Inventory items, orders, purchase.

Human Resources-Employee details, salary, tax etc.,
11

1A. CHARACTERISTICS OF DATABASE SYSTEM
1A. Explain in detail about database system and its characteristics. Also
compare Database systems with file processing system. Nov / Dec 2015
Database System Vs File System: (or) Characteristics of Database
approach (or) Disdvantages of File system:
Introduction:
File system:
A file system is a method for storing and organizing computer files and the
data they contain to make it easy to find and access them. In traditional file
processing, user defines and implements files needed for specific application as a
part of programming application. This file processing system is supported by
operating system. It stores permanent records in various files and it needs different
application programs to extract records from and add records to appropriate files.
Database System: Define DB, DBS and DBMS.
Database characteristics (or) purpose of database system:
Data redundancy and Consistency
Self Describing Nature (of Database System)
Data isolation or Abstraction
Integrity,
Atomicity
Concurrent Access or sharing of Data
Security
Support for multiple views of Data
Enforcing standards.
12

(i) Data redundancy and Consistency:
In file System, each user maintains separate files and programs to manipulate
these files because each requires some data not available from other user‗s files.
This redundancy in defining and storage of data results in
wasted storage space,

redundant efforts to maintain common update,

higher storage and access cost and

Leads to inconsistency of data (ie.,) various copies of same data may not
agree.

In Database approach, a single repository of data is maintained that is
maintained that is defined once and then accessed by various users. Thus
redundancy is controlled and it is consistent.
(ii) Self Describing Nature of Database System
In File System, the structure of the data file is embedded in the access
programs.
A database system contains not only the database itself but also a complete
definition or description of database structure and constraints. This definition is
stored in System catalog which contains information such as structure of each file,
type and storage format of each data item and various constraints on the data.
Information stored in the catalog is called Meta-Data.
What is the purpose of Meta data? April / May 2004
Metadata is "data that provides information about other data". Two types of
metadata exist: structural metadata and descriptive metadata. Structural metadata is
data about the containers of data. Descriptive metadata uses individual instances of
application data or the data content.
A main purpose of metadata is to facilitate in the discovery of relevant
information, more often classified as resource discovery. Metadata also helps
13

organize electronic resources, provide digital identification, and helps support
archiving and preservation of the resource.
DBMS is not written for specific applications, hence it must refer to catalog to
know structure of file etc., and hence it can work equally well with any number of
database applications.
(iii) Data Isolation or Abstraction
Conventional File processing systems do not allow data to be retrieved in
convenient and efficient manner. More responsive data retrieval systems are required
for general use.
The structure of the data file is embedded in the access programs. So any
changes to the structure of a file may require changing all programs that access this
file. Because data are scattered in various files and files may be in different formats,
writing new application programs to retrieve appropriate data is difficult.
But the DBMS access programs do not require such changes in most cases.
The structure of data files is stored in DBMS catalog separately from the access
programs. This property is known as program data independence.
Operations are separately specified and can be changed without affecting the
interface. User application programs can operate on data by invoking these
operations regardless of how they are implemented. This property is known as
program operation independence. Program data independence and program operation
independence are together known as data independence.
(iv) Enforcing Integrity Constraints:
The data values must satisfy certain types of consistency constraints.
In File System, developers enforce constraints by adding appropriate code in
application program. When new constraints are added, it is difficult to change the
programs to enforce them.
In data base system, DBMS provide capabilities for defining and enforcing
constraints. The constraints are maintained in system catalog. Therefore application
14

programs work independently with addition or modification of constraints. Hence
integrity problems are avoided.
(v) Atomicity:
A Computer system is subjected to failure. If failure occurs, the data has to be
restored to the consistent state that existed prior to failure. The transactions must be
atomic – it must happen in entirety or not at all.
It is difficult to ensure atomicity in File processing System.
In DB approach, the DBMS ensures atomicity using the Transaction manager
inbuilt in it. DBMS supports online transaction processing and recovery techniques
to maintain atomicity.
(vi) Concurrent Access or sharing of Data:
When multiple users update the data simultaneously, it may result in
inconsistent data. The system must maintain supervision which is difficult because
data may be accessed by many different application programs that may have not
been coordinated previously.
The database (DBMS) include concurrency control software to ensure that
several programs /users trying to update the same data do so in controlled manner, so
that the result of update is correct.
(vii) Security:
Every user of the database system should not be able to access the data.
But since the application programs are added to the system in an adhoc
manner, enforcing such security constraints is difficult in file system.
DBMS provide security and authorization subsystem, which the DBA uses to
create accounts and to specify account restrictions.
(viii) Support for multiple views of Data:
Database approach support multiple views of data. A database has many users
each of whom may require a different view of the database. View may be a subset of
15

database or virtual data retrieved from database which is not explicitly stored.
DBMS provide multiple views of the data or DB.
In file systems, different application programs are to be written for different
views of data.
(ix) Enforcing Standards:
Since DBMS is a central system, so standard can be enforced easily
may be at Company level, Department level, National level or International level.
The standardized data is very helpful during migration or interchanging of data. The
file system is an independent system so standard cannot be easily enforced on
multiple independent applications.
1B. DBMS AND DATABASE USERS
1 B. Write about the pros and cons of DBMS. Write a note on various
categories of Database users.
Explain the role and functions of DBA (6) Apr / May 2008
Advantages and Disadvantages of DBMS:
Advantages of DBMS:
Controlled redundancy

Restricting unauthorized access

Provide persistent storage for program objects and data structures

Multiple user interfaces

Provides Complex relationships among data

Enforcing integrity constraints and Security

Providing backup and recovery.
Disadvantages of DBMS:
Overhead cost of using DBMS due to
16

High initial investment in hardware and training

Generality that DBMS provide for defining and processing data

Overhead for security, concurrency control, recovery and integrity functions.
Data base Users:
Data base Administrator:
Administer Primary (DB) and secondary resources (DBMS + software)

Schema Definition: DBA creates original Database schema by extracting a set of
data definition statements in DDL.

Storage structure and access method definition

Schema and physical organization modification

Granting of Authorization for Data access

Coordinate, monitor and acquire hardware and software resources

Monitor Security problems

Routine Maintenance:
Backing up the database periodically
Check the disk space availability
Monitor jobs and ensuring performance
Database designers – Identify data, choose structures to represent and store data
End Users – People who access the database for querying, updating and
generating reports.
Casual end Users: Writes application programs. Choose tools to develop
interfaces (4GL) Use special programming languages.

Parametric or Naïve End users : Unsophisticated users who use previously
written application programs. Constantly queries and updates the database
using standard type of queries and updates which is known as canned
transactions.

Sophisticated End Users: Interact without writing programs. They form a
request in database query language. They submit such query to query
17

processor which break down DML statements into instructions that storage
manager understands. OLAP and data mining tools are used.
Specialised users: Sophisticated users write specialized database applications.
CAD, Expert system

Stand alone End Users: Maintain Personal Database by using ready mode
packages that provide easy to use menu and graphics based interfaces.
System Analyst :
Determine requirements of users and develop specifications.
Application Programmers:
Implement specifications as programs.
DATABASE SYSTEM STRUCTURE
Illustrate with neat diagram the database system architecture and its
components. (16) Nov / Dec 2015
Briefly Explain Database system architecture. (16) May / June 2016
Define Database, DBMS and DBS
A Data base system is partitioned into modules.
(i) Storage Manager (ii) Query Processor
(i) Storage Manager
Database requires a large amount of storage space (Disk). Database system
structures the data so as to minimize the need to move the data between disk and
main memory.
It provides an interface between low level data stored in database and
application programs and queries.

Interacts with File manager

Translates DML statements into file system (low level) commands.

Storing, Retrieving and updating the database.
18

Components of storage manager:
 Authorization and integrity manager :
It tests for satisfaction of integrity constraints and checks authority of
users to access data.
 Transaction Manager:
Checks database is in consistent state despite the system failures. And
checks whether the concurrent transaction executions proceed without
conflicting.
Transaction Management: Transaction is a collection of operations that
performs a single logical function or unit of work in a database application.
Properties of Transaction:
Atomicity, Consistency, Durability and Isolation.
The Transaction management component
Ensures atomicity and durability.
Failure recovery: Detect system failures and restore the database to the state
that exited prior to the occurrence of failure.
Concurrency control manager control interaction among concurrent
transactions to ensure consistency.
File Manager: Manages allocation of space on disk and data structures to
represent the information.

Buffer Manager: Fetches the data from disk into main memory. Decides
what to cache in main memory.

Data structures:

Data files: Stores the data base

Data Dictionary : Meta data about the structure of the database (metadata)
Indices: Provide fast access to data items that hold particular value.
(ii) Query Processor:
Simplify and facilitate access to data. Hides the physical details of
implementation. Provides quick processing of updates and queries.
19

Components of Query Processor:
 DDL interpreter:
Interprets the DDL statements and records the definition in data
dictionary.
Fig.1.2 Data base System Structure
20

 DML Compiler:
Translates DML Statements in Query into a query evaluation plan
consisting of low level instructions that query engine can understand and
evaluate.
Query can be translated into a number of evaluation plans. DML
compiler also performs query optimization that picks the lowest cost
evaluation among the alternatives.
Query Evaluation Engine: Execute low level instructions generated by
DML compiler.
3 A. DATA MODELS
3A. Explain the different types of data models. Nov / Dec 2014
Definitions:
a) Data Model:
Data model is a collection of concepts that can be used to describe the
structure of a database (ie., Data, Relationships, Datatypes and constraints). D ata
model represents the logical structure of a database and determines how data can be
stored, organized, and manipulated.
b) Schema:
Complete definition and description of the database is known as database
schema. Each object in the schema is known as schema construct.It is known as
Intension
c) Data Base State:
The data in the database at a particular moment in time is called a database
state or snapshot.It is known as the extension of database schema.
DBMS restores the description of the schema constructs and constraints (meta
data ) in DBMS Catalog.
21

Categories of Data model:
High level or Conceptual model: Close to users

Low level or Physical model: how data is stored in computer

Representational or Implementational Model : Concepts understood by users
and not too far from the way data is organized Eg. Network, Hierarchical
Model.
Types of Data model:
1. Record Based model
Relational model
Network Model
Hierarchical Model
Entity Relationship Model
Object-Oriented Model
Record Based Model
A record based data model is used to specify the overall logical structure
of the database.
Each record type defines a fixed no. of fields having a fixed length.
Relational Model:
In the relational model, data is organized in two-dimensional tables called relations.
22

Network Model:
In the network model, the entities are organized in a graph, in which some entities
can be accessed through several paths .
Hierarchical Model:
The hierarchical data model organizes data in a tree structure
Entity Relationship Model
Object Oriented Model
Object oriented model is a logic organization of the real
world objects(entities), constraints on them, and the relationships among objects
23

3 B. DBMS ARCHITECTURE
3B. Explain in detail about the levels of abstraction in DBMS. June 16
(or) Explain DBMS architecture or 3-schema architecture.
Three schema architecture

Achieve the database characteristics
Three Schema Architecture:
Separates the user applications and physical database. Schemas can be
defined in three levels:
Internal Level:
It has an internal schema which describes physical
storage structure of the database.

How the data are actually stored

Uses physical model

Describes the complete details of data storage and access paths
for the database.

Conceptual Level:
It has an conceptual schema which describes the structure of
whole database

What data are stored and what relationships exist among data.

Uses high level or implementational data model.

Hides the details of physical storage structures and describes
datatypes, relationships, operations and constraints.

External or View Level:
Includes a number of external schemas or views.

Each external schema describes the part of the database and
hides the rest.

Uses high level or implementational data model.
24

Each user refers only to its own schema. DBMS must transform a request specified
on external schema into a request against conceptual schema and then into a request
on internal schema for processing over stored database. The process of transforming
requests and results between levels are called mappings.
Fig 1.3 DBMS architecture
Data Independence:
The capacity to change the schema at one level of database system
without having to change the schema at the next higher level.
Logical Data independence:
The capacity to change the conceptual schema without having to change
the External schema.
Physical Data Independence:
The capacity to change the physical schema without having to change
the Conceptual schema.
Data base Application Architecture:
Client : Machine in which remote database users work
Server: machine in which database system runs.
25

Two-tier Architecture:
Application is partitioned into component that resides at the client machine,
which invokes database system functionality at server machine through query
language statements.
Eg. API – ODBC & JDBC used for Interaction.
Three tier Architecture:
The client merely acts as a front end and does not contain any direct
database calls. Client end communicates with application server usually
through form interface. Application server (has Business logic)
communicates with database system.
User
Client User Client
Client Client
Application
Application
Network Network
Application
Data base Serve
r Data base
System
Server
Fig 1.4 Two – Tier and Three Tier Architecture
4 A. ER MODEL – ENTITY –RELATIONSHIP MODEL
4A. Explain the concepts of ER model in detail.
26

Definitions of Key terms:
1. Entity :
It is an object with physical or conceptual existence. It is a thing with an
independent existence. An entity has a set of properties, and the values for the
properties may uniquely identify an entity. Eg. Company, job, Table.
2. Attributes:
Particular properties that describe an entity. Eg. Employee -
Empno,name,salary etc., Each entity may have its own value for each attribute,
Types of Attributes:
(i). Simple Vs Composite Attribute:
Attributes that are not divisible are called simple or atomic attribute.
Eg, Age
Attributes that can be divided into smaller subparts which represent more
basic attributes with independent meanings. Eg., Address –
doornumber,street,coty,state,pin.
(ii), Single valued Vs Multi valued Attribute:
Attributes that have a single value for a particular entity is called a single
valued attribute. Eg. Age
Attributes that have different values for it is known as multivalued attribute
Eg. Degrees of aperson.
(iii). Stored Vs Derived Attribute
An attribute that can be derived from another attribute is known as derived
attribute. Eg. Age attribute can be derived from DOB and current date. DOB

Stored Attribute.
(iv). Complex Attribute
Both Multivalued and composite attribute. Eg. Address

permanent and
residential
(v). Key Attribute
An entity type has an attribute whose values are distinct for each individual
entity in the collection. Such an attribute is known as key attribute and its values can
be used to identify each entity uniquely. Eg. Register no attribute of student entity
27

(vi). Null Attribute
An attribute takes a null value when an entity does not have a value for it
(unknown or missing or no value)
Value:Each entity has value for its attributes.
Entity Type:
Defines a collection of entities that have the same attribute. It is described by
its name and its attributes. It describes the schema (Intension)
5. Entity Set:
The collection of all entities of a particular entity type in the database at any
point in time is called an entity set.(Extension)
Entity types and Entity Sets (cont.)
15
6. Domain:
In each entity for each attribute, there is a permitted set of values called
domain or value set of that attribute. Eg. Age (16-25) A: E

P(V)
7. Weak Entity Vs Strong Entity
Entity type which has no key of their own is known as weak entity type.
Entities belonging to weak entity type are identified by being related to
specific entities from another entity type called as Identifying or owner entity
type. Entity type which has key of their own is known as strong entity type.
28

Relationship:It is a connection among entities.
Relationship type: (R) It defines a set of associations among entity types.
Relationship set:
Set or collection of relationship instances ri between the entities of entity type
involved in relationship type.
R is a mathematical relation on E1,E2,..En or it is a subset of cartesian
product of E1xE2x…En.
Each entity type E is said to participate in Relationship type R and each
individual entity e is said to participate in relationship instance ri
11. Degree of a relationship : It is the number of participating entity types.
Binary : Eg. Works for (Employee

works for

company)
Ternary : Eg.Supply (Supplier

Supply

Customer)
Product
12. Rolenames:
It is a role that a participating entity from entity type plays in relationship
instance and it helps to explain what the relationship means
29

Same entity can participate in different roles Recursive relationship). Role
name distinguishes the meaning of each participation
Same Entity type participates in it more than once in different roles.
13. Relationship attribute or descriptive attribute:
Attributes of a relationship.
Employee

works for

company
Join date
14. Constraints:
Cardinalities: It is the number of relationship instances that an entity can participate
in.
For a Binary Relationship between A and B
1. one to one : An entity in A is associated with atmost one entity in B
and an entity in B is associated with one entity in B. Student

id card
(has)
2. one to many : An entity in A is associated with many entities in B and
an entity in B is associated with one entity in B. classroom

student
(accomodates)
30

3. many to one : An entity in A is associated with atmost one entity in B
and an entity in B is associated with many entities in B. workers

department (works for)
Many to many : An entity in A is associated with many entities in B and
an entity in B is associated with many entities in A.
15. Participation constraint:
Specifies whether the existence of an entity depends on its being related to
another entity via relationship type.
31

Total: The participation of an entity set in R is total if every entity in E
participates in atleast one relationship in R. eg. Weak entity types should have total
participation in identifying relationship.Weak entity type has a partial key which is a
set of attributes that uniquely identify weak entities that are related to owner entity
type.
Partial: If only some entities in E participate in R, the participation of E in R
is said to be partial.
4B. ER DIAGRAM - NOTATIONS AND GUIDELINES
4B Explain the notations and design guidelines in ER diagram with an
example. Also explain Extended ER features.
ER Diagram:
ER diagram depicts the full representation of conceptual model. Express the
overall structure of a database graphically. ER diagrams are simple and clear.
Components:
ENTITY TYPE OR SET
WEAK ENTITY TYPE OR SET
RELATIONSHIP TYPE OR SET
R
IDENTIFYING RELATIONSHIP
R
GENERALIZATION OR SPECIALIZATION
IS A
32

ATTRIBUTE Name
KEY ATTRIBUTE
Regno
MULTIVALUED ATTRIBUTE
COMPOSITE ATTRIBUTE
DERIVED ATTRIBUTE
R
1 1
ONE TO ONE
ONE TO MANY 1 R m
MANY TO ONE m R 1
MANY TO MANY m
R
n
TOTAL PARTICIPATION OF ENTITY SET R E
1..h
R E
CARDINALITY LIMITS
33

Naming Conventions:
Design Guideline 1 – Proper Naming of Schema Constructs
Entity type – singular names (Nouns) in upper case letters
Relationship type – Verb in Upper case letters.
Attribute – Capitalized
Role name – lowercase letters.
Design Guideline 2 - Remove the Redundancy
It is important to have least possible redundancy when we design the
conceptual schema of a database.
If some redundancy is desired at the storage level or at the user level then it
may be introduced later.
34

Design Guideline 3 :
A Concept may be first modelled as an attribute and then refined into a
relationship when it is determined that the attribute is reference to other entity type.
Design Guideline 4:
An Attribute that exists in several entity types may be refined into its own
independent entity type.
Design Guideline 5 :
An inverse refinement to the previous case may be applied.
Design Phase
ER model gives us much flexibility in designing a database schema to model
a given enterprise. A high level data model serves the data base designer by
designing a conceptual framework in which to specify a systematic fashion, what the
data requirements of the data base users are, how database will be structured to fulfill
these requirements.
Initial phase in database design is to characterize fully the needs of the
prospective database users. Next, the designer chooses a data model and translates
these requirements into a conceptual schema of the database which provides a
detailed view. A fully developed conceptual schema will also indicate the functional
requirements of the enterprise. In a specification of functional requirements, users
describe kinds of operations that will be performed on the data.
The process of moving from an abstract data model to the implementation of
database proceeds in two final design phases.In logical design phase, the designer
maps the high level conceptual schema on to implementation data model of database
system.The designer uses the resulting system specific database schema in the
physical design phase in which physical features of the database is specified.
35

Extended ER features:
Generalization: The refinement from an initial entity set into successive levels of
entity sub groupings (top down). The commonality is expressed by generalization
that exists between lower level and higher-level entity sets (super class). Number of
entity sets share common features.
Constraints: Disjoint and overlapping
Disjoint : An entity belong to no more than one lower level entity set. Eg. Account

Saving &Checking
Overlapping: The same entity may belong to more than one lower level entity set
within a single generalization.
Total generalization or specialization : Each higher level entity must belong to a
lower level entity.
Partial generalization or specialization: some higher level entities may not belong
to any lower entity set.
Specialisation: An entity set may include subgroupings of entities that are distinct in
some way from other entities in set. The process of designating subgroupings within
an entity set is called specialization.
INTRODUCTION TO RELATIONAL MODEL
Relational model was introduced by Ted Codd. It defines a collection of
relations as database. Relation is a table of values and each row in the table
represents a collection of related data values. Relation C D1 x D2 X
D3 .. Dn
Table – Relation ; Row – Tuple; Column Header – Attribute
Relational schema R(A1,A2,...An) is made up of relation name R and a list
of attributes A1.. An

Each attribute Ai is the role played by some Domain in R. D is called the
domain of Ai denoted by Dom(Ai). Domain of an attribute is a set of atomic
values

R – name of the table or relation
36

Degree of a relation is the number of attributes (n) of its relational schema

Relation state r of relational schema R (A1..An) is denoted by r(R) is a set of
n-tuples

r={t1,t2, ...tm}
r(R) C (dom(A1) X dom(A2).... xdom(An))
 Each n-tuple t is a ordered list of n values t=<v1,v2,...vn> where a<=i<=n
Cardinality:
It is the number of values of a domain D ie., I D I.
Characteristics of relations:
1) Ordering of tuples in a relation:
Relation is a set of tuples. Tuples in a relation do not have any order. n
tuple is an ordered list of n values. Tuple variable is a variable
that stands for a tuple.
2) Ordering of values in a tuple:
Relational schema R={ A1,A2..An} is a set of attributes and a relation
r(R) is a finite set of mappings r={ t1,t2,..tm} where each ti is a
mapping from R to D. D is the union of domains of the attributes.
t[Ai]=Dom(Ai)
In a relation, attributes and values within a tuple are ordered.
3) Values in tuples:
Each value in tuple is atomic. Composite and multivalued attribute
values are not allowed.
4) Interpretation of a relation:
Predicate calculus. Tuples are interpreted as a values that satisfy the predicate.
Note:
Relational schema of degree n R(A1,A2..An)

Relation r={t1,t2....tm}

tuple in r(R); t=<v1,v2..vn>

t[Ai] or t.Ai = Vi in t for Ai
37

Relational Database schema : Logical design of a database
Database instance: Snapshot of the data in the database at given instance of time.
Relational schema: It corresponds to the programming language notion of type
definition.
Relational instance:
It corresponds to programming language notion of a value of a variable. The
value of a given variable may change with time; similarly the contents of relation
instance may change with time, as a relation is updated.
Schema diagram:
Schema diagrams can depict a database schema along with primary key and
foreign key dependencies. These diagrams diagrammatically display referential
integrity constraints by drawing a directed arc from each foreign key to the relation it
references (or to the primary key of referenced relation)
Employee (Parent table)
Ssn Name Address Dno
Dno is the primary key
Department (Child Table)
Dno is the foreign key
Dno Dname Location DSsn
Relational Constraints:
Restrictions on data that can be specified on a relational database schema in
the form of constraints.
1) Domain Constraints
It states that Value of each attribute A must be an atomic value from domain
dom(A)
38

2) Key Constraints
It states that No two tuples can have the same combination of values for all their
attributes.
Super Key :
For one subset of attributes [SK] then for any distinct tuples t1, t2 in
r(R)
t1[SK] = t2[SK].Any such attribute SK is called a super key of R. No two
distinct tuples have same values on all attributes in SK.
Candidate key is a minimal super key. (ie.,) removing any attribute A from K
leaves a set of attributes K‗ that‗s not a super key of R.
Primary Key is a candidate key whose values are used to identify tuples
uniquely in a relation.
3) Entity Integrity Constraints:
It states that primary key value cannot be null.
4) Referential Integrity Constraints:
Referential Integrity constraint is specified on two relations and is used
to maintain consistency among tuples of two relations. It states that a tuple in
one relation that refers to another relation must refer to an existing tuple in
that relation.
The referencing attribute of a referencing relation is known as foreign
key.
Conditions:
The foreign key FK should have same domain as the primary key
attribute (PK) of R2. FK is said to reference the relation R2. t1[FK] =
t2[PK]

R1 is called referencing relation and R2 is called referenced relation.

R1 may be known as child table and R2 may be known as Parent table.

The value of FK in t1 of R1 occurs either as a value of PK in some
tuple of R2 or is null. Foreign key can refer to its own relation.

Child table can be created only after the creation of parent table
39

The foreign key can take only the values assigned for the primary key
attribute of parent table.

It is not possible to delete a tuple from a parent table, which is referred
by a child table, otherwise it can be deleted. (first delete the
referencing tuple ie.,child record and then the parent record)

Foreign key can refer to its own relation.
Note: On delete cascade option can be used to forcefully delete the child while
deleting the parent. It is not possible to delete or drop the parent table when it is
being referenced by child table.
Mapping ER and EER Schemas into the Relational Model : Steps of the
algorithm
- STEP 1: Map Entity Types Regular entities map directly to tables
STEP 2: Map Weak Entity Types – draw identifier from parent entity type into
weak entity type and map directly to table
Map Relationship Types (STEPS 3 – 5):
1:1 - Choose one of the relations (with total participation if it exists) and insert
into this relation the primary key of the other relationas a FK
1:N – the many side provides a FK to the one side, no new relation
M:N – need to set up a separate relation for the relationship with PK from
each relation
STEP 6: Map multivalued attributes – set up a new relation for each multi-
valued attribute. Use the PK and value of multi-valued attribute.
STEP 7: Map higher order relationships (ternary, 4-way, etc.) – each higher
order relationship become separate relations.
STEP 8: Mapping of generalization hierarchies and set-subset relationships –
use the PK of the super class plus the attributes of the subclass.
STEP 9: Mapping of Union Types – form a surrogate key.
40

Relational Schema - Company Schema
LOGICAL DATA BASE DESIGN
Relational Database Design:
Each relational database schema consists of a number of relational schemas
and each relational schema consist of a number of attributes.
Attributes are grouped together to form a relational schema
Mapping conceptual schema in ER model into a relational schema.
Formal measure of why one group of attributes into a relational schema is better
than other.
There are two levels of goodness of relational schemas.
Logical or conceptual: How users interpret. Clearly understand relation and
meanings of data.(Views and relations)
Implementation and Storage: How tuples in the relation are stored.
Approaches:
(1) Bottom up: Design by Synthesis
Consider basic relationships among attributes and build the relations.
Large number of binary attribute relationships need to be found.
(2) Top Down: Design By Analysis
41

Number of groupings of attributes into relations from conceptual
design and mapping activities.
Informal Design Guidelines for relational schemas:
Semantics of attributes
Reducing the redundant values in tuples.
Reducing the null values in tuples.
Disallowing the generation of spurious tuples.
Semantics of the attributes: The semantics specifies how to interpret the
attribute values stored in the tuple.ie., how attribute values in a tuple are related to
one another.
Guideline : Clear , Easy to Explain and single ET and RT. Design a relational
schema so that it is easy to explain its meaning. Do not combine attributes from
multiple entity types and relationship types.
Redundant information:
It leads to update anomalies (insertion, updation or deletion)
To Minimise Storage space.
Guideline: Design has no anomalies.
(3) Reducing Null Values:
If many of the attributes donot apply to all tuples then many null values
exist. Disadvantages of null values:
Storage space is wasted
Understanding the meaning of attribute
Specifying Join operation
How to account for Aggregate functions
When do we use null values?
When Attribute does not apply for the tuple
If Attribute Value is unknown.
Value known but not recorded.
Guideline: Avoid placing attribute whose null values will be frequently null
42

(4) Spurious Tuple:
Additional tuples in natural join that represent wrong information and is not
valid are called Spurious tuples.
Guideline: Design relational schemas so that they can be joined with equality
condition on attributes that are either primary keys or foreign keys that guarantees
that no spurious tuples were generated.
FUNCTIONAL DEPENDENCIES
Explain in detail about functional dependencies.
Functional Dependencies:
A functional dependency is a constraint between two sets of attributes from
the database.
Let R be the relational schema R={A1,A2,…An}. A functional dependency denoted
by X

Y
Between two sets of attributes X and Y that are subset of R specifies a constraint on
the possible tuples that can forma a relation state r of R
X

Y --- X functionally determines Y (or) there is a functional dependency
from X to Y (or) Y is functionally dependent on X.
X

L.H.S of F.D. Y

R.H.S of F.D.
Definition 1:
X functionally determines Y in a relational schema R iff whenever two tuples
of r(R) agree on their X value , they must necessarily agree on their Y value. F.D. is
a property of semantics or meaning odf attributes.
Definition 2 :
For any two tuples t1 and t2 in r if t1[x]=t2[x], we must have t1[y]=t2[y], ie.,
values of Y component of a tuple in r depend on and determined by the values of X
component.
43

Eg. SSN

ENAME ie., SSN uniquely determines ENAME.
Pnumber

{pname,location}
{SSN, Pnumber}

hours
Relation state r(R) that satisfies functional dependency constraints are called legal
extensions or legal relation states.
Diagrammatic Notation:
An arrow mark is drawn from L.H.S to R.H.S of a f.d
SSN ENAME SSN

ENAME
DNO

{DNAME,MGRSSN}
DNO DNAME MGRSSN
Inference rules:
Let F be the set of f.ds specified on R. Other numerous functional
dependencies also hold in R that satisfy F.
The set of all such dependencies inferred from F is called the closure of F denoted by
F+.
SSN

{Ename,Bdate,Address,Dno}
DNO

{Dname,MGRSSN} Inferred
F.d.s – F+
SSN

{Dname,MGRSSN}
SSN

Ename
A F.D X‗

Y‗ is inferred from a set of dependencies F specified on R if X‗

Y‗
holds in every relation state r(R).
F X‗

Y‗ ie.,X‗

Y‗ is inferred from F
IR1: (Reflexive) : If x  y, then x

y
IR2: (Augmentation) : {x

y} xz

yz
IR3: (Transitive) : {x

y,y

z} x

z
44

IR4: (Decomposition) : {x

yz} x

y
IR5: (Union) : x

y, x

z x

yz
IR6: (Pseudo transitive) : {x

y,wy

z} wx

z
Closure of X under F:
For each set of attributes X in L.H.S. of a f.d. in F ,determine all set of
attributes dependent on X. Thus for each X we determine X+ functionally
determined by X based on F.
X+ is called closure of X under F.
Equivalence sets of f.d.s
A set of F.Ds E is covered by a set of f.ds F or F covers E if every F.D in E
can be inferred from F (also in F+)
Two sets of F.Ds E and F are said to be equivalent if E+=F+ ie., every F.D in
E can be inferred from F and vice-versa.ie., E covers F and F covers E.
Minimal set of functional dependency:
A set of functional dependencies F is said to be minimal if
Every dependency in F has a single attribute in its R.H.S.
We cannot replace any f.d X

A in F with Y

A where Y X have a set of
functional dependencies equivalent to F.
We cannot remove any dependency from F and still have a set of
dependencies equivalent to F.
It is a standard or canonical form with no redundancies.
Minimal Cover:
A minimal cover of set of f.ds F is a minimal set of dependencies Fmin that is
equivalent to F.
Trivial and Nontrivial Functional Dependency
A functional dependency X

Y is trivial if YX, otherwise it is non-trivial.
45

Part - C
6. NORMALIZATION
What are Normal forms? Explain the types of Normal forms with an
example. (16) Nov / Dec 2014
State the need for Normalization of a database and explain the various
normal forms (1
st
, 2
nd
, 3
rd
, BCNF, 4
th
, 5
th
and Domain –key) with suitable
examples. (16) Apr / May 2015
Normal Forms:
Normalization was proposed by Codd (1972)
It takes the relational schema through a series tests whether it satisfies a
certain normal form.
It proceeds in Top down fashion

Design by analysis.
Codd has defined I, II., III NF and a stronger definition of 3NF known as
BCNF.(Boyce Codd Normal Form)

All these normal forms are based on f.ds among attributes of a relation.
Normalization:
It is a process of analyzing the given relational schema based on their FDs and
primary keys to achieve desirable properties of minimizing redundancy and
minimizing the insertion and Updation anamolies.
The relational schema that donot satisfy normal form tests are decomposed
into smaller relational schemas that meet the tests and possess desirable
properties.
Normalization procedure provides:
Framework for analyzing relational schema based on keys and f.ds
A series of normal from tests carried out on each relational schema so that
the database is normalized to desired degree.

46

There are some predefined rules set for normalization. Those rules are called normal
forms.A normal form of a relation is the highest normal form condition that it meets
ad hence indicates the degree to which it is normalized.
Normalisation through decomposition must meet
1) Lossless join or non-additive join property:
Spurious tuples problem donot occur w.r.t schemas after
decomposition.
2) Dependency preservation property:
Each f.d is represented by some individual relations after
decomposition.
The process of storing the join of higher normal form relations as a base relation in
lower normal form is known as denormalization.
First Normal Form:
Statement of First normal form:
The domain of an attribute must include only atomic (simple atomic)
values and that value of an attribute in a tuple must be a single value from
domain of that attribute.
1NF disallows set of values, a tuples of values ie., it disallows a relation
within a relation or relations as attributes (ie., it disallows composite and
Multivalued attribute)
Eg. DEPARTMENT relational schema whose primary key is DNUMBER.
Considering DLOCATIONS attribute. Each department can have a number of
locations.
DEPARTMENT
DNAME DNUMBER DMGRSSN DLOCATIONS NOT IN 1NF
47

1) The domain of DLOCATIONS contains atomic values, but some tuples can
have a set of these values. Therefore DLOCATIONS is not functionally dependent
on DNUMBER.
DNAME DNUMBER DMGRSSN DLOCATIONS
Research 5 334 Bellaire
Research 5 334 Alaska
Research 5 334 Newyork
2)
Domain of DLOCATIONS has set of values and hence non-atomic. But
DNUMBER

DLOCATIONS exist.
DNAME DNUMBER DMGRSSN DLOCATIONS
Research 5 334 {Bellaire,Alaska,Newyork}
Normalisation to achieve 1NF:
Three ways to achieve first normal form
Remove the attribute DLOCATIONS and place it in a separate relation along
with the primary key DNUMBER of the department.
DNAME DNUMBER DMGRSSN DNUMBER DLOCATIONS
Expand the key so that there will be a separate tuple for each location of
department. But it introduces redundancy in relation.
If maximum number of values is known for DLOCATIONS eg 3., Replace
DLOCATIONS by 3 atomic attributes LOC1, LOC2, LOC3. It introduces
more null values in the relation.
The 1 NF also disallows multivalued attributes that are themselves composite.
These are called nested relations.
48

Test for First Normal Form:
Relation should not have non-atomic attributes or nested relation.
Remedy
Form new relations for each non-atomic attribute or nested relation.
Second Normal Form:
2NF is based on Full Functional Dependency.
Full Functional Dependency:
A FD X

Y is a full functional dependency, if removal of any attribute A
from X, {X-{A}} does not functionally determine Y. Eg. {SSN, PNUMBER}

HOURS
Partial Functional dependency:
A FD X

Y is a partial dependency if some attribute A X can be removed
from X and the dependency still holds ie., A X, X-{A}

Y Eg.,{SSN, Pnumber}

Ename
Statement of Second normal form:
A relational schema R is in 2NF if every nonprime attribute A is fully
functionally dependent on the primary key of R.
If a relational schema R is not in 2NF, it can be 2NF normalized into a number of
2NF relations in which non prime attributes are associated only with the part of the
primary key on which they are fully functionally dependent. Eg.,
SSN
PNumber Hours Ename Pname PLocation
NOT IN 2 NF
FD1
FD2
FD3
49

Ename, Pname & PLocation violates 2NF. FD2 and FD3 are partially dependent on
primary key.
Normalisation (Decomposition) to achieve 2NF:
SSN Ename SSN PNumber Hours
FD1
FD2
PNumber Pname PLocation
FD3
Test for Second Normal Form
For relations where primary key contains multiple attributes, non-key or non-
prime attribute should not be functionally dependent on a part of primary key.
Remedy:
Decompose and set up a new relation for each partial key with its dependent
attribute(s). make sure to keep a relation with the original primary key and any
attributes that are fully functionally dependent on it.
Third Normal Form:
It is based on Transitive dependency.
Transitive Dependency:
A functional dependency X

Y in R is a transitive dependency if there is a
set of attributes Z (that is neither a candidate key or subset of any key of R) and both
X

Z and Z

Y hold.
Statement of Third Normal Form:
According to Codd‗s Definition, a relational schema R is in 3NF if and only if
it satisfies 2NF and every nonprime attribute is non-transitively dependent on that
primary key.
50

General Definition of 3NF:
A relational schema R is in 3NF if whenever a nontrivial functional
dependency X

A hold in R either a) X is a superkey of R or b) A is a Prime
Attribute of R
This definition states that a table is in 3NF if and only if, for each of its
functional dependencies X → A, at least one of the following conditions holds:
X contains A (that is, X → A is trivial functional dependency), or

X is a superkey, or
A is a prime attribute (i.e., A is contained within a candidate key)


Eg., EMPLOYEE

Ename SSN Bdate Address Dno Dname Dmgrssn




Ename SSN Bdate Address Dno Dno Dname Dmgrssn

.


SSN

MGRSSN is transitive through DNO

Test for Third Normal Form:

A relation should not have non key attribute functionally determined by
another non key attribute ie., there should be no transitive dependency of a non key
attribute on the primary key.


Remedy:

Decompose and set up a relation that includes non-key attributes that
functionally determine other non-key attribute.


Boyce-Codd normal form (BCNF):




51

It is stricter than 3 NF because every relation in BCNF also in 3NF. A relation
R is in BCNF if whenever non-trivial dependency X

A holds in R then X is a
superkey of R. The condition of 3NF, which allows A to be prime, is absent from
BCNF.
A relation is in BCNF, if and only if, every determinant is a candidate
key.
Identify all candidate keys in the relations
Identify all functional dependencies in the relations.
If there are functional dependencies in the relation, where their determinants are not
candidate keys for the relation, remove the functional dependencies by placing them
in a new relation along with a copy of their determinant.
Differentiate 3 NF and BCNF:
The difference between 3NF and BCNF is that for a functional dependency A

B, 3NF allows this dependency in a relation if B is a primary-key attribute and A
is not a candidate key.whereas BCNF insists that for this dependency to remain in a
relation, A must be a candidate key.
Note:
A non-prime attribute of R is an attribute that does not belong to any
candidate key of R.
MULTIVALUED DEPENDENCY:

A multivalued dependency is a full constraint between two sets of attributes in
a relation.
Therefore, a multivalued dependency is also referred as a tuple-generating
dependency.
Let R be a relation schema and let and . The multivalued dependency α ->> β holds
on R if, in any legal relation r(R), for all pairs of tuples t1 and t2 in r such that t1[α]
= t2[α], there exist tuples t3 and t4 in r such that
52

t1[α] = t2[α] = t3[α] = t4[α]
t3[β] = t1[β]
t4[β] = t2[β]
t4[R − β] = t1[R − β]
t3[R − β] = t2[R − β]
A multivalued dependency on R, X ->>Y, says that if two tuples of R agree on
all the attributes of X, then their components in Y may be swapped, and the
result will be two tuples that are also in the relation.

FOURTH NORMAL FORM:
Under fourth normal form, a record type should not contain two or more
independent multi-valued facts about an entity. Consider employees, skills, and
languages, where an employee may have several skills and several languages.
Under fourth normal form, these two relationships should not be represented
in a single record such as
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
Instead, they should be represented in the two records
-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
FIFTH NORMAL FORM
Fifth normal form (5NF), also known as project-join normal form (PJNF)
is a level of database normalization designed to reduce redundancy in relational
databases recording multi-valued facts by isolating semantically related multiple
53

relationships. A table is said to be in the 5NF if and only if every non-trivial join
dependency in it is implied by the candidate keys.
A join dependency *{A, B, … Z} on R is implied by the candidate key(s) of R if and
only if each of A, B, …, Z is a superkey for R.
Eg.
-----------------------------
| AGENT | COMPANY | PRODUCT |
|------- +---------+ ---------|
| Smith | Ford | car |
| Smith | GM | truck |
-----------------------------
------------------- --------------------- -----------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|------- +---------| |--------- +--------- | |------- +--------- |
| Smith | Ford | | Ford | car | | Smith | car |
| Smith | GM | | Ford | truck | | Smith | truck |
| Jones | Ford | | GM | car | | Jones | car |
------------------- ------------------- -------------------
A record type is in fifth normal form when its information content cannot
be reconstructed from several smaller record types
54

RELATIONAL ALGEBRA
Explain in detail about relational algebra operators.
Relational algebra is a procedural query language, which consists of basic set
of relational model operations. These operations enable the user to specify basic
retrieval requests. It takes one or more relations as input and result of retrieval is a
relation.
Relational Algebra operations are divided into two groups
Set operations
Fundamental operations (Relation based operations)
Relational Algebra Notations:
Operation Keyword Symbol
Set Operators (Binary)
Union UNION
Intersection INTERSECTION
Cartesian
X
product
Set
-
Difference MINUS
OperationKeyword Symbol
Other Operators
Operation Keyword Symbol
Relation Based Operators
Selection SELECT
Projection PROJECT
Renaming RENAME
Assignment <-
Binary Relation Based Operators
Join JOIN
Left outer
LEFT
OUTER
join
Group or
AGGREGATE  JOIN
Aggregate
Right outer
RIGHT
OUTER
join
JOIN
55

Full outer FULL
join
OUTER
JOIN
Division DIVIDE %
1. UNARY RELATIONAL ALGEBRA OPERATIONS:
SELECT Operation: ()
The SELECT operation is used to select a set of tuples from a relation
that satisfy the selection condition.

The sigma  represents the select operator. It is a Unary operator
because it operates on one relation.
<selection condition> (R)
selection condition is the Boolean expression on the attributes of relation R.
The selection condition is represented as
<attribute name> <comparison operator> <constant value> or
<attribute name> <comparison operator> <attributename>
where comparison operator is one of {=,,,,,} and Boolean conditions
AND,OR,NOT can also be used.
The selection condition is applied to each tuple in R.All the selected tuples
appear in the result of SELECT operation.

The selection conditions cannot involve more than one tuple.

The degree of the relation resulting from SELECT is same as that of R.

The number of tuples in the resultant relation is always less than or equal to
the number of tuples in R. ie.,| c(R) | |R|.

The number of tuples selected by a selection condition is referred to as the
selectivity of the condition.

Cascade of select operations can be combined into a single select using
conjunctive AND condition.
56

<cond1>( <cond2> (<cond3> R)) =<cond1> AND <cond2> AND<cond3>(R)
 SELECT operation is commutative.
<cond1> (<cond2> (R)) =( <cond2> (<cond1> (R))
Eg.,
DNO=4(EMPLOYEE)
SALARY>30000(EMPLOYEE)
 (DNO=4 AND SALARY>25000) OR (DNO=5 AND SALARY>30000)(EMPLOYEE)
PROJECT Operation: ()
The PROJECT operation is used to select certain columns from the table and
discards other columns. The pi  represents the project operator. It is a Unary
operator because it operates on one relation.

<attribute list> (R)
attribute list is a list of attributes from attributes of relation R.
The result of PROJECT operation has only the attribute specified in <attribute
list> in the same order as they appear in the list.

The degree of the relation resulting from SELECT is same as that of attribute list.

If the attribute list has a nonkey attribute, PROJECT operation removes any
duplicates , so the result is a set of tuples. This is known as duplicate
elimination.

The number of tuples in the resultant relation is always less than or equal to the
number of tuples in R

< list1> ( < list2> (R) ) = < list1> (R)

PROJECT operation is not commutative.
57

Eg. LNAME, FNAME, SALARY(EMPLOYEE)
(iii) RENAME Operation: ()
The rename operator is used to rename either the relation name or the
attribute names or both. The  (rho ) operator denotes the rename operation. It is a
Unary operator.
S(B1,B2,…Bn) (R) or (ii) S(R) or (iii) (B1,B2,..Bn)(R)
Renames both relation and its attributes
Renames the relation only
Renames the attributes only
The attributes are renamed as the same order in R.
FN,LN,SAL (FNAME, LNAME, SALARY(DNO= 5(EMPLOYEE)))
2. RELATIONAL ALGEBRA – SET OPERATORS:
Set theoretic operations are binary operations that are applied in two sets
of tuples.
Union Compatibility:
The two relations on which these operations are applied must have the same
type of tuples. This condition is called as union compatibility.
Two relations R(A1,,..An) and S(B1, ..Bn) are said to be union compatible if they
have same degree n (same number of attributes )and if dom(Ai)=dom(Bi) ie., each
corresponding pair of attributes have the same domain.
58

Set operators:
 Union (U)
 Intersection (n) Defined on union compatible
relations
Set Difference (--)

Cartesian Product (x)



Union: The result of the operation denoted by R U S is a relation that includes all
tuples that are either in R or in S or in both R and S. Duplicate tuples are eliminated.
Intersection: The result of the operation denoted by R n S is a relation that
includes all tuples that are in both R and S.
Both union and intersection are commutative operations and can be treated as n-ary
operations applicable to any number of relations and both are associative operations;
R U S = S U R and R n S = S n R
R U (S U T) = (R U S) U T ; R n (S n T) = (R n S) n T
(iii) Set Difference: The result of this operation, denoted by R - S is a relation that
includes all tuples that are in R but not in S.
The minus operation is not commutative;
R – S  S –R
STUDENT INSTRUCTOR
FN LN
Suresh Rao
Bala Ganesh
Ravi Sankar
Mohan Varma
59

Sushil Sharma
Rakesh Agarwal
STUDENT U INSTRUCTOR
FN LN
Suresh Rao
Bala Ganesh
Ravi Sankar
Mohan Varma
Sushil Sharma
Rakesh Agarwal
Kishore Das
Ashwin Kumar
FN LN
Suresh Rao
Bala Ganesh
Sushil Sharma
Kishore Das
Ashwin Kumar
STUDENT n INSTRUCTOR
FN LN
Suresh Rao
Bala Ganesh
Sushil Sharma
60

INSTRUCTOR - STUDENT STUDENT - INSTRUCTOR
FN LN FN LN
Kishore Das Ravi Sankar
Ashwin Kumar Mohan Varma
Rakesh Agarwal
(iv) Cartesian Product (or Cross Product) Operation:
It is denoted by X. It is a binary set operation, but the relations on which it is
applied need not be union compatible. This operation is used to combine tuples from
two relations in a combinatorial fashion.
The Cartesian product creates tuples with combined attributes of two
relations.
R(A1, A2,.. An) X S(B1,B2.. Bm) is a relation Q with degree n+m attributes Q(A1,..
An,B1,….Bm).
The resultant relation Q has one tuple for each combination of tuples – one
from R and one from S. If R has nR tuples and S has nS tuples then R X S will have
nR * nS Tuples.
The operation applied leads to meaningless tuples. It is useful when followed
by a selection that matches the values of attributes coming from relations.
FEMALE_EMPS

 SEX=‗F‗(EMPLOYEE)
EMPNAMES

 FNAME, LNAME, SSN (FEMALE_EMPS)
EMP_DEPENDENTS

EMPNAMES x DEPENDENT
3. BINARY RELATIONAL ALGEBRA OPERATIONS
JOIN Operation:
Join operation denoted by is used to combine related tuples from two
relations into single tuples. It allows to process relationships among relations.
61

R <join condition> S
The join condition is of the form
<condition> AND <condition> AND .. <condition> and each condition is of
the form Ai  Bj, where Ai is the attribute of R and Bj is the attribute of S, Ai and Bj
must have same domain. And  is one of the relational operator { =,<,>,,,} A
join with such a general join condition is called as theta join.
The result of join is a relation Q with n + m attributes Q(A1,.. An,B1,….Bm)..
Q has one tuple from R and one tuple from S whenever the combination satisfies the
join condition.
The join condition specified on attributes from two relations R and S is
evaluated for each combination. Each tuple combination for which the condition is
evaluated to true is included in the result.
Equi Join: The join which involves join conditions with equality comparisons only
ie., it involves only = operator in join condition is known as equi join.
DEPT_MGR

DEPARTMENT MGRSSN=SSN
EMPLOYEE
Natural Join : denoted by * Equi join with two join attributes have same name in
both relationships. The attribute involved in join condition is known as Join
attribute.
DEPT_LOCS

DEPARTMENT * DEPT_LOCATIONS
62

Outer Join : The tuples without a matching tuple and the tuples with null values for
the join attribute are eliminated from JOIN result. Outer joins can be used to keep all
the tuples in R or in S or those in both relation in the result of join even if they donot
have matching tuples in join.
It preserves all the tuples whether or not they match in the join condition.
Left outer Join: Keeps every tuple in the left relation R in R S even if no
matching tuple is found in S., The attributes of S in the join result are filled with null
values. It is denoted by
To find the names of all employees, along with the names of the departments they
manage, we could use :
T1

EMPLOYEE SSN=MGRSSN DEPARTMENT
RESULT

(T1)
FNAME,MINIT,LNAME,DNAME
Right Outer Join: Keeps every tuple in the right relation S in R S even if no
matching tuple is found in S., The attributes of R in the join result are filled with null
values. It is denoted by
Full Outer Join: Keeps all tuples in both the right relation R and S in R S , even if
no matching tuples are found and filled with null values when needed. It is denoted
as
The main difference between Cartesian product and join:
In join only the combinations of tuples satisfying the join condition appears in the
result. Whereas in Cartesian product all combinations of tuples are included in the
result.
(ii) Divide operator (%)
63

The division operator is applied on two relations R(Z) % S(X) where X
C Z. Let Y be the set of attributes Y=Z-X the set of attributes that are not in S. The
result of division is a relation T(Y) ., in which a tuple to appear in result T , the
values in t must appear in R in combination with every tuple in S.
4. Aggregate Functions and Grouping
The relational algebra allows to specify mathematical aggregate functions on
collections of values from the database. Common functions applied to collections of
numeric values include SUM, AVERAGE, MAXIMUM, and MINIMUM. The
COUNT function is used for counting tuples or values.
<grouping attributes> <function list> (R)
where <grouping attributes> is a list of attributes of the relation specified in R, and
<function list> is a list of (<function> <attribute>) pairs. In each such pair,
<function> is one of the allowed functions— such as SUM, AVERAGE,
MAXIMUM, MINIMUM, COUNT—and <attribute> is an attribute of the relation
specified by R.
The resulting relation has the grouping attributes plus one attribute for each
element in the function list.
1. FMAXIMUM Salary (Employee) retrieves the maximum salary value
from the Employee relation
MAXIMUM_Salary
55000
2. Retrieve each department number, the number of employees in the department,
and their average salary.
(DNO  COUNT SSN, AVERAGE SALARY (EMPLOYEE))
-o0o-o0o-o0o-
64

UNIT II
SQL & QUERY OPTIMIZATION
SQL Standards - Data types - Database Objects- DDL-DML-DCL-TCL-Embedded
SQL - Static Vs Dynamic SQL.
QUERY OPTIMIZATION: Query Processing and Optimization - Heuristics and
Cost Estimates in Query Optimization.
PART - A
1. Name the categories of SQL commands. May / June 2016
Categories of SQL commands:
Data Definition Language (DDL) Commands –create, alter, truncate and drop
Data Manipulation Language (DML) Commands – select, insert, delete, update
Transaction Control Language (TCL) Commands – rollback, save point
Data Control Language (DCL) commands – grant and revoke
2. Explain Query Optimization. List the types. May / June 2016
State the need for Query Optimization. Apr / May 2015
Query Optimization:
Query Optimization is the overall process of choosing the most efficient way of
executing a SQL statement. It is the process of choosing a suitable execution strategy
for processing a query among the possible query execution plans. It is the function of
DBMS.
Types of Query Optimization:
Heuristics or rule based Query Optimization - the optimizer chooses execution
plans based on heuristically ranked operations.
Cost based Query Optimization - the optimizer examines chooses the execution
plan with lowest estimate cost based on usage of resources.
Need for Query Optimization:
To speed up long running queries.
To avoid data locking and corruption.
To reduce the amount of data a query has to process.
65

3. Differentiate static and dynamic SQL. Nov / Dec 2007, Nov/ Dec 2014,
Nov / Dec 2015
What is the difference between Static and Dynamic SQL? Apr / May 2015
Static SQL:
Static SQL is SQL statements in an application that do not change at runtime.
Static SQL statement is embedded within an application program - form of SQL
statements cannot be changed without making changes to the program
Static SQL statements in a source program must be processed before the program is
compiled and executed.
Dynamic SQL:
Static SQL is SQL statements in an application that change at runtime.
Dynamic SQL is SQL statements that are constructed at runtime - the SQL
statements are prepared and processed within a program while the program is
running.
Form of SQL statements can be changed during execution.
4. Why does SQL allow duplicate tuples in a table or in a query result?
Nov / Dec 2015
As duplicate removal is expensive, SQL allows duplicates.

To remove duplicates, we use the distinct keyword.

Typical SQL engines are not relational DBMS.
5. Give a brief description on DCL commands. Nov / Dec 2014
A data control language (DCL) is a component of SQL language that is used for
Authorization. DCL is used to control access to data stored in a database.
Examples of DCL commands include:
GRANT to allow specified users to perform specified tasks.
REVOKE to cancel previously granted or denied permissions






66

6. What is Embedded SQL? Nov / Dec 2003
Embedded SQL is a method of combining the computing power of a
programming language and the database manipulation capabilities of SQL. Embedded
SQL statements are SQL statements written inline with the program source code of the
host language.
7. How will you create a view in SQL? Nov / Dec 2003
A view is a virtual table. A view can be created from one or many tables which
depends on the written SQL query to create a view.
CREATE VIEW view_name AS SELECT column_name(s) FROM table_name
WHERE condition.
8. Name the different types of joins supported in SQL. Nov / Dec 2005
Simple Join - Equi – Join & Non –Equi Join
Self join
Outer Join - Left Outer Join &Right Outer Join
9. Define Query Language. Give the classification of query language.
May/June 2007
Query Language: It is a construct used to infer the information from the database. It
can define the structure of the data, modify data in the data base and specify
constraints. It is classified into follows
Data Definition Language
Data Manipulation Language
Transaction Control Language
View definition
Embedded SQL
Dynamic SQL and
Query by example
67

10. Consider the following relation:
EMP (Eno, Name, Date_of_birth, Sex, Date_of_join, Basic_pay, Dept)
Develop an SQL query that will find and display the dept and average basic pay
in each dept. May / June 2009.
SELECT AVERAGE(BASIC_PAY),DEPT FROM EMP GROUP BY DEPT.
List few data types and data base objects supported by SQL.
Datatypes : CHAR, VARCHAR, BINARY, BOOLEAN, BLOB AND CLOB
Data Objects: Tables, Views, Constraints, Sequences, Index,Triggers.
12. Give the reasons why null values are introduced into the database.
SQL uses Null values to indicate absence of information about the value of
an attribute. Null values might be introduced into database because:
The actual value is unknown.
The actual value does not exist.
Write down the various cost components of a Query execution.
Access cost to secondary storage
Storage cost, Computation cost
Memory usage cost and Communication cost
14. Define Triggers. What is the need for triggers?
A trigger is a stored procedure / named database object that automatically
executes when an event occurs in the database server. DML triggers execute when a
user tries to modify data through a data manipulation language (DML) event.
There are two types of triggers: Row – level and Statement –level triggers.
Need for Triggers:
 To catch the errors.
68

To run scheduled tasks.

To audit the changes in data base.

To implement integrity constraints.
PART – B
1. SQL FUNDAMENTALS
1 A. Explain about SQL fundamentals. (8) May / June 2016
B. Embedded SQL. (8) Nov / Dec 2014
Write about Embedded / static and dynamic SQL.
a) SQL FUNDAMENTALS
SQL - Structured Query Language
SQL has evolved from IBM‗s Sequel (Structures English QUEry Language)
Language.
Advantages of SQL:
SQL is a standard relational-database Language.


SQL is a comprehensive database language; it has statements for data
definition, Query and update. Hence it is both DDL and DML


It has facilities for defining views on the database, specifying security and
authorization, for defining integrity constraints and for specifying
transaction controls.


It also has rules for embedding SQL statements into general purpose
programming language such as C or Pascal.

Parts of SQL:
The SQL Language has several parts:
Data Definition Language – Defines relational schema, deleting relations and
modifying relational schemas.

69

Interactive Data Manipulation Language – Based on both relational algebra
and relational calculus. It includes commands to insert, update, select and
delete tuples in the database.


View Definition – define views


Transaction Control – specify beginning and end of transactions.


Embedded SQL and Dynamic SQL - SQL statements can be embedded in
general purpose programming language.


Integrity – SQL DDL commands specify integrity constraints that the data
stored in the database must satisfy. The updates that violate these constraints
are disallowed.


Authorization – specifying access rights to relations and views.

Basic Structure of SQL:
The basic structure of SQL expression consists of three clauses: select, from and
where
The SELECT clause corresponds to the project operator of the relational
algebra to list the attributes desired in the result of a query.


The FROM clause corresponds to the Cartesian product operation of the
relational algebra to list the relations in the evaluation of the expression.


The WHERE clause corresponds the selection predicate of the relational
algebra. It consist of the predicate involving attributes of the relations that

appear in the from clause.
A SQL query is of the form:
SELECT A1,A2,A3…An
FROM r1,r2,r3…rm
WHERE
P or
SELECT <attribute list> FROM <table list> WHERE
<condition>; where:
• <attribute list> is a list of attribute names whose values are to be retrieved
by the query.
70

<table list> is a list of the relation names required to process the query.
<condition> is a conditional (Boolean) expression that identifies the
tuples to be retrieved by the query.
B. EMBEDDED SQL
Need for Embedded SQL:
All the Queries cannot be expressed in SQL.
Because SQL does not use variables and control-of-flow statements,
processing of table is difficult.
To generate reports and Data auditing, SQL requires the use of a
programming language.
Embedded SQL is a method of combining the computing power of a programming
language and the database manipulation capabilities of SQL.
Embedded SQL statements are SQL statements written inline with the
program source code of the host language.
Embedded SQL statements are processed by a special SQL precompiler.

The embedded SQL statements are parsed by an embedded SQL pre-
processor and replaced by host-language calls to a code library.
The output from the pre-processor is then compiled by the host compiler.
This allows programmers to embed SQL statements in programs written in
any number of languages such as: C/C++, COBOL and Fortran,Java
The SQL standard defines embeddings of SQL in a variety of programming
languages such as C, Java, and Cobol.

A language to which SQL queries are embedded is referred to as a host
language, and the SQL structures permitted in the host language comprise
embedded SQL.
EXEC SQL statement is used to identify embedded SQL request to the pre-
processor.
EXEC SQL <embedded SQL statement > END_EXEC
EXEC SQL <query or PL/SQL block> END_EXEC
71

The statement causes the query to be evaluated EXEC SQL.. END_EXEC
The fetch statement causes the values of one tuple in the query result to be placed on
host language variables.
EXEC SQL fetch c into :cn, :cc END_EXEC
Repeated calls to fetch get successive tuples in the query result
The close statement causes the database system to delete the temporary
relation that holds the result of the query.
EXEC SQL close c END_EXEC
DYNAMIC SQL
The SQL statements in the program are static; that is, they do not change each
time the program is run.

Allows programs to construct and submit SQL queries at run time.
 The program constructs an SQL statement in a buffer, just as it does for the
EXECUTE IMMEDIATE statement.
Instead of host variables, a question mark (?) can be substituted for a constant.

The program passes the SQL statement to the DBMS with a PREPARE
statement, which requests that the DBMS parse, validate, and optimize the
statement and generate an execution plan for it.

The program can use the EXECUTE statement repeatedly, supplying different
parameter values each time the dynamic statement is executed.

EXEC SQL execute dynprog using :account;

The dynamic SQL program contains a ?, which is a place holder for a value
that is provided when the SQL program is executed.
72

Difference between Static and Dynamic SQL:
S.No. Static (Embedded) SQL Dynamic (Interactive) SQL
1. In static SQL how database will be accessed In dynamic SQL, how database will
is predetermined in the embedded SQL be accessed is determined at run
statement. Hard coded time.
2. It is more fast and efficient. It is less fast and efficient.
3. SQL statements are compiled at compile SQL statements are compiled at run
time. time.
4. Parsing, validation, optimization, and Parsing, validation, optimization,
generation of application plan are done at and generation of application plan
compile time. are done at run time.
5. It is generally used for situations where data It is generally used for situations
is distributed uniformly. where data is distributed non-
uniformly.
6. EXECUTE IMMEDIATE, EXECUTE and EXECUTE IMMEDIATE,
PREPARE statements are not used. EXECUTE and PREPARE
statements are used.
7. It is less flexible. It is more flexible.
2. SQL COMMANDS
2A. Explain the following with examples: Nov / Dec 2014
(i) DDL (4)
(ii) DML (4)
Explain about Data Definition Language. (8) May / June 2016
B. Write about Integrity constraints in SQL. (8)
73

2A. SQL COMMANDS: (16)
Categories of SQL commands:
Data Definition Language (DDL) Commands
Data Manipulation Language (DML) Commands
Transaction Control Language (TCL) Commands
Data Control Language (DCL) commands
SQL – DATA DEFINITION LANGUAGE: (DDL)
SQL uses the terms table, row, and column for relation, tuple, and attribute,
respectively. The SQL2 commands for data definition are CREATE, ALTER,
TRUNCATE and DROP;
The SQL DDL allows specification of not only a set of relations, but also
information about each relation, including:
The schema for each relation

The domain values associated with each attribute.

The integrity constraints

The set of indices to be maintained for each relation.

The security and authorization information for each relation.

The physical storage structure of each relation on disk.
Schema and Catalog Concepts in SQL2
An SQL schema is identified by a schema name, and includes an
authorization identifier to indicate the user or account who owns the schema, as well
as descriptors for each element in the schema.
Schema elements include the tables, constraints, views, domains, and other
constructs (such as authorization grants) that describe the schema.
A schema is created via the CREATE SCHEMA statement, which can
include all the schema elements‗ definitions.
Alternatively, the schema can be assigned a name and authorization identifier,
and the elements can be defined later.
74

The CREATE TABLE Command and SQL2 Data Types and Constraints
The CREATE TABLE command is used to specify a new relation by giving it
a name and specifying its attributes and constraints. The attributes are specified first,
and each attribute is given a name, a data type to specify its domain of values, and
any attribute constraints such as NOT NULL.
Create table r (A1 D1, A2 D2,…………….An Dn, (integrity constraint1),
(integrity constraint2) ……(integrity constraint k))
Where r is the name of the relation, each Ai is the name of an attribute in the
schema of a relation r and Di is the domain type of values in the domain
CREATE TABLE STUDENT (reg number(4), name varchar2(25) );
CREATE TABLE STUDENT (reg number(4) primary key , name
varchar2(25) );
Domain types in SQL
The SQL standard supports a variety of built-in domain types including:
char(n) – fixed length character string with length n.
varchar(n) – variable length character string with maximum length n
int - integer
smallint - small integer , subset of integer
number(n) - a number with n digits
number(p,d) - a fixed point number with p digits and d of the p digits are to the right
of the decimal point.
real - floating point numbers
float(n) - a floating point number with precision of atleast n digits.
date - a calendar date containing day-month –four digit year
time - the time of the day in hours, minutes and seconds.
timestamp - A combination of date and time.
SQL DDL DATA DEFINITION LANGUAGE
DDL commands are used to create an object, alter the structure of an object
and also drop the object created.
75

CREATE Command:
This command is used to create a table or an object.
Syntax: create table <tablename>(<column name1> datatype, <column name2 >
datatype,....);
ALTER Command:
This command is used to add a field or modify the definition of field or
column.
Syntax:
alter table <tablename> add (<column name1> datatype, <column name2>
datatype, .....);
alter table <tablename> modify (<column name1> datatype, <column name2>
datatype, ...);
TRUNCATE Command:
This command is used to delete all the rows of a table but not the structure of
the table.
Syntax: truncate table <tablename>;
DROP Command:
This command is used to delete the entire table along with the structure.
Syntax: drop table <tablename>;
DML - DATA MANIPULATION LANGUAGE
DML commands are used to insert, view, update and delete the values of an
object.
INSERT Command:
This command is used to insert a set of data values into the tables as defined
in the create table command.
Syntax:
insert into <tablename>values (value1,value2,.....,valuen);
76

insert into <table name> values (&columnname1, &columnname2, .....,
&columnname n);
SELECT Command:
This command is used to view particular data records or columns.
Syntax:
select <column name1,....>from <tablename>;
select * from <tablename>; - to view all records.
select distinct <columnname> from <tablename>;
select * from <tablename> orderby <columnname>; - default –ascending
order.
select * from <tablename> orderby <columnname> desc;
- Records are sorted in descending order w.r.t column name
select * from <tablename> where <condition>;
UPDATE Command:
This command is used to update and change the data values of the table.
Syntax:
update <tablename> set <column>=value where <condition>;
DELETE Command:
This command is used to delete a particular record or all records of the table.
Syntax:
delete from <tablename> where <condition>;
delete * from <tablename>; -- to delete all the records or rows of a table.
-- similar to truncate command.
SQL - TCL TRANSACTION CONTROL LANGUAGE
SAVEPOINT Command:
This command is used to save and store the transaction done till a point.
Syntax: savepoint <savepoint_id>;
77

ROLL BACK Command:
This command is used to undo the transaction up to a save point or commit.
Syntax:roll back;
roll back to <savepoint_id>;
COMMIT Command:
This command is used to save and end the transaction
Syntax:commit;
SQL DCL DATA CONTROL LANGUAGES
GRANT Command:
This command is used to used to grant privileges on tables to other users.
Syntax: grant <privileges> on <tablename> to <username>;
REVOKE Command:
This command is used to the privileges on tables from users..
Syntax: revoke < privileges> on <tablename> from <username>;
2B. INTEGRITY CONSTRAINTS
Integrity Constraints: Integrity constraints ensure that changes made to the
database by authorized users do not result in loss of consistency. Integrity constraints
guard against damage to the database.
DDL Statements --Constraints are specified as a part of DDL statements (Mainly
Create command) in the column definition.
DOMAIN INTEGRITY CONSTRAINTS:
Not Null Constraint: When a column is defined as not null, then the
column becomes a mandatory column.
Syntax: Column name datatype size not null;
Column name datatype Constraint <constraint-name> not null;
78

>CREATE TABLE STUDENT (reg number(4) primary key , name
varchar2(25) not null);
Check Constraint: Check constraints must be specified as the logical expression
that evaluates either TRUE or FALSE.
Syntax: Column name datatype (size) check (logical expression);
Column name datatype Constraint <con-name> check (logical Expression);
>>CREATE TABLE STUDENT (reg number(4) primary key , name
varchar2(25), mark number(3) check mark <=100 );
Unique Constraint: The purpose of the unique key is to ensure that information in
the column is distinct. Null values are allowed.
Syntax: Columnname datatype (size) unique
Columnname datatype (size) unique constraint <cons-name> unique
varchar2(25) unique );
ENTITY INTEGRITY CONSTRAINTS:
Primary Key: A Primary key is a one or more column in a table used to uniquely
identify each row in the table. Primary key cannot have null values.
Syntax: Column name datatype (size) primary key
Column name datatype (size) constraint <cons-name> primary key
varchar2(25) );
REFERENTIAL INTEGRITY CONSTRAINTS:
Foreign Key: A Foreign key is a column in a (referencing) table that references
a (Primary key) field of one other table (Parent Table)
Syntax: Column name datatype (size) references
<referenced table(or) Parent table >(<referenced field name>
79

SQL QUERY STRUCTURE
Describe the six clauses in the syntax of and SQL query and show what
type of constructs can be specified in each of the six clauses. Which of the
six clauses are required and which are optional? (16) Nov / Dec 2015
The SQL data-manipulation language (DML) provides the ability to query
information, and insert, delete and update tuples.
There are six clauses that can be used in an SQL statement. These six clauses
are SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. Clauses
must be coded in a specific sequence.
. SELECT column name(s)*
2. FROM table or views
3. WHERE conditions or predicates are met
4. GROUP BY Grouping attribute
5. HAVING a common condition as a group
6. ORDER BY a sorting method
SELECT <attribute list>
FROM <table list>
[WHERE <condition>]
[GROUP BY <grouping attributes> ]
[HAVING <group condition>]
[ORDER BY <attribute list>]
SELECT and FROM are required; the rest of these clauses are optional.
A typical SQL query has the form:
select A1, A2, ..., An
80

from r1, r2, ..., rm
where P ;
– Ai represent attributes , ri represent relations and Pi is a predicate.
This query is equivalent to the relational algebra expression:
ΠA1, A2, ..., An(σP (r1 × r2 × ... × rm))
The result of an SQL query is a relation.
SELECT CLAUSE:
The select clause corresponds to the projection operation of the relational
algebra. It is used to list the attributes desired in the result of a query.

An asterisk in the select clause denotes ―all attributes‖
 select ∗ from emp;


SQL allows duplicates in relations as well as in query results.

To force the elimination of duplicates, insert the keyword distinct after select.
Find the names of all cities in the employee relation, and remove duplicates
select distinct city from emp;


The keyword all specifies that duplicates not be removed.
select all city from emp;

The select clause can contain arithmetic expressions involving the operators, +, −, ∗, and /, and operating on constants or attributes of tuples.

select salary+da from emp;
FROM CLAUSE:
The from clause corresponds to the Cartesian product operation of the
relational algebra. It lists the relations to be scanned in the evaluation of the
expression.
Find the Cartesian product emp X dept

select ∗ from emp,dept;
81

WHERE CLAUSE:
The where clause corresponds to the selection predicate of the relational
algebra. It consists of a predicate involving attributes of the relations that appear in
the from clause
SQL uses the logical connectives and, or, and not. It allows the use of arithmetic
expressions as operands to the comparison operators.
select loan-number from loan where amount between 90000 and

100000;
Find the maximum salary, the minimum salary, and the average salary among
employees who work for the 'Research' department.
SELECT MAX(SALARY), MIN(SALARY), AVG(SALARY)


FROM EMPLOYEE, DEPARTMENT WHERE
DNAME='Research'
GROUP BY CLAUSE:
It is used to apply the aggregate functions to subgroups of tuples in a relation

„

Each subgroup of tuples consists of the set of tuples that have the same value
for the grouping attribute(s) „

The function is applied to each subgroup independently „

SQL has a GROUP BY-clause for specifying the grouping attributes, which
must also appear in the SELECT-clause
For each department, retrieve the department number, the number of
employees in the department, and their average salary.
SELECT DNO, COUNT (*), AVG (SALARY) FROM


EMPLOYEE GROUP BY DNO;
ORDERBY CLAUSE:
ORDER BY specifies an order for displaying the result of a query. By
default ascending order. To specify descending order – desc should be specified.
82


Select ssn, name from emp where dno=2 order by sal desc;
HAVING CLAUSE:
It is used to retrieve the values of the aggregate functions for only those
groups that satisfy certain conditions.
The HAVING-clause is used for specifying a selection condition on groups.
SELECT DNO, COUNT (*) c, AVG (SALARY)
FROM EMPLOYEE GROUP BY DNO having c >50;

4. QUERY PROCESSING
4. Briefly explain about Query processing. (16) May / June 2016
Query Processing
Query processing is a 3- step process that transforms a high-level query (of
relational calculus/SQL) into an equivalent and more efficient lower-level query (of
relational algebra).
Parsing and translation
Optimization
Evaluation
Parsing and translation – Query Compiler
Translate the query into its internal form. This is then translated into relational
algebra.

Parser checks syntax, verifies relations Evaluation
A query expressed in a high-level query language such as SQL must first be scanned,
parsed, and validated.
The scanner identifies the language tokens—such as SQL keywords, attributes
names, and relation names—in the text of the query.
83

The parser checks the query syntax to determine whether it is formulated
according to the syntax rules (rules of grammar) of the query language
The query must also be validated, by checking that all attribute and relation names
are valid and semantically meaningful names in the schema of the particular database
being queried.
Statistics about data
DB
Fig. Steps in processing a high level Query.
Translating SQL Queries into Relational Algebra:
An SQL Query is first translated into an equivalent relational algebra
expression. SQL queries are decomposed into query blocks and then it is translated
into relational algebra operators and then optimized.
Heuristic Relational algebra optimization can group operations together for
execution. This is called pipelining or stream based processing.
84

Internal representation of the SQL Query:
An internal representation of the query is created, usually as a tree data structure
called a query tree. It is also possible to represent the query using a graph data
structure called a query graph.
Query Tree: A query tree is a tree data structure that corresponds to a relational
algebra expression.
Input relations of the query are represented as leaf nodes of the tree, and relational
algebra operations are represented as internal nodes.
A query tree represents a specific order of operations for executing a query. An
execution of the query tree consists of executing an internal node operation
whenever its operands are available and then replacing that internal node by the
relation that results from executing the operation.
Query Graph: Relations in the query are represented by nodes. Selection and join
conditions are represented by graph edges. There is a single graph to each query. A
graph data structure corresponds to a relational calculus expression. The Query
graph does not indicate an order on which operations are performed.
2. Query Optimization – Query Optimizer
 Generate an optimal evaluation plan (with lowest cost) for the query plan.
An evaluation plan or execution strategy defines exactly what algorithm is used for
each operation, and how the execution of the operations is coordinated.
A query typically has many possible execution strategies, and the process of
choosing a suitable one for processing a query is known as query optimization.
The DBMS must devise an execution strategy for retrieving the result of the query
from the database. The query optimizer module has the task of producing an
execution plan, and the code generator generates the code to execute that plan.
85

3. Query evaluation – Query Command processor
The query-execution engine takes a query-evaluation plan, executes that plan,
and returns the answers to the query.
The runtime database processor has the task of running the query code, whether in
compiled or interpreted mode, to produce the query result. If a runtime error results,
an error message is generated by the runtime database processor.
QUERY OPTIMIZATION
Explain in detail about Query Optimization. (16)
Explain the cost estimation of Query optimization. (16) Nov / Dec 2014
Discuss about the Join order optimization and heuristic optimization. (16)
Apr/May 2015
Query Optimization
Query Optimization:
Query Optimization is the overall process of choosing the most efficient way of
executing a SQL statement. It is the process of choosing a suitable execution strategy
for processing a query among the possible query execution plans. It is the function of
DBMS.
Need for Query Optimization:
To speed up long running queries.
To avoid data locking and corruption.
To reduce the amount of data a query has to process.
Types of Query Optimization:
Heuristics or Rule based Query Optimization - the optimizer schooses execution
plans based on heuristically ranked operations.
Cost based Query Optimization - the optimizer examines chooses the execution
plan with lowest estimate cost based on usage of resources.
86

Heuristic Optimization
Process for heuristics optimization:
Step 1. The parser of a high-level query generates an initial internal
representation
Many query trees can be drawn for the same query. The query parser will
generate a standard initial query tree corresponding an SQL query without doing any
optimization.
Query Tree: A query tree is a tree data structure that corresponds to a relational
algebra expression.
Input relations of the query are represented as leaf nodes of the tree, and relational
algebra operations are represented as internal nodes.
A query tree represents a specific order of operations for executing a query. An
execution of the query tree consists of executing an internal node operation
whenever its operands are available and then replacing that internal node by the
relation that results from executing the operation.
Query Graph: Relations in the query are represented by nodes. Selection and join
conditions are represented by graph edges. There is a single graph to each query. A
graph data structure corresponds to a relational calculus expression. The Query
graph does not indicate an order on which operations are performed.
SQL Query Q0:
87

Query Tree : (Initial)
(a)
(b)
88

(c)
(d)
89

Step 2. Apply heuristics rules to optimize the internal representation.
It is the job of the heuristic query optimizer to transform this initial query tree
into a final query tree that is efficient to execute. The optimizer includes rules for
equivalence among relational algebra expressions that can be applied to the initial tree.
Heuristic query optimization rules use these equivalence expressions to transform the
initial tree to final optimized query tree.
Heuristic optimization transforms the query-tree by using a set of rules that
typically improve execution performance:
Perform selection early (reduces the number of tuples)
Perform projection early (reduces the number of attributes)
Perform most restrictive selection and join operations (i.e. with smallest result
size) before other similar operations.
Steps in converting a query tree during heuristic optimization:
Represent initial query tree for a SQL query.
Moving SELECT operations down the query tree.
Apply restrictive select operations first.
Replace Cartesian product and select with join.
Move project operations down the tree.
Step 3. A query execution plan is generated to execute groups of operations
based on the access paths available on the files involved in the query.
The main heuristic is to apply first the operations that reduce the size of
intermediate results.
E.g., Apply SELECT and PROJECT operations before applying the JOIN or other
binary operations
An execution plan has 3 components
A query tree

A strategy selected for each non-leaf node

An ordering of evaluation of non-leaf nodes
90

Compiled queries:
The optimization is done at compile time and the resulting execution strategy
code is stored and executed directly at run time.
Interpreted queries:
The optimization and execution of code is done at run time. A full scale
optimization may slow down the response time.
Part - C
Cost based Optimization:
Cost difference between evaluation plans for a query can be enormous. Estimate and
compare the costs of executing a query using different execution strategies and
should choose the strategy with the lowest cost estimate. Cost functions are used in
query optimization.
The goal of cost based optimization in oracle is to minimize the elapsed time to
process the entire query.
Optimizer calculates this cost based on the estimated usage of resources such as I/O,
CPU time and memory needed.
Cost components for Query execution:
Access cost to secondary storage
Storage cost
Computation cost
Memory usage cost
Communication cost
Catalog information used in cost functions:
Number of records(r)
The average record size (R)
Number of blocks(b)
Blocking factor(bfr)
Number of levels of indices (x)
91

Selectivity(sl) – fraction of records satisfying a condition on the attribute.
Selection cardinality(s) (s=sl*r)
Steps in cost-based query optimization:
Generate logically equivalent expressions using equivalence rules.
Annotate resultant expressions to get alternative query plans.
Choose the cheapest plan based on estimated cost.
Join Ordering:
Join Ordering is a first cost based optimization.
The performance of a query plan is determined largely by the order in which the
tables are joined.
First, all ways to access each relation in the query are computed. For each
relation, the optimizer records the cheapest way to scan the relation.
The optimizer then considers combining each pair of relations for which a join
condition exists. For each pair, the optimizer will consider theavailable join
algorithms implemented by the DBMS. It will preserve the cheapest way to
join each pair of relations,
Then all three-relation query plans are computed, by joining each two-relation
plan produced by the previous phase with the remaining relations in the
query.

The SQL Server Query Optimizer needs to take two important decisions regarding
joins: the selection of a join order and the choice of a join algorithm.

The selection of a join algorithm is the choice between a nested loops join, merge
join or hash join operator.


Join tree:

A join tree is a binary tree with

Join operators as inner nodes
Relations as leaf nodes



92

Commonly used classes of join trees:
Left-deep tree
Right-deep tree
Zigzag tree
Bushy tree
A left deep tree is a binary tree where the right child of each non leaf node is always
a base relation. Optimizer chooses left deep tree with lowest cost estimate.
Left deep tree is used for pipelining.
Semantic Query Optimization:
Uses constraints specified on the database schema in order to modify one query
into another query that is more efficient to execute.
SQL QUERIES
a) Consider a student registration database comprising of the below given table
schema. (16) Apr / May 2015
Student File
Student Number Student Name Address Telephone
Course File
Course Number Description Hours Professor Number
Professor File
Professor Number Name Office
Registration File
Student Number Course Number Date
Consider a suitable sample of tuples/records for the above mentioned
tables and write DML statements (SQL) to answer for the queries listed
below.
93

Which courses does a specific professor teach?
What courses are taught by two specific professors?
Who teaches a specific course and where is his/her office?
For a specific student number, in which courses is the student
registered and what is his/her name?
Who are the professors for a specific student?
Who are the students registered in a specific course?
Solution:
select CourseNumber from Course where ProfessorNumber = (select
ProfessorNumber from Professor where Name = ‗Dr.Lakshmi‗);
Select CourseNumber, Professor Number from Course where Professor
Number in (1,2);
Select ProfessorNumber, Name , Office from Professor P, Course C
where P.ProfessorNumber=C.ProfessorNumber and CourseNumber=1
For a specific student number, in which courses is the student registered and
what is his/her name?
select StudentName, CourseNumber from Student S, Registration R
where StudentNumber=1 ;
Select Name from Course C,Professor P,Registration R where
StudentNumber=1 and R.CourseNumber=C.Course Number and
C.ProfessorNumber=P.ProfessorNumber
Select Student Number from Registration where
CourseNumber=‗2001‗;

(b). Assume the following table: Nov / Dec 2015
Degree (degcode,name,subject)
Candidate(seatno,degcode,name,semester,month,year,result)
Marks(seatno,degcode,semester,month,year,papcode,marks)
Degcode – degree code, name – name of the degree (MSc. M.Com)
Subject – Subject of the course Eg. Phy Papcode – Paper code eg. A1.
Solve the following queries using SQL: (16)
Write a SELECT statement to display all the degree codes which are there in
the candidate table but not present in degree table in the order of
degcode. (4)
Select degcode from candidate minus select degcode from Degree;
Write a SELECT statement to display the name of all the candidates who
have got less than 40 marks in exactly 2 subjects. (4)
select seatno,name from Marks where marks<40 group by seatno
having count(seatno) >2;
Write a SELECT statement to display the name, subject and number of
candidates for all degrees in which there are less than 5 candidates.
(4)
Select name, subject from Degree where degcode in (Select degcode
from candidate group by degcode having count(seatno) <5);
Write a SELECT statement to display the names of all the candidates
who have got highest total marks in M,sc (Maths) (4)
Select name from candidate where seatno in (Select seatno from Marks
where marks in (select max(marks) from Marks group by semester)
and degcode in (select degcode from degree where name=‗M.Sc‗ and
subject=‗Maths‗)) ;
-o0o-o0o-o0o-95

UNIT III
TRANSACTION PROCESSING AND CONCURRENCY CONTROL
Introduction-Properties of Transaction– Serializability– Concurrency Control –
Locking Mechanisms– Two Phase Commit Protocol–Dead lock
PART-A
1. What do you mean by transaction? Define the properties of transaction
Nov/Dec 2014, April/May 2015, May/June 2016/
Nov/Dec 2010, April/May 2010
A transaction is a unit of program execution that accesses and possibly updates
various data items.
Atomicity: Transaction is either performed in its entirety or not
performed at all.

Consistency: Transaction is consistent if its complete execution
takes the database from one consistent state to another.

Isolation: Each transaction is unaware of other transactions
executing concurrently in the system.

Durability: After the successful completion of the transaction the changes
it has made to the database persist even if there are system failures.
2. What is serializability? How is it tested? Nov/Dec 2014,
May/June 2014(2008)
Serializability is the process of managing the execution of a set of transactions in
such a way that their concurrent execution produces the same end result as if they were
run serially. Serializability is tested by using a directed graph called precedence graph
constructed from schedule. This graph consists of a pair G=(V,E) where V is a set of
vertices representing all transactions and E is a set of edges Ti to Tj.
Ti Tj
96

3. Define DDL,DML,DCL and TCL. April/May
2015
DDL- Data Definition language: It is used to create ,alter and delete database
objects. The commands used are create, alter and drop.

DML-Data Manipulation Language: It lets users to insert, modify and delete
the data in the database. The commands used are insert, update and delete.

DCL-Data Control language: It consists of commands that control the user
access to the database objects. The commands used are Commit, Rollback,
Grant etc.
 TCL-Transaction Control language: It is a computer language and a subset of
SQL, used to control transactional processing in a database. A transaction is
logical unit of work that comprises one or more SQL statements, usually a
group of Data Manipulation Language (DML) statements.
4. Differentiate strict two phase locking protocol and rigorous two phase
locking protocol.
April/May2016, May/June2013 (2008), Nov/Dec2013 (2008), May/June 2012
Two phase locking protocol requires that each transaction issue lock and
unlock requests in two phases.
Growing phase: In this phase transaction may obtain locks but may not release any
lock.
Shrinking phase: In this phase, a transaction may release locks but may not obtain
any new locks.
Strict Two phase Locking Rigorous Two phase Locking
Here a transaction must hold all its It is even stricter: here all locks(shared
exclusive locks till it commits/aborts. and exclusive) are held till commit/abort.
97

It guarantees cascadeless recoverability It is used in dynamic environments
where data access patterns are not known
before hand.
It does not guarantee a deadlock free It guarantees a deadlock free schedule
schedule
5. What is meant by concurrency control?
Nov/Dec 2015, April/May 2015(2008)
List the two commonly used Concurrency Control techniques.
Nov/Dec 2011(2008)
Concurrency control is a :
Process of managing simultaneous operations on the database without
having them interfere with one another.

Prevents interference when two or more users are accessing a database
simultaneously and at least one is updating data.

Need for concurrency:
Improved throughput and resource utilization


Reduced waiting time

98

Two techniques:

Lock Based protocol


Time stamp based protocol

6. Give an example for two phase commit protocol . Nov/Dec 2015
The two phase commit protocol is a distributed algorithm which lets all sites
in a distributed system agree to commit a transaction. The protocol results in either
all nodes committing the transaction or aborting, even in the case of site failures and
message losses.
Example:
Transfer money from bank A to bank B – Debit A, credit B, tell client ―OK‖ . Want
both to do it or neither to do it.
7. Define deadlock. May/June 2014(2008)
In a database, a deadlock is a situation in which two or more transactions are
waiting for one another to give up locks. In other words, there exists a set of waiting
transactions{ T0,T1,….TN}, such that T0 is waiting for a data item that T1 holds
and …., and TN-1 is waiting for TN and TN is waiting for the data item that T0
holds. Either of the transaction can ever proceed with its normal execution.
What are the disadvantages of not controlling concurrency? Nov/Dec 2014
The Lost Update Problem : - This problem occurs when two transactions
that access the same database items have their operations interleaved in a way
that makes the value of some database item incorrect.

The Temporary Update (or Dirty Read) Problem :- This problem occurs
when one transaction updates a database item and then the transaction fails for
some reason

The Incorrect Summary Problem :- If one transaction is calculating an
aggregate summary function on a number of records while other transactions
99

are updating some of these records, the aggregate function may calculate
some values before they are updated and others after they are updated.
9. Write the use of save points. April/May 2015(2008)
A savepoint is a way of implementing sub transactions (also known as nested
transactions) within a relational database management system. It indicates a point
within a transaction that can be "rolled back to" without affecting any work done in
the transaction.
Multiple savepoints can exist within a single transaction. Savepoints are
useful for implementing complex error recovery in database applications. If an error
occurs in the midst of a multiple-statement transaction, the application may be able
to recover from the error without having to abort the entire transaction.
A savepoint can be declared by issuing a SAVEPOINT name statement.
10. State the write-ahead log rule. Why is it necessary? Nov/Dec 2012
Write-ahead logging (WAL) is a family of techniques for providing atomicity
and durability (two of the ACID properties) in database systems. In a system using
WAL, all modifications are written to a log before they are applied. Usually both
redo and undo information is stored in the log which guarantees that no data
modifications are written to disk before the associated log record is written to disk.
This maintains the ACID properties for a transaction.
11. Brief about cascading rollback. Nov/Dec 2013
A cascading rollback occurs in database systems when a transaction (T1)
causes a failure and a rollback must be performed. Other transactions dependent on
T1's actions must also be rollbacked due to T1's failure, thus causing a cascading
effect. That is, one transaction's failure causes many to fail.
100

What are two pitfalls (problems) of lock-based protocols?April/May 2011
Deadlock :- Two or more transactions are waiting for one another to give up
locks.


Starvation :- A transaction may be waiting for an X-lock on an item, while a
sequence of other transactions request and are granted an S-lock on the same
item.

PART-B
1. TRANSACTION CONCEPT & STATES

Explain the ACID properties of transaction May/June 2014

Write short notes on transaction concept Nov/Dec 2014
i. Transaction Concept:
A transaction is a logical unit of database processing that includes one or
more database access operations—these can include insertion, deletion,
modification, or retrieval operations. The database operations that form a transaction
can either be embedded within an application program or they can be specified
interactively via a high-level query language such as SQL.
One way of specifying the transaction boundaries is by specifying explicit begin
transaction and end transaction statements in an application program;
A single application program may contain more than one transaction if it contains
several transaction boundaries.
If the database operations in a transaction do not update the database but only
retrieve data, the transaction is called a read-only transaction.
The basic database access operations that a transaction can include are as follows:
• read_item(X): Reads a database item named X into a program variable.
101

• write_item(X): Writes the value of program variable X into the database item
named X.
Executing a read_item(X) command includes the following steps:
Find the address of the disk block that contains item X.
Copy that disk block into a buffer in main memory (if that disk block is not
already in some main memory buffer).
Copy item X from the buffer to the program variable named X.
Executing a write_item(X) command includes the following steps:
Find the address of the disk block that contains item X.
Copy that disk block into a buffer in main memory (if that disk block is not
already in some main memory buffer).
Copy item X from the program variable named X into its correct location in the
buffer.
Store the updated block from the buffer back to disk (either immediately or at
some later point in time).
A transaction includes read_item and write_item operations to access and update the
database.
Transaction States:
Operations involved:
BEGIN_TRANSACTION: This marks the beginning of transaction
execution.
READ or WRITE: These specify read or write operations on the database
items that are executed as part of a transaction.
END_TRANSACTION: This specifies that READ and WRITE transaction
operations have ended and marks the end of transaction execution. However,
at this point it may be necessary to check whether the changes introduced by
the transaction can be permanently applied to the database (committed) or
whether the transaction has to be aborted.
102

COMMIT_TRANSACTION: This signals a successful end of the transaction
so that any changes (updates) executed by the transaction can be safely
committed to the database and will not be undone.
ROLLBACK (or ABORT): This signals that the transaction has ended
unsuccessfully, so that any changes or effects that the transaction may have
applied to the database must be undone
The figure shows a state transition diagram that describes how a transaction moves
through its execution states.
Active: A transaction goes into an active state immediately after it starts
execution, where it can issue READ and WRITE operations.
Partially Committed: When the transaction ends, it moves to the partially
committed state.
Committed: At this point, some recovery protocols need to ensure that a
system failure will not result in an inability to record the changes of the
transaction permanently. Once this check is successful, the transaction is said
to have reached its commit point and enters the committed state. Once a
transaction is committed, it has concluded its execution successfully and all
its changes must be recorded permanently in the database.
Failed: However, a transaction can go to the failed state if one of the checks
fails or if the transaction is aborted during its active state. The transaction may
then have to be rolled back to undo the effect of its WRITE operations on the
database.
Terminated: The terminated state corresponds to the transaction leaving the
system.
103

Transaction Example:
Let T be the transaction to transfer Rs.500 from A to B
T i = read(A)
A=A-500
write(A)
B=B+500
write(B)
Transaction properties:
Transactions should possess several properties. These are often called the ACID
properties, and they should be enforced by the concurrency control and recovery
methods of the DBMS. The following are the ACID properties:
Atomicity: A transaction is an atomic unit of processing; it is either performed
in its entirety or not performed at all.
Consistency preservation: A transaction is consistency preserving if its
complete execution take(s) the database from one consistent state to another.
Isolation: A transaction should appear as though it is being executed in
isolation from other transactions. That is, the execution of a transaction
should not be interfered with by any other transactions executing
concurrently.
104

Durability or permanency: The changes applied to the database by a
committed transaction must persist in the database. These changes must not
be lost because of any failure.
Atomicity:
A=1500, B=500 before the execution of transaction T1. Suppose that during the
execution of Transaction T1, a failure occurs that prevents T1 from executing
successfully. Suppose error occurs after write(A) and before write(B) the values of
accounts A & Bwill be A=1000 & B=500 and the sumA+B is no longer preserved.
Due to failure the state of the system no longer reflects a real state of the world, i.e.,
it is in inconsistent state.If atomicity property is present all actions of the transaction
are reflected in the database or none.
The database system keeps track of the old values of any data on which a transaction
performs a write and if the transaction performs a write and if the transaction
doesnot complete its execution the database system restores the old values to make it
appear as though the transaction never executed. It is handled by transaction
recovery subsystem.
Consistency:
The preservation of consistency is generally considered to be the responsibility of
the programmers who write the database programs or of the DBMS module that
enforces integrity constraints.
A database state is a collection of all the stored data items (values) in the database
at a given point in time. A consistent state of the database satisfies the constraints
specified in the schema as well as any other constraints that should hold on the
database. A database program should be written in a way that guarantees that, if the
database is in a consistent state before executing the transaction, it will be in a
consistent state after the complete execution of the transaction, assuming that no
interference with other transactions occurs.
105

For example, sum of A and B be unchanged by the execution of transaction is
required in the previous example.
Isolation:
Isolation is enforced by the concurrency control subsystem of the DBMS. If every
transaction does not make its updates visible to other transactions until it is
committed, one form of isolation is enforced that solves the temporary update
problem and eliminates cascading rollbacks . There have been attempts to define the
level of isolation of a transaction.
For example, In an intermediate state in transferring funds from A to B, a second
concurrently running transaction reads A and B and computes A+B it will observe an
inconsistent state. In addition, if it performs update on A and B based on inconsistent
values , the database may be left in an inconsistent state.
Durability:
Once the execution of the transaction completes successfully and the user who
initiated the transaction has been notified the transfer of funds has been taken place,
it must be that no system failure will result in loss of data corresponding to this
transfer of funds.
This property guarantees that once a transaction completes, all the updates that it
carried out on database persist even if there are failures after execution.
106

2. SERIALIZABILITY

Discuss View Serializability and Conflict Serializability. Nov/Dec 2015
If T1 and T2 are submitted at the same time (and if interleaving of operations are
not permitted) then there are two possible outcomes.
1.Execute all the operations of T1 followed by all operations of T2(in sequence)
2.Execute all operations of T2(in sequence) followed by all operations of T1(in
sequence)
We can use the result of serial execution as a measure of correctness and concurrent
execution for improving resource utilization.
Serialization:
If interleaving is allowed, there will be many orders in which the system can execute
the operations. The process of managing the execution of a set of transactions in
such a way that their concurrent execution produces the same end result as if they
run serially.
Serializable Schedule:
A schedule s of n transactions is serializable if it is equivalent to some serial
schedule of same n transactions.
Two transactions are result equivalent, if they produce the same final state of the
database.
x=100
read(x) read(x)
x=x+10 x=x*1.1
write(x) write(x)
107

Types of serializability:
 Conflict Serializability:
Let us consider a schedule S in which there are two consecutive instructions Ii and Ij
of transactions Ti and Tj respectively (i≠ ).If Ii and Ij refer to different data items ,
then we can swap Ii and Ij without affecting the results of any instruction in the
schedule. However if Ii and Ij refer to the same data item Q, then the order of the two
steps may matter. There are four cases to consider:
Ii=read(Q), Ij=read(Q). The order of Ii and Ij does not matter, since the same value
of Q is read by Ti and Tj regardless of the matter.
Ii=read(Q), Ij=write(Q).If Ii comes before Ij, then Ti does not read the value of Q
that is written by Tj in instruction Ij.If Ij comes before Ii, then Ti reads the value of
Q tht is written by Tj. Thus the order of Ii and Ij matters.
Ii=write(Q), Ij=read(Q). The order of Ii and Ij matters, reason is same as previous
case.
Ii=write(Q), Ij=write(Q). Since both instructions are write operations, the order of
these instructions does not affect either Ti or Tj. However, the value obtained by the
next read(Q) instruction of S is affected, since the result of only the latter of the two
write instructions is preserved in the database.
Thus only in the case where both Ii and Ij are read instructions the order of execution
does not matter. We say that Ii and Ij conflict if they are operations by different
transactions on the same data item and atleast one of these instructions is a write
operation.
Consider the following schedule 3
108

T1 T2
read(A)
write(A)
read(A)
write(A)
read(B)
write(B)
read(B)
write(B)
The write(A) of T1 conflict with read(A) of T2. However write(A) of T2 does not
conflict with read(B) of T1, hence we can swap these instructions to generate a new
schedule 5 as shown in the following figure
T1 T2
read(A)
write(A)
read(A)
read(B)
write(A)
write(B)
read(B)
write(B)
Swapping can be continued for non-conflict instructions.
Swap the read(B) instruction of T1 with read(A) instruction of T2.

Swap the write(B) instruction of T1 with write(A) instruction of T2.

Swap the write(B) instruction of T1 with read(A) instruction of T2.

The final result of these swaps is shown in the figure as a serial schedule.
109

T1 T2
read(A)
write(A)
read(B)
write(B)
read(A)
write(A)
read(B)
write(B)
Schedule 6:A serial schedule that is equivalent to schedule 3
If a schedule S can be transformed into a schedule S‗ by a series of swaps of non
conflicting instructions we say that S and S‗ are conflict equivalent.
We say that a schedule S conflict serializable if it is conflict equivalent to a serial
schedule. Thus schedule 3 is conflict serializable since it is conflict equivalent to
serial schedule 1.
Consider schedule 7 which consists of two transactions T3 and T4. This schedule is
not conflict serializable since it is not equivalent to either the serial schedule
<T3,T4> or the serial schedule <T4,T3>
T3 T4
read(Q)
write(Q)
write(Q)
Schedule 7
110

 View Serializability:
Consider two schedules S and S‗ where the same set of transactions participates in
both schedules. The schedules S and S‗ are said to be view equivalent if three
conditions are met.
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S,
then transaction Ti must, in schedule S‗, also read the initial value of Q.
2.For each data item Q, if transaction Ti executes read(Q) in schedule S and if that
value was produced by a write(Q) operation executed by transaction Tj, then the
read(Q) operation of transaction Ti must, in schedule S‗, also read the value of Q that
was produced by the same write(Q) operation of transaction Tj.
3. For each data item Q, the transaction that performs the final write (Q) operation in
schedule S must perform the final write(Q) operation in schedule S‗.
A schedule is view serializable, if it is view equivalent to a serial schedule.
Consider the schedule 8 which is view equivalent to the serial schedule<T3,T4,T5>
since one read(Q) instruction reads the initial value of Q and T5 performs the final
write of Q.
Every conflict serializable schedule is also view serializable but there are view
serializable schedules that are not conflict serializable.
T3 T4 T5
read(Q)
write(Q)
write(Q)
write(Q)
111

In schedule 8, the transactions T4 and T5 performs write(Q) operations without
having performed a read(Q) operation. Writes of this sort are called blind writes.
View serializable schedule with blind writes is not conflict serializable.
Testing of serializability:
Testing of serializability is done by using a directed graph, called precedence
graph constructed from schedule. This graph consists of a pair G=(V,E) where V is a
set of vertices and E is a set of edges. The set of vertices consists of all transactions
in schedule. The set of edges consists of all edges Ti

Tj for which one of three
conditions holds:
Ti executes write(Q) before Tj executes read(Q).
Ti executes read(Q) before Tj executes write(Q).
Ti executes write(Q) before Tj executes write(Q).
The precedence graph for schedule 1 is shown in the following Figure (a). It contains
a single edge T1

T2 since all the instructions of T1 are executed before the first
instruction of T2 is executed. Fig (b) shows the precedence graph for schedule 2.
T1 T2
(a) Precedence graph for schedule 1
T2 T1
(b) Precedence graph for schedule 2
Consider the following schedule 9
112

T1 T2
read(A)
A=A-50
read(A)
temp=A*0.1
A=A-temp
write(A)
write(B)
write(A)
read(B)
B=B+50
write(B)
B=B+temp
write(B)
Schedule 9
T1 T2
Precedence Graph for schedule 9
Test for conflict serializability:
To test conflict serializability construct a precedence graph for given
schedule. If graph contains cycle, the schedule is not conflict serializable. If the
graph contains no cycle then the schedule is conflict serializable.
Schedule 1 and Schedule 2 are conflict serializable as the precedence graph for both
the schedules does not contain any cycle. While the schedule 9 is not conflict
serializable as precedence graph for it contains cycle.
113

Topological sorting:
If the graph is acyclic, then using topological sorting given below, find serial
schedule:
Initialize the serial schedule as empty
Find a transaction Ti, such that there are no arcs entering Ti,Tj is the next
transaction in the serial schedule.
Remove Ti and all edges emitting from Ti. If the remaining set is non-empty,
return to step 2, else the serial schedule is complete.
COMMIT PROTOCOLS
Explain about the two phase commit and three phase commit

protocols. April/May 2015
Commit Protocol:
To ensure atomicity, all the sites in which a transaction T executed must agree on the
final outcome of the execution. T must commit at all sites or it must abort at all sites.
To ensure this property the transaction co-ordinator of T must execute a commit
protocol.
Two phase Commit Protocol (2PC):
Consider a transaction T initiated at site Si, where the transaction coordinator is Ci.
When T completes its execution—that is, when all the sites at which T has executed
inform Ci that T has completed—Ci starts the 2PC protocol.
Phase 1.
Ci adds the record to the log, and forces the log onto stable storage. It then
sends a prepare T message to all sites at which T executed.
114

On receiving such a message, the transaction manager at that site determines
whether it is willing to commit its portion of T.

If the answer is no, it adds a record to the log, and then responds by sending
an abort T message to Ci.

If the answer is yes, it adds a record to the log, and forces the log (with all the
log records corresponding to T) onto stable storage.

The transaction manager then replies with a ready T message to Ci.


Phase 2

When Ci receives responses to the prepare T message from all the sites, or
when a prespecified interval of time has elapsed since the prepare T message
was sent out, Ci can determine whether the transaction T can be committed or
aborted.

Transaction T can be committed if Ci received a ready T message from all
the participating sites. Otherwise, transaction T must be aborted.

Depending on the verdict, either a record or a record is added to the log and
the log is forced onto stable storage. At this point, the fate of the transaction
has been sealed.

Following this point, the coordinator sends either a commit T or an abort T
message to all participating sites. When a site receives that message, it records
the message in the log.
A site at which T executed can unconditionally abort T at any time before it
sends the message ready T to the coordinator. Once the message is sent, the
transaction is said to be in the ready state at the site. The ready T message is, in
effect, a promise by a site to follow the coordinator‗s order to commit T or to
abort T. To make such a promise, the needed information must first be stored in
stable storage. Otherwise, if the site crashes after sending ready T, it may be
unable to make good on its promise. Further, locks acquired by the transaction
must continue to be held till the transaction completes.
115

In some implementations of the 2PC protocol, a site sends an acknowledge T
message to the coordinator at the end of the second phase of the protocol. When
the coordinator receives the acknowledge T message from all the sites, it adds the
record to the log.
Handling of Failures:
1. Failure of a participating site:
If the coordinator Ci detects that a site has failed, it takes these actions:
If the site fails before responding with a ready T message to Ci, the
coordinator assumes that it responded with an abort T message.

If the site fails after the coordinator has received the ready T message from
the site, the coordinator executes the rest of the commit protocol in the normal
fashion, ignoring the failure of the site.

When a participating site Sk recovers from a failure, it must examine its log to
determine the fate of those transactions that were in the midst of execution
when the failure occurred.

Let T be one such transaction. We consider each of the possible cases:
o The log contains a <commit T> record. In this case, the site executes
redo(T).
The log contains an <abort T>record. In this case, the site executes
undo(T). The log contains a <ready T>record. In this case, the site
must consult Ci to determine the fate of T.
If Ci is up, it notifies Sk regarding whether T committed or
aborted. In the former case, it executes redo(T); in the latter
case, it executes undo(T).


If Ci is down, Sk must try to find the fate of T from other sites.
It does so by sending a querystatus T message to all the sites in
the system. On receiving such a message, a site must consult its

116

log to determine whether T has executed there, and if T has,
whether T committed or aborted.
It then notifies Sk about this outcome. If no site has the
appropriate information (that is, whether T committed or

aborted), then Sk can neither abort nor commit T.
The log contains no control records (abort, commit, ready) concerning
T. Thus, we know that Sk failed before responding to the prepare T
message from Ci. Since the failure of Sk precludes the sending of such
a response, by our algorithm Ci must abort T. Hence, Sk must execute
undo(T).
Failure of the coordinator:
If the coordinator fails in the midst of the execution of the commit protocol for
transaction T, then the participating sites must decide the fate of T.
If an active site contains a <commit T> record in its log, then T must be
committed.

If an active site contains an <abort T> record in its log, then T must be
aborted.

If some active site does not contain a <ready T>record in its log, then the
failed coordinator Ci cannot have decided to commit T, because a site that
does not have a <ready T> record in its log cannot have sent a ready T
message to Ci. However, the coordinator may have decided to abort T, but not
to commit T. Rather than wait for Ci to recover, it is preferable to abort T.

If none of the preceding cases holds, then all active sites must have a <ready
T> record in their logs, but no additional control records (such as <abort T> or
<commit T>).


Network partition:
When a network partitions, two possibilities exist:
117

The coordinator and all its participants remain in one partition. In this case, the
failure has no effect on the commit protocol.
The coordinator and its participants belong to several partitions. From the
viewpoint of the sites in one of the partitions, it appears that the sites in other
partitions have failed. Sites that are not in the partition containing the coordinator
simply execute the protocol to deal with failure of the coordinator. The coordinator
and the sites that are in the same partition as the coordinator follow the usual commit
protocol, assuming that the sites in the other partitions have failed.
Disadvantage:
Thus, the major disadvantage of the 2PC protocol is that coordinator failure may
result in blocking, where a decision either to commit or to abort T may have to be
postponed until Ci recovers.
Three-Phase Commit Protocols:
The three-phase commit (3PC) protocol is an extension of the two-phase commit
protocol that avoids the blocking problem under certain assumptions. In particular, it
is assumed that no network partition occurs, and not more than k sites fail, where k is
some predetermined number.
 Third Phase:
The three-phase protocol introduces a third phase called the pre-commit. The aim of
this is to remove the uncertainty period for participants that have committed and are
waiting for the global abort or commit message from the coordinator. When
receiving a pre-commit message, participants know that all others have voted to
commit. If a pre-commit message has not been received the participant will abort
and release any blocked resources.
118

Under these assumptions, the protocol avoids blocking by introducing an extra third
phase where multiple sites are involved in the decision to commit. Instead of directly
noting the commit decision in its persistent storage, the coordinator first ensures that
at least k other sites know that it intended to commit the transaction.
 Failure of the Co-ordinator:
If the coordinator fails, the remaining sites first select a new coordinator. This new
coordinator checks the status of the protocol from the remaining sites; if the
coordinator had decided to commit, at least one of the other k sites that it informed
will be up and will ensure that the commit decision is respected. The new
coordinator restarts the third phase of the protocol if some site knew that the old
coordinator intended to commit the transaction. Otherwise the new coordinator
aborts the transaction.
 Advantages & Disadvantages:
While the 3PC protocol has the desirable property of not blocking unless k sites fail,
it has the drawback that a partitioning of the network will appear to be the same as
more than k sites failing, which would lead to blocking. The protocol also has to be
carefully implemented to ensure that network partitioning (or more than k sites
failing) does not result in inconsistencies, where a transaction is committed in one
partition, and aborted in another. Because of its overhead, the 3PC protocol is not
widely used.
CONCURRENCY CONTROL
Describe the two phase locking protocol with examples.May/June 2014


Explain about locking protocols. May/June 2016

What is concurrency control? How is it implemented in DBMS?
119

Illustrate with a suitable
example Nov/Dec 2015

What is concurrency? Explain it in terms of locking mechanism and

two phase commit protocol. Nov/Dec2014
The system must control the interaction among the concurrent transactions; this
control is achieved through one of a variety of mechanisms called concurrency-
control schemes. The concurrency-control schemes are all based on the
serializability property.
Different types of protocols/ schemes are used to control concurrent execution of
transactions.
Lock-Based Protocols
One way to ensure serializability is to require that data items be accessed in a
mutually exclusive manner; that is, while one transaction is accessing a data item, no
other transaction can modify that data item. The most common method used to
implement this requirement is to allow a transaction to access a data item only if it is
currently holding a lock on that item.
Locks
There are two modes in which a data item may be locked:
i) Shared mode lock:If a transaction Ti has obtained a shared mode lock on item Q,
then Ti can read, but cannot write Q. It is denoted by S.
ii) Exclusive: If a transaction Ti has obtained an exclusive mode lock on item Q, then
Ti can read and also write Q. It is denoted by X.
120

A transaction requests a shared lock on data item Q by executing the lock –S(Q)
instruction. Similarly, a transaction requests an exclusive lock through the lock-X(Q)
instruction. A transaction can unlock a data item Q by the unlock (Q) instruction.
Given a set of lock modes, we can define a compatibility function on them as
follows. Let A and B represent arbitrary lock modes. Such a function can be
represented conveniently by a matrix which is shown in the above figure. An
element comp(A, B) of the matrix has the value true if and only if mode A is
compatible with mode B.
Shared mode is compatible with shared mode, but not with exclusive mode. At any
time, several shared-mode locks can be held simultaneously (by different
transactions) on a particular data item.A subsequent exclusive-mode lock request has
to wait until the currently held shared-mode locks are released.
To access a data item, transaction Ti must first lock that item. If the data item is
already locked by another transaction in an incompatible mode, the concurrency
control manager will not grant the lock until all incompatible locks held by other
transactions have been released. Thus, Ti is made to wait until all incompatible locks
held by other transactions have been released.
T1: lock-X(B);
read(B);
B := B − 50;
write(B);
unlock(B);
lock-X(A);
121

read(A);
A := A + 50;
write(A);
unlock(A).
Transaction T1.
T2: lock-S(A);
read(A);
unlock(A);
lock-S(B);
read(B);
unlock(B);
display(A + B).
Transaction T2
Transaction Ti may unlock a data item that it had locked at some earlier point.A
transaction must hold a lock on a data item as long as it accesses that item.
Transaction Ti may unlock a data item that it had locked at some earlier point.
Considering the simplified banking system Let A and B be two accounts that are
accessed by transactions T1 and T2. Transaction T1 transfers $50 from account B to
account A
Transaction T2 displays the total amount of money in accounts A and B—that is, the
sum A + B
122

Schedule 1
Suppose that the values of accounts A and B are $100 and $200, respectively. If
these two transactions are executed serially, either in the order T1, T2 or the order
T2, T1, then transaction T2 will display the value $300.
If, however, these transactions are executed concurrently transaction T2 displays
$250, which is incorrect. The reason for this mistake is that the transaction T1
unlocked data item B too early, as a result of which T2 saw an inconsistent state. The
lock must be granted in the interval of time between the lock-request operation and
the following action of the transaction.
Suppose now that unlocking is delayed to the end of the transaction. Transaction T3
corresponds to T1 with unlocking delayed. Transaction T4 corresponds to T2 with
unlocking delayed. You should verify that the sequence of reads and writes in
schedule 1, which lead to an incorrect total of $250 being displayed, is no longer
possible with T3 and T4.
T4: lock-S(A);
read(A);
123

lock-S(B);
read(B);
display(A + B);
unlock(A);
unlock(B).
Transaction T4.
Other schedules are possible. T4 will not print out an inconsistent result in any of
them. Unfortunately, locking can lead to an undesirable situation. Consider the
partial schedule for T3 and T4.
Schedule 2
Since T3 is holding an exclusive-mode lock on B and T4 is requesting a shared-mode
lock on B, T4 is waiting for T3 to unlock B. Similarly, since T4 is holding a shared-
mode lock on A and T3 is requesting an exclusive-mode lock on A, T3 is waiting for
T4 to unlock A. Thus, we have arrived at a state where neither of these transactions
can ever proceed with its normal execution. his situation is called deadlock.
We shall require that each transaction in the system follow a set of rules, called a
locking protocol, indicating when a transaction may lock and unlock each of the
data items. Locking protocols restrict the number of possible schedules. The set of
all such schedules is a proper subset of all possible serializable schedules.
124

Granting of Locks:
Starvation of transactions can be avoided by granting locks in the following manner:
When a transaction Ti requests a lock on a data item Q in a particular mode M, the
concurrency-control manager grants the lock provided that
There is no other other transaction holding a lock on Q in amode that
conflicts with M.
There is no other transaction that is waiting for a lock on Q, and that made its
lock request before Ti.
Thus, a lock request will never get blocked by a lock request that is made later.
The Two-Phase Locking Protocol:
One protocol that ensures serializability is the two-phase locking protocol. This
protocol
requires that each transaction issue lock and unlock requests in two phases:
Growing phase. A transaction may obtain locks, but may not release any lock.
Shrinking phase. A transaction may release locks, but may not obtain any
new locks.
Initially, a transaction is in the growing phase. The transaction acquires locks as
needed. Once the transaction releases a lock, it enters the shrinking phase, and
it can issue no more lock requests.
Example:
For example, transactions T3 and T4 are two phase. On the other hand, transactions
T1 and T2 are not two phase. Note that the unlock instructions do not need to appear
at the end of the transaction. For example, in the case of transaction T3, we could
125

move the unlock(B) instruction to just after the lock-X(A) instruction, and still retain
the two-phase locking property.
Lock point:
The two-phase locking protocol ensures conflict serializability. The point in the
schedule where the transaction has obtained its final lock (the end of its growing
phase) is called the lock point of the transaction.
Now, transactions can be ordered according to their lock points—this ordering is, in
fact, a serializability ordering for the transactions.
Disadvantages:
Two-phase locking does not ensure freedom from deadlock. Observe that
transactions T3 and T4 are two phase, but, in schedule 2 they are deadlocked.


Cascading rollback may occur under two-phase locking. As an illustration,
consider the partial schedule in the following figure. Each transaction
observes the two-phase locking protocol, but the failure of T5 after the
read(A) step of T7 leads to cascading rollback of T6 and T7.

Partial Schedule under two phase locking
126

Strict two phase locking protocol:
Cascading rollbacks can be avoided by a modification of two-phase locking
called

the strict two-phase locking protocol.
This protocol requires not only that locking be two phase, but also that all
exclusive-mode locks taken by a transaction be held until that transaction
commits.


This requirement ensures that any data written by an uncommitted transaction
are locked in exclusive mode until the transaction commits, preventing any
other transaction from reading the data.

Rigorous two phase locking protocol:
Another variant of two-phase locking is the rigorous two-phase locking
protocol,which requires that all locks be held until the transaction commits.


We can easily verify that, with rigorous two-phase locking, transactions can
be serialized in the order in which they commit.




Most database systems implement either strict or rigorous two-phase locking.
Lock Conversions:
Consider the following two transactions, for which we have shown only some
of the significant read and write operations:
T8: read(a1);
read(a2);
. . .
read(an);
write(a1).
127

T9: read(a1);
read(a2);
display(a1 + a2).
If we employ the two-phase locking protocol, then T8 must lock a1 in exclusive
mode. Therefore, any concurrent execution of both transactions amounts to a serial
execution. Notice, however, that T8 needs an exclusive lock on a1 only at the end of
its execution, when it writes a1. Thus, if T8 could initially lock a1 in shared mode,
and then could later change the lock to exclusive mode, we could get more
concurrency, since T8 and T9 could access a1 and a2 simultaneously.
This observation leads us to a refinement of the basic two-phase locking protocol, in
which lock conversions are allowed.
We denote conversion from shared to exclusive modes by upgrade, and from
exclusive to shared by downgrade. Lock conversion cannot be allowed
arbitrarily.


Rather, upgrading can take place in only the growing phase, whereas
downgrading


can take place in only the shrinking phase.
Strict two-phase locking and rigorous two-phase locking (with lock conversions) are
used extensively in commercial database systems.
A simple but widely used scheme automatically generates the appropriate lock and
unlock instructions for a transaction, on the basis of read and write requests from the
transaction:
• When a transaction Ti issues a read(Q) operation, the system issues a lock- S(Q)
instruction followed by the read(Q) instruction.
128

When Ti issues a write(Q) operation, the system checks to see whether Ti already
holds a shared lock on Q. If it does, then the system issues an upgrade( Q)
instruction, followed by the write(Q) instruction. Otherwise, the system issues a
lock-X(Q) instruction, followed by the write(Q) instruction.
All locks obtained by a transaction are unlocked after that transaction commits
or aborts.
Part - C
5. DEADLOCKS

Write short notes on deadlock. Nov/Dec 2014
Deadlock occurs when each transaction T in a set of two or more transactions is
waiting for some item that is locked by some other transaction T in the set.
Hence, each transaction in the set is on a waiting queue, waiting for one of the other
transactions in the set to release the lock on an item.
A simple example is shown in Figure (a) & (b), where the two transactions T'1 and
T'2 are deadlocked in a partial schedule; T'1 is on the waiting queue for X, which is
locked by T'2, while T'2 is on the waiting queue for Y, which is locked by T'1.
Meanwhile, neither T'1 nor T'2 nor any other transaction can access items X and Y.
129

There are two principle methods for dealing with deadlock.
1. Deadlock prevention:
There are two approaches for deadlock prevention:
1. One approach ensures that no cyclic waits can occur by ordering the requests for
locks or requiring all locks to be acquired together. This approach requires that each
transaction locks all data items before it begins execution. It is required that either
data items should be locked in one step or none should be locked. X
Disadvantages:
It is hard to predict before the transaction begins, what data items need to be
locked.
Data item utilization may be very low, since many of the data items may be
locked but unused for a long time.
The second approach for deadlock prevention is to use preemption is to use
preemption and transaction rollbacks. In preemption when a transaction T2 requests
a lock that transaction T1 holds, the lock granted to T1 may be preempted by rolling
back T1 and granting of lock to T2.
To control preemption unique timestamp is assigned to each transaction. The system
uses timestamp to decide whether a transaction should wait or rollback.
Two different deadlock prevention schemes using timestamp are:
1. Wait die:
The wait die scheme is non-preemption technique. In this, when transaction Ti
requests a data item held by Tj,Ti is allowed to wait only if it has a timestamp
smaller than Tj(i.e., Ti is older than Tj). Otherwise Ti is rolled back(dies).
130

2. Wound wait:
The wound wait is preemptive technique. In this, when transaction Ti requests data
item held by Tj, Ti is allowed to wait only if it has timestamp greater than Tj(i.e., Ti
is is younger than Tj). Otherwise Tj is rolled back.
Timeout based schemes:
This approach is based on lock timeouts. In this approach a transactions that has
requested a lock waits for atmost a specified amount of time. If the lock has not been
granted within that time, transaction is said to be time out and it rolls back itself and
restarts. Thus if there was a deadlock one or more transactions will time out and roll
back, allowing others to proceed.
Advantages:
This scheme is easy to implement.

It works well if transactions are short and if long waits are likely to be due to
deadlocks.
Disadvantages:
It is hard to decide how long a transaction should wait. Too long waits result in
unnecessary delays and too short waits result in rollbacks.

Starvation is also possible with this scheme.
Deadlock detection and recovery:
Deadlock detection:
Deadlock can be described precisely in terms of a directed graph called wait
for graph. The graph consists of a pair G=(V,E) where V is a set of vertices and E is
a set of edges. The set of vertices consists of all transactions in the system.
Each element in the set E of edges is an ordered pair Ti-> Tj. If Ti->Tj is in E, then
transaction Ti is waiting for transaction Tj to release a data item that it needs.
131

A deadlock exists in the system if and only if the wait-for-graph contains a
cycle.
In the below figure a T1 is waiting for T2,the transaction has used T1 is waiting for
T3 and T3 is waiting for T2.There is no cycle but if T4 is a transaction waiting for
the transaction T3 to release a resource then there is a cycle formed in the graph.
Thus as shown in fig b, it is found that T2, T4 and T3 are deadlocked.
T2 T2 T4
T1
T1
T3 T3
a. Wait for graph without cycle b. Wait for graph with cycle
Recovery from deadlock:
When a deadlock detection algorithm determines that a deadlock
exists, the system must recover from deadlock. The most common solution is to
rolback one or more transactions to break the deadlock. Three actions need to be
taken.
1. Selection of a victim: We should roll back the transactions that will incur
minimum cost. Many factors may determine the cost of a rollback, including
How long the transaction has computed
How many data items the transaction has used
How many more items the transaction needs
How many transactions will be involved in rollback.
132

Rollback: The simplest solution is total rollback i.e., abort the transaction and
then restart it. It is more effective to rollback the transaction only as far as necessary
to break the deadlock. Partial rollbacks requires the system to maintain additional
information about the state of all running transactions. Specifically the sequence of
lock requests/grants have to be recorded. It has to decided about which locks need to
be released. The selected transaction must be rolled back to the point where it
obtained the lock first undoing all the actions after that point.
Starvation:It is possible that the same transaction is always picked as a victim. As
a result the transaction never completes its designated task, thus there is starvation.It
has to be ensure that the transaction can be picked as a victim only a finite number of
times.The most common solution is to include the number of rollbacks in the cost
factor.
-o0o-o0o-o0o-
133

UNIT IV
TRENDS IN DATABASE TECHNOLOGY
Overview of Physical Storage Media – Magnetic Disks – RAID – Tertiary storage –
File Organization – Organization of Records in Files – Indexing and Hashing –
Ordered Indices – B+ tree Index Files – B tree Index Files – Static Hashing –
Dynamic Hashing - Introduction to Distributed Databases- Client server technology-
Multidimensional and Parallel databases- Spatial and multimedia databases- Mobile
and web databases- Data Warehouse-Mining- Data marts..
1. What is the use of RAID? What are the factors to be taken into account
when choosing a RAID level?
A variety of disk-organization techniques, collectively called redundant arrays
of independent disks are used to improve the performance and reliability.
Monetary cost of extra disk storage requirements.
Performance requirements in terms of number of I/O
operations Performance when a disk has failed.
Performances during rebuild.
What is a primary index or ordered indices?
A primary index is an index whose search key also defines the sequential
order of the file. If the index is created on the primary key of the table then it is
called as Primary Indexing. Since these primary keys are unique to each record
and it has 1:1 relation between the records, it is much easier to fetch the record
using it.
3. What is B-Tree?
A B-tree eliminates the redundant storage of search-key values .It allows
search key values to appear only once.
134

4. What is a B+-Tree index?
A B+-Tree index takes the form of a balanced tree in which every path from
the root of the root of the root of the tree to a leaf of the tree is of the same
length
5. Differentiate static and dynamic hashing. Nov/Dec 2014 , April/May 2015,
Nov/Dec 15
STATIC HASHING DYNAMIC HASHING
Numbers of buckets are fixed. Numbers of Buckets are not fixed
As the file grows, performance Performance donot degrade as the file
decreases grows
Space overhead is more. Minimum space lies overhead.
Here we donot use Bucket Address Bucket address table is used.
table.
Open hashing and Closed hashing Extendable hashing and Linear
are forms of it. hashing are forms of it
No complex implementation. Implementation is complex
It is a less attractive technique It is a highly attractive technique
Here system directly accesses the Here the Bucket address table is
Bucket. accessed before accessing the Bucket
Chaining used is overflow chaining Overflow chaining is not used
6. What is data mining? Nov/Dec 2014
Data mining (sometimes called data or knowledge discovery) is the
process of analyzing data from different perspectives and summarizing it into
useful information - information that can be used to increase revenue, cuts
costs, or both. Data mining is the process of finding correlations or patterns
among dozens of fields in large relational databases.
135

7. Define Data warehouse. Nov/Dec 2014
A data warehouse is a subject-oriented, integrated, time-variant and
non-volatile collection of data in support of management's decision making
process. Subject-Oriented: A data warehouse can be used to analyze a particular
subject area.
What are the techniques to be evaluated for both ordered indexing and
hashing?
Access types
Access time
Insertion time
Deletion time
Space overhead
Define rotational latency time.
The time spent in waiting for the sector to be accessed to appear under the
head is called the rotational latency time
10. Give an example of join that is not a simple equi join for which partitioned
parallelism can be used. Nov/Dec 2015
11. Write about the four types(star,snowflake,galaxy and fast constellation) of
data ware house schemas April/May 2015
The star schema consists of a fact table with a single table for each dimension.
The snowflake schema is a variation on the star schema in which the dimensional
tables from a star schema are organized into a hierarchy by normalizing them Some
installations are normalizing data warehouses up to the third normal form so that
136

they can access the data warehouse to the finest level of detail. A fact constellation is
a set of fact tables that share some dimension tables. These share the dimension table
called product. Fact constellations limit the possible queries for the warehouse.
1.2. What is meant by garbage collection? May/June 2016
Garbage collection is the process of destroying objects that are no longer
referenced, and freeing the resources those objects used. In Java there is a
background process that performs garbage collection. Requires bi-directional object
relationships. Determines if the database performs garbage collection on objects that
are no longer referenced by the database. This keeps external programs from having
to track the use of object pointers.
13. Define software and hardware RAID system May/June 2016
The hardware-based system manages the RAID subsystem independently
from the host and presents to the host only a single disk per RAID array.
Software RAID implements the various RAID levels in the kernel disk (block
device) code. Software RAID performance can excel against Hardware RAID. The
performance of a software-based array is dependent on the server CPU performance
and load.
PART – B
RAID TECHNOLOGY
What is RAID? Briefly explain different levels of RAID. Discuss the
factors to be considered in choosing a RAID level.
May/June2016 ,Nov/Dec 2014, Apr/May 2015, Nov/Dec 2015
Definition-Redundant Array of Inexpensive (Independent) Disks
A major advance in secondary storage technology is represented by the
development of RAID, which originally stood for Redundant Arrays of
137

Inexpensive Disks. Lately, the "I" in RAID is said to stand for Independent. The
RAID idea received a very positive endorsement by industry and has been developed
into an elaborate set of alternative RAID architectures (RAID levels 0 through 6).
Features of RAID (Data Striping, Mirroring, Block/ Bit Level Striping)
The natural solution is a large array of small independent disks acting as a
single higher-performance logical disk. A concept called data striping is used,
which utilizes parallelism to improve disk performance. Data striping distributes data
transparently over multiple disks to make them appear as a single large, fast disk.
Figure shows a file distributed or striped over four disks. Striping improves overall
I/O performance by allowing multiple I/Os to be serviced in parallel, thus providing
high overall transfer rates. Data striping also accomplishes load balancing among
disks. Moreover, by storing redundant information on disks using parity or some
other error correction code, reliability can be improved.
One technique for introducing redundancy is called mirroring or shadowing.
Data is written redundantly to two identical physical disks that are treated as one
logical disk. When data is read, it can be retrieved from the disk with shorter
queuing, seek, and rotational delays.
Disk striping may be applied at a finer granularity by breaking up a byte of
data into bits and spreading the bits to different disks. Thus, bit-level data striping
consists of splitting a byte of data and writing bit j to the disk. With 8-bit bytes, eight
physical disks may be considered as one logical disk with an eightfold increase in
the data transfer rate. Each disk participates in each I/O request and the total amount
of data read per request is eight times as much.
The granularity of data interleaving can be higher than a bit; for example,
blocks of a file can be striped across disks, giving rise to block-level striping.With
block-level striping, multiple independent requests that access single blocks (small
requests) can be serviced in parallel by separate disks, thus decreasing the queuing
138

time of I/O requests. Requests that access multiple blocks (large requests) can be
parallelized, thus reducing their response time.
Levels-RAID 0 to RAID 6
Factors
o What type of data will be stored on the RAID volume?
o What applications will be accessing or running on the RAID volume?
Is performance, redundancy, or a combination of both important to
you?
RAID 0
RAID 0 consists of striping, without mirroring or parity. The capacity of a
RAID 0 volume is the sum of the capacities of the disks in the set, the same as with
aspanned volume.
139

There is no added redundancy for handling disk failures, just as with a
spanned volume. Thus, failure of one disk causes the loss of the entire RAID 0
volume, with reduced possibilities of data recovery when compared to a broken
spanned volume.
Striping distributes the contents of files roughly equally among all disks in
the set, which makes concurrent read or write operations on the multiple disks
almost inevitable and results in performance improvements.
The concurrent operations make the throughput of most read and write
operations equal to the throughput of one disk multiplied by the number of disks.
Increased throughput is the big benefit of RAID 0 versus spanned volume.
RAID 1
RAID 1 consists of data mirroring, without parity or striping. Data is written
identically to two (or more) drives, thereby producing a "mirrored set" of drives.
Thus, any read request can be serviced by any drive in the set. If a request is
broadcast to every drive in the set, it can be serviced by the drive that accesses the
data first (depending on its seek time and rotational latency), improving
performance. Sustained read throughput, if the controller or software is optimized for
it, approaches the sum of throughputs of every drive in the set, just as for RAID 0.
Actual read throughput of most RAID 1 implementations is slower than the
fastest drive. Write throughput is always slower because every drive must be
updated, and the slowest drive limits the write performance. The array continues to
operate as long as at least one drive is functioning.
RAID 2
RAID 2 consists of bit-level striping with dedicated Hamming-code parity.
All disk spindle rotation is synchronized and data is striped such that each sequential
bit is on a different drive. Hamming-code parity is calculated across corresponding
bits and stored on at least one parity drive. This level is of historical significance
only; although it was used on some early machines (for example,
140

the Thinking Machines CM-2), as of 2014 it is not used by any of the commercially
available systems.
RAID 3
RAID 3 consists of byte-level striping with dedicated parity. All disk
spindle rotation is synchronized and data is striped such that each sequential byte is
on a different drive. Parity is calculated across corresponding bytes and stored on a
dedicated parity drive. Although implementations exist, RAID 3 is not commonly
used in practice.
RAID 4
RAID 4 consists of block-level striping with dedicated parity. This level
was previously used by NetApp, but has now been largely replaced by a proprietary
implementation of RAID 4 with two parity disks, called RAID-DP. The main
advantage of RAID 4 over RAID 2 and 3 is I/O parallelism: in RAID 2 and 3, a
single read/write I/O operation requires reading the whole group of data drives,
while in RAID 4 one I/O read/write operation does not have to spread across all data
drives. As a result, more I/O operations can be executed in parallel, improving the
performance of small transfers.
RAID 5
RAID 5 consists of block-level striping with distributed parity. Unlike
RAID 4, parity information is distributed among the drives, requiring all drives but
one to be present to operate. Upon failure of a single drive, subsequent reads can be
calculated from the distributed parity such that no data is lost. RAID 5 requires at
least three disks. RAID 5 is seriously affected by the general trends regarding array
rebuild time and the chance of drive failure during rebuild. Rebuilding an array
requires reading all data from all disks, opening a chance for a second drive failure
and the loss of entire array. In August 2012, Dell posted an advisory against the use
of RAID 5 in any configuration on Dell EqualLogic arrays and RAID 50 with "Class
2 7200 RPM drives of 1 TB and higher capacity" for business-critical data.
141

RAID 6
RAID 6 consists of block-level striping with double distributed parity.
Double parity provides fault tolerance up to two failed drives. This makes larger
RAID groups more practical, especially for high-availability systems, as large-
capacity drives take longer to restore. RAID 6 requires a minimum of four disks. As
with RAID 5, a single drive failure results in reduced performance of the entire array
until the failed drive has been replaced. With a RAID 6 array, using drives from
multiple sources and manufacturers, it is possible to mitigate most of the problems
associated with RAID 5. The larger the drive capacities and the larger the array size,
the more important it becomes to choose RAID 6 instead of RAID 5.
2. INDEXING AND HASHING
2A. Explain the various indexing and hashing schemes used in database
environment. Nov/Dec 2015
Single Level Index
o Primary Index
o Clustering Index
o Secondary Index
Multilevel Index
Hashing
Static Hashing
Dynamic Hashing
Indexing
Index Structure for files
An index is a set of one or more keys, each pointing to rows in a table. An
index allows more efficient access to rows in a table by creating a direct path to the
data through the pointers.
142

To create an index, the command is
CREATE INDEX <name> ON <TABLE NAME>{ <comuln name>}
Indexes are access structures which are used to speed up the retrieval of
records in response to certain search conditions.
The index structures provide secondary access paths which provide alternate
ways of accessing the records without affecting the physical placement of records on
disk.
They enable efficient access to records based on the indexing fields that are
used to construct the index.
Any field can be used to create an index, and multiple indexes on ifferent can
be constructed on the same file.
A variety of indexes are possible; each of them uses a particular data structure
to speed up the search.
To find a record or records in the file based on a certain selection criterion on
an indexing field, one has to initially access the index which points to one or more
blocks in the file where.
Different types of Indexes
Apart from the primary file organisation such as unordered, ordered or hashed
organisation, there are additional access structure called indexes which are used to
speed up the retrieval of records in response to certain search conditions.
The efficient access to records is based on the indexing fields that are used to
construct the index. To find a record or records in the file based on a certain
selection criterion on an indexing field, one as to initially access the index, which
points to one or more blocks in the file where the required records are located. The
most prevalent types of indexes are based on
ordered files (single level Indices)
Tree data structures (Multilevel indexes)
Types of single level ordered Indexes:
143

For a file with a given record structure consisting of several fields, the index
access structure is usually defined on a single field of a file, called an indexing field.
The index stores each value of the index field along with a list of pointers to all disk
blocks that contain records with that field value. The values in the index are ordered.
So that binary search can be performed on the index. Also the index file is much
smaller than data file. So searching the index using a binary search is efficient.
Single level ordered indexes
Primary Clustered Index Secondary
(Non – dense index)
A primary index is specified in ordering key field of an ordered file of
records. An ordering key field is used to physically order the file records on disk and
every record has a unique value for that field.
If the ordering field is not a key field then the clustering index is used.
Secondary index can be specified on any non-ordering field of a file.
Primary Indexes:
A primary index is an ordered file whose records are of fixed length with two
fields. The first field is of the same data type as the ordering key field called the
primary key of data file.
The second field is a pointer to a disk block. There is one index entry in the
index file for each block in the data file. The total number of entries in the index is
the same as the number of disk blocks in the ordered file. The first record in each
block of the data file is called the anchor record of the block or block anchor.
Indexes can also be characterised as dense or sparse. A dense index has an
index entry for every search key value (ie., for every record). A sparse index has
index entries for only some of the search values. Therefore, a primary index is a non-
dense sparse index.
The major problem with a primary index is insertion and deletion of records.
144

If we insert a new record in data file at its correct position, we have to not
only move records to make space for new record but also change index
entries, since moving records will change the anchor records of some blocks.

Record deletion is handled using deletion markers.
Index File Data File
NameSSN Bdate Salpost
Block pointer Aaron
Aaron  Abbot
….
Adams 
Acosta
. 
. 
Adams
. 
. 
….
Wong 
. …
….. Wong
….
Primary Index
Clustering Index:
If the records of a file are physically ordered on a non – key field which does
not have a distinct value for each record. That field is called clustering field.
Different types of clustering index can be created to speed up the retrieval of records.
A clustering index is also an ordered file with two fields. The first field is of
the same type of the clustering field of the data file and the second field is a pointer
to block in data file. There is entry in the clustering index for each distinct value of
145

the clustering field. It is an example for non dense index, because it has an entry for
every distinct value of the indexing field rather than for every record in the file.
There is a problem with insertion and deletion of records. For insertion, it is
common to reserve a whole block for each value of clustering field. All records with
that value are placed in that block. This makes insertion straightforward.
Index File Data File
Dno SSN Bdate Sal post
Clustering Field
Dno Block pointer 1
1  1
1
2 
Block pointer
. 
.  Null
2
. 
2
. 
….
10 
Block pointer
Clustering index with a Null
3
Separate block cluster for
3
Each group of records that share
3
same value for clustering field.
 Block pointer
3
Clustering index
Null
Secondary Indexes:
146

Index File
Index Field
ssn Block pointer
1 
2 
3 
4 
5 
6 
7 

10 
Data File
SSN Bdate Sal post
1
2
5
Block pointer
Null
3
7
6
Block pointer
9
4
8
Null
 Block pointer
Null
Secondary Index
A secondary index is also an ordered file with two fields. The first field is of
same type as some non ordering field of the data file that is an indexing field. The
second field is either a block pointer or a record pointer.
Because the records of the data file are not physically ordered by values of the
secondary key field. We cannot use block anchors. So index entry is created for each
record in data file rather than for each block. The pointer in the index is either to the
block in which the record is stored or to the record itself. So the secondary index is
dense.
147

A secondary index usually needs more storage space and longer search time
than primary index because of its larger number of entries.
Multilevel Indexes:
2 
8 
15 
24 
Second (top)
Level
2 
35 
35 
39 
55 
44 
85 
51 
55 
63 
71 
80 
2
5
8
12
35
37
15
21


A two-level primary index
A multilevel index considers the index file as first level of a multi-level index
which is an ordered file with distinct value for each key field entry. Hence we can

148

create a primary index for the first level. This index to the first level is called the
second level of the multilevel index.
The blocking factor bfri for second level and all subsequent levels is same as
that for the first level index because all index entries are the same size.
We require a second level only if the first level needs more than one block of
disk storage. We can repeat the preceding process until all the entries of some index
level t fit in a single block. The block at the t th level is called top index level.
2B.HASHING
Definition of hashing
Hashing for disk files is called external hashing. To suit the characteristics of
disk storage, the target address space is made of buckets, each of which holds
multiple records. A bucket is either one disk block or a cluster of contiguous blocks.
The hashing function maps a key into a relative bucket number, rather than assign an
absolute block address to the bucket.
The collision problem is less severe with buckets, because as many records as
will fit in a bucket can hash to the same bucket without causing problems. A
variation of chaining can be used in which a pointer is maintained in each bucket to a
linked list of overflow records for the bucket. The pointers in the linked list should
be record pointers, which include both a block address and a relative record
position within the block.
Hashing provides the fastest possible access for retrieving an arbitrary record
given the value of its hash field. Although most good hash functions do not maintain
records in order of hash field values, some functions—called order preserving—do.
A simple example of an order preserving hash function is to take the leftmost three
digits of an invoice number field as the hash address and keep the records sorted by
invoice number within each bucket.
149

The hashing scheme described is called static hashing because a fixed
number of buckets M is allocated. This can be a serious drawback for dynamic files.
Suppose that we allocate M buckets for the address space and let m be the maximum
number of records that can fit in one bucket; then at most (m * M) records will fit in
the allocated space. If the number of records turns out to be substantially fewer than
(m * M), we are left with a lot of unused space.
When using external hashing, searching for a record given a value of some
field other than the hash field is as expensive as in the case of an unordered file.
Record deletion can be implemented by removing the record from its bucket. If the
bucket has an overflow chain, we can move one of the overflow records into the
bucket to replace the deleted record. If the record to be deleted is already in
overflow, we simply remove it from the linked list.
A major drawback of the static hashing scheme just discussed is that the hash
address space is fixed. Hence, it is difficult to expand or shrink the file dynamically.
The first scheme—extendible hashing—stores an access structure in addition to the
file, and hence is somewhat similar to indexing. The main difference is that the
access structure is based on the values that result after application of the hash
function to the search field. In indexing, the access structure is based on the values of
the search field itself. The second technique, called linear hashing, does not require
additional access structures.
These hashing schemes take advantage of the fact that the result of applying a
hashing function is a nonnegative integer and hence can be represented as a binary
number. The access structure is built on the binary representation of the hashing
function result, which is a string of bits. This is the hash value of a record. Records
are distributed among buckets based on the values of the leading bits in their hash
values.
150

Extendible Hashing
In extendible hashing, a type of directory—an array of 2d bucket addresses—
is maintained, where d is called the global depth of the directory. The integer value
corresponding to the first (high-order) d bits of a hash value is used as an index to
the array to determine a directory entry, and the address in that entry determines the
bucket in which the corresponding records are stored.
However, there does not have to be a distinct bucket for each of the 2d
directory locations. Several directory locations with the same first d‗ bits for their
hash values may contain the same bucket address if all the records that hash to these
locations fit in a single bucket.
151

3. Write short notes about B Trees. May/June2016, Nov/Dec 2014
B-trees and B+-trees are special cases of the well-known tree data structure. A
tree is formed of nodes. Each node in the tree, except for a special node called the
root, has one parent node and several—zero or more—child nodes. The root node
has no parent. A node that does not have any child nodes is called a leaf node; a
nonleaf node is called an internal node.
The level of a node is always one more than the level of its parent, with the
level of the root node being zero. A subtree of a node consists of that node and all
its *descendant nodes—its child nodes, the child nodes of its child nodes, and so on.
A precise recursive definition of a subtree is that it consists of a node n and the
subtrees of all the child nodes of n.
B-trees
The B-tree has additional constraints that ensure that the tree is always
balanced and that the space wasted by deletion, if any, never becomes excessive. A
B-tree of order p, when used as an access structure on a key field to search for
records in a data file, can be defined as follows:
1. Each internal node in the B-tree is of the form
<P1, <K1, Pr1> , P2, <K2, Pr2> , ..., <Kq-1,Prq-1> , Pq>
Where q 1 p. Each Pi is a tree pointer—a pointer to another node in the B-
tree. Each Pri is a data pointer (Note 8)—a pointer to the record whose search key
field value is equal to Ki (or to the data file block containing that record).
Within each node, K1 <K2 < ... < Kq-1.
For all search key field values X in the subtree pointed at by Pi (the ith
subtree, see Figure 06.10a), we have:
Ki-1 < X < Ki for 1 <i< q; X < Ki for i = 1; and Ki-1 < X for i = q.
Each node has at most p tree pointers.
152

Each node, except the root and leaf nodes, has at least (p/2) tree pointers.
The root node has at least two tree pointers unless it is the only node in the tree.
A node with q tree pointers, q 1 p, has q - 1 search key field values (and
hence has q - 1 data pointers).
All leaf nodes are at the same level. Leaf nodes have the same structure as
internal nodes except that all of their tree pointers Pi are null.
4. Write short notes on the following. Nov/Dec 2015
Spatial and Multimedia databases (8)
Mobile and Web databases (8)
A. Spatial Databases:
Spatial databases provide concepts for databases that keep track of objects in a
multi-dimensional space. For example, cartographic databases that store maps
include two-dimensional spatial descriptions of their objects—from countries and
states to rivers, cities, roads, seas, and so on.
These databases are used in many applications, such as environmental,
emergency, and battle management. Other databases, such as meteorological
databases for weather information, are three-dimensional, since temperatures and
other meteorological information are related to three-dimensional spatial points. In
general, a spatial database stores objects that have spatial characteristics that describe
them. The spatial relationships among the objects are important, and they are often
needed when querying the database.
The following categories illustrate three typical types of spatial queries:
Range query: Finds the objects of a particular type that are within a given spatial
area or within a particular distance from a given location. (For example, finds all
hospitals within the Dallas city area, or finds all ambulances within five miles of
an accident location.)
153

Nearest neighbor query: Finds an object of a particular type that is closest to a
given location. (For example, finds the police car that is closest to a particular
location.)
Spatial joins or overlays: Typically joins the objects of two types based on some
spatial condition, such as the objects intersecting or overlapping spatially or
being within a certain distance of one another.
Multimedia Databases:
To provide such database functions as indexing and consistency, it is desirable to
store multimedia data in a database. Rather than storing them outside the database, in
a file system
The database must handle large object representation.


Similarity-based retrieval must be provided by special index structures.


Must provide guaranteed steady retrieval rates for continuous-media data.

Multimedia Data Formats
Store and transmit multimedia data in compressed form
JPEG and GIF the most widely used formats for image data.
MPEG standard for video data use commonalties among a sequence of frames to
achieve a greater degree of compression.
MPEG-1 quality comparable to VHS video tape.
Stores a minute of 30-frame-per-second video and audio in approximately 12.5 MB
MPEG-2 designed for digital broadcast systems and digital video disks; negligible
loss ofvideo quality.
Compresses 1 minute of audio-video to approximately 17
MB. Several alternatives of audio encoding
MPEG-1 Layer 3 (MP3), RealAudio, WindowsMedia format, etc.
Continuous-Media Data
Most important types are video and audio data.
154

Characterized by high data volumes and real-time information-delivery
requirements.
Data must be delivered sufficiently fast that there are no gaps in the audio or
video. Data must be delivered at a rate that does not cause overflow of system
buffers. Synchronization among distinct data streams must be maintained
video of a person speaking must show lips moving synchronously withthe audio
Video Servers
Video-on-demand systems deliver video from central video servers, across a
network, to terminals must guarantee end-to-end delivery rates
Current video-on-demand servers are based on file systems; existing database
systems do not meet real-time response requirements.
Multimedia data are stored on several disks (RAID configuration), or on tertiary
storagefor less frequently accessed data.
Head-end terminals - used to view multimedia data
PCs or TVs attached to a small, inexpensive computer called a set-top box.
Similarity-Based Retrieval
Examples of similarity based retrieval
Pictorial data: Two pictures or images that are slightly different as represented in
thedatabase may be considered the same by a user.
e.g., identify similar designs for registering a new trademark.
Audio data: Speech-based user interfaces allow the user to give a command or
identify adata item by speaking.
e.g., test user input against stored commands.
Handwritten data: Identify a handwritten data item or command stored in the
database,
155

Part - C
B. MOBILE DATABASE:
The mobile computing environment will provide database applications with
useful aspects of wireless technology. The mobile computing platform allows users
to establish communication with other users and to manage their work while they are
mobile. This feature is especially useful to geographically dispersed organizations
Mobile Computing Architecture
Fig: Infrastructure based Architecture
It is a distributed architecture where a number of computers, generally referred
to as Fixed Hosts (FS) and Base Stations (BS), are interconnected through a high-
speed wired network. Fixed hosts are general purpose computers that are not equipped
to manage mobile units but can be configured to do so. Base stations are equipped with
wireless interfaces and can communicate with mobile units to support data access.
Mobile Units (MU) (or hosts) and base stations communicate through
wireless channels having bandwidths significantly lower than those of a wired network.
A downlink channel is used for sending data from a BS to an MU and an uplink
channel is used for sending data from an MU to its BS.Mobile units are battery-

156

powered portable computers that move freely in a geographic mobility domain, an
area that is restricted by the limited bandwidth of wireless communication channels.
To manage the mobility of units, the entire geographic mobility domain is
divided into smaller domains called cells. The mobile discipline requires that the
movement of mobile units be unrestricted within the geographic mobility domain
(intercell movement), while having information access contiguity during movement
guarantees that the movement of a mobile unit across cell boundaries will have no
effect on the data retrieval process.
Characteristics of Mobile Environments
In mobile database environments, data generally changes very rapidly. Users
are mobile and randomly enter and exit from cells. The average duration of a user‗s
stay in the cell is referred to as residence latency (RL), a parameter that is computed
(and continually adjusted) by observing user residence times in cells. Thus each cell
has an RL value.
In order to conserve energy and extend battery life, clients can slip into a
doze mode, where they are not actively listening on the channel and they can expend
significantly less energy than they do in active mode. Clients can be woken up from the
doze mode when the server needs to communicate with the client.
Types of Data in Mobile Applications
In vertical applications users access data within a specific cell, and access is denied
to users outside of that cell. For example, users can obtain information on the
location of doctors or emergency centers within a cell or parking availability data at
an airport cell.
In horizontal applications, users cooperate on accomplishing a task, and they
can handle data distributed throughout the system. The horizontal application market
is massive; two types of applications most mentioned are mail-enabled applications
and information services to mobile users.
157

Data may be classified into three categories:
Private data: A single user owns this data and manages it. No other user
may access it.
Public data: This data can be used by anyone who can read it. Only one
source updates it. Examples include weather bulletins or stock prices.
Shared data: This data is accessed both in read and write modes by groups
of users. Examples include inventory data for products in a company.
Data Management Issues
Data distribution and replication: Data is unevenly distributed among the base
stations and mobile units. The consistency constraints compound the problem of
cache management. Caches attempt to provide the most frequently accessed and
updated data to mobile units that process their own transactions and may be
disconnected over long periods.
Transaction models: Issues of fault tolerance and correctness of transactions are
aggravated in the mobile environment. A mobile transaction is executed
sequentially through several base stations and possibly on multiple data sets
depending upon the movement of the mobile unit. Central coordination of
transaction execution is lacking, particularly in scenario above. Hence, traditional
ACID properties of transactions may need to be modified and new transaction
models must be defined.
Query processing: Awareness of where the data is located is important and
affects the cost/benefit analysis of query processing. The query response needs to
be returned to mobile units that may be in transit or may cross cell boundaries yet
must receive complete and correct query results.
Recovery and fault tolerance: The mobile database environment must deal with
site, media, transaction, and communication failures. Site failure at an MU is
158

frequently due to limited battery power. If an MU has a voluntary shutdown, it
should not be treated as a failure. Transaction failures are more frequent during
handoff when an MU crosses cells. MU failure causes a network partitioning and
affects routing algorithms.
5. Mobile database design: The global name resolution problem for handling queries
is compounded because of mobility and frequent shutdown. Mobile database design
must consider many issues of metadata management—for example, the constant
updating of location information.
WEB DATABASE
A web database is accessed and managed via the Internet.
A web database in simplest terms is a database application designed to be managed
and accessed through the Internet. Web database applications enable site operators to
manage collection of data and presentation of analytical results online
1. Basics
A database is a general software application that revolutionized
businesses in the 1990s and into the 21st century. It enables companies
to collect infinite amounts of data on infinite customers, analyze it and
turn it into useful information. A web database provides these
functions via the Internet.
Web Database Software
Web database software applications are widespread, as companies sell
applications to aspiring web developers. Benefits of top applications
include the ability to set up data collection forms, polls, feedback
forms and other collection tools and present the results of data analysis
in real time.
Business Functions
Businesses use web databases in various capacities depending on the operation.
Common uses include customized database generation, presentation of information
159

to customers or visitors, sorting of data, report generation, and importing and
exporting
DISTRIBUTED DATABASES
Explain about Distributed databases and their characteristics, functions
, advantages and disadvantages. May / June 2016
DISTRIBUTED DATABASES
Distributed databases (DDBs), distributed database management systems
(DDBMSs), and how the client-server architecture is used as a platform for
database application development. Distributed databases bring the advantages
of distributed computing to the database management domain.

A distributed computing system consists of a number of processing
elements, not necessarily homogeneous, that are interconnected by a computer
network, and that cooperate in performing certain assigned tasks.

As a general goal, distributed computing systems partition a big,
unmanageable problem into smaller pieces and solve it efficiently ina
coordinated manner.

The economic viability of this approach stems from two reasons: more
computing power is harnessed to solve a complex task, and each autonomous
processing element can be managed independently to develop its own
applications.

DDB technology resulted from a merger of two technologies: database
technology, and network and data communication technology. Computer
networks allow distributed processing of data.

Traditional databases, on the other hand, focus on providing centralized,
controlled access to data. Distributed databases allow an integration of
information and its processing by applications that may themselves be
centralized or distributed.
160

Distributed Database Concepts
A distributed database (DDB) is a collection of multiple logically
interrelated databases distributed over a computer network, and a distributed
database management system (DDBMS) as a software system that manages a
distributed database while making the distribution transparent to the user.
Distributed databases are different from Internet Web files. Web pages are
basically a very large collection of files stored on different nodes in a network—the
Internet—with interrelationships among the files represented via hyperlinks.
Differences between DDB and Multiprocessor Systems
Distributed databases from multiprocessor systems that use shared storage (primary
memory or disk). For a database to be called distributed, the following minimum
conditions should be satisfied:
■ Connection of database nodes over a computer network.
There are multiple computers, called sites or nodes. These sites must be connected
by an underlying communication network to transmit data and commands among
sites.
■ Logical interrelation of the connected databases.
It is essential that the information in the databases be logically related.
■ Absence of homogeneity constraint among connected nodes.
It is not necessarythat all nodes be identical in terms of data, hardware, and
software.
Transparency
The concept of transparency extends the general idea of hiding
implementation details from end users. A highly transparent system offers a lot of
flexibility to the end user/application developer since it requires little or no
awareness of underlying details on their part.
In the case of a traditional centralized database, transparency simply pertains to
logical and physical data independence for application developers. However, in a
161

DDB scenario, the data and software are distributed over multiple sites connected by
a computer network, so additional types of transparencies are introduced.
Data organization transparency (also known as distribution or
networktransparency).
This refers to freedom for the user from the operational details of the network and
the placement of the data in the distributed system. It may be divided into location
transparency and naming transparency.
Location transparency refers to the fact that the command used to perform a
task is independent of the location of the data and the location of the node
where the command was issued.

Naming transparency implies that once a name is associated with an object,
the named objects can be accessed unambiguously without additional
specification as to where the data is located.

Replication transparency. As we show in Figure 25.1, copies of the same
data objects may be stored at multiple sites for better availability,
performance, and reliability. Replication transparency makes the user
unaware of the existence of these copies.

Fragmentation transparency. Two types of fragmentation are possible.

Horizontal fragmentation distributes a relation (table) into sub relations
Autonomy
Autonomy determines the extent to which individual nodes or DBs in a
connected DDB can operate independently.
A high degree of autonomy is desirable for increased flexibility and
customized maintenance of an individual node. Autonomy can be
applied to design, communication, and execution.
Design autonomy refers to independence of data model usage and
transaction management techniques among nodes.
Communication autonomy determines the extent to which each node
can decide on sharing of information with other nodes.
162

Execution autonomy refers to independence of users to act as they
please.
Reliability and Availability
o Reliability and availability are two of the most common potential advantages
cited for distributed databases.
o Reliability is broadly defined as the probability that a system is running (not
down) at a certain time point, whereas availability is the probability that the
system is continuously available during a time interval. We can directly relate
reliability and availability of the database to the faults, errors, and failures
associated with it.
o A failure can be described as a deviation of a system‗s behavior from that
which is specified in order to ensure correct execution of operations.
o Errors constitute that subset of system states that causes the failure. Fault
is the cause of an error.
Advantages of Distributed Databases
Organizations resort to distributed database management for various reasons. Some
important advantages are listed
1. Improved ease and flexibility of application development.
163

Increased reliability and availability.
Improved performance.
Easier expansion.
Additional Functions of Distributed Databases
Keeping track of data distribution. The ability to keep track of the data
distribution, fragmentation, and replication by expanding the DDBMS catalog.
Distributed query processing. The ability to access remote sites and transmit
queries and data among the various sites via a communication network.
Distributed transaction management. The ability to devise execution strategies
for queries and transactions that access data from more than one site and to
synchronize the access to distributed data and maintain the integrity of the overall
database.
Replicated data management. The ability to decide which copy of a replicated
data item to access and to maintain the consistency of copies of a replicated data
item.
Distributed database recovery. The ability to recover from individual site
crashes and from new types of failures, such as the failure of communication links.
Security. Distributed transactions must be executed with the proper management
of the security of the data and the authorization/access privileges of users.
Distributed directory (catalog) management. A directory contains information
(metadata) about data in the database. The directory may be global for the entire
DDB, or local for each site. The placement and distribution of the directory are
design and policy issues.
Types of Distributed Database Systems
The term distributed database management system can describe various
systems that differ from one another in many respects. The main thing that all such
systems have in common is the fact that data and software are distributed over
multiple sites connected by some form of communication network. In this section we
164

discuss a number of types of DDBMSs and the criteria and factors that make some of
these systems different.
-o0o-o0o-o0o-
165

UNIT V
ADVANCED TOPICS
DATABASE SECURITY: Data Classification-Threats and risks – Database access
Control – Types of Privileges – Cryptography- Statistical Databases.- Distributed
Databases-Architecture-Transaction Processing-Data Warehousing and Mining-
Classification-Association rules-Clustering-Information Retrieval- Relevance
ranking-Crawling and Indexing the Web- Object Oriented Databases-XML
Databases.
PART – A
1. What is Crawling and Indexing the web? (Nov/Dec 2014)
Spiders are used to crawl the web and collect pages.
– A page is downloaded and its outward links are found.
– Each outward link is then downloaded.
– Exceptions:
Links from CGI interfaces
Robot Exclusion Standard
2. What is Relevance Ranking? (Nov/Dec 2014)
Relevancy ranking is the process of sorting the document results so that
those documents which are most likely to be relevant to your query are shown at the
top.
3. Define Threats and risks. (Apr/May 2015)
Threats
A threat is an agent that may want to or definitely can result in harm to the
target organization. Threats include organized crime, spyware, malware, adware
companies, and disgruntled internal employees who start attacking their employer.
166

Worms and viruses also characterize a threat as they could possibly cause harm in
your organization even without a human directing them to do so by infecting
machines and causing damage automatically. Threats are usually referred to as
―attackers‖.
Risk
Risk is where threat and vulnerability overlap. That is, we get a risk when our
systems have a vulnerability that a given threat can attack.
4. What is Association rule mining? (Apr/May 2015)
A consequent is an item that is found in combination with the antecedent.
Association rules are created by analyzing data for frequent if/then patterns and
using the criteria support and confidence to identify the most important relationships.
Support is an indication of how frequently the items appear in the database.
5. List the types of privileges used in database access control. (Nov/Dec 2015)
System privileges - This allows the user to CREATE, ALTER, or DROP
database objects.
Object privileges - This allows the user to EXECUTE, SELECT, INSERT,
UPDATE, or DELETE data from database objects to which the privileges apply.
6. Can we have more than one constructor in a class? If yes, explain the need for
such a situation. (Nov/Dec 2015)
Constructors are most often the only public interfaces of a class to create an
object of that class. Classes may have a lot of different fields (instance variables) and
user may choose to pass zero or all of those values while creating the object of the
class. Sometimes it is mandatory by design to pass all those arguments; but in other
cases, the design of the class may be flexible enough to allow the user of the class to
choose how many parameters need to be passed and provide sensible
default/initialization values for the rest. This is the reason there are often more than
one constructors.
167

7. Explain data classification. (Apr/May 2016)
Data classification is the process of organizing data into categories for its
most effective and efficient use. A well-planned data classification system makes
essential data easy to find and retrieve. This can be of particular importance for risk
management, legal discovery, and compliance.
8. What are the advantages of data warehouse? (Apr/May 2016)
Enhanced Business Intelligence
Increased Query and System Performance
Business Intelligence from Multiple Sources
Timely Access to Data
Enhanced Data Quality and Consistency
Historical Intelligence
High Return on Investment
In how many ways can we describe the knowledge discovered?
Association rules
Classification hierarchy
Sequential patterns
Patterns within finite series
Categorization & segmentation.
What is the functionality of data warehouse?
Roll up
Drill down
Pivot
Slice & dice
Sorting
Selection
Derived attributes
168

What are the phases in knowledge discovery process?
Data selection
Data cleansing
Enrichment
Data transformation (or) encoding
Data mining
Reporting & display the discovered information.
Write the applications of data mining?
Advertising
Store location
Targeted mailing
Segmentation of customer
Design of catalogs
Store layout.
What are tree structure diagrams?
It is the schema for a hierarchical database; such a diagram consists of 2
basic components.
boxes, which corresponds to record types
Lines, which corresponds to links.
What are the functions of distributed database?
Keeping track of data
Distributed query processing
Distributed transaction management
Replicated data management
Distributed d/b recovery
Security
169

Define i) data mining ii) data warehousing.
Data mining: Data mining refers to the mining of discovery of new
information in term of patterns or rules from cast amounts of data.
Data warehousing: W.H. Inmon characterized a data warehouse as a
subject oriented, integrated, non-volatile, time variant collection of data
in support of management‗s decisions. It provides access to data for
complex analysis, knowledge discovery & decision making.
Part B
1. DISTRIBUTED TRANSACTIONS
1. Write short notes on distributed transactions. Nov/Dec 2014
System Structure
Each site has its own local transaction manager, whose function is to ensure
the ACID properties of those transactions that execute at that site. The various
transaction managers cooperate to execute global transactions. To understand how
such a manager can be implemented, consider an abstract model of a transaction
system, in which each site contains two subsystems:
The transaction manager manages the execution of those transactions (or
subtransactions) that access data stored in a local site. Note that each such
transaction may be either a local transaction (that is, a transaction that executes at
only that site) or part of a global transaction (that is, a transaction that executes
at several sites).
The transaction coordinator coordinates the execution of the various transactions
(both local and global) initiated at that site.
The structure of a transaction manager is similar in many respects to the
structure of a centralized system. Each transaction manager is responsible for
Maintaining a log for recovery purposes
170

• Participating in an appropriate concurrency-control scheme to coordinate the
concurrent execution of the transactions executing at that site
As we shall see, we need to modify both the recovery and concurrency
schemes to accommodate the distribution of transactions. The transaction coordinator
subsystem is not needed in the centralized environment, since a transaction accesses
data at only a single site. A transaction coordinator, as its name implies, is
responsible for coordinating the execution of all the transactions
initiated at that site. For each such transaction, the coordinator is responsible for
Starting the execution of the transaction


Breaking the transaction into a number of subtransactions and distributing
these subtransactions to the appropriate sites for execution


Coordinating the termination of the transaction, which may result in the
transaction being committed at all sites or aborted at all sites

System Failure Modes
A distributed system may suffer from the same types of failure that a
centralized system does (for example, software errors, hardware errors, or disk
crashes). There are, however, additional types of failure with which we need to deal
in a distributed environment.
171

The basic failure types are
Failure of a site
Loss of messages
Failure of a communication link
Network partition
The loss or corruption of messages is always a possibility in a distributed system.
The system uses transmission-control protocols, such as TCP/IP, to handle
such errors. Information about such protocols may be found in standard textbooks on
networking.
However, if two sites A and B are not directly connected, messages from one
to the other must be routed through a sequence of communication links. If a
communication link fails, messages that would have been transmitted across the link
must be rerouted. In some cases, it is possible to find another route through the
network, so that the messages are able to reach their destination. In other cases, a
failure may result in there being no connection between some pairs of sites. A
system is partitioned if it has been split into two (or more) subsystems, called
partitions, that lack any connection between them. Note that, under this definition, a
subsystem may consist of a single node.
2. A. DATABASE SECURITY
2A. Explain types of database security and database security issues.
May 2016
Database security refers to protection from malicious access. Absolute
protection of the database from malicious abuse is not possible, but the cost to the
perpetrator can be made high enough to deter most if not all attempts to access the
database without proper authority.
To protect the database, we must take security measures at several levels:
172

Database system. Some database-system users may be authorized to access
only a limited portion of the database. Other users may be allowed to issue
queries, but may be forbidden to modify the data. It is the responsibility of the
database system to ensure that these authorization restrictions are not violated.
Operating system. No matter how secure the database system is, weakness in
operating-system security may serve as a means of unauthorized access to the
database.
Network. Since almost all database systems allow remote access through
terminals or networks, software-level security within the network software is
as important as physical security, both on the Internet and in private networks.
Physical. Sites with computer systems must be physically secured against
armed or surreptitious entry by intruders.
Human. Users must be authorized carefully to reduce the chance of any user
giving access to an intruder in exchange for a bribe or other favors.
Issues
Security at all these levels must be maintained if database security is to be
ensured.
A weakness at a low level of security (physical or human) allows
circumvention of strict high-level (database) security measures.
In the remainder of this section, we shall address security at the database-
system level.
Security at the physical and human levels, although important, is beyond the
scope of this text.
Security within the operating system is implemented at several levels, ranging
from passwords for access to the system to the isolation of concurrent
processes running within the system.
The file system also provides some degree of protection.
2.B. K-MEANS ALGORITHM.
2B. Explain about K-means algorithm. Apr / May 2015
173

K-Means clustering intends to partition n objects into k clusters in which each
object belongs to the cluster with the nearest mean. This method produces exactly k
different clusters of greatest possible distinction. The best number of clusters k
leading to the greatest separation (distance) is not known as a priori and must be
computed from the data. The objective of K-Means clustering is to minimize total
intra-cluster variance, or, the squared error function:
Algorithm
Clusters the data into k groups where k is predefined.
Select k points at random as cluster centers.
Assign objects to their closest cluster center according to the Euclidean distance
function.
Calculate the centroid or mean of all objects in each cluster.
Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
174

K-Means is relatively an efficient method. However, we need to specify the number
of clusters, in advance and the final results are sensitive to initialization and often
terminates at a local optimum. Unfortunately there is no global theoretical method to
find the optimal number of clusters. A practical approach is to compare the outcomes
of multiple runs with different k and choose the best one based on a predefined
criterion. In general, a large k probably decreases the error but increases the risk of
overfitting.
Example:
Suppose we want to group the visitors to a website using just their age (a one-
dimensional space) as follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters:
Centroid (C1) = 16 [16]
Centroid (C2) = 22 [22]
Iteration 1:
C1 = 15.33 [15,15,16]
C2 = 36.25 [19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65]
Iteration 2:
C1 = 18.56 [15,15,16,19,19,20,20,21,22]
C2 = 45.90 [28,35,40,41,42,43,44,60,61,65]
Iteration 3:
175

C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]
Iteration 4:
C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]
No change between iterations 3 and 4 has been noted. By using clustering, 2 groups
have been identified 15-28 and 35-65. The initial choice of centroids can affect the
output clusters, so the algorithm is often run multiple times with different starting
conditions in order to get a fair view of what the clusters should be.
2C.Write notes on classification and clustering. Nov / Dec 2014
Classification
Classification is a data mining function that assigns items in a collection to
target categories or classes. The goal of classification is to accurately predict the
target class for each case in the data. For example, a classification model could be
used to identify loan applicants as low, medium, or high credit risks.
A classification task begins with a data set in which the class assignments are
known. For example, a classification model that predicts credit risk could be
developed based on observed data for many loan applicants over a period of time.
Classification consists of predicting a certain outcome based on a given input.
In order to predict the outcome, the algorithm processes a training set containing a
set of attributes and the respective outcome, usually called goal or prediction
attribute. The algorithm tries to discover relationships between the attributes that
would make it possible to predict the outcome. Next the algorithm is given a data set
not seen before, called prediction set, which contains the same set of attributes,
except for the prediction attribute – not yet known. The algorithm analyses the input
and produces a prediction. The prediction accuracy defines how ―good‖ the
algorithm is.
176

Classification Algorithms:
Decision trees
Rule-based induction
Neural networks
Memory(Case) based reasoning
Genetic algorithms
Bayesian networks
Clustering
Clustering is ―the process of organizing objects into groups whose
members are similar in some way‖.
A cluster is therefore a collection of objects which are ―similar‖ between
them and are ―dissimilar‖ to the objects belonging to other clusters.
Clustering – a data mining technique
Usage:
o Statistical Data Analysis
o Machine Learning
o Data Mining
o Pattern Recognition
o Image Analysis
o Bioinformatics
Types of Clustering
Hierarchical
Finding new clusters using previously found ones

Partitional
Finding all clusters at once

o Self-Organizing Maps
o Hybrids (incremental)
177

Major Existing clustering methods
Distance-based
Hierarchical
Partitioning
Probabilistic
•
OBJECT ORIENTED DATABASE
Suppose an Object oriented database had an object A, which references
object B, which in turn references object C. Assume all objects are on disk
initially. Suppose a program first dereferences A, then dereferences B by
following the reference from A, and then finally dereferences C. Show the
objects that are represented in memory after each dereference, along with
their state. Nov / Dec 2015
An object-oriented database is a database that subscribes to a model with
information represented by objects.
Objects, in an object-oriented database, reference the ability to develop a
product, then define and name it. The object can then be referenced, or called
later, as a unit without having to go into its complexities.
Object-oriented database management systems (OODBMSs) combine
database capabilities with object-oriented programming language capabilities.
178

OODBMSs allow object-oriented programmers to develop the product, store
them as objects, and replicate or modify existing objects to make new objects
within the OODBMS. Because the database is integrated with the
programming language, the programmer can maintain consistency within one
environment, in that both the OODBMS and the programming language will
use the same model of representation.
• Complex object data model is non-1NF data model. It allows the
following extensions:
Sets of atomic values
Tuple-valued attributes
Sets of tuples (nested relations)
General set and tuple constructors
Object identity
An object is defined by a triple (OID, type constructor, state) where OID is the
unique object identifier, type constructor is its type (such as atom, tuple, set, list,
array, bag, etc.) and state is its actual value.
• Example: (i1, atom, 'John') (i2, atom, 30)
Object-Oriented features: Complex objects, object identity, encapsulation, classes,
inheritance, overriding, overloading, late binding, computational completeness,
extensibility ● Database features: Persistence, performance, concurrency, reliability,
declarative queries
OID:
Every object has a unique immutable object ID ● Objects are referenced by ID (note
that pointers are bad in a persistent disk-based representation!) ● Complex objects
can be created and an object can be ―shared‖ across complex objects ● In contrast
to relational, either Object IDs or keys can be compared for equality.
Persistence:
Persistence by name and Persistence by reachability
179

ODMG Standard:
ODL , OQL
Object Referencing and dereferencing:
Suppose, for example, that object A references object B using a physical OID.
Now object B is deleted and a new object C is created at the storage location at
which object B was originally stored. To be able to trap that the object referenced by
A no longer exists, the physical OIDs of objects B and C must differ in the value of
their unique fields.
Working with physical OIDs is very simple: to dereference a physical OID
(e.g., traverse the inter-object reference from A to B), the database system simply
decodes the storage location of the referenced object which is part of the physical
OID. Migrations are, for instance, necessary if objects grow as a result of update
operations. If an object migrates, a forward which contains the new storage location
of the object is established at the place at which the object was originally stored. If
an object migrates several times, this forward is updated so that it always contains
the right storage location and an object can be read with at most two ―hops.‖ Two
example physical OIDs are shown in Figure 1. The figure also shows that the object
referenced by the first OID is still stored at its original location whereas the object
referenced by the second OID was migrated to another page so that a forward for
that object had to be established. E
OID
180

. If an object is migrated, the object‗s entry in the mapping structure is updated in a
similar way as forwards are updated when objects migrate and physical OIDs are
used. Three different kinds of mapping structures are used in practice: (1) B-trees,
(2) hash tables, and (3) direct mapping tables.
Object Database Advantages over RDBMS
Objects don't require assembly and disassembly saving coding time
and execution time to assemble or disassemble objects.

Reduced paging

Easier navigation

Better concurrency control - A hierarchy of objects may be locked.

Data model is based on the real world.

Works well for distributed architectures.

Less code required when applications are object oriented.
Object Database Disadvantages compared to RDBMS
Lower efficiency when data is simple and relationships are simple.

Relational tables are simpler.

Late binding may slow access speed.

More user tools exist for RDBMS.

Standards for RDBMS are more stable.

Support for RDBMS is more certain and change is less likely to be required.


ACCESS CONTROL, THREATS AND RISKS
Write short notes on the following.
Access Control (8) Apr / May 2015, Nov / Dec 2014, Nov/Dec 2015
Threats and Risks (8)
4A. Discretionary Access Control
181

The typical method of enforcing discretionary access control in a database system
is based on the granting and revoking privileges
The account level:


At this level, the DBA specifies the particular privileges that each
account holds independently of the relations in the database.

The relation level (or table level):


At this level, the DBA can control the privilege to access each
individual relation or view in the database.

Types of discretionary privileges:
The privileges at the account level apply to the capabilities provided to the account
itself and can include
the CREATE SCHEMA or CREATE TABLE privilege, to create a
schema or base relation;
o the CREATE VIEW privilege;
o the ALTER privilege, to apply schema changes such adding or
removing attributes from relations;
o the DROP privilege, to delete relations or views;
o the MODIFY privilege, to insert, delete, or update tuples;
and the SELECT privilege, to retrieve information from the database
by using a SELECT query.
The second level of privileges applies to the relation level


This includes base relations and virtual (view) relations.
The granting and revoking of privileges generally follow an authorization model
for discretionary privileges known as the access matrix model where


The rows of a matrix M represents subjects (users, accounts, programs)
The columns represent objects (relations, records, columns, views,
operations).
Each position M(i,j) in the matrix represents the types of privileges
(read, write, update) that subject i holds on object j.
Specifying Privileges Using Views
182

The mechanism of views is an important discretionary authorization
mechanism in its own right. For example,


If the owner A of a relation R wants another account B to be able to retrieve
only some fields of R, then A can create a view V of R that includes only
those attributes and then grant SELECT on V to B.


The same applies to limiting B to retrieving only certain tuples of R; a view


V‗ can be created by defining the view by means of a query that selects only
those tuples from R that A wants to allow B to access.
Revoking Privileges
In some cases it is desirable to grant a privilege to a user temporarily. For example,
The owner of a relation may want to grant the SELECT privilege to a
user for a specific task and then revoke that privilege once the task is
completed.
Hence, a mechanism for revoking privileges is needed. In SQL, a
REVOKE command is included for the purpose of canceling
privileges.
Propagation of Privileges using the GRANT OPTION
Whenever the owner A of a relation R grants a privilege on R to another
account B, privilege can be given to B with or without the GRANT OPTION.


If the GRANT OPTION is given, this means that B can also grant that
privilege on R to other accounts.


o Suppose that B is given the GRANT OPTION by A and that B then
grants the privilege on R to a third account C, also with GRANT
OPTION. In this way, privileges on R can propagate to other accounts
without the knowledge of the owner of R.

o If the owner account A now revokes the privilege granted to B, all the
privileges that B propagated based on that privilege should
automatically be revoked by the system.
183

Specifying Limits on Propagation of Privileges
Techniques to limit the propagation of privileges have been developed,
although they have not yet been implemented in most DBMSs and are not a
part of SQL.


o Limiting horizontal propagationto an integer number i means that an
account B given the GRANT OPTION can grant the privilege to at
most i other accounts.

o Vertical propagation is more complicated; it limits the depth of the
granting of privileges.
Mandatory Access Control
In many applications, and additional security policy is needed that classifies
data and users based on security classes. This approach as mandatory access
control would typically be combined with the discretionary access control
mechanisms.


Typical security classes are top secret (TS), secret (S), confidential (C), and
unclassified (U), where TS is the highest level and U the lowest: TS ≥ S ≥ C
≥ U


The commonly used model for multilevel security, known as the Bell-
LaPadula model, classifies each subject (user, account, program) and object
(relation, tuple, column, view, operation) into one of the security
classifications, T, S, C, or U:


o Clearance (classification) of a subject S as class(S) and to the
classification of an object O as class(O).

Two restrictions are enforced on data access based on the subject/object
classifications:


o Simple security property: A subject S is not allowed read access to an
object O unless class(S) ≥ class(O).

o A subject S is not allowed to write an object O unless class(S) ≤
class(O). This known as the star property (or * property).
184

To incorporate multilevel security notions into the relational database model,
it is common to consider attribute values and tuples as data objects.


Hence, each attribute A is associated with a classification attribute C in the
schema, and each attribute value in a tuple is associated with a corresponding
security classification.


In addition, in some models, a tuple classification attribute TC is added to
the relation attributes to provide a classification for each tuple as a whole.


Hence, a multilevel relation schema R with n attributes would be represented
as


o R(A1,C1,A2,C2, …, An,Cn,TC)

where each Ci represents the classification attribute associated with attribute
Ai.


The value of the TC attribute in each tuple t – which is the highest of all
attribute classification values within t – provides a general classification for
the tuple itself, whereas each Ci provides a finer security classification for
each attribute value within the tuple.


o The apparent key of a multilevel relation is the set of attributes that
would have formed the primary key in a regular(single-level) relation.

A multilevel relation will appear to contain different data to subjects (users)
with different clearance levels.


o In some cases, it is possible to store a single tuple in the relation at a
higher classification level and produce the corresponding tuples at a
lower-level classification through a process known as filtering.

o In other cases, it is necessary to store two or more tuples at different
classification levels with the same value for the apparent key.

This leads to the concept of polyinstantiation where several tuples can have
the same apparent key value but have different attribute values for users at
different classification levels.

185

Role-based access control (RBAC)
Role-based access control (RBAC) emerged rapidly in the 1990s as a proven
technology for managing and enforcing security in large-scale enterprisewide
systems.
Its basic notion is that permissions are associated with roles, and users are
assigned to appropriate roles.
Roles can be created using the CREATE ROLE and DESTROY ROLE
commands.
The GRANT and REVOKE commands discussed under DAC can then be
used to assign and revoke privileges from roles.
RBAC appears to be a viable alternative to traditional discretionary and
mandatory access controls; it ensures that only authorized users are given
access to certain data or resources.
Many DBMSs have allowed the concept of roles, where privileges can be
assigned to roles.
Role hierarchy in RBAC is a natural way of organizing roles to reflect the
organization‗s lines of authority and responsibility.
4B. THREAT AND RISKS
4B. Explain about threats and risks.
Threat:
With the increase in usage of databases, the frequency of attacks against those
databases has also increased. Database attacks are an increasing trend these days.
One reason is the increase in access to data stored in databases. When the data is
been accessed by many people, the chances of data theft increases. In the past,
database attacks were prevalent, but were less in number as hackers hacked the
network more to show it was possible to hack and not to sell proprietary information.
Another reason for database attacks is to gain money selling sensitive information,
which includes credit card numbers, Social Security Numbers, etc.
186

Types of Threats to database security
1. Privilege abuse:
When database users are provided with privileges that exceeds their day-to-
day job requirement, these privileges may be abused intentionally or unintentionally.
Take, for instance, a database administrator in a financial institution. What
will happen if he turns off audit trails or create bogus accounts? He will be able to
transfer money from one account to another thereby abusing the excessive privilege
intentionally.
Having seen how privilege can be abused intentionally, let us see how
privilege can be abused unintentionally. A company is providing a ―work from
home" option to its employees and the employee takes a backup of sensitive data to
work on from his home. This not only violates the security policies of the
organization, but also may result in data security breach if the system at home is
compromised.
2. Operating System vulnerabilities:
Vulnerabilities in underlying operating systems like Windows, UNIX, Linux,
etc., and the services that are related to the databases could lead to unauthorized
access. This may lead to a Denial of Service (DoS) attack. This could be prevented
by updating the operating system related security patches as and when they become
available.
3. Database rootkits:
A database rootkit is a program or a procedure that is hidden inside the
database and that provides administrator-level privileges to gain access to the data in
the database. These rootkits may even turn off alerts triggered by Intrusion
Prevention Systems (IPS). It is possible to install a rootkit only after compromising
the underlying operating system. This can be avoided by periodical audit trails, else
the presence of the database rootkit may go undetected
187

Risks:
Database security concerns the use of a broad range of information security
controls to protect databases (potentially including the data, the database applications
or stored functions, the database systems, the database servers and the associated
network links) against compromises of their confidentiality, integrity and
availability. It involves various types or categories of controls, such as technical,
procedural/administrative and physical. Database security is a specialist topic within
the broader realms of computer security, information security and risk management.
Security risks to database systems include, for example:
Unauthorized or unintended activity or misuse by authorized database users,
database administrators, or network/systems managers, or by unauthorized users
or hackers (e.g. inappropriate access to sensitive data, metadata or functions
within databases, or inappropriate changes to the database programs, structures
or security configurations);

Malware infections causing incidents such as unauthorized access, leakage or
disclosure of personal or proprietary data, deletion of or damage to the data or
programs, interruption or denial of authorized access to the database, attacks on
other systems and the unanticipated failure of database services;

Overloads, performance constraints and capacity issues resulting in the inability
of authorized users to use databases as intended;

Physical damage to database servers caused by computer room fires or floods,
overheating, lightning, accidental liquid spills, static discharge, electronic
breakdowns/equipment failures and obsolescence;

Design flaws and programming bugs in databases and the associated programs
and systems, creating various security vulnerabilities (e.g. unauthorized privilege
escalation), data loss/corruption, performance degradation etc.;
Data corruption and/or loss caused by the entry of invalid data or commands,
mistakes in database or system administration processes, sabotage/criminal
damage etc.





188

5. DATAWAREHOUSING AND DATA MINING
5. Explain in detail about Data Warehousing and Data Mining.
Apr / May 2015
A data warehouse is a repository (or archive) of information gathered from
multiple sources, stored under a unified schema, at a single site. Once gathered, the
data are stored for a long time, permitting access to historical data. Thus, data
warehouses provide the user a single consolidated interface to data, making decision-
support queries easier to write. Moreover, by accessing information for decision
support from a data warehouse, the decision maker ensures that online transaction-
processing systems are not affected by the decision-support workload.
Components of a Data Warehouse
When and how to gather data


What schema to use


Data cleansing


How to propagate updates


What data to summarize

Data Warehouse Architecture
189

Warehouse Schemas
Data warehouses typically have schemas that are designed for data analysis,
using tools such as OLAP tools. Thus, the data are usually multidimensional data,
with dimension attributes and measure attributes. Tables containing
multidimensional data are called fact tables and are usually very large.
To minimize storage requirements, dimension attributes are usually short
identifiers that are foreign keys into other tables called dimension tables. For
instance, fact table sales would have attributes item-id, store-id, customer-id, and
date, and measure attributes number and price. The attribute store-id is a foreign key
into a dimension table store, which has other attributes such as store location (city,
state, country). The item-id attribute of the sales table would be a foreign key into a
dimension table item-info, which would contain information such as the name of the
item, the category to which the item belongs, and other item details such as color and
size. The customer-id attribute would be a foreign key into a customer table
containing attributes such as name and address of the customer. We can also view
the date attribute as a foreign key into a date-info table giving the month, quarter,
and year of each date.
The resultant schema appears in Figure. Such a schema, with a fact table,
multiple dimension tables, and foreign keys from the fact table to the dimension
tables, is called a star schema. More complex data warehouse designs may have
multiple levels of dimension tables; for instance, the item-info table may have an
attribute manufacturer-id that is a foreign key into another table giving details of the
manufacturer.
Such schemas are called snowflake schemas. Complex data warehouse
designs may also have more than one fact table.
190

Data Mining
Generally, data mining (sometimes called data or knowledge discovery) is the
process of analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or both.
Data mining software is one of a number of analytical tools for analyzing data. It
allows users to analyze data from many different dimensions or angles, categorize it,
and summarize the relationships identified. Technically, data mining is the process
of finding correlations or patterns among dozens of fields in large relational
databases.
Continuous Innovation
Although data mining is a relatively new term, the technology is not.
Companies have used powerful computers to sift through volumes of supermarket
scanner data and analyze market research reports for years. However, continuous
innovations in computer processing power, disk storage, and statistical software are
dramatically increasing the accuracy of analysis while driving down the cost.
191

Data
Data are any facts, numbers, or text that can be processed by a computer.
Today, organizations are accumulating vast and growing amounts of data in different
formats and different databases. This includes:
operational or transactional data such as, sales, cost, inventory, payroll, and
accounting

nonoperational data, such as industry sales, forecast data, and macro
economic data

meta data - data about the data itself, such as logical database design or data
dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data can yield
information on which products are selling and when.
Knowledge
Information can be converted into knowledge about historical patterns and
future trends. For example, summary information on retail supermarket sales can be
analyzed in light of promotional efforts to provide knowledge of consumer buying
behavior. Thus, a manufacturer or retailer could determine which items are most
susceptible to promotional efforts.
How does data mining work?
While large-scale information technology has been evolving separate
transaction and analytical systems, data mining provides the link between the two.
Data mining software analyzes relationships and patterns in stored transaction data
based on open-ended user queries. Several types of analytical software are available:
statistical, machine learning, and neural networks. Generally, any of four types of
relationships are sought:
192

Classes: Stored data is used to locate data in predetermined groups. For
example, a restaurant chain could mine customer purchase data to determine
when customers visit and what they typically order. This information could be
used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or
consumer preferences. For example, data can be mined to identify market
segments or consumer affinities.

Associations: Data can be mined to identify associations. The beer-diaper
example is an example of associative mining.

Sequential patterns: Data is mined to anticipate behavior patterns and trends.
For example, an outdoor equipment retailer could predict the likelihood of a
backpack being purchased based on a consumer's purchase of sleeping bags
and hiking shoes.
Data mining consists of five major elements:
Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology
professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
Artificial neural networks: Non-linear predictive models that learn through
training and resemble biological neural networks in structure.

Genetic algorithms: Optimization techniques that use processes such as
genetic combination, mutation, and natural selection in a design based on the
concepts of natural evolution.

Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision
tree methods include Classification and Regression Trees (CART) and Chi
193

Square Automatic Interaction Detection (CHAID) . CART and CHAID are
decision tree techniques used for classification of a dataset. They provide a set
of rules that you can apply to a new (unclassified) dataset to predict which
records will have a given outcome. CART segments a dataset by creating 2-
way splits while CHAID segments using chi square tests to create multi-way
splits. CART typically requires less data preparation than CHAID.
Nearest neighbor method: A technique that classifies each record in a
dataset based on a combination of the classes of the k record(s) most similar
to it in a historical dataset (where k 1). Sometimes called the k-nearest
neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on
statistical significance.

Data visualization: The visual interpretation of complex relationships in
multidimensional data. Graphics tools are used to illustrate data relationships.
194

6A. DISTRIBUTED ARCHITECTURE
6A. Explain in detail about Parallel versus Distributed Architectures
Parallel versus Distributed Architectures
Shared memory (tightly coupled) architecture. Multiplprocessors share
secondary (disk) storage and also share primary memory.
Shared disk (loosely coupled) architecture. Multiple processors share
secondary(disk) storage but each has their own primary memory.
o These architectures enable processors to communicate without the
overhead of exchanging messages over a network.4 Database
management systems developed using the above types of architectures
are termed parallel database management systems rather than
DDBMSs, since they utilize parallel processor technology.
Another type of multiprocessor architecture is called shared nothing
architecture. In this architecture, every processor has its own primary and
secondary (disk)memory, no common memory exists, and the processors
communicate over a high speed interconnection network (bus or switch).

Although the shared nothing architecture resembles a distributed database
computing environment, major differences exist in the mode of operation.

In shared nothing multiprocessor systems, there is symmetry and
homogeneity of nodes; this is not true of the distributed database environment
where heterogeneity of hardware and operating system at each node is very
common.

Shared nothing architecture is also considered as an environment for parallel
databases.
General Architecture of Pure Distributed Databases
 Describes logical and component architectural models of a DDB. In
Figure, which describes the generic schema architecture of a DDB, the
enterprise is presented with a consistent, unified view showing the
logical structure of underlying data across all nodes.
195

This view is represented by the global conceptual schema (GCS), which
provides network transparency .

To accommodate potential heterogeneity in the DDB, each node is shown as
having its own local internal schema (LIS) based on physical organization details
at that particular site. The logical organization of data at each site is specified by
the local conceptual schema (LCS).

The global query compiler references the global conceptual schema from the
global system catalog to verify and impose defined constraints.

The global query optimizer references both global and local conceptual
schemas and generates optimized local queries from global queries. It evaluates all
candidate strategies using a cost function that estimates cost based on response
time.
196

Some different database system architectures. (a) Shared nothing architecture.
(b) A networked architecture with a centralized database at one of the sites. (c)
A truly distributed database architecture.
Part - C
6 B. Explain in detail about XML Databases.
Although HTML is widely used for formatting and structuring Web
documents, it is not suitable for specifying structured data that is extracted
from databases.


A new language—namely XML (eXtended Markup Language) has emerged
as the standard for structuring and exchanging data over the Web. XML can
be used to provide more information about the structure and meaning of the
data in the Web pages rather than just specifying how the Web pages are
formatted for display on the screen.


The formatting aspects are specified separately—for example, by using a
formatting language such as XSL (eXtended Stylesheet Language).

Structured, Semi Structured and Unstructured Data.
Information stored in databases is known as structured data because it is
represented in a strict format. The DBMS then checks to ensure that all data
follows the structures and constraints specified in the schema.


In some applications, data is collected in an ad-hoc manner before it is known
how it will be stored and managed. This data may have a certain structure, but
not all the information collected will have identical structure. This type of
data is known as semi-structured data.


– In semi-structured data, the schema information is mixed in with the
data values, since each data object can have different attributes that are
not known in advance. Hence, this type of data is sometimes referred
to as self-decribing data.

– A third category is known as unstructured data, because there is very
limited indication of the type of data. A typical example would be a
197

text document that contains information embedded within it. Web
pages in HTML that contain some data are considered as unstructured
data.
Semi-structured data may be displayed as a directed graph, as shown.


The labels or tags on the directed edges represent the schema names—the
names of attributes, object types (or entity types or classes), and relationships.


The internal nodes represent individual objects or composite attributes.


The leaf nodes represent actual data values of simple (atomic) attributes.

Representing semi structured data as a graph.
XML Hierarchical (Tree) Data Model
The basic object is XML is the XML document. There are two main
structuring concepts that are used to construct an XML document: elements
and attributes. Attributes in XML provide additional information that
describe elements.


As in HTML, elements are identified in a document by their start tag and
end tag. The tag names are enclosed between angled brackets <…>, and
end tags are further identified by a backslash </…>. Complex elements are
constructed from other elements hierarchically, whereas simple elements
contain data values.

198

It is straightforward to see the correspondence between the XML textual
representation and the tree structure. In the tree representation, internal nodes
represent complex elements, whereas leaf nodes represent simple elements.
That is why the XML model is called a tree model or a hierarchical model.

It is possible to characterize three main types of XML documents:
1. Data-centric XML documents:
These documents have many small data items that follow
a specific structure, and hence may be extracted from a structured database. They
are formatted as XML documents in order to exchange them or display them over
the Web.
199

2. Document-centric XML documents:
These are documents with large amounts of text, such as
news articles or books. There is little or no structured data elements in these
documents.
3. Hybrid XML documents:
These documents may have parts that contains
structured data and other parts that are predominantly textual or unstructured.
XML Documents, DTD, and XML Schema:
Well-Formed
– It must start with an XML declaration to indicate the version of XML
being used—as well as any other relevant attributes.
– It must follow the syntactic guidelines of the tree model. This means
that there should be a single root element, and every element must
include a matching pair of start tag and end tag within the start and end
tags of the parent element.
– A well-formed XML document is syntactically correct. This allows it
to be processed by generic processors that traverse the document and
create an internal tree representation.
DOM (Document Object Model) - Allows programs to
manipulate the resulting tree representation corresponding to
a well-formed XML document. The whole document must be
parsed beforehand when using dom.


SAX - Allows processing of XML documents on the fly by
notifying the processing program whenever a start or end tag is
encountered. Valid

A stronger criterion is for an XML document to be valid.
In this case, the document must be well-formed, and in addition the element names
used in the start and end tag pairs must follow the structure specified in a separate
XML DTD (Document Type Definition) file or XML schema file.
200

Limitations of XML DTD
First, the data types in DTD are not very general.


DTD has its own special syntax and so it requires specialized processors. It
would be advantageous to specify XML schema documents using the syntax

rules of XML itself so that the same processors for XML documents can
process XML schema descriptions.
Third, all DTD elements are always forced to follow the specified ordering the
document so unordered elements are not perm.
XML SCHEMA

Schema Descriptions and XML Namespaces:
It is necessary to identify the specific set of XML schema language
elements (tags) by a file stored at a Web site location. The second line in our
example specifies the file used in this example, which is:
"http://www.w3.org/2001/XMLSchema".
Each such definition is called an XML namespace.
The file name is assigned to the variable xsd using the attribute xmlns
(XML namespace), and this variable is used as a prefix to all XML schema tags.

Annotations, documentation, amd language used:
The xsd:annotation and xsd:documentation are used for providing
comments and other descriptions in the XML document. The attribute XML:lang of
the xsd:documentation element specifies the language being used. Eg. ―en‖

Elements and types:
We specify the root element of our XML schema. In XML schema, the
name attribute of the xsd:element tag specifies the element name, which is called
company for the root element in our example. The structure of the company root
element is a xsd:complexType.

First-level elements in the company database:
These elements are named employee, department, and project, and each
is specified in an xsd:element tag. If a tag has only attributes and no further sub-
201

elements or data within it, it can be ended with the back slash symbol (/>) and
termed Empty Element.

Specifying element type and minimum and maximum occurrences:
If we specify a type attribute in an xsd:element, this means that the
structure of the element will be described separately, typically using the
xsd:complexType element. The minOccurs and maxOccurs tags are used for
specifying lower and upper bounds on the number of occurrences of an element. The
default is exactly one occurrence.

Specifying Keys:
For specifying primary keys, the tag xsd:key is used.
For specifying foreign keys, the tag xsd:keyref is used. When specifying a
foreign key, the attribute refer of the xsd:keyref tag specifies the referenced primary
key whereas the tags xsd:selector and xsd:field specify the referencing element type
and foreign key.

Specifying the structures of complex elements via complex types:
Complex elements in our example are Department, Employee, Project,
and Dependent, which use the tag xsd:complexType. We specify each of these as a
sequence of subelements corresponding to the database attributes of each entity type
by using the xsd:sequence and xsd:element tags of XML schema. Each element is
given a name and type via the attributes name and type of xsd:element.
We can also specify minOccurs and maxOccurs attributes if we need to
change the default of exactly one occurrence. For (optional) database attributes
where null is allowed, we need to specify minOccurs = 0, whereas for multivalued
database attributes we need to specify maxOccurs = ―unbounded‖ on the
corresponding element.
Composite (compound) attributes:
Composite attributes from ER Schema are also specified as complex types
in the XML schema, as illustrated by the Address, Name, Worker, and WorkesOn
complex types. These could have been directly embedded within their parent
elements.
202

Approaches to Storing XML Documents

Using a dbms to store the documents as text:
We can use a relational or object dbms to store whole XML documents as
text fields within the dbms records or objects. This aproach can be used if the dbms
has a special module for document processing, and would work for storing
schemaless and document-centric XML documents.

Using a dbms to store the document contents as data elements:
This approach would work for storing a collection of documents that
follow a specific XML DTD or XML schema. Since all the documents have the
same structure, we can design a relational (or object) database to store the leaf-level
data elements within the XML documents.

Designing a specialized system for storing native XML data:
A new type of database system based on the hierarchical (tree) model
would be designed and implemented. The system would include specialized indexing
and querying techniques, and would work for all types of XML documents.
Creating or publishing customized XML documents from pre-existing
relational databases:


Because there are enormous amounts of data already stored in relational
databases, parts of these data may need to be formatted as documents for exchanging
or displaying over the Web.
Extracting XML Documents from Relational Databases.
Suppose that an application needs to extract XML documents for student,
course, and grade information from the university database. The data needed for
these documents is contained in the database attributes of the entity types course,
section, and student as shown below (part of the main ER), and the relationships s-s
and c-s between them.
203

CRYTOGRAPHY AND STATISTICAL DATABASE
Explain in detail about Cryptography and Statistical database.
May 2016
Encryption
Data Encryption Standard (DES) substitutes characters and rearranges their
order on the basis of an encryption key which is provided to authorized users via a
secure mechanism. Scheme is no more secure than the key transmission mechanism
since the key has to be shared.
Advanced Encryption Standard (AES) is a new standard replacing DES,
and is based on the Rijndael algorithm, but is also dependent on shared
secret keys.


Public-key encryption is based on each user having two keys:

o public key – publicly published key used to encrypt data, but cannot be
used to decrypt data
o private key -- key known only to individual user, and used
to decrypt data.
Need not be transmitted to the site doing encryption.


Encryption scheme is such that it is impossible or extremely hard to decrypt
data given only the public key.


The RSA public-key encryption scheme is based on the hardness of factoring
a very large number (100's of digits) into its prime components.

Authentication (Challenge response system)
Password based authentication is widely used, but is susceptible to sniffing on
a network


Challenge-response systems avoid transmission of passwords

o DB sends a (randomly generated) challenge string to user
o User encrypts string and returns result.
o DB verifies identity by decrypting result
204

Can use public-key encryption system by DB sending a message
encrypted using user‗s public key, and user decrypting and sending the
message back.
Digital signatures are used to verify authenticity of data


Private key is used to sign data and the signed data is made public.
Any one can read the data with public key but cannot generate data
without private key..

o Digital signatures also help ensure non repudiation:
sender cannot later claim to have not created the data
Digital Certificates
Digital certificates are used to verify authenticity of public keys.


Problem: when you communicate with a web site, how do you know if you
are talking with the genuine web site or an imposter?


o Solution: use the public key of the web site

o Problem: how to verify if the public key itself is genuine?
Solution:

Every client (e.g. browser) has public keys of a few root-level
certification authorities
A site can get its name/URL and public key signed by a certification
authority: signed document is called a certificate
o Client can use public key of certification authority to verify certificate
Multiple levels of certification authorities can exist. Each certification
authority
presents its own public-key certificate signed by a higher level
authority, and


Uses its private key to sign the certificate of other web
sites/authorities

Statistical database
Statistical databases are used mainly to produce statistics about various
populations.
205

The database may contain confidential data about individuals, which should
be protected from user access.
However, users are permitted to retrieve statistical information about the
populations, such as averages, sums, counts, maximums,minimums, and
standard deviations. Income, Address, City, State, Zip, Sex, and Last_degree.


A population is a set of tuples of a relation (table) that satisfy some selection
condition. Hence, each selection condition on the PERSON relation will
specify a particular population of PERSON tuples. For example, the condition


Sex = ‗M‗ specifies the male population; the condition ((Sex = ‗F‗) AND
(Last_degree = ‗M.S.‗ OR Last_degree = ‗Ph.D.‗)) specifies the female
population that has an M.S. or Ph.D. degree as their highest degree; and the
condition City = ‗Houston‗ specifies the populationthat lives in Houston.

Statistical queries involve applying statistical functions to a population of
tuples.

For example, we may want to retrieve the number of individuals in a
population or the average income in the population. However, statistical users
are not allowed to retrieve individual data, such as the income of a specific
person.


Statistical database security techniques must prohibit the retrieval of
individual data. This can be achieved by prohibiting queries that retrieve
attribute values and by allowing only queries that involve statistical aggregate
functions such as COUNT, SUM, MIN, MAX, AVERAGE, and STANDARD
DEVIATION. Such queries are sometimes called statistical queries.



It is the responsibility of a database management system to ensure the
confidentiality of information about individuals, while still providing useful
statistical summaries of data about those individuals to users. Provision of
privacy protection of users in a statistical database is paramount; its
violation is illustrated in the following example. of statistical queries. This is
particularly true when the conditions result in a

206

The PERSON relation schema for illustrating statistical database security.
population consisting of a small number of tuples. As an illustration, consider
the following statistical queries:
Q1: SELECT COUNT (*) FROM PERSON WHERE <condition>;
Q2: SELECT AVG (Income) FROM PERSON WHERE <condition>;
Now suppose that we are interested in finding the Salary of Jane Smith, and we
know
that she has a Ph.D. degree and that she lives in the city of Bellaire, Texas.We issue
the statistical query Q1 with the following condition: (Last_degree=‗Ph.D.‗ AND
Sex=‗F‗ AND City=‗Bellaire‗ AND State=‗Texas‗)
207

INDUSTRIAL AND PRACTICAL CONNECTIVITY OF THE SUBJECT
Databases are used to support internal operations of organizations and to underpin
online interactions with customers and suppliers.
Databases are used to hold administrative information and more specialized data,
such as engineering data or economic models.
Examples of database applications include computerized library systems, flight
reservation systems, computerized parts inventory systems, and many content
management systems that store websites as collections of webpages in a database.
PRACTICAL ORIENTATION:
Practical Lab work with ORACLE will help the students to learn the DBMS
concepts learned in theory.
Further a Database server administration, configuration and architecture in
real time will help the students understand the application of DBMS in industry.
208

B.E / B.TECH DEGREE EXAMINATION, NOVEMBER/DECEMBER 2014
Third Semester
Computer Science and Engineering
CS6302 – DATABASE MANAGEMENT SYSTEMS
(Common to Information Technology)
(Regulation 2013)
Time: Three Hours Maximum: 100
marks
Answer ALL Questions
PART A – (10 x 2 = 20 Marks)
Why 4NF in normal form is more desirable than BCNF? [Pg No.7]
What is the purpose of Data Base Management System? [Pg No.5]
Differentiate between Dynamic and Static SQL. [Pg No.66]
Give a brief description on DCL Commands. [Pg No.66]
Define the properties of Transaction. [Pg No.96]
What is Serializability? How is it tested? [Pg No.96]
Differentiate between static and dynamic hashing. [Pg No.135]
Define Data mining and Data Warehousing. [Pg No.135&136]
What is Crawling and Indexing the web? [Pg No.166]
What is Relevance Ranking? [Pg No.166]
PART B – (5 X 16 = 80 marks)
11. (a)Write short notes on: (16)
(i) Data model and its types. [Pg No.21]
(ii)E- R Diagram for Banking System [Pg No.26]
Or
What are Normal Forms? Explain the types of normal form with an example.
[Pg No.46] (16)
12. (a) Explain the following with examples:
(iii) DDL [Pg No.74] (4)
(iv) DML [Pg No.74] (4)
(v) Embedded SQL [Pg No.69] (8)
209

Or
Give a detailed description about Query Processing and Optimization.
Explain the cost estimation of Query optimization. [Pg No.83] (16)
13. (a) What is Concurrency? Explain it in terms of locking mechanism and two
phase Commit Protocol. [Pg No.119] (16)
Or
Write short notes on:
(i) Transaction concept [Pg No.101] (8)
(ii) Deadlock. [Pg No.129] (8)
14. (a) (i) Explain in detail RAID technology [Pg No.137] (8)
(ii) Write short notes on Spatial and Mobile Databases [Pg.No.153] .(8)
Or
(b) Explain in detail about (i) B+ tree index (ii) B tree index files. [Pg No.152] (16)
15. (a) (i) Write short notes on Distributed transactions.[Pg No.170] (8)
(ii) Explain about Discretionary access control based granting and revoking
privileges. [Pg No.181] (8)
Or
(b) Write short notes on (16)
(i) Classification[Pg No.176]
(ii) Clustering[Pg No.176]
210

B.E / B.TECH DEGREE EXAMINATION, APRIL/ MAY 2015
Third Semester
(Regulation 2013)
Time: Three Hours Maximum : 100 marks
PART A – (10 x 2 = 20 Marks)
Write the characteristics that distinguish the Data base approach with the file
based approach. [Pg No.5]
Define Functional Dependency. [Pg No.5]
State the need for Query Optimization. [Pg No.65]
What is the difference between Static and Dynamic SQL? [Pg No.66]
Write the ACID properties of Transaction. [Pg No.96]
Define: DDL, DML, DCL and TCL. [Pg No.74]
How dynamic hashing differ from Static Hashing? [Pg No.135]
Write about the four types (Star, Snow flake, Galaxy and Fast Constellation) of data
ware house schema. [Pg No.136]
Define Threats and risks. [Pg No.166]
What is Association rule mining? [Pg No.167]
PART B – (5 X 16 = 80 marks)
(a) Draw E-R Diagram for the restaurant Menu Ordering System, which will
facilitate the food items ordering and services within a restaurant. The entire
restaurant scenario is detailed as follows. The customer is able to view the food
items menu, call the waiter, place orders and obtain the final bill through their
wireless tablet PC are able to initialize a table for customers, control the table
functions to assist customers, orders send orders to food preparation staffs(chefs) ,
with their touch –display interfaces to the system are able to view orders sent to
kitchen by waiters. During preparation, they are able to let the waiter know the
status of each item, and can send notifications when items are completed. The
211

system should have full accountability and logging facilities, and should support
supervisor actins to account for exceptional circumstances, such as meal being
refunded or walked out on. (16) [Pg No.26]
Or
(b) State the need for Normalization of a Database and explain the various
normal forms (1
st
, 2
nd
, 3
rd
, BCNF, 4
th
, 5
th
and Domain key) with suitable
examples. (16) [Pg No.46]
(a) Consider a student registration database comprising of the below given table
schema.
Student File
Student Number Student Name Address Telephone
Course File
Course Number Description Hours Professor Number
Professor File
Professor Number Name Office
Registration File
Student Number Course Number Date
Consider a suitable sample of tuples/records for the above mentioned tables and
write DML statements (SQL) to answer for the queries listed below.
(16) [Pg No.93]
For a specific student number, in which courses is the student registered and what is
his/her name?
Or
212

Discuss about the Join order optimization and heuristic optimization.
[Pg No.86]
(a)Explain about Two-phase commit and Three –Phase commit protocols.
[Pg No.114]
Or
Consider the following schedules. The actions are listed in the order they are
scheduled and prefixed with the transaction name. [Pg No.107] (16)
S1: T1:R(X), T2:R(X), T1:W(Y), T2:W(Y), T1 : R(Y), T2:R(Y)
S2: T3:W(X), T1:R(X), T1:W(Y), T2:R(Z), T2:W(Z), T3:R(Z)
14. (a) With suitable diagrams, discuss about the Raid levels (Level 0, Level 1, Level
0+1, Level 3, Level 4 and Level 5) [Pg No.137] (16)
Or
(b) Explain the Architectural components of a Data warehouse and write about Data
Marts. [Pg No.189] (16)
15. (a) Neatly write the K-means algorithm and show the intermediate results in
clustering the below given points into clusters using K-means algorithm. (16)
P1: (0,0), P2 : (1,10), P3 : (2,20) , P4: (1,15), P5:(1000,2000)
P6: (1500,1500), P7 : (1000,1250). [174]
Or
Discuss about the Access control mechanisms and cryptography methods to
secure the database. [Pg No.181] (16)
213

Third Semester
(Regulation 2013)
PART A – (10 x 2 = 20 Marks)
State the anomalies of 1NF. [Pg No.7]
Is it possible for several attributes to have the same domain? Illustrate your answer
with suitable examples. [Pg No.7]
Differentiate static and dynamic SQL. [Pg No.66]
Why does SQL allow duplicate tuples in a table or in a query result? [Pg No.66]
What is meant by concurrency control? [Pg No.98]
Give an example of Two phase commit protocol. [Pg No.99]
Differentiate static and dynamic hashing. [Pg No135]
Give an example of a join that is not a simple equi-join for which partitioned
parallelism can be used. [Pg No.136]
List the types of privileges used in database access control. [Pg No.167]
Can we have more than one constructor in a class? If yes, explain the need for such a
situation. [Pg No.167]
PART B – (5 X 16 = 80 marks)
11. (a) (i) With help of a neat block diagram explain the basic architecture of a
Data base management system. [Pg No.18] (8)
What are the advantages of having a centralised control of data?
Illustrate your answer with suitable example.[Pg No.12] (8)
Or
A Car rental company maintains a database for all vehicles in its current
fleet. For all vehicles, it includes the vehicle identification number license
214

number, manufacturer, model, date of purchase and color. Special data
are included for certain types of vehicles.
Trucks : Cargo capacity
Sports Cars: horse power, renter age requirement
Vans: number of passengers
Off-road vehicles:ground clearance, drive train (fur or two-wheel drive)
Construct an ER model for the car rental company database. [Pg No.26]
(16)
12. (a) Describe the six clause in the syntax of and SQL query and show what
type of constructs can be specified in each of the six clauses. Which of
the six clauses are required and which are optional? [Pg No.80]
(16)
Or
Assume the following table: [Pg No.95]
Degree (degcode,name,subject)
Candidate(seatno,degcode,name,semester,month,year,result)
Marks(seatno,degcode,semester,month,year,papcode,marks)
Degcode – degree code, name – name of the degree(MSc. M.Com)
Subject – Subject of the course Eg. Phy Papcode – Paper code eg. A1.
Solve the following queries using SQL: (16)
Write a SELECT statement to display all the degree codes which are
there in the candidate table but not present in degree table in the order
of degcode. (4)
Write a SELECT statement to display the name of all the candidates
who have got less than 40 marks in exactly 2 subjects. (4)
Write a SELECT statement to display the name, subject and
number od candidates for all degrees in which there are less than 5
candidates. (4)
Write a SELECT statement to display the names of all the candidates
who have got highest total marks in M,sc (Maths) (4)
13. (a) (i) What is concurrency control? How is it implemented in DBMS? Illustrate
215

with a suitable example. [Pg No.119] (8)
(ii) Discuss view serializability and conflict serializability. (8)
[Pg No.107]
Or
What is deadlock? How does it occur? How transactions be written to
(i) Avoid deadlock [Pg No.129] (8)
(ii) Guarantee correct execution [Pg No.119] (8)
Illustrate with suitable example.
14. (a) (i) What is RAID? List the different levels in RAID technology and explain
its features. [Pg No.137] (8)
(ii) Illustrate indexing and hashing techniques with suitable examples.(8)
[Pg No.142]
Or
(b) Write short notes on (8+8)
Spatial and multimedia databases[Pg No.153]
Mobile and web databases[Pg No.156]
(a)(i) Describe the GRANT functions and explain how it relates to
security. What types of privileges may be granted? How rights could be
invoked? [Pg No.181] (8)
(ii) Write short notes on Data warehousing. [Pg No.189] (8)
Or
Suppose an Object oriented database had an object A, which references
object B, which in turn references object C. Assume all objects are on disk
initially. Suppose a program first dereferences A, then dereferences B by
following the reference from A, and then finally dereferences C. Show the
objects that are represented in memory after each dereference, along with
their state. (16) [Pg No.178]
216

B.E / B.TECH DEGREE EXAMINATION, MAY/JUNE 2016
Third Semester
(Regulation 2013)
PART A – (10 x 2 = 20 Marks)
What are the disadvantages of file processing system? [Pg No.6]
Explain entity relationship model. [Pg No.6]
Name the categories of SQL Commands. [Pg No.65]
Explain Query optimization. [Pg No.65]
What are the properties of transaction? [Pg No.96]
Differentiate strict two phase locking protocol and rigorous two phase
locking protocol. [Pg No.97]
What is meant by garbage collection? [Pg No.137]
Define Software and hardware RAID systems. [Pg No.137]
Explain data classification. [Pg No.168]
What are the advantages of data warehouse? [Pg No.168]
PART B – (5 X 16 = 80 marks)
11.(a) Briefly explain about Data base system Architecture. [Pg No.18] (16)
Or
(b) Briefly explain about views of data. [Pg No.24] (16)
12.(a) (i) Explain about SQL fundamentals. [Pg No.69] (8)
(ii) Explain about Data Definition Language. [Pg No.74] (8)
Or
(b) Briefly explain about Query processing. [Pg No.83] (16)
13.(a) Briefly explain about Two phase commit. [Pg No.114] (16)
Or
(b) Explain about locking protocols. [Pg No.119] (16)
217

14. (a) Briefly explain RAID and RAID levels. [Pg No.137] (16)
Or
(b) Briefly explain about B+ tree index file with example. [Pg No.152] (16)
15. (a) Explain about Distributed Databases and their characteristics, functions and
advantages and disadvantages. [Pg No.160] (16)
Or
Explain types of database security and database security issues. [Pg No.172]
(16)
-o0o-o0o-o0o-

Third Semester
(Regulation 2013)
PART A – (10 x 2 = 20 Marks)
1. Differentiate file processing system with database management system.
2. What is a weak entity? Give Example.
3. What is data definition language? Give Example.
4. Differentiate between static and dynamic SQL.
5. What is ―serializability‖?
6. List the four conditions for deadlock.
7. List out the mechanisms to avoid collision during hashing.
8. What are the advantages of B Tree over B+ Tree?
9. Define a distributed database management system.
10. How does the concept of an object in the object oriented model differ from the
concept of an entity in the entity – relationship model?
PART B —(5 x 13 = 65 marks)
11. a) i) Explain select, project and Cartesian product operations in relational
algebra with an example. (6)
ii) Construct an E-R diagram for a car insurance company whose
customers own one or more cars each. Each car has associated with it
zero to any number of recorded accidents. Each insurance policy covers
one or more cars, and has one or more premium payments associated with
it. Each payment is for a particular period of time, and has an associated
due date, and the date when the payment was received. (7)
or
b) Explain first normal form, second normal form, third normal form
and BCNF with an example. (13)
12. a) Let relations r, (A, B, C) and r2 (C, D, E)) have the following
properties; r, has 20,000 tuples, r2 has 45,000 tuples, 25 tuples of ri fit on one
block and 30 tuples of r2 fit on one block. Estimate the number of block

transfers and seeks required, using each of the following join strategies for rj
Go r2
i. Nested-loop join
ii. Block nested-loop join
i i i . M e r ge j o i n .
iv. Hash j oi n. (13)
Or
b) i) Explain query optimization with an example. (8)
ii) What is embedded SQL? Give example. (5)
13. a) i) Consider the following two transactions:
Ti : read(A);
read(B);
if A = 0 then B := B+ 1;
write(B).
T2 read(B);
read(A);
if B= 0 then A := A + 1;
write(A).
Add lock and unlock instructions to transactions 71 and T2 , so that they observe
the two-phase locking protocol. Can the execution of these transactions result
in a deadlock? (8)
ii) Consider the following extension to the tree-locking protocol, which allows both
shared and exclusive locks:
A-
transaction can be either a read-only transaction, in which case it an request
only shared- locks, or an, update transaction, in which case it can request only
exclusive locks.
Each transaction must follow the rules of the tree protocol. Read-only transactions
may lock any data item first, whereas update transactions must lock the root first.
Show that the protocol ensures serializability and deadlock. (5)
or
13.b) i) Illustrate two phase locking protocol with an example. (6)
ii) Outline deadlock handling mechanisms. (7)
14. a) i) Explain the architecture of a distributed database system. (7)
ii) Explain the concept of RAID. (6)
or
b) i) Describe benefits and drawbacks of a source-driven architecture for
gathering of data at a data warehouse, as compared to a destination driven
architecture. (7)

ii) Explain the concept of spatial database. (6)
15. a) Suppose that you have been hired as a consultant to choose a database
system for your client's application. For each of the following applications,
state what type of database system (relational,• persistent programming language-
based OODB, object relational; do not specify a commercial product) you
would recommend. Justify you r recommendation. ( 1 3 )
i. A computer-aided design system for a manufacturer of airplanes.
ii. A system to track contributions made to candidates for public office.
iii. An information system to support the making of movies
. Or
(b) Discuss Apriori algorithm for mining association rules with an example. (13)
Part – C (1 x 15 = 15 Marks)
16. a) ) Give the DTD or XVII, Schema for an XML representation of the
following nested-relational schema
Emp = (ename, ChildreaSet setof (Children), Skills.
Set setof (Skills))
Children.= (name, Birthday)
Birthday = (day, meon,th, year)
Skills = (type, Exams et setof(Exams))
Exams (year, city)
Or
b) Consider the following bitmap technique for tracking free space in a fie. For
each block in the file, two bits are maintained in the bitmap. If the block is
between 0 and 30 percent full the bits are 00, between 30 and 60 percent the bits
are 01, between 60 and 90 percent the bits are 10, and above 90 percent the bits
are 11. Such bitmaps can be kept in memory even for quite large files.
(i) Describe how to keep the bitmap up to date on record insertions and deletions.
Outline the benefit of the bitmap technique over free lists in searching for
free space and in updating free space information.

B.E / B.TECH DEGREE EXAMINATION MAY / JUNE 2017
Third Semester
(Regulation 2013)
Answer all question
Part A (10 x 2 = 20 Mark)
1. What are the desirable properties of decomposition?
2. Distinguish between key and super key.
3. What is a query execution plan?
4. Which cost component are used most often as the basis for cost
5. What is serializable schedule?
6. What type of locking needed for insert and delete operations?
7. Define replication transparency.
8. State the function of data marts.
9. Define Support and confidence.
10. Distinguish between threats and risks.
Part B (5 x 13 = 65 Marks)
11. a) Discuss the correspondence between ER model construct and relational
model constructs. Show how each ER model can be mapped to the relational
model. Discuss the option for mapping EER model construct.
Or
b) i) Explain the overall architecture of data base system in detail . (8)
ii) List the operations of relational algebra and the purpose of each with
example. (5)
12. a) What is meant by semantic query optimization? How does it differ from
other query optimization technique? Give example.
Or
b) Justify the need of embedded SQL.( Consider the relation student Reg no,
mark, name, grade). Write embedded dynamic SQL program in C language to
retrieve all the students records whose mark is more than 90.

13. a) Discuss the violations caused by each of the following: dirty read, non
repeatable read and phantoms with suitable example.
Or
b) Explain why timestamp-based concurrency control allows schedules that are not
recoverable. Describe how it can be modified through buffering to disallow such
schedules.
14. a) i) Compare and contrast the distributed databases and the centralized
database systems. (8)
ii) Describe the mobile database recovery schemes. (5)
Or
b) Explain that a RAID system is. How does it improve performance and
reliability? Discuss the level 3 and level 4 of RAID. (3 + 4 + 6)
15. a) i) What are the basic crawling operations. Explain the processing steps involve in
crawling procedure with example. (8)
ii) Explain the process of querying XML data with an example. (5)
Or
b) Describe the various components of data warehouse and explain the
different data models used to store data with example.
Part – C (1 x 15 = 15 Marks)
16. a) consider the relation schema given in figure, design and draw an ER diagram
that capture the information of this schema. (5)
Employee (empno, name, office, age)
Books(Isbn, title, authors, publisher)
Loan(empno, Isbn, date)
Write the following queries in relational algebra and SQL.
(i)Find the names of employees who have borrowed a book published by Mcgraw
Hill. (5)
(ii)Find the names of employees who have borrowed all books published by
Mcgraw Hill. (5)
Or
b) Trace the results of using the Apriori algorithm on the grocery store
example with support threshold a =33.34% and confidence threshold c =
60%. Show the candidate and frequent itemsets for each database scan.

Enumerate all the final frequent itemsets. Also indicate the association rules
that are generated and highlight the strong ones, sort them by confidence.
Transaction ID Items
T1 HotDogs, Buns,Ketchup
T2 HotDogs, Buns
T3 HotDogs,Coke, Chips
T4 Chips, coke
T5 Chips, Ketchup
T6 HotDogs,Coke, Chips

CS6302-SCAD-MSM-by www.LearnEngineering.in.pdf

More Related Content

Similar to CS6302-SCAD-MSM-by www.LearnEngineering.in.pdf (20)

Recently uploaded (20)

CS6302-SCAD-MSM-by www.LearnEngineering.in.pdf