Assisting Migration and Evolution of Relational Legacy Databases

Assisting

Migration and Evolution

of

Relational Legacy Databases

by

G.N. Wikramanayake

Department of Computer Science,

University of Wales Cardiff,

Cardiff

September 1996

Abstract

The research work reported here is concerned with enhancing and preparing databases with limited
DBMS capability for migration to keep up with current database technology. In particular, we have
addressed the problem of re-engineering heterogeneous relational legacy databases to assist them in
a migration process. Special attention has been paid to the case where the legacy database service
lacks the specification, representation and enforcement of integrity constraints. We have shown how
knowledge constraints of modern DBMS capabilities can be incorporated into these systems to
ensure that when migrated they can benefit from the current database technology.

To this end, we have developed a prototype conceptual constraint visualisation and enhancement
system (CCVES) to automate as efficiently as possible the process of re-engineering for a
heterogeneous distributed database environment, thereby assisting the global system user in
preparing their heterogeneous database systems for a graceful migration. Our prototype system has
been developed using a knowledge based approach to support the representation and manipulation of
structural and semantic information about schemas that the re-engineering and migration process
requires. It has a graphical user interface, including graphical visualisation of schemas with
constraints using user preferred modelling techniques for the convenience of the user. The system
has been implemented using meta-programming technology because of the proven power and
flexibility that this technology offers to this type of research applications.

The important contributions resulting from our research includes extending the benefits of meta-
programming technology to the very important application area of evolution and migration of
heterogeneous legacy databases. In addition, we have provided an extension to various relational
database systems to enable them to overcome their limitations in the representation of meta-data.
These extensions contribute towards the automation of the reverse-engineering process of legacy
databases, while allowing the user to analyse them using extended database modelling concepts.

Page v

CHAPTER 1

Introduction

This chapter introduces the thesis. Section 1.1 is devoted to the background and motivations of the
research undertaken. Section 1.2 presents the broad goals of the research. The original achievements
which have resulted from the research are summarised in Section 1.3. Finally, the overall
organisation of the thesis is described in Section 1.4.

1.1 Background and Motivations of the Research

Over the years rapid technological changes have taken place in all fields of computing. Most
of these changes have been due to the advances in data communications, computer hardware and
software [CAM89] which together have provided a reliable and powerful networking environment
(i.e. standard local and wide area networks) that allow the management of data stored in computing
facilities at many nodes of the network [BLI92]. These changes have turned round the hardware
technology from centralised mainframes to networked file-server and client-server architectures
[KHO92] which support various ways to use and share data. Modern computers are much more
powerful than the previous generations and perform business tasks at a much faster rate by using
their increased processing power [CAM88, CAM89]. Simultaneous developments in the software
industry have produced techniques (e.g. for system design and development) and products capable of
utilising the new hardware resources (e.g. multi-user environments with GUIs). These new
developments are being used for a wide variety of applications, including modern distributed
information processing applications, such as office automation where users can create and use
databases with forms and reports with minimal effort, compared to the development efforts using
3GLs [HIR85, WOJ94]. Such applications are being developed with the aid of database technology
[ELM94, DAT95] as this field too has advanced by allowing users to represent and manipulate
advanced forms of data and their functionalities. Due to the program data independence feature of
DBMSs the maintenance of database application programs has become easier as functionalities that
were traditionally performed by procedural application routines are now supported declaratively
using database concepts such as constraints and rules.

In the field of databases, the recent advances resulting from technological transformation
include many areas such as the use of distributed database technology [OZS91, BEL92], object-
oriented technology [ATK89, ZDO90], constraints [DAT83, GRE93], knowledge-based systems
[MYL89, GUI94], 4GLs and CASE tools [COMP90, SCH95, SHA95]. Meanwhile, the older
technology was dealing with files and primitive database systems which now appear inflexible, as
the technology itself limits them from being adapted to meet the current changing business needs
catalysed by newer technologies. The older systems which have been developed using 3GLs and in
operation for many years, often suffer from failures, inappropriate functionality, lack of
documentation, poor performance and are referred to as legacy information systems [BRO93,
COMS94, IEE94, BRO95, IEEE95].

The current technology is much more flexible as it supports methods to evolve (e.g. 4GLs,
CASE tools, GUI toolkits and reusable software libraries [HAR90, MEY94]), and can share
resources through software that allows interoperability (e.g. ODBC [RIC94, GEI95]). This evolution

reflects the changing business needs. However, modern systems need to be properly designed and
implemented to benefit from this technology, which may still be unable to prevent such systems
themselves being considered to be legacy information systems in the near future due to the advent of
the next generation of technology with its own special features. The only salvation would appear to
be building in evolution paths in the current systems.

The increasing power of computers and their software has meant they have already taken
over many day to day functions and are taking over more of these tasks as time passes. Thus
computers are managing a larger volume of information in a more efficient manner. Over the years
most enterprises have adopted the computerisation option to enable them to efficiently perform their
business tasks and to be able to compete with their counterparts. As the performance ability of
computers has increased, the enterprises still using early computer technology face serious problems
due to the difficulties that are inherent in their legacy systems.

This means that new enterprises using systems purely based on the latest technology have an
advantage over those which need to continue to use legacy information systems (ISs), as modern ISs
have been developed using current technology which provides not only better performance, but also
utilises the benefits of improved functionality. Hence, managers of legacy IS enterprises want to
retire their legacy code and use modern database management systems (DBMSs) in the latest
environment to gain the full benefits from this newer technology. However they want to use this
technology on the information and data they already hold as well as on data yet to be captured. They
also want to ensure that any attempts to incorporate the modern technology will not adversely affect
the ongoing functionality of their existing systems. This means legacy ISs need to be evolved and
migrated to a modern environment in such a way that the migration is transparent to the current
users. The theme of this thesis is how we can support this form of system evolution.

1.1.1 The Barriers to Legacy Information System Migration

Legacy ISs are usually those systems that have stood the test of time and have become a core
service component for a business’s information needs. These systems are a mix of hardware and
software, sometimes proprietary, often out of date, and built to earlier styles of design,
implementation and operation. Although they were productive and fulfilled their original
performance criteria and their requirements, these systems lack the ability to change and evolve. The
following can be seen as barriers to evolution in legacy IS [IEE94].

• The technology used to build and maintain the legacy IS is obsolete,
• The system is unable to reflect changes in the business world and to support new needs,
• The system cannot integrate with other sub-systems,
• The cost, time and risk involved in producing new alternative systems to the legacy IS.

The risk factor is that a new system may not provide the full functionality of the current
system for a period because of teething problems. Due to these barriers, large organisations [PHI94]
prefer to write independent sub-systems to perform new tasks using modern technology which will
run alongside the existing systems, rather than attempt to achieve this by adapting existing code or
by writing a new system that replaces the old and has new facilities as well. We see the following
immediate advantages of this low risk approach.

Page 4

• The performance, reliability and functionality of the existing system is not affected,
• New applications can take advantage of the latest technology,
• There is no need to retrain those staff who only need the facilities of the old system.

However with this approach, as business requirements evolve with time, more and more new
needs arise, resulting in the development and regular use of many diverse systems within the same
organisation. Hence, in the long term the above advantages are overshadowed by the more serious
disadvantages of this approach, such as:

• The existing systems continue to exist and are legacy IS running on older and older
technology,
• The need to maintain many different systems to perform similar tasks increases the
maintenance and support costs of the organisation,
• Data becomes duplicated in different systems which implies the maintenance of redundant data
with its associated increased risk of inconsistency between the data copies if updating occurs,
• The overall maintenance cost for hardware, software and support personnel increases as many
platforms are being supported,
• The performance of the integrated information functions of the organisation decreases due to
the need to interface many disparate systems.

To address the above issues, legacy ISs need to be evolved and migrated to new computing
environments, when their owning organisation upgrades. This migration should occur within a
reasonable time after the upgrade occurs. This means that it is necessary to migrate legacy ISs to
new target environments in order to allow the organisation to dispose of the technology which is
becoming obsolete. Managers of some enterprises have chosen an easy way to overcome this
problem, by emulating [CAM89, PHI94] the current environment on the new platforms (e.g. AS/400
emulators for IBM S/360 and ICL’s DME emulators for 1900 and System 4 users). An alternative
strategy is achieved by translating [SHA93, PHI94, SHE94, BRO95] the software to run in new
environments (i.e. code-to-code level translation). The emulator approach perpetuates all the
software deficiencies of the legacy ISs although successfully removing the old-fashioned hardware
technology and so it does enjoy the increased processing power of the new hardware. The translation
approach takes advantage of some of the modern technological benefits in the target environment as
the conversions - such as IBM’s JCL and ICL’s SCL code to Unix shell scripts, Assembler to
COBOL, COBOL to COBOL embedded with SQL, and COBOL data structures to relational DBMS
tables - are also done as part of the translation process. This approach, although a step forward, still
carries over most of the legacy code as legacy systems are not evolved by this process. For example,
the basic design is not changed. Hence the barrier to change and/or integration to a common sub-
system still remains, and the translated systems were not designed for the environment they are now
running in, so they may not be compatible with it.

There are other approaches to overcoming this problem which have been used by enterprises
[SHA93, BRO95]. These include re-implementing systems under the new environment and/or
upgrading existing systems to achieve performance improvements. As computer technology
continues to evolve at an ever quicker pace the need to migrate arises more rapidly. This means,
most small organisations and individuals are left behind and are forced to work in a technologically

Page 5

obsolete environment, mainly due to the high cost of frequently migrating to newer systems and/or
upgrading existing software, as this process involves time and manpower which cost money. The
gap between the older and newer system users will very soon create a barrier to information sharing
unless some tools are developed to assist the older technology users’ migration to new technology
environments. This assistance for the older technology users may take many forms, including tools
for: analysing and understanding existing systems; enhancing and modifying existing systems;
migrating legacy ISs to newer platforms. The complete migration process for a legacy IS needs to
consider these requirements and many other aspects, as recently identified by Brodie and
Stonebraker in [BRO95]. Our work was primarily motivated by these business oriented legacy
database issues and by work in the area of extending relational database technology to enable it to
represent more knowledge about its stored data [COD79, STO86a, STO86b, WIK90]. This second
consideration is an important aspect of legacy system migration, since if a graceful migration is to be
achieved we must be able to enhance a legacy relational database with such knowledge to take full
advantage of the new system environment.

1.1.2 Heterogeneous Distributed Environments

As well as the problem of having to use legacy ISs, most large enterprises are faced with the
problem of heterogeneity and the need for interoperability between existing ISs [IMS91]. This arises
due to the increased use of different computer systems and software tools for information processing
within an organisation as time passes. The development of networking capabilities to manage and
share information stored over a network has made interoperability a requirement and local area
networks finding broad acceptance in business enterprises has enhanced the need to perform this task
within organisations. Network file servers, client-server technology and the use of distributed
databases [OZS91, BEL92, KHO92] are results of these challenging innovations. This technology is
currently being used to create and process information held in heterogeneous databases, which
involves linking different databases in an interoperable environment. An aspect of this work is
legacy database interoperation, since as time passes these databases will have been built using
different generations of software.

In recent years, the demand for distributed database capabilities has been fuelled mostly by
the decentralisation of business functions in large organisations to address customer needs, and by
mergers and acquisitions that have taken place in the corporate world. As a consequence, there is a
strong requirement among enterprises for the ability to cross-correlate data stored in different
existing heterogeneous databases. This has led to the development of products referred to as
gateways, to enable users to link different databases together, e.g. Microsoft’s Open Database
Connectivity (ODBC) drivers can link Access, FoxPro, Btrieve, dBASE and Paradox databases
together [COL94, RIC94]. There are similar products for other database vendors, such as Oracle1
[HOL93] and others2 [PUR93, SME93, RIC94, BRO95]. Database vendors have targetted cross-
platform compatibility via SQL access protocols to support interoperability in a heterogeneous
environment. As heterogeneity in distributed systems may occur in various forms ranging from

1
For IBM’s DB2, UNISYS’s DMS, DEC RMS.
2
For INGRES, SYBASE, Informix and other popular SQL DBMSs.
3
During the life-time of this project the SQL-3 standards moved from a preliminary draft, through
several modifications before being finalised in 1995.

Page 6

different hardware platforms, operating systems, networking protocols and local database systems,
cross-platform compatibility via SQL provides only a simple form of heterogeneous distributed
database access. The biggest challenge comes in addressing heterogeneity due to differences in local
databases [OZS91, BEL92]. This challenge is also addressed in the design and development of our
system.

Distributed DBMSs have become increasingly popular in organisations as they offer the
ability to interconnect existing databases, as well as having many other advantages [OZS91,
BEL92]. The interconnection of existing databases leads to two types of distributed DBMS, namely:
homogeneous and heterogeneous distributed DBMSs. In homogeneous systems all of the constituent
nodes run the same DBMS and the databases can be designed in harmony with each other. This
simplifies both the processing of queries at different nodes and the passing of data between nodes. In
heterogeneous systems the situation is more complex, as each node can be running a different
DBMS and the constituent databases can be designed independently. This is the normal situation
when we are linking legacy databases, as the DBMS and the databases used are more likely to be
heterogeneous since they are usually implemented for different platforms during different
technological eras. In such a distributed database environment, heterogeneity may occur in various
forms, at different levels [OZS91, BEL92], namely :

• The logical level (i.e. involving different database designs),
• The data management level (i.e. involving different data models),
• The physical level, (i.e. involving different hardware, operating systems and network
protocols), and
• At all three or any pair of these levels.

1.1.3 The Problems and Search for a Solution

The concept of heterogeneity itself is valuable as it allows designers a freedom of choice
between different systems and design approaches, thus enabling them to identify those most suitable
for different applications. The exploitation of this freedom over the years in many organisations has
resulted in the creation of multiple local and remote information systems which now need to be
made interoperable to provide an efficient and effective information service to the enterprise
managers. Open database connectivity (ODBC) [RIC94, GEI95] and its standards has been proposed
to support interoperability among databases managed by different DBMSs. Database vendors such
as Oracle, INGRES, INFORMIX and Microsoft have already produced tools, engines and
connectivity products to fulfil this task [HOL93, PUR93, SME93, COL94, RIC94, BRO95]. These
products allow limited data transfer and query facilities among databases to support interoperability
among heterogeneous DBMSs. These features, although they permit easy, transparent heterogeneous
database access, still do not provide a solution to legacy IS where a primary concern is to evolve and
migrate the system to a target environment so that obsolete support systems can be retired.
Furthermore, the ODBC facilities are developed for current DBMSs and hence may not be capable
of accessing older generation DBMSs, and, if they are, are unlikely to be able to enhance them to
take advantage of the newer technologies. Hence there is a need to create tools that will allow ODBC
equivalent functionality for older generation DBMSs. Our work provides such functionality for all
the DBMSs we have chosen for this research. It also provides the ability to enhance and evolve
legacy databases.

Page 7

In order to evolve an information system, one needs to understand the existing system’s
structure and code. Most legacy information systems are not properly documented and hence
understanding such systems is a complex process. This means that changing any legacy code
involves a high risk as it could result in unexpected system behaviour. Therefore one needs to
analyse and understand existing system code before performing any changes to the system.

Database system design and implementation tools have appeared recently which have the
aim of helping new information system development. Reverse and re-engineering tools are also
appearing in an attempt to address issues concerned with existing databases [SHA93, SCH95]. Some
of these tools allow the examination of databases built using certain types of DBMSs, however, the
enhancements they allow are done within the limitation of that system. Due to continuous ongoing
technology changes, most current commercial DBMSs do not support the most recent software
modelling techniques and features (e.g. Oracle version 7 does not support Object-Oriented features).
Hence a system built using current software tools is guaranteed to become a legacy system in the
near future (i.e. when new products with newer techniques and features begin to appear in the
commercial market place).

Reverse engineering tools [SHA93] are capable of recreating the conceptual model of an
existing database and hence they are an ideal starting point when trying to gain a comprehensive
understanding of the information held in the database and its current state, as they create a visual
picture of that state. However, in legacy systems the schemas are basic, since most of the
information used to compose a conceptual model is not available in these databases. Information
such as constraints that show links between entities is usually embedded in the legacy application
code and users find it difficult to reverse engineer these legacy ISs. Our work addresses these issues
while assisting in overcoming this barrier within the knowledge representation limitations of existing
DBMSs.

1.1.4 Primary and Secondary Motivations

The research reported in this thesis therefore was primarily promoted by the need to provide,
for a logically heterogeneous distributed database environment, a design tool that allows users not
only to understand their existing systems but also to enhance and visualise an existing database’s
structure using new techniques that are either not yet present in existing systems or not supported by
the existing software environment. It was also motivated by:

a) Its direct applicability in the business world, as the new technique can be applied to incrementally
enhance existing systems and prepare them to be easily migrated to new target environments,
hence avoiding continued use of legacy information systems in the organisation.

Although previous work and some design tools address the issue of legacy information
system analysis, evolution and migration, these are mainly concerned with 3GL languages such as
COBOL and C [COMS94, BRO95, IEEE95]. Little work has been reported which addresses the new
issues that arise due to the Object-Oriented (O-O) data model or the extended relational data model
[CAT94]. There are no reports yet of enhancing legacy systems so that they can migrate to O-O or
extended relational environments in a graceful migration from a relational system. There has been

Page 8

some work in the related areas of identifying extended entity relationship structures in relational
schemas, and some attempts at reverse-engineering relational databases [MAR90, CHI94, PRE94].

b) The lack of previous research in visualising pre-existing heterogeneous database schemas and
evolving them by enhancing them with modern concepts supported in more recent releases of
software.

Most design tools [COMP90, SHA93] which have been developed to assist in Entity-
Relationship (E-R) modelling [ELM94] and Object Modelling Technique (OMT) modelling
[RUM91] are used in a top-down database design approach (i.e. forward engineering) to assist in
developing new systems. However, relatively few tools attempt to support a bottom-up approach
(i.e. reverse engineering) to allow visualisation of pre-existing database schemas as E-R or OMT
diagrams. Among these tools only a very few allow enhancement of the pre-existing database
schemas, i.e. they apply forward engineering to enhance a reverse-engineered schema. Even those
which do permit this action to some extent, always operate on a single database management system
and work mostly with schemas originally designed using such systems (e.g. CASE tools). The tools
that permit only the bottom-up approach are referred to as reverse-engineering tools and those which
support both (i.e. bottom-up and top-down) are called re-engineering tools [SHA93]. This thesis is
primarily concerned with creating re-engineering tools that assist legacy database migration.

The commercially available re-engineering tools are customised for particular DBMSs and
are not easily usable in a heterogeneous environment. This barrier against widespread usability of re-
engineering tools means that a substantial adaptation and reprogramming effort (costing time and
money) is involved every time a new DBMS appears in a heterogeneous environment. An obvious
example that reflects this limitation arises in a heterogeneous distributed database environment
where there may be a need to visualise each participant database’s schema. In such an environment
if the heterogeneity occurs at the database management level (where each node uses a different
DBMS, for example, one node uses INGRES [DAT87] and another uses Oracle [ROL92]), then we
have to use two different re-engineering tools to display these schemas. This situation is exacerbated
for each additional DBMS that is incorporated into the given heterogeneous context. Also, legacy
databases are migrated to different DBMS environments as newer versions and better database
products have appeared since the original release of their DBMS. This means that a re-engineering
tool that assists legacy database migration must work in an heterogeneous environment so that its
use will not be restricted to particular types of ISs.

Existing re-engineering tools provide a single target graphical data model (usually the E-R
model or a variant of it), which may differ in presentation style between tools and therefore inhibits
the uniformity of visualisation that is highly desirable in an interoperable heterogeneous distributed
database environment. This limitation means that users may need to use different tools to provide the
required uniformity of display in such an environment. The ability to visualise the conceptual model
of an information system using a user-preferred graphical data model is important as it ensures that
no inaccurate enhancements are made to the system due to any misinterpretation of graphical
notations used.

c) The need to apply rules and constraints to pre-existing databases to identify and clean inconsistent
legacy data, as preparation for migration or as an enhancement of the database’s quality.

Page 9

The inability to define and apply rules and constraints on early database systems due to
system limitations resulted in them not using constraints to increase the accuracy and consistency of
the data held by these systems. This limitation is now a barrier to information system migration as a
new target DBMS is unable to enforce constraints on a migrated database until all violations are
investigated and resolved either by omitting the violating data or by cleaning it. This investigation
may also show that a constraint has to be adjusted as the violating data is needed by the organisation.
The enhancement of such a system by rules and constraints provides knowledge that is usable to
determine possible data violations. The process of detecting constraint violations may be done by
applying queries that are generated from these enhanced constraints. Similar methods have been
used to implement integrity constraints [STO75], optimise queries [OZS91] and obtain intensional
answers [FON92, MOT89]. This is essential as constraints may have been implemented at the
application coding level and that can lead to their inconsistent application.

d) An awareness of the potential contribution that knowledge-based systems and meta-programming
technologies, in association with extended relational database technology, have to offer in coping
with semantic heterogeneity.

The successful production of a conceptual model is highly dependent on the semantic
information available, and on the ability to reason about these semantics. A knowledge-based system
can be used to assist in this task, as the process to generalise effective exploitation of semantic
information for pre-existing heterogeneous databases needs to undergo three sub-processes, namely:
knowledge acquisition, representation and manipulation. The knowledge acquisition process extracts
the existing knowledge from a database’s data dictionaries. This knowledge may include subsequent
enhancements made by the user, as the use of a database to store such knowledge will provide easy
access to this information along with its original knowledge. The knowledge representation process
represents existing and enhanced knowledge. The knowledge manipulation process is concerned
with deriving new knowledge and ensuring consistency of existing knowledge. These stages are
addressable using specific processes. For instance, the reverse-engineering process used to produce a
conceptual model can be used to perform the knowledge acquisition task. Then the derived and
enhanced knowledge can be stored in the same database by adopting a process that will allow us to
distinguish this knowledge from its original meta-data. Finally, knowledge manipulation can be done
with the assistance of a Prolog based system [GRA88], while data and knowledge consistency can be
verified using the query language of the database.

1.2 Goals of the Research

The broad goals of the research reported in this thesis are highlighted here, with detailed aims
and objectives presented in section 2.4. These goals are to investigate interoperability problems,
schema enhancement and migration in a heterogeneous distributed database environment, with
particular emphasis on extended relational systems. This should provide a basis for the design and
implementation of a prototype software system that brings together new techniques from the areas of
knowledge-based systems, meta-programming and O-O conceptual data modelling with the aim of
facilitating schema enhancement, by means of generalising the efficient representation of constraints
using the current standards. Such a system is a tool that would be a valuable asset in a logically
heterogeneous distributed extended relational database environment as it would make it possible for

Page 10

global users to incrementally enhance legacy information systems. This offers the potential for users
in this type of environment to work in terms of such a global schema, through which they can
prepare their legacy systems to easily migrate to target environments and so gain the benefits of
modern computer technology.

1.3 Original Achievements of the Research

The importance of this research lies in establishing the feasibility of enhancing, cleaning and
migrating heterogeneous legacy databases using meta-programming technology, knowledge-based
system technology, database system technology and O-O conceptual data modelling concepts, to
create a comprehensive set of techniques and methods that form an efficient and useful generalised
database re-engineering tool for heterogeneous sets of databases. The benefits such a tool can bring
are also demonstrated and assessed.

A prototype Conceptual Constraint Visualisation and Enhancement System (CCVES)
[WIK95a] has been developed as a result of the research. To be more specific, our work has made
four important contributions to progress in the database topic area of Computer Science:

1) CCVES is the first system to bring the benefits of meta-programming technology to the very
important application area of enhancing and evolving heterogeneous distributed legacy databases
to assist the legacy database migration process [GRA94, WIK95c].

2) CCVES is also the first system to enhance existing databases with constraints to improve their
visual presentation and hence provide a better understanding of existing applications [WIK95b].
This process is applicable to any relational database application, including those which are unable
to naturally support the specification and enforcement of constraints. More importantly, this
process does not affect the performance of an existing application.

3) As will be seen later, we have chosen the current SQL-3 standards [ISO94] as the basis for
knowledge representation in our research. This project provides an extension to the
representation of the relational data model to cope with automated reuse of knowledge in the re-
engineering process. In order to cope with technological changes that result from the emergence
of new systems or new versions of existing DBMSs, we also propose a series of extended
relational system tables conforming to SQL-3 standards to enhance existing relational DBMSs
[WIK95b].

4) The generation of queries using the constraint specifications of the enhanced legacy systems is an
easy and convenient method of detecting any constraint violating data in existing systems. The
application of this technique in the context of a heterogeneous environment for legacy
information systems is a significant step towards detecting and cleaning inconsistent data in
legacy systems prior to their migration. This is essential if a graceful migration is to be effected
[WIK95c].

1.4 Organisation of the Thesis

Page 11

The thesis is organised into 8 chapters. This first chapter has given an introduction to the
research done, covering background and motivations, and outlining original achievements. The rest
of the thesis is organised as follows:

Chapter 2 is devoted to presenting an overview of the research together with detailed aims and
objectives for the work undertaken. It begins by identifying the scope of the work in terms of
research constraints and development technologies. This is followed by an overview of the research
undertaken, where a step by step discussion of the approach adopted and its role in a heterogeneous
distributed database environment is given. Finally, detailed aims and objectives are drawn together
to conclude the chapter.

Chapter 3 identifies the relational data model as the current dominant database model and presents
its development along with its terminology, features and query languages. This is followed by a
discussion of conceptual data models with special emphasis on the data models and symbols used in
our project. Finally, we pay attention to key concepts related to our project, mainly the notion of
semantic integrity constraints and extensions to the relational model. Here, we present important
integrity constraint extensions to the relational model and its support using different SQL standards.

Chapter 4 addresses the issue of legacy information system migration. The discussion commences
with an introduction to legacy and our target information systems. This is followed by migration
strategies and methods for such ISs. Finally, we conclude by referring to current techniques and
identify the trends and existing tools applicable to database migration.

Chapter 5 addresses the re-engineering process for relational databases. Techniques currently used
for this purpose are identified first. Our approach, which uses constraints to re-engineer a relational
legacy database is described next. This is followed by a process for detecting possible keys and
structures of legacy databases. Our schema enhancement and knowledge representation techniques
are then introduced. Finally, we present a process to detect and resolve conflicts that may occur due
to schema enhancement.

Chapter 6 introduces some example test databases which were chosen to represent a legacy
heterogeneous distributed database environment and its access processes. Initially, we present the
design of our test databases, the selection of our test DBMSs and the prototype system environment.
This is followed by the application of our re-engineering approach to our test databases. Finally, the
organisation of relational meta-data and its access is described using our test DBMSs.

Chapter 7 presents the internal and external architecture and operation of our conceptual constraint
visualisation and enforcement system (CCVES) in terms of the design, structure and operation of its
interfaces, and its intermediate modelling system. The internal schema mappings, e.g. mapping from
INGRES QUEL to SQL and vice-versa, and internal database migration processes are presented in
detail here.

Chapter 8 provides an evaluation of CCVES, identifying its limitations and improvements that could
be made to the system. A discussion of potential applications is presented. Finally we conclude the

Page 12

chapter by drawing conclusions about the research project as a whole.

Page 13

CHAPTER 2

Research Scope, Approach, Aims and Objectives

This chapter describes, in some detail, the aims and objectives of the research that has been
undertaken. Firstly, the boundaries of the research are defined in section 2.1, which considers the
scope of the project. Secondly, an overview of the research approach we have adopted in dealing
with heterogeneous distributed legacy database evolution and migration is given in section 2.2.
Next, in section 2.3, the discussion is extended to the wider aspects of applying our approach in a
heterogeneous distributed database environment using the existing meta-programming technology
developed at Cardiff in other projects. Finally, the research aims and objectives are detailed in
section 2.4, illustrating what we intend to achieve, and the benefits expected from achieving the
stated aims.

2.1 Scope of the Project

We identify the scope of the work in terms of research constraints and the limitations of
current development technologies. An overview of the problem is presented along with the
drawbacks and limitations of database software development technology in addressing the
problem. This will assist in identifying our interests and focussing the issues to be addressed.

2.1.1 Overview of the Problem

In most database designs, a conceptual design and modelling technique is used in
developing the specifications at the user requirements and analysis stage of the design. This stage
usually describes the real world in terms of object/entity types that are related to one another in
various ways [BAT92, ELM94]. Such a technique is also used in reverse-engineering to portray
the current information content of existing databases, as the original designs are usually either
lost, or inappropriate because the database has evolved from its original design. The resulting
pictorial representation of a database can be used for database maintenance, for database re-
design, for database enhancement, for database integration or for database migration, as it gives its
users a sound understanding of an existing database’s architecture and contents.

Only a few current database tools [COMP90, BAT92, SHA93, SCH95] allow the capture
and presentation of database definitions from an existing database, and the analysis and display of
this information at a higher level of abstraction. Furthermore, these tools are either restricted to
accessing a specific database management system’s databases or permit modelling with only a
single given display formalism, usually a variant of the EER [COMP90]. Consequently there is a
need to cater for multiple database platforms with different user needs to allow access to a set of
databases comprising a heterogeneous database, by providing a facility to visualise databases
using a preferred conceptual modelling technique which is familiar to the different user
communities of the heterogeneous system.

The fundamental modelling constructs of current reverse and re-engineering tools are
entities, relationships and associated attributes. These constructs are useful for database design at

a high level of abstraction. However, the semantic information now available in the form of rules
and constraints in modern DBMSs provides their users with a better understanding of the
underlying database as its data conforms to these constraints. This may not necessarily be true for
legacy systems, which may have constraints defined that were not enforced. The ability to
visualise rules and constraints as part of the conceptual model increases user understanding of a
database. Users could also exploit this information to formulate queries that more effectively
utilise the information held in a database. Having these features in mind, we concentrated on
providing a tool that permits specification and visualisation of constraints as part of the graphical
display of the conceptual model of a database. With modern technology increasing the number of
legacy systems and with increasing awareness of the need to use legacy data [BRO95, IEEE95],
the availability of such a visualisation tool will be more important in future as it will let users see
the full definition of the contents of their databases in a familiar format.

Three types of abstraction mechanism, namely: classification, aggregation and
generalisation, are used in conceptual design [ELM94]. However, most existing DBMSs do not
maintain sufficient meta-data information to assist in identifying all these abstraction mechanisms
within their data models. This means that reverse and re-engineering tools are semi-automated, in
that they extract information, but users have to guide them and decide what information to look
for [WAT94]. This requires interactions with the database designer in order to obtain missing
information and to resolve possible conflicts. Such additional information is supplied by the tool
users when performing the reverse-engineering process. As this additional information is not
retained in the database, it must be re-entered every time a reverse engineering process is
undertaken if the full representation is to be achieved. To overcome this problem, knowledge
bases are being used to retain this information when it is supplied. However, this approach
restricts the use of this knowledge by other tools which may exist in the database’s environment.
The ability to hold this knowledge in the database itself would enhance an existing database with
information that can be widely used. This would be particularly useful in the context of legacy
databases as it would enrich their semantics. One of the issues considered in this thesis is how this
can be achieved.

Most existing relational database applications record only entities and their properties (i.e.
attribute names and data types) as system meta-data. This is because these systems conformed to
early database standards (e.g. the SQL/86 standard [ANSI86], supported by INGRES version 5
and Oracle version 5). However, more recent relational systems record additional information
such as constraint and rule definitions, as they conform to the SQL/92 standards [ANSI92] (e.g.
Oracle version 7). This additional information includes, for example, primary and foreign key
specifications, and can be used to identify classification and aggregation abstractions used in a
conceptual model [CHI94, PRE94, WIK95b]. However, the SQL/92 standard does not capture the
full range of modelling abstractions, e.g. inheritance representing generalisation hierarchies. This
means that early relational database applications are now legacy systems as they fail to naturally
represent additional information such as constraint and rule definitions. Such legacy database
systems are being migrated to modern database systems not only to gain the benefits of the current
technology but also to be compatible with new applications built with the modern technology. The
SQL standards are currently subject to review to permit the representation of extra knowledge
(e.g. object-oriented features), and we have anticipated some of these proposals in our work - i.e.
SQL-33 [ISO94] will be adopted by commercial systems and thus the current modern DBMSs

Page 15

will become legacy databases in the near future or already may be considered to be legacy
databases in that their data model type will have to be mapped onto the newer version. Having
experienced the development process of recent DBMSs it is inevitable that most current databases
will have to be migrated, either to a newer version of the existing DBMS or to a completely
different newer technology DBMS for a variety of reasons. Thus the migration of legacy
databases is perceived to be a continuing requirement, in any organisation, as technology
advances continue to be made.

Most migrations currently being undertaken are based on code-to-code level translations of
the applications and associated databases to enable the older system to be functional in the target
environment. Minimal structural changes are made to the original system and database, thus the
design structures of these systems are still old-fashioned, although they are running in a modern
computing environment. This means that such systems are inflexible and cannot be easily
enhanced with new functions or integrated with other applications in their new environment. We
have also observed that more recent database systems have often failed to benefit from modern
database technology due to inherent design faults that have resulted in the use of unnormalised
structures, which cause omission of the features enforcing integrity constraints even when this is
possible. The ability to create and use databases without the benefit of a database design course is
one reason for such design faults. Hence there is a need to assist existing systems to be evolved,
not only to perform new tasks but also to improve their structure so that these systems can
maximise the gains they receive from their current technology environment and any environment
they migrate to in the future.

2.1.2 Narrowing Down the Problem

Technological advances in both hardware and software have improved the performance
and maintenance functionality of all information systems (ISs), and as a result, older ISs suffer
from comparatively poor performance and inappropriate functionality when compared with more
modern systems. Most of these legacy systems are written in a 3GL such as COBOL, have been
around for many years, and run on old-fashioned mainframes. Problems associated with legacy
systems are being identified and various solutions are being developed [BRO93, SHE94, BRO95].
These systems basically have three functional components, namely: interface, application and a
database service, which are sometimes inter-related to each other, depending on how they were
used during the design and implementation stages of the IS development. This means that the
complexity of a legacy IS depends on what occurred during the design and implementation of the
system. These systems may range from a simple single user database application using separate
interfaces and applications, to a complex multi-purpose unstructured application. Due to the
complex nature of the problem area we do not address this issue as a whole, but focus only on
problems associated with one sub-component of such legacy information systems, namely the
database service. This in itself is a wide field, and we have further restricted ourselves to legacy
ISs using a specific DBMS for their database service. We considered data models ranging from
original flat file and relational systems, to modern relational DBMSs and object-oriented DBMSs.
From these data models we have chosen the traditional relational model for the following reasons.

• The relational model is currently the most widely used database model.

Page 16

• During the last two decades the relational model has been the most popular model;
therefore it has been used to develop many database applications and most of these are now
legacy systems.
• There have been many extensions and variations of the relational model, which has
resulted in many heterogeneous relational database systems being used in organisations.
• The relational model can be enhanced to represent additional semantics currently
supported only by modern DBMSs (e.g. extended relational systems [ZDO90, CAT94]).

As most business requirements change with time, the need to enhance and migrate legacy
information systems exists for almost every organisation. We address problems faced by these
users while seeking for a solution that prevents new systems becoming legacy systems in the near
future. The selection of the relational model as our database service to demonstrate how one could
achieve these needs means that we shall be addressing only relational legacy database systems and
not looking at any other type of legacy information systems.

This decision means we are not considering the many common legacy IS migration
problems identified by Brodie [BRO95] (e.g. migration of legacy database services such as flat-
file structures or hierarchical databases into modern extended relational databases; migration of
legacy applications with millions of lines of code written in some COBOL-like language into a
modern 4GL/GUI environment). However, as shown later, addressing the problems associated
with relational legacy databases has enabled us to identify and solve problems associated with
more recent DBMSs, and it also assists in identifying precautions which if implemented by
designers of new systems will minimise the chance of similar problems being faced by these
systems as IS developments occur in the future.

2.2 Overview of the Research Approach

Having presented an overview and narrowing down of our problem, we identify the
following as the main functionalities that should be provided to fulfil our research goal:

• Reverse-engineering of a relational legacy database to fully portray its current information
content.
• Enhancing a legacy database with new knowledge to identify modelling concepts that
should be available to the database concerned or to applications using that database.
• Determining the extent to which the legacy database conforms to its existing and enhanced
descriptions.
• Ensuring that the migrated IS will not become a legacy IS in the future.

We need to consider the heterogeneity issue in order to be able to reverse-engineer any
given relational legacy database. Three levels of heterogeneity are present for a particular data
model, namely: at a physical, logical and data management level. The physical level of
heterogeneity usually arises due to different data model implementation techniques, use of
different computer platforms and use of different DBMSs. The physical / logical data
independence of DBMSs hides implementation differences from users, hence we need only
address how to access databases that are built using different DBMSs, running on different
computer platforms.

Page 17

Differences in DBMS characteristics lead to heterogeneity at the logical level. Here, the
different DBMSs conform to a particular standard (e.g. SQL/86 or SQL/92), which supports a
particular database query language (e.g. SQL or QUEL) and different relational data model
features (e.g. handling of integrity constraints and availability of object-oriented features). To
tackle heterogeneity at the logical level, we need to be aware of different standards, and to model
ISs supporting different features and query languages.

Heterogeneity at the data management level arises due to the physical limitations of a
DBMS, differences in the logical design and inconsistencies that occurred when populating the
database. Logical differences in different database schemas have to be resolved only if we are
going to integrate them. The schema integration process is concerned with merging different
related database applications. Such a facility can assist the migration of heterogeneous database
systems. However any attempt to integrate legacy database schemas prior to the migration process
complicates the entire process as it is similar to attempting to provide new functionalities within
the system which is being migrated. Such attempts increase the chance of failure of the overall
migration process. Hence we consider any integration or enhancements in the form of new
functionalities only after successfully migrating the original legacy IS. However, the physical
limitations of a DBMS and data inconsistencies in the database need to be addressed beforehand
to ensure a successful migration.

Our work addresses the heterogeneity issues associated with database migration by
adopting an approach that allows its users to incrementally increase the number of DBMSs it
could handle without having to reprogram its main application modules. Here, the user needs to
supply specific knowledge about DBMS schema and query language constructs. This is held
together with the knowledge of the DBMSs already supported and has no effect on the
application’s main processing modules.

2.2.1 Meta-Programming

Meta-programming technology allows the meta-data (schema information) of a database to
be held and processed independently of its source specification language. This allows us to work
on a database language independent environment and hence overcome many logical heterogeneity
issues. Prolog based meta-programming technology has been used in previous research at Cardiff
in the area of logical heterogeneity [FID92, QUT94]. Using this technology the meta-translation
of database query languages [HOW87] and database schemas [RAM91] has been performed. This
work has shown how the heterogeneity issues of different DBMSs can be addressed without
having to reprogram the same functionality for each and every DBMS. We use meta-programming
technology for our legacy database migration approach as we need to be able to start with a legacy
source database and end with a modern target database where the respective database schema and
query languages may be different from each other. In this approach the source database schema or
query language is mapped on input into an internal canonical form. All the required processing is
then done using the information held in this internal form. This information is finally mapped to
the target schema or query language to produce the desired output. The advantage of this approach
is that processing is not affected by heterogeneity as it is always performed on data held in the
canonical form. This canonical form is an enriched collection of semantic data modelling features.

Page 18

2.2.2 Application

We view our migration approach as consisting of a series of stages, with the final stage
being the actual migration and earlier stages being preparatory. At stage 1, the data definition of
the selected database is reverse-engineered to produce a graphical display (cf. paths A-1 and A-2
of figure 2.1). However, in legacy systems much of the information needed to present the database
schema in this way is not available as part of the database meta-data and hence these links which
are present in the database cannot be shown in this conceptual model. In modern systems such
links can be identified using constraint specifications. Thus, if the database does not have any
explicit constraints, or it does but these are incomplete, new knowledge about the database needs
to be entered at stage 2 (cf. path B-1 of figure 2.1), which will then be reflected in the enhanced
schema appearing in the graphical display (cf. path B-2 of figure 2.1). This enhancement will
identify new links that should be present for the database concerned. These new database
constraints can next be applied experimentally to the legacy database to determine the extent to
which it conforms to them. This process is done at stage 3 (cf. paths C-1 and C-2 of figure 2.1).
The user can then decide whether these constraints should be enforced to improve the quality of
the legacy database prior to its migration. At this point the three preparatory stages in the
application of our approach are complete. The actual migration process is then performed. All
stages are further described below to enable us to identify the main processing components of our
proposed system as well as to explain how we deal with different levels of heterogeneity.

Stage 1: Reverse Engineering

In stage 1, the data definition of the selected database is reverse-engineered to produce a
graphical display of the database. To perform this task, the database’s meta-data must be extracted
(cf. path A-1 of figure 2.1). This is achieved by connecting directly to the heterogeneous database.
The accessed meta-data needs to be represented using our internal form. This is achieved through
a schema mapping process as used in the SMTS (Schema Meta-Translation System) of Ramfos
[RAM91]. The meta-data in our internal formalism then needs to be processed to derive the
graphical constructs present for the database concerned (cf. path A-2 of figure 2.1). These
constructs are in the form of entity types and the relationships and their derivation process is the
main processing component in stage 1. The identified graphical constructs are mapped to a display
description language to produce a graphical display of the database.

Page 19

Schema
Enhanced Visualisation Enforced
Constraints (EER or OMT) Constraints
with Constraints

B-1 C-1
B-2 A-2

Internal Processing

B-3 C-2

A-1

Heterogeneous Databases

Stage 1 (Reverse Engineering) Stage 2 (Knowledge Augmentation)
Stage 3 (Constraint Enforcement)

Figure 2.1: Information flow in the 3 stages of our approach prior to migration

a) Database connectivity for heterogeneous database access

Unlike the previous Cardiff meta-translation systems [HOW87, RAM91, QUT92], which
addressed heterogeneity at the logical and data management levels, our system looks at the
physical level as well. While these previous systems processed schemas in textual form and did
not access actual databases to extract their DDL specification, our system addresses physical
heterogeneity by accessing databases running on different hardware / software platforms (e.g.
computer systems, operating systems, DBMSs and network protocols). Our aim is to directly
access the meta-data of a given database application by specifying its name, the name and version
of the host DBMS, and the address of the host machine4. If this database access process can
produce a description of the database in DDL formalism, then this textual file is used as the
starting point for the meta-translation process as in previous Cardiff systems [RAM91, QUT92].
We found that it is not essential to produce such a textual file, as the required intermediate
representation can be directly produced by the database access process. This means that we could
also by-pass the meta-translation process that performs the analysis of the DDL text to translate it
into the intermediate representation5. However the DDL formalism of the schema can be used for
optional textual viewing and could also serve as the starting point for other tools6 developed at
Cardiff for meta-programming database applications.

The initial functionality of the Stage 1 database connectivity process is to access a
heterogeneous database and supply the accessed meta-data as input to our schema meta-translator

4
We assume that access privileges for this host machine and DBMS have been granted.
5
A list of tokens ready for syntactic analysis in the parsing phase is produced and processed
based on the BNF syntax specification of the DDL [QUT92].
6
e.g. The Schema Meta-Integration System (SMIS) of Qutaishat [QUT92].

Page 20

(SMTS). This module needs to deal with heterogeneity at the physical and data management
levels. We achieve this by using DML commands of the specific DBMS to extract the required
meta-data held in database data dictionaries treated like user defined tables.

Relatively recently, the functionalities of a heterogeneous database access process have
been provided by means of drivers such as ODBC [RIC94]. Use of such drivers will allow access
to any database supported by them and hence obviate the need to develop specialised tools for
each database type as happened in our case. These driver products were not available when we
undertook this stage of our work.

b) Schema meta-translation

The schema meta-translation process [RAM91] accepts input of any database schema
irrespective of its DDL and features. The information captured during this process is represented
internally to enable it to be mapped from one database schema to another or to further process and
supply information to other modules such as the schema meta-visualisation system (SMVS)
[QUT93] and the query meta-translation system (QMTS) [HOW87]. Thus, the use of an internal
canonical form for meta representation has successfully accommodated heterogeneity at the data
management and logical levels.

c) Schema meta-visualisation

Schema visualisation using graphical notation and diagrams has proved to be an important
step in a number of applications, e.g. during the initial stages of the database design process; for
database maintenance; for database re-design; for database enhancement; for database integration;
or for database migration; as it gives users a sound understanding of an existing database’s
structure in an easily assimilated format [BAT92, ELM94]. Database users need to see a visual
picture of their database structure instead of textual descriptions of the defining schema as it is
easier for them to comprehend a picture. This has led to the production of graphical
representations of schema information, effected by a reverse engineering process. Graphical data
models of schemas employ a set of data modelling concepts and a language-independent graphical
notation (e.g. the Entity Relationship (E-R) model [CHE76], Extended/Enhanced Entity
Relationship (EER) model [ELM94] or the Object Modelling Technique (OMT) [RUM91]). In a
heterogeneous environment different users may prefer different graphical models, and an
understanding of the database structure and architecture beyond that given by the traditional
entities and their properties. Therefore, there is a need to produce graphical models of a database’s
schema using different graphical notations such as either E-R/EER or OMT, and to accompany
them with additional information such as a display of the integrity constraints in force in the
database [WIK95b]. The display of integrity constraints allows users to look at intra- and inter-
object constraints and gain a better understanding of domain restrictions applicable to particular
entities. Current reverse engineering tools do not support this type of display.

The generated graphical constructs are held internally in a similar form to the meta-data of
the database schema. Hence using a schema meta visualisation process (SMVS) it is possible to
map the internally held graphical constructs into appropriate graphical symbols and coordinates
for the graphical display of the schema. This approach has a similarity to the SMTS, the main

Page 21

difference being that the output is graphical rather than textual.

Stage 2: Knowledge Augmentation

In a heterogeneous distributed database environment, evolution is expected, especially in
legacy databases. This evolution can affect the schema description and in particular schema
constraints that are not reflected in the stage 1 (path A-2) graphical display as they may be
implicit in applications. Thus our system is designed to accept new constraint specifications (cf.
path B-1 of figure 2.1) and add them to the graphical display (cf. path B-2 of figure 2.1) so that
these hidden constraints become explicit.

The new knowledge accepted at this point is used to enhance the schema and is retained in
the database using a database augmentation process (cf. path B-3 of figure 2.1). The new
information is stored in a form that conforms with the enhanced target DBMS’s methods of
storing such information. This assists the subsequent migration stage.

a) Schema enhancement

Our system needs to permit a database schema to be enhanced by specifying new
constraints applicable to the database. This process is performed via the graphical display. These
constraints, which are in the form of integrity constraints (e.g. primary key, foreign key, check
constraints) and structural components (e.g. inheritance hierarchies, entity modifications) are
specified using a GUI. When they are entered they will appear in the graphical display.

b) Database augmentation

The input data to enhance a schema provides new knowledge about a database. It is
essential to retain this knowledge within the database itself, if it is to be readily available for any
further processing. Typically, this information is retained in the knowledge base of the tool used
to capture the input data, so that it can be reused by the same tool. This approach restricts the use
of this knowledge by other tools and hence it must be re-entered every time the re-engineering
process is applied to that database. This makes it harder for the user to gain a consistent
understanding of an application, as different constraints may be specified during two separate re-
engineering processes. To overcome this problem, we augment the database itself using the
techniques proposed in SQL-3 [ISO94], wherever possible. When it is not possible to use SQL-3
structures we store the information in our own augmented table format which is a natural
extension of the SQL-3 approach.

When a database is augmented using this method, the new knowledge is available in the
database itself. Hence, any further re-engineering processes need not make requests for the same
additional knowledge. The augmented tables are created and maintained in a similar way to user-
defined tables, but have a special identification to distinguish them. Their structure is in line with
the international standards and the newer versions of commercial DBMSs, so that the enhanced
database can be easily migrated to either a newer version of the host DBMS or to a different
DBMS supporting the latest SQL standards. Migration should then mean that the newer system
can enforce the constraints. Our approach should also mean that it is easy to map our tables for

Page 22

holding this information into the representation used by the target DBMS even if it is different, as
we are mapping from a well defined structure.

Legacy databases that do not support explicit constraints can be enhanced by using the
above knowledge augmentation method. This requirement is less likely to occur for databases
managed by more recent DBMSs as they already hold some constraint specification information
in their system tables. The direction taken by Oracle version 6 was a step towards our
augmentation approach, as it allowed the database administrator to specify integrity constraints
such as primary and foreign keys, but did not yet enforce them [ROL92]. The next release of
Oracle, i.e. version 7, implemented this constraint enforcement process.

Stage 3: Constraint Enforcement

The enhanced schema can be held in the database, but the DBMS can only enforce these
constraints if it has the capability to do so. This will not normally be the case in legacy systems. In
this situation, the new constraints may be enforced via a newer version of the DBMS or by
migrating the database to another DBMS supporting constraint enforcement. However, the data
being held in the database may not conform to the new constraints, and hence existing data may
be rejected by the target DBMS in the migration, thus losing data and / or delaying the migration
process. To address this problem and to assist the migration process, we provide an optional
constraint enforcement process module which can be applied to a database before it is migrated.
The objective of this process is to give users the facility to ensure that the database conforms to all
the enhanced constraints before migration occurs. This process is optional so that the user can
decide whether these constraints should be enforced to improve the quality of the legacy data prior
to its migration, whether it is best left as it stands, or whether the new constraints are too severe.

The constraint definitions in the augmented schema are employed to perform this task. As
all constraints held have already been internally represented in the form of logical expressions,
these can be used to produce data manipulation statements suitable for the host DBMS. Once
these statements are produced, they are executed against the current database to identify the
existence of data violating a constraint.

Stage 4: Migration Process

The migration process itself is incrementally performed by initially creating the target
database and then copying the legacy data over to it. The schema meta-translation (SMTS)
technique of Ramfos [RAM91] is used to produce the target database schema. The legacy data can
be copied using the import / export tools of source and target DBMS or DML statements of the
respective DBMSs. During this process, the legacy applications must continue to function until
they too are migrated. To achieve this an interface can be used to capture and process all database
queries of the legacy applications during migration. This interface can decide how to process
database queries against the current state of the migration and re-direct those newly related to the
target database. The query meta-translation (QMTS) technique of Howells [HOW87] can be used
to convert these queries to the target DML. This approach will facilitate transparent migration for
legacy databases. Our work does not involve the development of an interface to capture and

Page 23

process all database queries, as interaction with the query interface of the legacy IS is embedded
in the legacy application code. However, we demonstrate how to create and populate a legacy
database schema in the desired target environment while showing the role of SMTS and QMTS in
such a process.

2.3 The Role of CCVES in Context of Heterogeneous Distributed Databases

Our approach described in section 2.2 is based on preparing a legacy database schema for
graceful migration. This involves visualisation of database schemas with constraints and
enhancing them with constraints to capture more knowledge. Hence we call our system the
Conceptualised Constraint Visualisation and Enhancement System (CCVES).

CCVES has been developed to fit in with the previously developed schema (SMTS)
[RAM91] and query (QMTS) [HOW87] meta-translation systems, and the schema meta-
visualisation system (SMVS) [QUT93]. This allows us to consider the complementary roles of
CCVES, SMTS, QMTS and SMVS during Heterogeneous Distributed Database access in a
uniform way [FID92, QUT94]. The combined set of tools achieves semantic coordination and
promotes interoperability in a heterogeneous environment at logical, physical and data
management levels.

Figure 2.2 illustrates the architecture of CCVES in the context of heterogeneous
distributed databases. It outlines in general terms the process of accessing a remote (legacy)
database to perform various database tasks, such as querying, visualisation, enhancement,
migration and integration.

There are seven sub-processes: the schema mapping process [RAM91], query mapping
process [HOW87], schema integration process [QUT92], schema visualisation process [QUT93],
database connectivity process, database enhancement process and database migration process. The
first two processes together have been called the Integrated Translation Support Environment
[FID92], and the first four processes together have been called the Meta-Integration/Translation
Support Environment [QUT92]. The last three processes were introduced as CCVES to perform
database enhancement and migration in such an environment.

The schema mapping process, referred to as SMTS, translates the definition of a source
schema to a target schema definition (e.g. an INGRES schema to a POSTGRES schema). The
query mapping process, referred to as QMTS, translates a source query to a target query (e.g. an
SQL query to a QUEL query). The meta-integration process, referred to as SMIS, tackles
heterogeneity at the logical level in a distributed environment containing multiple database
schemas (e.g. Ontos and Exodus local schemas with a POSTGRES global schema) - it integrates
the local schemas to create the global schema. The meta-visualisation process, referred to as
SMVS, generates a graphical representation of a schema. The remaining three processes, namely:
database connectivity, enhancement and migration with their associated processes, namely:
SMVS, SMTS and QMTS, are the subject of the present thesis, as they together form CCVES
(centre section of figure 2.2).

The database connectivity process (DBC), queries meta-data from a remote database (route

Page 24

A-1 in figure 2.2) to supply meta-knowledge (route A-2 in figure 2.2) to the schema mapping
process referred to as SMTS. SMTS translates this meta-knowledge to an internal representation
which is based on SQL schema constructs. These SQL constructs are supplied to SMVS for
further processing (route A-3 in figure 2.2) which results in the production of a graphical view of
the schema (route A-4 in figure 2.2). Our reverse-engineering techniques [WIK95b] are applied to
identify entity and relationship types to be used in the graphical model. Meta-knowledge
enhancements are solicited at this point by the database enhancement process (DBE) (route B-1 in
figure 2.2), which allows the definition of new constraints and changes to the existing schema.
These enhancements are reflected in the graphical view (route B-2 and B-3 in figure 2.2) and may
be used to augment the database (route B-4 to B-8 in figure 2.2). This approach to augmentation
makes use of the query mapping process, referred to as QMTS, to generate the required queries to
update the database via the DBC process. At this stage any existing or enhanced constraints may
be applied to the database to determine the extent to which it conforms to the new enhancements.
Carrying out this process will also ensure that legacy data will not be rejected by the target DBMS
due to possible violations. Finally, the database migration process, referred to as DBMI, assists
migration by incrementally migrating the database to the target environment (route C-1 to C-6 in
figure 2.2). Target schema constructs for each migratable component are produced via SMTS, and
DDL statements are issued to the target DBMS to create the new database schema. The data for
these migrated tables are extracted by instructing the source DBMS to export the source data to
the target database via QMTS. Here too, the queries which implement this export are issued to the
DBMS via the DBC process.

2.4 Research Aims and Objectives

Our relational database enhancement and augmentation approach is important in three
respects, namely:

1) by holding the additional defining information in the database itself, this information is
usable by any design tool in addition to assisting the full automation of any future re-
engineering of the same database;
2) it allows better user understanding of database applications, as the associated constraints
are shown in addition to the traditional entities and attributes at the conceptual level;

Page 25

3) the process which assists a database administrator to clean inconsistent legacy data ensures a
safe migration. To perform this latter task in a real world situation without an automated support
tool is a very difficult, tedious, time consuming and error prone task.

Therefore the main aim of this project has been the design and development of a tool to
assist database enhancement and migration in a heterogeneous distributed relational database
environment. Such a system is concerned with enhancing the constituent databases in this type of
environment to exploit potential knowledge both to automate the re-engineering process and to
assist in evolving and cleaning the legacy data to prevent data rejection, possible losses of data
and/or delays in the migration process. To this end, the following detailed aims and objectives
have been pursued in our research:

1. Investigation of the problems inherent in schema enhancement and migration for a
heterogeneous distributed relational legacy database environment, in order to fully understand
these processes.

2. Identification of the conceptual foundation on which to successfully base the design and
development of a tool for this purpose. This foundation includes:

• A framework to establish meta-data representation and manipulation.
• A real world data modelling framework that facilitates the enhancement of existing working
systems and which supports applications during migration.
• A framework to retain the enhanced knowledge for future use which is in line with current
international standards and techniques used in newer versions of relational DBMSs.
• Exploiting existing databases in new ways, particularly linking them with data held in other
legacy systems or more modern systems.
• Displaying the structure of databases in a graphical form to make it easy for users to
comprehend their contents.
• The provision of an interactive graphical response when enhancements are made to a
database.
• A higher level of data abstraction for tasks associated with visualising the contents,
relationships and behavioural properties of entities and constraints.
• Determining the constraints on the information held and the extent to which the data
conforms to these constraints.
• Integrating with other tools to maximise the benefits of the new tool to the user community.

3. Development of a prototype tool to automate the re-engineering process and the migration
assisting tasks as far as possible. The following development aims have been chosen for this
system:

• It should provide a realistic solution to the schema enhancement and migration assistance
process.
• It should be able to access and perform this task for legacy database systems.
• It should be suitable for the data model at which it is targeted.
• It should be as generic as possible so that it can be easily customised for other data models.
• It should be able to retain the enhanced knowledge for future analysis by itself and other

Page 26

tools.
• It should logically support a model using modern data modelling techniques irrespective of
whether it is supported by the DBMS in use.
• It should make extensive use of modern graphical user interface facilities for all graphical
displays of the database schema.
• Graphical displays should also be as generic as possible so that they can be easily enhanced or
customised for other display methods.

Page 27

CHAPTER 3
Database Technology, Relational Model,
Conceptual Modelling and Integrity Constraints

The origins and historical development of database technology are initially presented here to focus
the evolution of ISs and the emergence of database models. The relational data model is identified as
currently the most commonly used database model and some terminology for this data model, along
with its features including query languages is then presented. A discussion of conceptual data
models with special emphasis on EER and OMT is provided to introduce these data models and the
symbols used in our project. Finally, we pay attention to crucial concepts relating to our work,
namely the notion of semantic integrity constraints, with special emphasis on those used in semantic
extensions to the relational model. The relational database language SQL is also discussed,
identifying how and when it supports the implementation of these semantic integrity constraints.

3.1 Origins and Historical Developments

The origin of data management goes back to the 1950’s and hence, this section is sub divided
into two parts: the first part describes database technology prior to the relational data model, and the
second part describes developments since. This division was chosen as the relational model is
currently the most dominant database model for information management [DAT90].

3.1.1 Database Technology Prior to the Relational Data Model

Database technology emerged from the need to manipulate large collections of data for
frequently used data queries and reports. The first major step in mechanisation of information
systems came with the advent of punched card machines which worked sequentially on fixed-length
fields [SEN73, SEN77]. With the appearance of stored program computers, tape-oriented systems
were used to perform these tasks with an increase in user efficiency. These systems used sequential
processing of files in batch mode, which was adequate until peripheral storage with random access
capabilities (e.g. DASD) and time sharing operating systems with interactive processing appeared to
support real-time processing in computer systems.

Access methods such as direct and indexed sequential access methods (e.g. ISAM, VSAM)
[BRA82, MCF91] were used to assist with the storage and location of physical records in stored
files. Enhancements were made to procedural languages (e.g. COBOL) to define and manage
application files, making the application program dependent on the organisation of the file. This
technique caused data redundancy as several files were used in systems to hold the same data (e.g.
emp_name and address in a payroll file; insured_name and address in an insurance file; and
depositors_name and address in a bank file). These stored data files used in the applications of the
1960's are now referred to as conventional file systems, and they were maintained using third
generation programming languages such as COBOL and PL/1. This evolution of mechanised
information systems was influenced by the hardware and software developments which occurred in
the 1950’s and early 1960’s. Most long existing legacy ISs are based on this technology. Our work
does not address this type of IS as they do not use a DBMS for their data management.

The evolution of databases and database management systems [CHA76, FRY76, SIB76,

Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
SEN77, KIM79, MCG81, SEL87, DAT90, ELM94] was to a large extent the result of addressing the
main deficiencies in the use of files, i.e. by reducing data redundancy and making application
programs less dependent on file organisation. An important factor in this evolution was the
development of data definition languages which allowed the description of a database to be
separated from its application programs. This facility allowed the data definition (often called a
schema) to be shared and integrated to provide a wide variety of information to the users. The
repository of all data definitions (meta data) is called data dictionaries and their use allows data
definitions to be shared and widely available to the user community.

In the late 1960's applications began to share their data files using an integrated layer of
stored data descriptions, making the first true database, e.g. the IMS hierarchical database [MCG77,
DAT90]. This type of database was navigational in nature and applications explicitly followed the
physical organisation of records in files to locate data using commands such as GNP - get next under
parent. These databases provided centralised storage management, transaction management,
recovery facilities in the event of failure and system maintained access paths. These were the typical
characteristics of early DBMSs.

Work on extending COBOL to handle databases was carried out in the late 60s and 70s. This
resulted in the establishment of the DBTG (i.e. DataBase Task Group) of CODASYL and the formal
introduction of the network model along with its data manipulation commands [DBTG71]. The
relational model was proposed during the same period [COD70], followed by the 3 level
ANSI/SPARC architecture [ANSI75] which made databases more independent of applications, and
became a standard for the organisation of DBMSs. Three popular types of commercial database
systems7 classified by their underlying data model emerged during the 70s [DAT90, ELM94],
namely:

• hierarchical
• network
• relational

and these have been the dominant types of DBMS from the late 60s on into the 80s and 90s.

3.1.2 Database Technology Since the Relational Data Model

At the same time as the relational data model appeared, database systems introduced another
layer of data description on top of the navigational functionality of the early hierarchical and
network models to bring extra logical data independence8. The relational model also introduced the
use of non-procedural (i.e. declarative) languages such as SQL [CHA74]. By the early 1980's many
relational database products, e.g. System R [AST76], DB2 [HAD84], INGRES [STO76] and Oracle
were in use and due to their growing maturity in the mid 80s and the complexity of programming,
navigating, and changing data structures in the older DBMS data models, the relational data model
was able to take over the commercial database market with the result that it is now dominant.

7
Other types such as flat file, inverted file systems were also used.
8
This allows changes to the logical structure of data without changing the application programs.

Page 29

Constraints
The advent of inexpensive and reliable communication between computer systems, through
the development of national and international networks, has brought further changes in the design of
these systems. These developments led to the introduction of distributed databases, where a
processor uses data at several locations and links it as though it were at a single site. This technology
has led to distributed DBMSs and the need for interoperability among different database systems
[OZS91, BEL92].

Several shortcomings of the relational model have been identified, including its inability to
perform efficiently compute-intensive applications such as simulation, to cope with computer-aided
design (CAD) and programming language environments, and to represent and manipulate effectively
concepts such as [KIM90]:

• Complex nested entities (e.g. design and engineering objects),
• Unstructured data (e.g. images, textual documents),
• Generalisation and aggregation within a data structure,
• The notion of time and versioning of objects and schemas,
• Long duration transactions.

The notion of a conceptual schema for application-independent modelling introduced by the
ANSI/SPARC architecture led to another data model, namely: the semantic model. One of the most
successful semantic models is the entity-relationship (E-R) model [CHE76]. Its concepts include
entities, relationships, value sets and attributes. These concepts are used in traditional database
design as they are application-independent. Many modelling concepts based on variants/extensions
to the E-R model have appeared since Chen’s paper. The enhanced/extended entity-relationship
model (EER) [TEO86, ELM94], the entity-category-relationship model (ECR) [ELM85], and the
Object Modelling Technique (OMT) [RUM91] are the most popular of these.

The DAPLEX functional model [SHI81] and the Semantic Data Model [HAM81] are also
semantic models. They capture a richer set of semantic relationships among real-world entities in a
database than the E-R based models. Semantic relationships such as generalisation / specialisation
between a superclass and its subclass, the aggregation relationship between a class and its attributes,
the instance-of relationship between an instance and its class, the part-of relationship between
objects forming a composite object, and the version-of relationship between abstracted versioned
objects are semantic extensions supported in these models. The object-oriented data model with its
notions of class hierarchy, class-composition hierarchy (for nested objects) and methods could be
regarded as a subset of this type of semantic data model in terms of its modelling power, except for
the fact that the semantic data model lacks the notion of methods [KIM90] which is an important
aspect of the object-oriented model.

The relational model of data and the relational query language have been extended [ROW87]
to allow modelling and manipulation of additional semantic relationships and database facilities.
These extensions include data abstraction, encapsulation, object identity, composite objects, class
hierarchies, rules and procedures. However, these extended relational systems are still being
evolved to fully incorporate features such as implementation of domain and extended data types,
enforcement of primary and foreign key and referential integrity checking, prohibition of duplicate
rows in tables and views, handling missing information by supporting four-valued predicate logic

Page 30

Constraints
(i.e. true, false, unknown, not applicable) and view updatability [KIV92], and they are not yet
available as commercial products.

The early 1990's saw the emergence of new database systems by a natural evolution of
database technology, with many relational database systems being extended and other data models
(e.g. the object-oriented model) appearing to satisfy more diverse application needs. This opened
opportunities to use databases for a greater diversity of applications which had not been previously
exploited as they were not perceived as tractable by a database approach (e.g. Image, medical,
document management, engineering design and multi-media information, used in complex
information processing applications such as office automation (OA), computer-aided design (CAD),
computer-aided manufacturing (CAM) and hyper media [KIM90, ZDO90, CAT94]). The object-
oriented (O-O) paradigm represents a sound basis for making progress in these areas and as a result
two types of DBMS are beginning to dominate in the mid 90s [ZDO90], namely: the object-oriented
DBMS, and the extended relational DBMS.

There are two styles of O-O DBMS, depending on whether they have evolved from
extensions to an O-O programming language or by evolving a database model. Extensions have been
created for two database models, namely: the relational and the functional models. The extensions to
existing relational DBMSs have resulted in the so-called Extended Relational DBMSs which have
O-O features (e.g. POSTGRES and Starburst), while extensions to the functional model have
produced PROBE and OODAPLEX. The approach of extending O-O programming language
systems with database management features has resulted in many systems (e.g. Smalltalk into
GemStone and ALLTALK, and C++ into many DBMSs including VBase / ONTOS, IRIS and O2).
References to these systems with additional information and references can be found in [CAT94].

Research is currently taking place into other kinds of database such as active, deductive and
expert database systems [DAT90]. This thesis focuses on the relational model and possible
extensions to it which can represent semantics in existing relational database information systems in
such a way that these systems can be viewed in new ways and easily prepared for migration to more
modern database environments.

3.2 Relational Data Model

In this section we introduce some of the commonly used terminology of the relational model.
This is followed by a selective description of the features and query languages of this model. Further
details of this data model can be found in most introductory database text books, e.g. [MCF91,
ROB93, ELM94, DAT95].

A relation is represented as a table (entity) in which each row represents a tuple (record), the
number of columns being the degree of the relation and the number of rows being its cardinality. An
example of this representation is shown in figure 3.1, which shows a relation holding Student details,
with degree 3 and cardinality 5. This table and each of its columns are named, so that a unique
identity for a table column of a given schema is achieved via its table name and column name. The
columns of a table are called attributes (fields) each having its own domain (data type) representing
its pool of legal data. Basic types of domains are used (e.g. integer, real, character, text, date) to
define the domains of attributes. Constraints may be enforced to further restrict the pool of legal

Page 31

Constraints
values for an attribute. Tables which actually hold data are called base tables to distinguish them
from view tables which can be used for viewing data associated with one or more base tables. A
view table can also be an abstraction from a single base table which is used to control access to parts
of the data.

A column or set of columns whose values uniquely identify a row of a relation is called a
candidate key (key) of the relation. It is customary to designate one candidate key of a relation as a
primary key (e.g. SNO in figure 3.1). The specification of keys restricts the possible values the key
attribute(s) may hold (e.g. no duplicate values), and is a type of constraint enforceable on a relation.
Additional constraints may be imposed on an attribute to further restrict its legal values. In such
cases, there should be a common set of legal values satisfying all the constraints of that attribute,
ensuring its ability to accept some data. For example, a pattern constraint which ensures that the first
character of SNO is ‘S’ further restricts the possible values of SNO - see figure 3.1. Many other
concepts and constraints are associated with the relational model although most of them are not
supported by early relational systems as, indeed, some of the more recent relational systems (e.g. a
value set constraint for the Address field as shown in figure 3.1).

Domain
(type character)
Value Set Constraint
Pattern Constraint
(all values begin with 'S')

Primary Key
(unique values)

Student SNO Name Address

Cardinality
S1 Jones Cardiff
S2 Smith Bristol :
Relation Tuples
S3 Gray Swansea
S4 Brown Cardiff :
S5 Jones Newport

Attributes
Degree

Figure 3.1: The Student relation

3.2.1 Requisite Features of the Relational Model

During the early stages of the development of relational database systems there were many
requisite features identified which a comprehensive relational system should have [KIM79, DAT90].
We shall now examine these features to illustrate the kind of features expected from early relational
database management systems. They included support for:

• Recovery from both soft and hard crashes,
• A report generator for formatted display of the results of queries,
• An efficient optimiser to meet the response-time requirements of users,
• User views of the stored database,

Page 32

Constraints
• A non-procedural language for query, data manipulation / definition / control,
• Concurrency control to allow sharing of a database by multiple users and applications,
• Transaction management,
• Integrity control during data manipulation,
• Selective access control to prevent one user’s database being accessed by unauthorised users,
• Efficient file structures to store the database, and
• Efficient access paths to the stored data.

Many early relational DBMSs originated at universities and research institutes, and none of
them were able to provide all the above features. These systems mainly focussed on optimising
techniques for query processing and recovery from soft and hard crashes, and did not pay much
attention to the other features. Few of these database systems were commercially available, and for
those that were the marketing was based on specific features such as report generation (e.g.
MAGNUM), and views with selective access control (e.g. QBE). The early commercial systems did
not support the full range of features either.

Since the mid 1980’s many database products have appeared which aim to provide most of
the above features. The enforcement of features such as concurrency control was embodied in these
systems, while features such as views, access and integrity control were provided via non-procedural
language commands. Systems which were unable to provide these features via a non-procedural
language offered procedural extensions (e.g. C with embedded SQL) to perform such tasks. This
resulted in the use of two types of data manipulation languages, i.e. procedural and non-procedural,
to perform database system functions. In procedural languages a sequence of statements is issued to
specify the navigation path needed in the database to retrieve the required data, thus they are
navigational languages. This approach was used by all hierarchical and network database systems
and by some relational systems. However, most relational systems offer a non-procedural (i.e. non-
navigational) language. This allows retrieval of the required data by using a single retrieval
expression, which in general has a degree of complexity corresponding to the complexity of the
retrieval (e.g. SQL).

3.2.2 Query Language for the Relational Model

Querying or the retrieval of information from a database is perhaps the aspect of relational
languages which has received the most attention. A variety of approaches to querying has emerged,
based on relational calculus, relational algebra, mapping-oriented languages and graphic-oriented
languages [CHA76, DAT90]. During the first decade of relational DBMSs, there were many
experimental implementations of relational systems in universities and industry, particularly at IBM.
The initial projects were aimed at proving the feasibility of relational database systems supporting
high-level non-procedural retrieval languages. The Structured Query Language (SQL9) [AST75]
emerged from an IBM research project. Later projects created more comprehensive relational
DBMSs and among the more important of these systems were probably the System R project at IBM
[AST76] and the INGRES project (with its QUEL query language) at the University of California at
Berkeley [STO76].

9
Initially called SEQUEL, and later pronounced as SEQUEL.

Page 33

Constraints
Standards for relational query languages were introduced [ANSI86] so that a common
language could be used to retrieve information from a database. SQL became the standard query
language for relational databases. These standards were reviewed regularly [ANSI89a, ANSI92] and
are being reviewed [ISO94] to incorporate technological changes that meet modern database
requirements. Hence, the standard query language SQL is evolving, and although some of the recent
database systems conform to [ANSI92] standards they will have to be upgraded to incorporate even
more recent advances such as the object-oriented paradigm additions to the language [ISO94]. This
means that different database system query languages conform to different standards, and provide
different features and facilities to their users even though they are of the same type. Hence,
information systems developed during different eras will have used different techniques to perform
the same task, with early systems being more procedural in their approach than more recent ones.

Query languages, including SQL, have three categories of statements, i.e. the data
manipulation language (DML) statements to perform all retrievals, updates, insertions and deletions,
the data definition language (DDL) statements to define the schema and its behavioural functions
such as rules and constraints, and the data control language (DCL) statements to specify access
control which is concerned with the privileges to be given to database users.

3.3 Conceptual Modelling

The conceptual model is a high level representation of a data model, providing an
identification and description of the main data objects (avoiding details of their implementation).
This model is hardware and software independent, and is represented using a set of graphical
symbols in a diagrammatic form. As noted in part ‘c’ of stage 1 of section 2.2.2, different users may
prefer different graphical models and hence it is useful to provide them with a choice of models. We
consider two types of conceptual model in this thesis, namely: the enhanced entity-relationship
model (EER), which is based on the popular entity-relationship model, and the object-modelling
technique (OMT), which uses the more recent concepts of object-oriented modelling as opposed to
the entities of the E-R model. These were chosen as they are among the currently most widely used
conceptual modelling approaches and they allow representation of modelling concepts such as
generalisation hierarchies.

3.3.1 Enhanced Entity-Relationship Model (EER)

The entity-relationship approach is considered to be the first useful proposal [CHE76] on the
subject of conceptual modelling. It is concerned with creating an entity model which represents a
high-level conceptual data model of the proposed database, i.e. it is an abstract description of the
structure of the entities in the application domain, including their identity, relationship to other
entities and attributes, without regard for eventual implementation details.

Thus an E-R diagram describes entities and their relationships using distinctive symbols, e.g.
an entity is a rectangle and a relationship is a diamond. Distinctive symbols for recent modelling
concepts such as generalisation, aggregation and complex structures have been introduced into these
models by practitioners. Despite its popularity, no standard has emerged or been defined for this
model. Hence different authors use different notations to represent the same concept. Therefore we
have to define our symbols for these concepts: we have based our definitions on [ROB93] and

Page 34

Constraints
[ELM94].

a) Entity

An entity in the E-R model corresponds to a table in the relational environment and is
represented by a rectangle containing the entity name, e.g. the entity Employee of figure 3.2.

b) Attributes

Attributes are represented by a circle that is connected to the corresponding entity by a line.
Each attribute has a name located near the circle10, e.g. attributes EmpNo, Name and Address of the
Employee entity in figure 3.2. Key attributes of a relation are indicated using a colour to fill in the
circle (red on the computer screen or shaded dark in this thesis) (e.g. the attribute EmpNo of
Employee in figure 3.2). Attributes usually have a single value in an entity occurrence although
multivalued attributes can occur and other types such as derived attributes can be represented in the
conceptual model (see appendix B for a comprehensive list of the symbols used in EER models in
this thesis).

c) Relationships

A relationship is an association between entities. Each relationship is named and represented
by a diamond-shaped symbol. Three types of relationships (one-to-many or 1:M, many-to-many or
M:N, and one-to-one or 1:1) are used to describe the association between entities. Here 1 means that
an instance of this entity relates to only one instance of the other entity (e.g. an employee works for
only one department), and M or N means that an instance of an entity may relate to more than one
instance of the other entity (e.g. a department can have many employees working for it - see figure
3.2), through this relationship (the same entities can be linked in more than one relationship). The
relationship type is determined by the participating entities and their associated properties. In the
relational model a separate entity is used for M:N relationship types (e.g. a composite entity as in the
case of the entity ComMem of figure 3.2), and the other relationship types (i.e. 1:1 and 1:M) are
represented by repeated attributes (e.g. relationship WorksFor of figure 3.2 is established from the
attribute WorksFor of the entity Employee).

10
We do not place the attribute name inside the circle to avoid the use of large circles or ovals in
our diagrams.

Page 35

Constraints
(Weak Entity)
(Weak Relationship)
(1,1) (1,N)
Title Committee Fcom Faculty
(1,N)

(Composite Entity) ComMem YearJoined Office d

(Generalised Entity)
(1,N)
(1,1) WorksFor (4,N)
(Entity) Employee N (Relationships) 1 Department
(0,1) (1,1)
(Key) Head (Specialised Entity)
EmpNo Address of Office
Name
(Attributes)

Figure 3.2: EER diagram for part of the University Database

A relationship’s degree indicates the number of associated entities (or participants) there are
in the relationship. Relationships with 1, 2 and 3 participants are called unary, binary and ternary
relationships, respectively. In practice most relationships are binary (e.g. relationship WorksFor in
figure 3.2) and relationships of higher order (e.g. four) occur very rarely as they are usually
simplified to a series of binary relationships.

The term connectivity is used to describe the relationship classification and it is represented
in the diagram by using 1 or N near the related entity (see for example, the WorksFor relationship in
figure 3.2). Alternatively, a more detailed description of the relationship is specified using
cardinality, which expresses the specific number of entity occurrences associated with one
occurrence of the related entity. The actual number of occurrences depends on the organisation’s
policy and hence, can differ from that of another organisation, although both may model the same
information. The cardinality has upper and lower limits indicating a range and is represented in the
diagram within brackets near the related entity (see the WorksFor relationship in figure 3.211).
Cardinality is a type of constraint and in appendix B.2 we provide more details about the symbols
and notations used to represent these types of constraints. Thus in the WorksFor relationship:

(1,1) indicates an employee must belong to a department
(4,N) indicates a department must have at least 4 employees
N indicates a department has many employees
1 indicates an employee may work for only one department

d) Other Relationship and Entity Types

The original E-R model of Chen did not contain relationship attributes and did not use the
concept of a composite entity. We use this concept as in [ROB93], because the relational model
requires the use of an entity composed of the primary keys of other entities to connect and represent
M:N relationships. Hence, a composite entity (also called a link [RUM91] or regular [CHI94] entity)
11
In practise in a diagram only one of these types is shown depending on availability of
information on cardinality limits.

Page 36

Constraints
representing an M:N relationship is represented using a diamond inside the rectangle, indicating that
it as an entity as well as a relationship (e.g. ComMem of figure 3.2). In this type of relationship, the
primary key of the composite entity is created by using the keys of the entities which it connects.
This is usually a binary or 2-ary relationship involving two referenced entities and is a special case
of n-ary relationship which connects with n entities.

Some entity occurrences cannot exist without an entity occurrence with which they have a
relationship. Such entities are called weak entities and are represented by a double rectangle (e.g.
Committee in figure 3.2). The relationship formed with this entity type is called a weak relationship
and is represented by a double diamond (e.g. Fcom relationship of figure 3.2). In this type of
relationship, the primary key of the weak entity is a proper subset of the key of the entity which it
depends on and the remaining attributes (called dangling attributes) of the primary key do not
contain a key of any other entity.

When a relationship exists between occurrences of the same entity set (e.g. a unary
relationship) it forms a recursive relationship (e.g. a course may have pre-requisites courses).

e) Generalisation / Specialisation / Inheritance

Most organisations employ people with a wide range of skills and special qualifications (e.g.
a university employs academics, secretaries, porters, research associates, etc.) and it may be
necessary to record additional information for certain types of employee (e.g. qualifications of
academics). Representing such additional information in the employee table results in the use of null
values in this attribute for other employees as this additional information is not applicable for these
employees. To overcome this, common characteristics for all employees are chosen to define the
employee entity as a generalised entity, and the additional information is put in a separate entity,
called a specialised entity, which inherits all the properties of its parent entity (i.e. the generalised
entity), creating a parent-child or is-a relationship (also called a generalised hierarchy). The higher
level of this relationship is a supertype entity (i.e. generalised entity) and the lower-level is a subtype
entity (i.e. specialised entity). A supertype entity set is usually composed of several unique and
disjoint (non-overlapping) subtype entity sets. However some supertypes contain overlapping
subtypes (e.g. an employee may also be a student and hence we get two subtypes of person in an
overlapping relationship). There are constraints applicable for generalised hierarchies and special
symbols / notations are used in these cases (see appendixes B.1 figure ‘e’ and B.2 figure ‘b’). In
figure 3.2, the entities Office, Department and Faculty form a generalised hierarchy with Office
being the Supertype entity and Department and Faculty being the subtype entities. Subtype and
supertype entities have a 1:1 relationship although we view it differently, i.e. as a hierarchy.

The subtypes described above inherit from a single supertype entity. However, there may be
cases where a subtype inherits from multiple supertypes (e.g. an empstudent entity representing
employees who are also students may inherit from employee and student entities). This is known as
multiple inheritance. In such cases the subtype may represent either an intersection or a union. The
concept of inheritance was taken from the O-O paradigm and hence it does not occur in the original
E-R model, but is included in the EER model.

3.3.2 Object Modelling Technique (OMT)

Page 37

Constraints

The Object Modelling Technique (OMT) is an O-O development methodology. It creates a
high-level conceptual data model of a proposed system without regard for eventual implementation
details. This model is based on objects.

The notations of OMT used here are taken from [RUM91] and those used in our work are
described in appendix B, where they are compared with their EER equivalents. Hence we do not
describe this model in depth here. The diagrams produced by this method are known as object
diagrams. They combine O-O concepts (i.e. classes and inheritance) with information modelling
concepts (i.e. entities and relationships). Although the terminology used differs from that used in the
EER model, both create similar conceptual models, although using different graphical notations. The
main notations used in OMT are rectangles with text inside (e.g. for classes and their properties, as
opposed to the EER where attributes appear outside the entity). This makes OMT easier to
implement than EER in a graphical computing environment. OMT is used for most O-O modelling
(e.g. in [COO93, IDS94]), and so it is a widely known technique.

3.4 Semantic Integrity Constraints

A real world application is always governed by many rules which define the application
domain and are referred to as integrity constraints [DAT83, ELM94]. An important activity when
designing a database application is to identify and specify these integrity constraints for that database
and if possible to enforce them using the DBMS constraint facilities.

The term integrity refers to the accuracy, correctness or validity of a database. The role of the
integrity constraint enforcer is to ensure that the data in the database is accurate by guarding it
against invalid updates, which may be caused by errors in data entry, mistakes on the part of the
operator or the application programmer, system failures, and even due to deliberate falsification by
users. This latter case is the concern of the security system which protects the database from
unauthorised access (i.e. it implements authorisation constraints). The integrity system uses integrity
rules (integrity constraints) to protect the database from invalid updates supplied by authorised users
and to maintain the logical consistency of the database.

Integrity is sometimes used to cover both semantic and transaction integrity. The latter case
deals with concurrency control (i.e. the prevention of inconsistencies caused by concurrent access by
multiple users or applications to a database), and recovery techniques which prevent errors due to
malfunctioning of system hardware and software. Protection against this type of integrity-violation is
dealt with by most commercially available systems and is not an issue of this thesis. Here we shall
use the terms integrity and constraints to refer only to semantic integrity constraints.

Integrity rules cannot detect all types of errors, for instance when dealing with percentage
marks, there is no way that the computer can detect the fact that an input value of 45 for a student
mark should really be 54. However, on the other hand, a value of 455 could be detected and
corrected. Consistency is another term used for integrity. However, this is normally used in cases
where two or more values in the database are required to be in agreement with each other in some
way. For example, the DeptNo in an Employee record should tally with the DeptNo appearing in
some Department record (referential integrity in relational systems), or the Age of a Person must be

Page 38

Constraints
equal to the difference in years between today’s date and their date of birth (a property of a derived
attribute).

In order to check for invalid data, DBMSs use an integrity subsystem to monitor transactions
and detect integrity violations. In the event of a violation the DBMS takes appropriate actions such
as rejecting the operation, reporting the violation, or assisting in correcting the error. To perform
such a task, the integrity subsystem must be provided with a set of rules that define what errors to
check for, when to do the checking, and what to do if an error is detected. Most early DBMSs did
not have an integrity subsystem (mainly due to unacceptable database system performance when
integrity checking was performed in older technological environments) and hence such checking was
not implemented in their information systems. These information systems performed integrity
checking using procedural language extensions of the database to check for invalid entries during the
capture of data via their user interface (e.g. data entry forms). Here too, due to technological
limitations and poor database performance, only specific types of constraints (e.g. range check,
pattern matching), and a limited number of checks were allowed for an attribute. As these rules were
coded in application programs they violated program / data (rule) independence for constraint
specification. However, most recent DBMSs attempt to support such specifications using their DDL
and hence they achieve program / rule independence.

The original SQL standard specifications [ANSI86] were subsequently enhanced so that
constraints could be specified using SQL [ANSI89a]. Current commercial DBMSs are seeking to
meet these standards by targeting the implementation of the SQL-2 standards [ANSI92] in their
latest releases. Systems such as Oracle now conform to these standards, while others such as
INGRES and POSTGRES have taken a different path by extending their systems with a rule
subsystem, which performs similar tasks but using a procedural style approach where the rules and
procedures are retained in data dictionaries.

Integrity constraints can be identified for the properties of a data model and for the values of
a database application. We examine both to present a detailed description of the types of constraint
associated with databases and in particular those used for our work.

3.4.1 Integrity Constraints of a Data Model

Some constraints are automatically supported by the data model itself. These constraints are
assumed to hold by the definition of that data model (i.e. they are built into the system and not
specified by a user). They are called the inherent constraints of the data model. There are also
constraints that can be specified and represented in a data model. These are called the implicit
constraints of the model and they are specified using DDL statements in a relational schema, or
graphical constructs in an E-R model. Table 3.1 gives some examples of implicit and inherent
constraints for relational and EER data models. The constraint types used in this table are described
in detail in section 3.5.

The structure of a data model represents inherent constraints implicitly and is also capable of
representing implicit constraints. Hence, constraints represented in these two ways are referred to as
structural constraints. Data models differ in the way constraints are handled. Hierarchical and
network database constraints are handled by being tightly linked to structural concepts (records, sets,

Page 39

Constraints
segment definitions), of which the parent-child and owner-member relationships are logical
examples. The classical relational model, on the other hand, has two constraints represented
structurally by its relations or tables (namely, relations consist of a certain number of simple
attributes and have no duplicate rows). Hence only specific types of structural constraint are defined
for a particular data model (e.g. parent-child relationships are not defined for the relational model).

Data Model Implicit constraint Inherent constraint

• Primary key attributes, • Every relationship instance of an
• Attribute structural constraints, n-ary relationship type R relates
• Relationship structural constraints, exactly one entity from each entity
EER
• Superclass /subclass relationship, type participating in R in a specific
• Disjointness /totality constraints role,
on specialisation /generalisation. • Every entity instance in a subclass
must also exist in its superclass.
• Domain constraints, • A relation consists of a certain
Relational • Key constraints, number of simple attributes,
• Relationship structural constraints. • An attribute value is atomic,
• No duplicate tuples are allowed.
Table 3.1: Structural constraints of selected data models

Every data model has a set of concepts and rules (or assertions) that specify the structure and
the implicit constraints of a database describing a miniworld. A given implementation of a data
model by a particular DBMS will usually support only some of the structural (inherent and implicit)
constraints of the data model automatically and the rest must be defined explicitly. These additional
constraints of a data model are called explicit or behavioural constraints. They are defined using
either a procedural or a declarative (non-procedural) approach, which is basically not part of the data
model per se.

3.4.2 Integrity Constraints of a Database Application

In database applications, integrity constraints are used to ensure the correctness of a
database. A change to a database application takes place during an update transaction and
constraints are used at this stage to ensure that the database is in a consistent state before and after
that transaction. This type of constraint is called a state (static) constraint as it applies to a particular
state of the database and should hold for every state where the database is not in transition, i.e. not in
the process of being updated. Constraints that are applicable to a database and which change from
one state to another are called transition (dynamic) constraints (e.g. the age of a person can only be
increased, meaning that the new value of age is greater than the old value). In general, transition
constraints occur less frequently than state constraints and are usually specified explicitly.

The discussion above classifies the types of semantic integrity constraints used in data
models and database applications. We summarise them in figure 3.3 to highlight the basic
classification of integrity constraints. We separate the two approaches using a dotted line as they are
independent of each other. However, most constraints are common to both categories as they are
implemented using a particular data model for a database application. Data models used for
conceptual modelling are more concerned with structural constraints as opposed to the value
constraints of database applications.

Page 40

Constraints

Integrity
Constraints

Data Model Database Application

structural explicit state transistion
constraints (behavioural) (static) (dynamic)
constraints constraints constraints

inherent implicit
constraints constraints

Figure 3.3: Classification of integrity constraints

3.5 Constraint Types

We consider constraint types in more detail here so that we can later relate them to data
models and database applications. Initially, we describe value constraints (i.e. domain and key
constraints) which are applicable to database values (i.e. attributes). Then, we describe structural
constraints, namely: attribute structural, relationship structural and superclass/subclass structural.
These constraints are often associated with data models and some of them have been mentioned in
section 3.4.1. In this section, we look at them with respect to their structural properties and are
concerned with identifying differences within a structure, in addition to the relationships (e.g.
between entities) formed by them. Finally, constraints that do not fall into either of these categories
are described. As most of these constraints are state constraints we shall refer to the constraint type
only when type distinction is necessary.

All structural constraints are shown in a conceptual model as this model is used to describe
the structure of a database. Not all value constraints (e.g. check constraints) are shown as they are
not associated with the structure of a database and are described using a DML. However, our work
includes presenting them at optional lower levels of abstraction which involves software dependent
code. This code is based on current SQL standards and may be replaced using equivalent graphical
constructs if necessary12. Here for each type of an SQL statement, we could introduce a suitable
graphical representation and hence increase its readability. All value constraints are implicitly or
explicitly defined when implementing an application. Most constraints considered here are implicit
constraints as they may be specified using the DDL of a modern DBMS. In such cases the DBMS
will monitor all changes to the data in the database to ensure that no constraint has been violated by
these changes.

3.5.1 Domain Constraints

Domain constraints are specified by associating every simple attribute type with a value set.

12
This idea is beyond the scope of this thesis.

Page 41

Constraints
The value set of a simple attribute is initially defined using its data type (integer, real, char, boolean)
and length, and later is further restricted using appropriate constraints such as range (minimum,
maximum) and pattern (letters and digits). For example, the value set for the Deptno attribute of the
entity Department could be specified as data type character of length 5, and the Salary attribute of the
entity Staff as data type decimal of length 8 with 2 decimal places, in the range 3000 to 20000.

Nonnull constraints can be seen as a special case of domain constraints, since they too restrict
the domain of attributes. These constraints are used to eliminate the possibility of missing, or
unknown values of an attribute occurring in the database.

A domain constraint is usually used to restrict the value of an attribute, e.g. an employee’s
age is ≥ 18 (i.e. a state constraint), however they may also be used to compare values of two states,
e.g. an employee’s new salary is ≥ to their current salary (i.e. a transition constraint).

3.5.2 Key Constraints

Key constraints specify key attribute(s) that can uniquely identify an instance of an entity.
These constraints are also called candidate key or uniqueness constraints. For example, stating
Deptno is a key of Department will ensure that no two departments will have the same Deptno.
When a set of attributes form a key, then that key is called a composite key, as we are dealing with a
composite attribute. When a nonnull constraint is added to a key uniqueness constraint then such
keys are referred to as primary keys. An entity may have several candidate keys and in such cases
one is called the primary key and the others alternate keys.

Primary key attributes are shown in the EER model (see appendix B.2, figure ‘b’). The OMT
model uses object identities (oids) to uniquely identify objects and as they are usually system
generated they are not shown in this model. However, when modelling relational databases we do
not use the concept of oid, instead we have primary keys which are shown in our diagrams (see
appendix B.2, figure ‘b’) as they carry important information about a relational database.

3.5.3 Structural Constraints on Attributes

Attribute structural constraints specify whether an attribute is single valued or multivalued.
Multivalued attributes with a fixed number of possible values are sometimes defined as composite
attributes. For example, name can be a composite attribute composed of first name, middle initial
and last name. However composite attributes cannot be constructed for multivalued attributes like a
student’s course set, where the student can attend several courses. In such a case one would have to
use an alternative solution, such as recording all possibilities in one long string or using a separate
data type like sets. This type of constraint is not generally supported by most traditional DBMSs. In
the relational model we use a separate entity to hold multiple values and these are related to the
correct entity through an identical primary key [ELM94].

3.5.4 Structural Constraints on Relationships

Structural constraints on relationships specify limitations on the participation of entities in
relationship instances. Two types of relationship constraints occur frequently. They are called

Page 42

Constraints
cardinality ratio constraints and participation constraints. The cardinality ratio constraint specifies
the number of relationship instances that an entity can participate in using 1 and N (many). For
example, the constraints every employee works for exactly one department and a department can
have many employees working for it has the cardinality ratio of 1:N, meaning that each department
entity can be related to numerous employee entities. A participation constraint specifies whether
the existence of an entity depends on its being related to another entity via a certain relationship
type. If all the instances of an entity participate in a relationship of this type then the entity has total
participation (existence dependency) in the relationship. Otherwise the participation is partial,
meaning only some of the instances of that entity participate in a relationship of this type. For
example, the constraint every employee works for exactly one department means that an Employee
entity has a total participation in the relationship WorksFor (see figure 3.2), and the constraint an
employee can be the head of a department, means that the Employee entity has a partial participation
in the relationship Head (see figure 3.2) (i.e. not all employees are head of a department).

Referential integrity constraints are used to specify a type of structural relationship
constraint. In relational databases, foreign keys are used to define referential integrity constraints. A
foreign key is defined on attributes of a relation. This relation is known as the referencing table. The
foreign key attribute of the referencing table (e.g. WorksFor of Employee in figure 3.4) will always
refer to an attribute(s) of another relation, where the attribute(s) are the primary or alternate key (e.g.
DeptCode of Department in figure 3.4). We refer to this relation as the referenced table. The foreign
key attribute(s) of the referenced table have a uniqueness property, and may be the primary or
alternate key of that relation. This means that references from one relation to another are achieved by
using foreign keys, which indicate a relationship between two entities. Also this establishes an
inclusion dependency between the two entities. Here the values of the attribute of the referencing
entity (e.g. Employee.WorksFor) are a subset of the values of the attribute of the referenced entity
(e.g. Department.DeptCode). Only recent DBMSs such as Oracle version 7 support the specification of
foreign keys using DDL statements.

Employee Department
...Attributes...WorksFor DeptCode ...Attributes...
COMMA COMMA
MATHS ELSYM
COMMA MATHS
COMMA
MATHS
(5 records) (3 records)

WorksFor is a foreign key attribute of referencing entity Employee. This attribute refers
to the referenced entity Department whose attribute DeptCode is its primary key .

Figure 3.4: A foreign key example

3.5.5 Structural Constraints on Specialisation/Generalisation

Disjointness and completeness constraints are defined on specialisation/generalisation
structures. The disjointness constraint specifies that the subclasses (superclass) of the
specialisation (generalisation) must be disjoint. This means that an entity can be a member of at most
one of the subclasses of the specialisation. When an entity is a member of more than one of the
subclasses, then we get an overlapping situation. The completeness constraint on specialisation
(generalisation) defines total/partial specialisation (generalisation). A total specialisation specifies

Page 43

Constraints
the constraint that every entity in the superclass must be a member of some subclass in the
specialisation. For example: Lecturer, Secretary and Technician are some of the job types of an
Employee. They describe disjoint subclasses of the entity Employee, having a partial specialisation
as there could be an employee with another job type.

Generalisation is a feature supported by many object-oriented (O-O) systems, but it has yet to
be adopted by commercial relational DBMSs. However, with O-O features being incorporated into
the relational model (e.g. SQL-3) we can expect to see this feature in many RDBMSs in the near
future.

3.5.6 General Semantic Constraints

There are general semantic integrity constraints that do not fall into one of the above
categories (e.g. the constraint the salary of an employee must not be greater than the salary of the
head of the department that the employee works for, or the salary attribute of an employee can only
be increased). These constraints can be either state or transition constraints, and are generally
specified as explicit constraints.

The transition constraint mentioned above is a single-step transition constraint. Here, a
constraint is evaluated on a pair of pre-transaction and post-transaction states of a database, e.g. in
the employee’s salary example, the current salary will be the pre-transaction state and the new salary
will become the post-transaction state. However, there are transition constraints that are not limited
to a single-step, e.g. temporal constraints specified using the temporal qualifiers always and
sometimes [CHO92]. Other forms of constraint exist, including those defined for incomplete data
(e.g. employees having similar jobs and experience must have almost equal salary) [RAJ88]. These
can also be considered as a type of semantic constraint, mainly as they are not implicitly supported
by the most frequently used (i.e. relational) data model. They may need a special constraint
specification language to support them.

3.6 Specifying Explicit Constraints

Explicit constraints are generally defined using either a procedural or a declarative approach.

3.6.1 Procedural Approach

In the procedural approach (or the coded constraints technique), constraint checking
statements are coded into appropriate update transactions of the application by the programmer. For
example to enforce the constraint, the salary of an employee must not be greater than the salary of
the head of the department that the employee works for, one has to check every update transaction
that may violate this constraint. This includes any transaction that modifies the salary of an
employee, any transaction that links an employee to a department, and any transaction that assigns a
new employee or manager to a department. Thus in all such transactions appropriate code has to be
included that will check for possible violations of this constraint. When a violation occurs the
transaction has to be aborted, and this is also done by including appropriate code in the application.

The above technique for handling explicit constraints is used by many existing DBMSs. This

Page 44

Constraints
technique is general because the checks are typically programmed in a general-purpose
programming language. It also allows the programmer to code in an effective way. However, it is
not very flexible and places an extra burden on the programmer, who must know where the
constraints may be violated and include checks at each and every place that a violation may occur.
Hence, a misunderstanding, omission or error by the programmer may leave the database able to get
into an inconsistent state.

Another drawback of specifying constraints procedurally is that they can change with time as
the rules of a real world situation change. If a constraint changes, it is the responsibility of the DBA
to instruct appropriate programmers to recode all the transactions that are affected by the change.
This again opens the possibility of overlooking some transactions, and hence the chance that errors
in constraint representation will render the database inconsistent.

Another source of error is that it is possible to include conflicting constraints in procedural
specifications that will cause incorrect abortion of correct transactions. This error may occur in other
types of constraint specification, e.g. whenever we attempt to define multiple constraints on the same
entity. However, such errors can be detected more easily in a declarative approach, as it is possible
to evaluate constraints defined in that form to identify conflicts between them.

The procedural approach is usually adopted only when the DBMS cannot declaratively
support the same constraint. In all early DBMSs the procedural code was part of the application code
and was not retained in the database’s system catalog. However, some recent DBMSs (e.g. INGRES)
provide a rule subsystem where all defined procedures are stored in system catalogs and are fired
using rules which detect update transactions associated with a particular constraint. This approach is
a step towards the declarative approach as it overcomes some of the deficiencies of the procedural
approach described above (e.g. the maintenance of constraints), although the code is still of
procedural type which for example, prevents the detection of conflicting constraints.

Some DBMSs (e.g. INGRES) do not support the specification of foreign key constraints
through their DDL. Hence, for these systems such constraints have to be explicitly defined using a
procedural approach. In section ‘a’ of table 3.2, we show how a procedure is used in INGRES to
implement a foreign key constraint. Here the existence of a value in the referenced table is checked
and the statement is rejected if it does not exist. For comparison purposes, we include the definition
of the same constraint using an SQL-3 DDL specification (implicit) in section ‘b’ of table 3.2, and
the declarative approach (explicit) in section ‘c’ of table 3.2. When comparing these three
approaches, it is clear that the procedural one is most unfriendly and more error-prone. The DDL
approach is the best and most efficient approach as it is specified and managed implicitly by the
DBMS.

Page 45

Constraints
Approach SQL Statements

(a) Procedural CREATE PROCEDURE Employee_FK_Dept (WorksFor CHAR(5)) AS
Approach DECLARE
(Explicit) msg = VARCHAR(70) NOT NULL;
check_value = INTEGER;
BEGIN
IF WorksFor IS NOT NULL THEN
SELECT COUNT(*) INTO :check_value FROM Department
WHERE DeptCode = :WorksFor;
IF check_value = 0 THEN
msg = ‘Error 1: Invalid Department Code: ‘ + :WorksFor;
RAISE ERROR 1 :msg;
RETURN
ENDIF
ENDIF
END;

(b) DDL CONSTRAINT Employee_FK_Dept
Approach FOREIGN KEY (WorksFor) REFERENCES Department (DeptCode);
(Implicit) Note: This constraint is defined in the Employee table.

(c) Declarative CREATE ASSERTION Employee_FK_Dept
Approach CHECK (NOT EXISTS (SELECT * FROM Employee
(Explicit) WHERE WorksFor IN (SELECT DeptCode FROM Department)));

Note: The schema on which these constraints are defined is in figure 6.2.
Table 3.2: Different Approaches to defining a Constraint

3.6.2 Declarative Approach

This more formal technique for representing explicit constraints is to use a constraint
specification language, usually based on some variation of relational calculus. This is used to specify
or declare all the explicit constraints. In this declarative approach there is a clean separation between
the constraint base in which the constraints are stored, in a suitable encoded form, and the integrity
control subsystem of the DBMS, which accesses the constraints to apply them to transactions.

When using this technique, constraints are often called integrity assertions, or simply
assertions, and the specification language is called an assertion specification language. The term
assertion is used in place of explicit constraints, and the assertions are specified as declarative
statements. These constraints are not attached to a specific table as in the case of the implicit
constraint types introduced in section 3.5. This approach is supported by SQL-3. For example, the
constraint head of departments’ salary must be greater than that of his employees, can be specified
as,

CREATE ASSERTION Salary_Constraint
AFTER INSERT, UPDATE ON Employee E H, Department
CHECK (E.Salary < H.Salary and E.Dept = DeptCode and Head = H.EmpNo)

Assertions can also be used to define implicit constraints, like examination mark is between 0
and 100; or referential integrity constraints, as shown in table 3.2 part ‘b’. However, it is easier and
more efficient (i.e. consumes less computer resources) to monitor and enforce implicit constraints
using the DDL approach as such constraints are attached to an entity and used only when an update
transaction is applied to that entity, as opposed to checking whenever an update transaction is
performed on the database in general.

Page 46

Constraints
In many cases it is convenient to specify the type of action to be taken when a constraint is
violated or satisfied, rather than just having the options of aborting or silently performing the
transaction. In such cases, it is useful to include the option of informing a responsible person
regarding the need to take action or notifying them of the occurrence of that transaction (e.g. in
referential constraints, it is sometimes necessary to perform an action like update or delete on a table
to amend its contents, instead of aborting the transaction). This facility is supported either by an
optional trigger option on an existing DDL statement or by defining triggers [DAT93]. Triggers can
combine the declarative and procedural approaches, as the action part can be procedural, while the
condition part is always declarative (INGRES rules are a form of trigger). A trigger can be used to
activate a chain of associated updates, that will ensure database integrity (e.g. update total quantity
when new stock arrives or when stock is dispatched). An alerter, which is a variant of the trigger
mechanism, is used to notify users of important events in the database. For example, we could send a
message to the head calling to his attention a purchase transaction for £1,000 or more made from
department funds.

In this section we have introduced concepts from INGRES which also appear in other
DBMSs, namely triggers and alerters. These constructs provide further information about database
contents, but are beyond the scope of this project. They are related to constraints, so could be utilised
in a similar fashion.

3.7 SQL Approach to Implementing Constraints

In SQL-3, a constraint is either a domain constraint, a table constraint or a general constraint
[ISO94]. It is described by a constraint descriptor, which is either a domain constraint descriptor (cf.
sections 3.7.1 and A.11), a table descriptor (cf. sections 3.7.2 and A.4) or a general descriptor (cf.
sections 3.7.3 and A.12). Every constraint descriptor includes: the name of the constraint, an
indication of whether or not the constraint is deferrable, and an indication of whether or not the
initial constraint mode is deferred or immediate (see section A.3). Constraint descriptors are optional
in that they can be assigned with system generated names (except for the general constraint case,
where a name must be given). A constraint has an action which is either deferrable or non-
deferrable. This can be set using the constraint mode option (see section A.13). Usually, most
constraints are immediate as the default constraint mode is immediate, and in these cases they are
checked at the end of an SQL transaction. To deal with deferred constraints, all constraints are
effectively checked at the end of an SQL session or when an SQL commit statement is executed.
Whenever a constraint is detected as being violated, an exception condition is raised and the
transaction concerned is terminated by an implicit SQL rollback statement to ensure the consistency
of the database system.

3.7.1 SQL Domain Constraints

In SQL, domain constraints are specified by means of the CREATE DOMAIN statement (see
section A.11) and can be added to or dropped from an existing domain by means of the ALTER
DOMAIN statement [DAT93]. These constraints are associated with a specific domain and apply to
every column that is defined using that domain. They allow users to define new data types, which in
turn are used to define the structure of a table. For example, a domain Marks may be defined as
shown in figure 3.5. This means SQL will recognise the data type Marks which permits integers

Page 47

Constraints
between 0 and 100, thus giving a natural look to that data type when it is used.

CREATE DOMAIN Marks INTEGER
CONSTRAINT icmarks
CHECK(VALUE BETWEEN (0,100));

Figure 3.5: An SQL domain constrant

3.7.2 SQL Table Constraints

In SQL, table constraints (i.e. constraints on base tables) are initially specified by means of
the CREATE TABLE statement (see section A.4) and can be added to or dropped from an existing
base table by means of the ALTER TABLE statement [DAT93]. These constraints are associated with
a specific table, as they cannot exist without a base table. However, this does not mean that such
constraints cannot span multiple tables as in the case of foreign keys. Constraints defined on specific
columns of a base table are a type of table constraint, but are usually referred to as column
constraints.

Three types of table constraints are defined here, namely: candidate key constraints, foreign
key constraints and check constraints. Their definitions may appear next to their respective column
definitions or at the end (i.e. after all column definitions have been defined). We now describe an
example that uses all three types of constraints, using figure 3.6. The PRIMARY KEY (only one per
table) (see section A.6) and UNIQUE (the value in a row position is unique) (see section A.5) are
used to define candidate keys. A FOREIGN KEY definition (see section A.8) defines a referential
integrity constraint and may also include an action part (which is not used in figure 3.6 for
simplicity). Check constraints are defined using a CHECK clause (see section A.9) and may contain
any logical expression. The check constraint CHECK(Name IS NOT NULL) is usually defined using a
shorthand form NOT NULL next to the column Name, as shown in figure 3.6. We have also included
a check constraint spanning multiple tables in figure 3.6. Such table constraints can be included only
after the tables have been created, and hence in practice they are added using ALTER TABLE
statements.

CREATE TABLE Employee(
EmpNo CHAR(5) PRIMARY KEY,
Name CHAR(20) NOT NULL,
Address CHAR(80)
Age INTEGER CHECK(Age BETWEEN (21,65)),
WorksFor CHAR(5) FOREIGN KEY REFERENCES (Department),
Salary DECIMAL,
CHECK (E.Salary <= (SELECT H.Salary
FROM Department D, Employee H E
WHERE D.Head=H.EmpNo AND E.WorksFor=H.WorksFor),
UNIQUE (Name, Address) );

Figure 3.6: SQL table constrants

3.7.3 SQL General Constraints

In SQL, general constraints are specified by means of the CREATE ASSERTION statement
(see section A.12) and are dropped by means of the DROP ASSERTION statement. These constraints
must be associated with a user defined constraint name as they are not attached to a specific table

Page 48

Constraints
and are used to constrain arbitrary combinations of columns in arbitrary combinations of base tables
in a database. The constraint defined in section ‘c’ of table 3.2 belongs to this type.

Domain and table constraints are implicit constraints of a database, while assertions used to
define general constraints are explicit constraints (using a declarative approach). SQL data types
have their own constraint checking, which rejects for example string values being entered into a
numeric column definition. This type of constraint checking can be considered as inherent as it is
supported by the SQL language itself.

All integrity constraints discussed above are deterministic and independent of the application
and system environments. Hence, no parameters, host variables and built in system functions (e.g.
CURRENT_DATE) are allowed in these definitions as they make the database inconsistent. For
example CURRENT_DATE will give different values on different days. This means the validity of a
data entry may not hold during its lifetime despite no changes being made to its original entry. This
makes the task of maintaining the consistency of the database more difficult. Also it makes it
difficult to distinguish these errors from the traditional errors discussed in section 3.5. Hence
attributes such as age, which involves use of CURRENT_DATE should be derived attributes whose
value is computed during retrieval.

3.8 SQL Standards

To conclude this chapter, we compare different SQL standards to chronicle when respective
constraint specification statements were introduced to the language. A standard for the database
language SQL was first introduced in 1986 [ANSI86], and this is now called SQL/86. The SQL/86
standard specified two levels, namely: level 1 and level 2 (referred to as level 1 and 2 respectively in
table 3.3, column 2); where level 2 defined the complete SQL database language, while level 1 was a
subset of level 2. In 1989, the original standard was extended to include the integrity enhancement
feature [ANSI89a]. This standard is called SQL/89 and is referred to as level Int. in table 3.3,
column 2. The current standard, SQL/92 [ANSI92], is also referred to as SQL-2. This standard
defines three levels, namely: Entry, Intermediate and Full SQL (referred to as level E, I and F,
respectively, in table 3.3, column 4); where Full SQL is the complete SQL database language,
Intermediate SQL is a proper subset of Full SQL, and Entry SQL is a proper subset of Intermediate
SQL. The purpose of introducing 3 levels was to enable database vendors who had incorporated the
SQL/89 extensions into their original SQL/86 implementations to claim SQL/92 Entry level status.
As Intermediate extensions were more straightforward additions than the rest, they were separated
from the Full SQL/92 extensions. However, SQL/92 is also constantly being reviewed [ISO94],
mainly to incorporate O-O features into the language, and this latest release is called SQL-3 (referred
to as level O-O in table 3.3, column 5). Until recently, relational DBMSs supported only the SQL/86
standard and even now most support only up to the Entry level of SQL/92. Hence ISs developed
using these relational DBMSs are not capable of supporting modern features introduced in the
newest standards. Our work focuses on providing these newer features for existing relational legacy
database systems so that features such as primary / foreign key specification can be supported for
relational databases conforming to SQL/86 standards; and sub / super type features can be specified
for all relational products.

Page 49

Constraints
SQL Version SQL/86 SQL/89 SQL/92 SQL-3
Level 1 2 Int. E I F O-O
Data Type x + = = + + +
Identifier length x + = = + = =
Not Null x + = = = = =
Unique Key - x = = + = =
Primary Key - - x = + = =
Foreign Key - - x = = + +
Check Constraint - - x = = + =
Default Value - - x = = = =
Domain - - - - x + =
Assertion - - - - - x +
Trigger - - - - - - x
Sub/SuperType - - - - - - x
Table 3.3: SQL Standards and introduction of constraints

The integrity features discussed in previous sections were thus gradually introduced into the
SQL language standards as we can see from table 3.3. In this table ‘x’ indicates when the feature was
first introduced. The ‘+’ sign indicates that some enhancements were made to a previous version, the
‘=’ sign indicates that the same feature was used in a later version, and the ‘-’ sign indicates that the
feature was not provided in that version. For example, the Primary Key constraint for a table was
first introduced in SQL/89 (cf. appendix A.6) and later enhanced (i.e. in SQL/92 Intermediate) by
merging it with the Unique constraint (cf. appendix A.5) to introduce a candidate key constraint (cf.
appendix A.7). Thus, SQL standards for Primary Key are shown in table 3.3 as: ‘-’ for SQL/86; ‘x’
for SQL/89; ‘=’ for SQL/92 Entry level; ‘+’ for SQL/92 Intermediate level; and ‘=’ for all
subsequent versions.

Simple attributes are defined using their data type and length (cf. section 3.5.1). These
specifications are used as inherent domain constraints. The first two rows of table 3.3 show that they
were among the first constraint features introduced via SQL standards (i.e. SQL/86). The Not Null
constraint, which is a special domain constraint, was also introduced in the initial SQL standard. The
key constraints (cf. section 3.5.2), which specify unique and primary keys, were introduced in a
subsequent standard (i.e. SQL/89) as shown in table 3.3. The referential constraint which specifies a
type of a structural relationship constraint uses foreign keys, and this constraint was also introduced
in SQL/89, along with default values for attributes and check constraints. Later, more complex forms
of constraints were introduced in SQL/92. These include defining new domains for an attribute (e.g.
child as a domain having an inherent constraint of age being less than 18 years), and specifying
domain constraints spanning multiple tables (i.e. assertions). Constraints which are activated when
an event occurs (i.e. triggers), and structural constraints on specialisation / generalisation (i.e.
sub/super type, cf. section 3.5.5) are among other enhancements proposed in the draft SQL-3
standards. A detailed description of the syntax of statements used to provide the features identified in
table 3.3 is given in appendix A. For further details we refer the reader to the standards themselves
[ANSI86, ANSI89a, ANSI92, ISO94].

Page 50

CHAPTER 4

Migration of Legacy Information Systems

This chapter addresses in detail the migration process and issues concerned with legacy information
systems (ISs). We identify the characteristics and components of a typical legacy IS and present the
expected features and functions of a migration target IS. An overview of some of the strategies and
methods used for migration of a legacy IS to a target IS is presented along with a detailed study of
migration support tools. Finally, we introduce our tool to assist the enhancement and migration of
relational legacy databases.

4.1 Introduction

Rapid technological advances in recent years have changed the standard computer hardware
technology from centralised mainframes to network file-server and client/server architectures, and
software data management technology from files and primitive database systems to powerful
extended relational distributed DBMSs, 4GLs, CASE tools, non-procedural application generators
and end-user computing facilities. Information systems (ISs) built using old-fashioned technology
are inflexible, as that technology limits them from being changed and evolving to meet changing
business needs, which adjust rapidly to the potential of technological advances. It also means they
are expensive to maintain, as older systems suffer from failures, inappropriate functionality, lack of
documentation, poor performance and problems in training support staff. Such older systems are
called legacy ISs [BRO93, PHI94, BEN95, BRO95], and they need to be evolved and migrated to a
modern technological environment so that their existence remains beneficial to their user community
and organisation, and their full potential to the organisation can be realised.

4.2 Legacy Information Systems

Technological advances in hardware and software have improved the performance and
maintainability of new information systems. Equipment and techniques used by older ISs are
obsolete and prone to suffer from frequent breakdowns along with ever increasing maintenance
costs. In addition, older ISs may have other deficiencies depending on the type of system. Common
characteristics of these systems are [BRO93, PHI94, BEN95, BRO95] that they are:

• scientifically old, as they were built using older technology,
• written in a 3GL,
• use an old fashioned database service (e.g. a hierarchical DBMS),
• have a dated style of user interface (e.g. command driven).

Furthermore, in very large organisations additional negative characteristics may be present
making the intended migration process even more complex and difficult. These include [BRO93,
AIK94, BEN95, BRO95]: being very large (e.g. having millions of lines of code); being mission
critical (e.g. an on-line monitoring system like customer billing); and being operational all the time
(i.e. 24 hours a day and 7 days a week). These characteristics are not present in most smaller scale
legacy information systems, and hence the latter are less complex to maintain. Our work may not

Chapter 4 Migration of legacy ISs

assist all types of legacy IS as it deals with one particular component of a legacy IS only (i.e. the
database service).

Information systems consist of three major functional components, namely: interfaces,
applications and a database service. In the context of a legacy IS these components are, accordingly,
referred to as [BRO93, BRO95] the:

• legacy interface,
• legacy application,
• legacy database service.

These functional components are sometimes inter-related depending on how they were
designed and implemented in the IS’s development. They may exist independently of each other,
having no functional dependencies (i.e. all three components are decomposable - see section ‘a’ of
figure 4.1); they may be semi-decomposable (e.g. the interface may be separate from the rest of the
system - see section ‘b’ of figure 4.1); or they may be totally non-decomposable (i.e. the functional
components are not discrete but are intertwined and used as a single unit - section ‘c’ of figure 4.1).
This variety makes the legacy IS environment complex to deal with. Due to the potential complexity
of entire legacy ISs, we have focussed on one particular functional component only, namely: the
legacy database service.

In order to restrict our attention to the database service component, we have to treat the other
components, namely the interface and application, as black boxes. This can be done successfully
when a legacy IS has decomposable modules as in section ‘a’ of figure 4.1. However, when the
legacy database service is not fully decomposable from both the legacy interface and the legacy
application, treating them as black boxes may result in loss of information since relevant database
code may also appear in other components. In such cases, attempts must be made by the designer to
decompose or restructure the legacy code. The designer needs to investigate the legacy code of the
interface and application modules to detect any database service code, then move it to the database
service module (e.g. legacy code used to specify and enforce integrity constraints in the interface or
application components is identified and transferred to this module). This will assist in the
conversion of legacy ISs of the types shown in sections ‘b’ and ‘c’ of figure 4.1 to conform to the
structure of the IS type in section ‘a’ of figure 4.1. The identification and transfer of any legacy
database code left in the legacy application or interface modules can be done at any stage (e.g. even
after the initial migration) as the migration process can be repeated iteratively. Also, the existence of
legacy database code in the application does not affect the operation of the IS as we are not going to
change any existing functionalities during the migration process. Hence, treating a legacy interface
or a legacy application having legacy database code as a black box does not harm migration.

Page 52


I - Interface
A -Application
I
D - Database Service
I
A

A/D I/A/D
D

(a) (b) (c)

Figure 4.1 : Functional Components of a Legacy IS

4.2.1 Legacy Interfaces

Early information systems were developed for system users who were computer literate.
These systems did not have system or user level interfaces because they were not regarded as
essential since the system users did these tasks themselves. However, when the business community
and others wanted to use these systems, the need for user interfaces was identified and they have
been incorporated into more recent ISs.

The introduction of DBMSs in the late 1960’s facilitated easy access to computer held data.
However, in the early DBMSs, the end user had no direct access to their database and their
interactions with the database were done through verbal communication with a skilled database
programmer [ELM94]. All user requests were made via the programmer, who then coded the task as
a batch program using a procedural language. Since the introduction of query languages such as SQL
[CHA74, CHA76], QBE [ZLO77] and QUEL [HEL75], provision of interfaces for database access
has rapidly improved. These interfaces are provided not only to encourage laymen to use the system
but also to hide technical details from users. A presentation layer consisting of forms [ZLO77] was
the initial approach to supporting interaction with the end user. Modern user interfaces rely on
multiple screen windows and iconic (pictorial) representations of entities manipulated by pull-down
or pop-up menus and pointing devices such as cursor mice [SHN86, NYE93, HEL94, QUE93]. The
current trend is towards using separate interfaces for all decomposable application modules of an IS.
Some Graphical User Interface (GUI) development tools (e.g. Vision for graphics and user interfaces
[MEY94]) allow the construction of advanced GUIs supporting portability to various toolkits. This is
an important step towards building the next generation of ISs. Changes in the user interface and
operating environment result in the need for user training, an additional factor in system evolution
costs.

As defined by Brodie [BRO93, BRO95], we shall use the term legacy interfaces in the
context of all ISs whose applications have no interfaces or use old fashioned user / system interfaces.
In figures 4.1a and 4.1b, these interfaces are distinct and separable from the rest of the system as
they are decomposable modules. However, interfaces can be non-decomposable as shown in figure
4.1c. Migration issues concerning user interfaces have been addressed in [BRO93, BRO95,
MER95], and as mentioned in section 4.2, our work does not address problems associated with such
interface migration.

4.2.2 Legacy Applications

Page 53


Originally, ISs were written using 3GLs, usually the COBOL or PL/1 programming
languages. These languages had many software engineering deficiencies due to the state of the
technology at the time of their development. Techniques such as structured programming,
modularity, flexibility, reusability, portability, extensibility and tailorability [YOU79, SOM89,
BOO94, ELM94] were not readily available until subsequent advances in software engineering
occurred. The lack of these has made 3GL based ISs appear to be inflexible and, hence, difficult and
expensive to maintain and evolve to meet changing business needs.

The unstructured and non-modular nature of 3GLs resulted in the use of non-decomposable
application modules13 in the development of early ISs. However, with the introduction of software
engineering techniques such as structured modular programming these 3GLs were enhanced, and
new languages, such as Pascal [WIR71], Simula [BIR73], and more recently C++ [STR91] and
Eiffel [MEY92], gradually emerged to support these changing software engineering requirements.

The emergence of query languages in the 1970’s as standard interfaces to databases saw the
development and use of embedded query languages in host programming languages for large
software application program development. Embedded QUEL for INGRES [RTI90a] and Embedded
SQL for many relational DBMSs [ANSI89b] are examples of this gendre. The emergence of 4GLs is
an evolution which allows users to give a high-level specification of an application expressed
entirely in 4GL code. A tool then automatically generates the corresponding application code from
the 4GL code. For example, in the INGRES Application-by-Forms interface [RTI90b], the
application designer develops a system by using forms displayed on the screen, instead of writing a
program. Similar software products are offered by other vendors, such as Oracle [KRO93].

Information systems developed recently have partially or totally adopted modern software
engineering practices. As a result, decomposable modules exist in some recent ISs, i.e. their
architecture is as in section ‘a’ of figure 4.1. Applications which do not use the concept of
modularity are non-decomposable (e.g. section ‘c’ of figure 4.1), while those partially using it are
semi-decomposable (section ‘b’ of figure 4.1). Semi- and non- decomposable applications are
referred to as legacy applications and need to be converted into fully-decomposable systems to
simplify maintenance and make it easier for them to evolve and support future business needs.

Some aspects of legacy application migration need tools to analyse code. These are discussed
in [BIG94, NIN94, BRA95, WON95]. They are beyond the scope of this thesis, except insofar as we
are interested in any legacy code that is relevant to the provisions of a modern database service (e.g.
integrity constraints).

4.2.3 Legacy Database Service

Originally, many ISs were developed on centralised mainframes using COBOL and PL/1
based file systems [FRY76, HAN85]. These ISs had no DBMS and their data was managed by the
system using separate files and programs for each file handling task [HAN85]. Later ISs were based

13
often containing calculated or parameter-driven GOTO statements preventing a reasonable
analysis of their structure.

Page 54


on using database technology with limited DBMS capabilities. These systems typically used
hierarchical or network DBMSs for their data management [ELM94, DAT95], such as IMS
[MCG77] and IDMS [DAT86, ELM94].

The introduction and rapid acceptance of the relational model for DBMSs in recent years has
resulted in most applications now being developed with original relational DBMSs (e.g. System R
[AST76], DB2 [HAD84], SQL/DS [DAT88], INGRES [STO76, DAT87]). The steady evolution of
the relational data model has resulted in the emergence of extended relational DBMSs (e.g.
POSTGRES [STO91]) and newer versions of existing products (e.g. Oracle [ROL92], INGRES
[DAT87] and SYBASE [DAT90]) which have been used for most recent database applications. This
relational data model has been widely accepted as the dominant current generation standard for
supporting ISs. The rapidly expanding horizon of applications means that DBMSs are now expected
to cater for diverse data handling needs such as management of image, spatial, statistical or temporal
databases [ELM94, DAT95] and it is in its support of these that they are often weak. This highlights
the different range of functionality that is supported by various DBMSs. Thus applications using
older database services support modern database functionalities by means of application modules.
This is a typical characteristic of a legacy IS. Hence, the structure of such ISs is more complex and is
poorly understood as it is not adequately engineered in accordance with current technology.

The database services offered by most hierarchical, network and original relational DBMSs
are now considered to be primitive, as they fail to support many functions (e.g. backup, recovery,
transaction support, increased data independence, security, performance improvements and views
[DAT77, DAT81, DAT86, DAT90, ELM94]) found in modern DBMSs. These functions facilitate
the system maintenance of databases developed using modern DBMSs. Hence, the database services
provided by early DBMSs and file based systems are now referred to as legacy database services,
since they do not fulfil many current requirements and expectations of such services.

The specifications of a database service are described by a database schema which is held in
the database using data dictionaries. Analysis of the contents of these data dictionaries will provide
information that is useful in constructing a conceptual model for a legacy system. Our approach
focuses on using the data dictionaries of a relational legacy database to extract as much information
as possible about the specifications of that database.

4.2.4 Other Characteristics

The programming constructs of 3GL programs are less powerful than the data manipulation
features offered by 4GLs. As 4GL code uses the higher level DML code of modern DBMSs, it uses
less code (e.g. about 20% less) than its predecessors to accomplish even more powerful
applications. A typical program of a 3GL based information system is large and may consist of over
a hundred thousand lines of 3GL code. This means that a 20% reduction is a considerable saving in
quantity of code to be maintained [BRO93, BRO95]. Code translation tools [SHA93, SHE94] are
being built to automate as far as possible the conversion between 3GL and 4GL. These translations
sometimes optimise the translated code. Some of these techniques were mentioned in section 1.1.1.

The long lifetime of some ISs also leads to deficiencies in documentation. This may be due
to non-existent, out of date or lost documentary materials. The extent of this deficiency was only

Page 55


realised recently when people tried to transform ISs. To address this problem in the future, CASE
tools are being developed to automatically produce suitable documentation for current ISs developed
using them [COMP90]. However, this solution does not apply to legacy ISs as they were not built
using such tools and it is impossible to use these tools on the legacy systems. Thus we must solve
this problem in another way.

Sometimes, certain critical functions of an IS are written for high performance, often using a
specific, machine dependent set of instructions on some obsolete computer. This results in the use of
mysterious and complex machine code constructs which need to be deciphered to enable the code to
be ported to other computer systems. Such code is usually not convertible using generalised
translation tools. In general, the performance of legacy ISs is poor as most of their functions are not
optimised. This is inevitable, due to the state of the technology at the time of their original
development. Thus problems arise when we try to translate 3GL code into 4GL equivalent code in a
straightforward manner.

Solving the problems identified above is the overall concern when assisting the migration
and evolution of legacy ISs. However, our aim is to address only some of the problems concerning
legacy ISs, as the complete task is beyond the scope of our project.

4.2.5 Legacy Information System Architecture

Having considered the characteristics of the components of legacy ISs, we can conclude that
a typical IS consists of many application modules, which may or may not use an interface for user /
system interactions, and may use a legacy database service to manage legacy data. This database
service may use a DBMS to manage its database.

Hence, in general, the architecture of most legacy ISs is not strictly decomposable, semi-
decomposable or non-decomposable, as they may have evolved several times during their lifetime.
As a result, parts of the system may belong to any of the three categories shown in figure 4.2. This
means that the general architecture of a legacy IS is a hybrid one, as defined in [BRO93, BRO95,
KAR95]. Figure 4.2 suggests that some interfaces and application modules are inseparable from the
legacy database service while others are modular and independent of each other. This legacy IS
architecture highlights the database service component, as our interactions are with this layer to
extract the legacy database schema and any other database services required.

4.3 Target Information System

A legacy IS can be migrated to a target IS with an associated computing environment. This
target environment is intended to take maximum advantage of the benefits of rightsized computers,
client/server network architecture, and modern software including relational DBMSs, 4GLs and
CASE tools. In this section we present the hardware and software environments needed for the target
ISs.

4.3.1 Hardware Environment

The target environment must be equipped with modern technology supporting current

Page 56


business needs which should be flexible enough to evolve and fulfil future requirements. The
fundamental goal of a legacy IS migration is that the target IS must not itself become a legacy IS in
the near future. Thus, the target hardware environment needs to be flexibly networked (e.g. client-
server architecture) to support a distributed multi-user community. This type of environment
includes a desk top computer for each target IS user with an appropriate server machine(s)
controlling and resourcing the network provision. A PC (e.g. IBM PC or compatible) or a
workstation (e.g. Sun SPARC) may be used as the desk top computer (i.e. client / local machine),
each being connected using a local area network (LAN) (e.g. Ethernet) to the server(s).

I 1..Il I l+1 I m+1 In
• • Im

• •
A1 ..A l Al+1..Am Mm+1 Mn
non-decomposable semi-decomposable decomposable

Legacy Database Service

Legacy Legacy
• • • • Database
Database Data

I - Interface module A - Application module with legacy database services
M - Decomposed application module

Figure 4.2 : Legacy IS Architecture

4.3.2 Software Environment

The target database software is typically based on a relational DBMS with a 4GL, SQL,
report writers and CASE tools (e.g. Oracle v7 with Oracle CASE). Use of such software provides
many benefits to its users, such as an increase in program / data independence, introduction of
modularised software components, graphical user interfaces, reduction in code, ease of maintenance,
support for future evolution and integration of heterogeneous systems.

The target database can be centralised on a single server machine or distributed over multiple
servers in a networked environment. The target system may consist of application modules
representing the decomposed system components, each having its corresponding graphical user
interface (see figure 4.3). A typical architecture for a modern IS consists of layers for each of the
system functions (e.g. interface, application, database, network) as identified in [BRO93, BRO95].
In figure 4.3 we introduce such an architecture with special emphasis on the database service, which
will be a modern DBMS.

Page 57


GUI1 GUIi GUIn

• • • •
M1 Mi Mn

Target DBMS

Target Target Target
Database • • Database • • Database

GUI - graphical user interface module M - Decomposed application module

Figure 4.3 : Target IS Architecture

The complete migration process involves significant changes, not only in hardware and
software of the applications but also in the skills required by users and management. Thus they will
have to be trained or replaced to operate the target IS. These changes must be done in some
organised manner as the complete migration process itself is complex, and may take months or even
years depending on the size and complexity of the legacy IS. The number of persons involved in the
migration process and the resources available also contribute towards determining the ultimate
duration and cost of the migration.

4.4 Migration Strategies

The migration process for legacy ISs may take one of two main forms [BRO93, BRO95].
The first approach involves rewriting a legacy IS from scratch to produce the target IS using modern
software techniques (i.e. a complete migration). The other approach involves gradually migrating
the legacy IS in small steps until the desired long term objective is reached (i.e. incremental
migration). The approach of complete rewriting carries substantial risks and has failed many times in
large organisations as it is not an easy task to perform, especially when dealing with systems that
must remain operational throughout the process, or large ISs [BRO93, BEN95, BRO95]. However, if
the incremental approach fails, then only the failed step must be repeated rather than redoing the
entire migration. Hence, it is argued [BRO95] that the latter approach involves a lower risk and is
more appropriate in most situations. These approaches are further described in the next two
subsections. Our work is directed towards assisting this incremental migration approach.

4.4.1 Complete Migration

The process of complete migration involves rewriting a legacy IS from scratch to produce the
intended target IS. This approach carries a substantial risk. We discuss some of the reasons for this
risk to explain why this approach is not considered to be suitable by us. These are:

a) A better system is expected.

Page 58


A 1-1 rewrite of a complex IS is nearly impossible to undertake, as additional functions not
present in the original system are expected to be provided by the target IS. Besides, it is a significant
problem to evolve a developing replacement IS in step with an evolving legacy IS and to incorporate
in both ongoing changes in business requirements. Changes to and requests to evolve ISs may occur
at any time, without warning, and hence, it is difficult to incorporate any minor / major ad hoc
changes to the new system as they may not fit into its design. Also, an attempt to change this design
may violate its original goals and contribute towards a never ending cycle of development changes.

b) Specifications rarely exist for the current system.

Documentation for the current system is often non-existent and typically available only in the
form of the code14 itself. Due to the evolution of the IS many undocumented dependencies will have
been added to the system without the knowledge of the legacy IS owners (i.e. uncontrolled
enhancements have occurred). These additions and dependencies must be identified and
accommodated when rewriting from scratch. This adds to the complexity of a complete rewriting
process and raises the risk of unpredicted failure of dependent ISs when we rewrite a legacy system
as they are dependent on undocumented features of that system.

c) Information system is too critical to the organisation.

Many legacy ISs must be operational almost all the time and cannot be dormant during
evolution. This means that migrating live data from a legacy IS to the target IS may take more time
than the business can afford to be without its mission critical information. Such situations often
prohibit complete rewriting altogether and make this approach a non-starter. It also means that a
carefully thought out staged migration plan must be followed in this situation.

d) Management of large projects is hard.

The management of large projects involves managing more and more people. This normally
results in less and less productive work because of the effort required to manage organisational
complexity. As a result the timing of most large projects is seriously under-estimated. Frequently,
this results in partial or complete abortion of the project, as the inability to keep up with original
targets due to time lost is not always tolerated by an impatient company management.

4.4.2 Incremental Migration

An incremental migration process involves a series of steps, each requiring a relatively small
resource allocation (e.g. a few person weeks or months in the case of small or medium scale
systems), and a short time to produce a specific small result towards the desired goal. This is in sharp
contrast to the complete rewrite approach which may involve a large resource allocation (e.g. several
person months or years), and a development project spanning several years to produce a single
massive result. To perform a migration involving a series of steps, it is important to identify

14
This code is sometimes provided only in the form of executable code, as ISs are often in-house
developments.

Page 59


independent increments (i.e. portions of the legacy interfaces, applications and databases that can be
migrated independently of each other), and sequence them to achieve the desired goal. However, as
legacy ISs have a wide spectrum of forms from well-structured to unstructured, this process is
complex and usually has to deal with unavoidable problems due to dependencies between migration
steps. The following are the most important steps to apply in an incremental migration approach:

a) Iteratively migrate the computing environment.

The target environment must be selected, tested and established based on the total target IS
requirements. To determine the target IS requirements, it may be necessary to partially or totally
analyse and decompose the legacy IS. The installation of the target environment typically involves
installing a desk top computer for each target IS user and appropriate server machines, as identified
in section 4.3.1. The process of replacing dumb terminals with a PC or a workstation and connecting
them with a local area network can be done incrementally. This process allows the development of
the application modules and GUIs on an inexpensive local machine by downloading the relevant
code from a server machine, rather than by working on the server itself to develop this software.

Software and hardware changes are gradually introduced in this approach along with the
necessary user and management training. Hence, although we explicitly refer to a particular process
there are many processes that take place simultaneously. This is due to the involvement of many
people in the overall migration activity, with each person contributing towards the desired migration
goal in a controlled and measurable way.

Our work is concerned with iteratively migrating part of the legacy software (i.e. the database
service) and not the computing environment. Therefore we worked on a preinstalled target software
and hardware environment.

b) Iteratively analyse the legacy information system.

The purpose of this process is to understand the legacy IS in detail so that ultimately the
system can be modified to consist of decomposable modules. Any existing documentation, along
with the system code are used for this analysis. Knowledge and experience from people who support
and manage the legacy IS is also used to document the existing and the target IS requirements. This
knowledge has played a key role in other migration projects [DED95].

Some existing code analysis tools such as Reasoning Systems' Software Refinery and
Bachman Information Systems' Product Set [COMP90, CLA92, BRO93, MAR94] can be used to
assist in this process. It may be useful to conduct experiments to examine the current system using
its known interfaces and available tools (e.g. CASE tools), so that the information gathered with one
tool can be reused by other tools. Here, functions and the data content of the current system are
analysed to extract as much information as possible about the legacy IS. Other available information
for this type of analysis includes: documentation, discussions with users, dumps (system backups),
the history of system operation and the services it provides.

We do not perform any code analysis as part of our work. However, the analysis we do by
automated conceptual modelling identifies the usage of the data structures of the legacy IS. Our

Page 60


analysis assists in identifying the structural components of the legacy IS and their functional
dependencies. This information may then be used to restructure the legacy code.

c) Iteratively decompose the legacy information system structure.

The objective of this process is to construct well-defined interfaces between the modules and
the database services of the legacy IS. The process may involve restructuring the legacy IS and
removing dependencies between modules. It will thereby simplify the migration, that otherwise must
support all these dependencies. This step may be too complex in the worst case, when the legacy IS
will have to remain in its original form. Such a situation will complicate the migration process and
may result in increased cost, reduced performance and additional risk. However, in such cases an
attempt to perform even limited restructuring could facilitate the migration, and is preferable to
totally avoiding the decomposition step altogether.

We investigate supporting some structural changes in order to improve the existing structures
of the legacy database (e.g. introduction of inheritance to represent hierarchical structures and
specification of relationship structures). These changes eventually affect the application modules and
the interfaces of the legacy IS. Hence there is a direct dependency with respect to decomposing the
legacy database service and an indirect dependency with respect to decomposing the other
components of the legacy IS. The actual testing of this indirect dependency was not considered due
to its involvement with the application module. However, the ability to define referential integrity
constraints and assertions spanning multiple tables allows us to redefine functional dependencies in
the form of constraints or rules. When these constraints are stored in the database, it is possible to
remove such dependencies from the legacy applications. This assists decomposition of some
functional components of a legacy IS.

d) Iteratively migrate the identified portions.

An identified portion of the legacy IS may be an interface, application or a database service.
These components are individually migrated to the target environment. When this is done the
migrated portion will then run in the target environment with the remaining parts of the legacy
system continuing to operate. A gateway is used to encapsulate system components undergoing
changes. The objective of this gateway is to hide the ongoing changes in the application and the
database service from the system users. Obviously any change made to the appearance of any user
interface components will be noticeable, along with any significant performance improvements in
application modules processing large volumes of data.

Our work is applicable only to a legacy database service and hence any incremental
migration of interfaces or application modules is not considered at this stage. The complete
migration of legacy data takes a significant amount of time from hours to days depending on the
volume of data held. During this process no updates or additions can be made to the data as they
cause inconsistency problems. This means all functions of the database application have to be
stopped to perform a complete data migration in one go. For large organisations this type of action is
not appropriate. Hence iterative migration of selected data portions is desirable. To ensure a
successful migration, each chosen portion needs to be validated for consistency and guarded against
being rejected by the target database. When migrating data in stages it is necessary to be aware of

Page 61


the two sources of data as queries involving a migrated portion need to be re-directed to the target
system while other queries must continue to access the legacy database. This process may cause a
delay when a response for the query involves both the legacy and target databases. Hence it is
important to minimise this delay by choosing independent portions wherever possible for the
migration process.

4.5 Migration Method

A formal approach to migrating legacy ISs has been proposed by Brodie [BRO93, BRO95]
based on his experience in the field of telecommunication and other related projects. These methods,
referred to as forward, backward/reverse and general (a combination of forward and backward)
migration, are based on his “chicken little” incremental migration process. A forward migration
incrementally analyses the legacy IS and then decomposes its structure, while incrementally
designing the target IS or installing the target environment. In this approach the database is migrated
prior to the other IS components and hence unchanged legacy applications are migrated forward onto
a modern DBMS before they are migrated to new target applications. Conversely, backward
migration creates the target applications and allows them to access the legacy data as the database
migration is postponed to the end. General migration is more complex as it uses a combination of
both these methods based on the characteristics of the legacy application and databases. However,
this is more suitable for most ISs as the approach can be tailored at will.

The incremental migration process consists of a number of migration steps that together
achieve the desired migration. Each step is responsible for a specific aspect of the migration (i.e.
computer environment, legacy application, legacy data, system and user interfaces). The selection
and ordering of each aspect of the migration may differ as it depends on the application, as well as
the money and effort allocated for each process. Independent components can be migrated
sequentially or in parallel.

As we see here, the migration methods of Brodie deal with all components of a legacy IS.
Our interest in this project is to focus on a particular component, namely the database service, and as
a result a detailed review of Brodie’s migration methods is not relevant here. However, our
approach has taken account of his forward migration approach as it first deals with the migration of
the legacy database service and then allows the legacy applications to access the post-migrated data
management environment through a forward database gateway.

4.6 Migration Support Tools

There is no tool that is capable of performing a complete automatic migration for a given IS.
However, there are tools that can assist at various stages of this process. Hence, categorising tools by
their functions according to the stages of a migration process can help in identifying and selecting
those most appropriate. There are three main types of tools, namely: gateways, analysers and
migration tools, which can be of use at different stages of migration [BRO95]. For large migration
projects, testing and configuration management tools are also of use.

a) Gateways

Page 62


The function of a gateway is to coordinate between different components of an IS and hide
ongoing changes (i.e. to interfaces, data, applications and other system components being migrated)
from users. One of the main functions of these tools is to intercept calls on an application or database
service and direct them to the appropriate part of the legacy or target IS.

To incrementally migrate a legacy IS to a target IS, we need to select independent
manageable portions, replicate them in the target environment and give control to the new target
modules while the legacy system is still operational. To perform such a transition in a fashion
transparent to users, we need a software module (i.e. a gateway) which encapsulates system
components that are undergoing change behind an unchanged interface. Such a software module
manages information flow between two different environments, the legacy and target environments.
Functions such as retrieval, processing, management and representation of data from various systems
are expected from gateways. These expectations from a gateway managing a migration process are
similar to those we have of DBMS’s to manage data. DBMSs were designed to provide general
purpose data management and similarly the gateway needs to manage the migration process in a
generalised form. Development of such a gateway is beyond the scope of this project as it may take
several man years of effort. Hence our work will focus on some selected functionalities of a
gateway, which are sufficient to produce a realistic prototype.

We aim to provide a simplified form of the functionality of a gateway, which permits the
evolution of an existing IS at the logical level, by creating a target database and managing an
incremental migration of the existing database service in a way transparent to its users. This facility
should be provided not only for centralised database systems, but also for heterogeneous distributed
databases. This means our gateway functionality should support databases built using different types
of DBMS. We expect some of this functionality to be incorporated in future DBMSs as part of their
system functionality. Functions such as schema evolution, management of heterogeneous distributed
databases and schema integration are expected capabilities of modern database products.

b) Analysers

These tools employ a wide variety of techniques in exploring the technical, functional and
architectural aspects of an application program or database service to provide graphical and textual
information about an IS. The functions of reverse and forward engineering are provided by these
tools.

Many tools are used in this way to analyse the different components of an IS. Most of these
tools are semi-automatic as some form of user interaction is required to successfully complete their
task. For example, an application or database translation process is automatic if the source program
and data conform to all the standards supported by the tool. Otherwise, the translation process will
be terminated with the unconvertible portions indicated, leaving the database administrator to
complete the job manually by either correcting or re-programming those unconvertible portions of
the source program into target language code. We experienced this situation when attempting to
migrate an Oracle version 6 database to Oracle version 7, using the Oracle tools. In this case, Oracle
failed to successfully convert date functions used to check the constraints of its version 6 databases
to the equivalent coding in version 7 (Note: Oracle version 6’s use of non-standard date functions
was the cause of this problem).

Page 63


c) Migration tools

These tools are responsible for creating the components of the target IS, including interfaces,
data, data definitions and applications.

d) Testing

An important task is to ensure that the migrated target IS performs in the same way as its
legacy original, with possibly better performance. For this task we need test beds to check the most
amount of logic using the least amount of data. There are tools that allow for easy manipulation of
testing functions like break points and data values. However, they do not help with the generation of
test beds or validation of the adequacy of the testing process. Comparing the results that are
generated using both systems will assist the achievement of a reasonable form of testing. This may
not be sufficient to test new features such as the introduction of distributed computing functionality
to our systems. It is up to the person involved to ensure that a reasonable amount of testing has been
done to ensure the functionality and the accuracy of the new IS.

e) Configuration management

This type of tool is needed for large migration projects involving many people, to coordinate
functions such as documentation, synchronisation, keeping track of changes made (auditing),
management of revisions to system elements (version control), and automatic building of a particular
version of a system component.

Our work focuses on bringing these tools together into a single environment. We wish to
analyse a legacy database service, hence the functions of reverse and forward engineering are of
particular interest. We integrate these functions with some forward gateway and migration functions
as they are the relevant components for us to address the enhancement and migration of a database
service. Thus, we are not interested in all the features associated with migration support tools.

The classification of reverse and re-engineering tools given in [SHA93, BRO95] provides a
broad description of the functions of existing CASE tools. These include maintaining, understanding
and enhancing existing systems; converting / migrating existing systems to new technology; and
building new replacement systems. There are many tools which perform some of these functions.
However, none of them is capable of performing the integrated task of all the above functions. This
is one of the important requirements for future CASE tools. As it is practically impossible to produce
a single tool to perform all these tasks, the way to overcome this deficiency is to provide a gateway
that permits multiple tools to exchange information and hence provide the required integrated
facility.

The need to integrate different software components (i.e. database, spreadsheet, word
processing, e-mail and graphic applications) has resulted in the production of some integration tools,
such as DEC’s Link Works and Dictionary Software’s InCASE [MAY94]. However, what we need
is to integrate data extraction and downloading tools with schema enhancement and evolution
functions as they are together vital in the context of enhancing and migrating legacy databases.

Page 64


Support for interoperability among various DBMSs and the ability to re-engineer a DBMS
are important functions for a successful migration process. Of these two, the former has not been
given any attention until very recently, and there has been some progress relating to the latter in the
form of CASE tools. However, among the many CASE tools available only a handful support the re-
engineering process. The reason for this is that most CASE tools focus on forward-engineering. In
this situation, new or replacement software systems are being designed and appropriate code
generated. The re-engineering process is a combination of both forward-engineering and reverse-
engineering. The reverse-engineering process analyses the data structures of the databases of
existing systems to produce a higher level representation of these systems. This higher level
representation is usually in diagrammatic form and may be an entity-relationship diagram, data-flow
diagram, cross-reference diagram or structure chart.

We came across some tools that are commercially available for performing various tasks of
the migration process. These include data access and / or extraction tools for Oracle [BAR90,
HOL93, KRO93] and INGRES [RTI92] - two of our test DBMSs. Some other tools, mainly those
capable of performing the initial reverse engineering task, are also identified here. These tools are
not suitable for legacy ISs in general, as they fail to support a variety of DBMSs or the re-
engineering of most pre-existing databases. Among the different tools available, tools such as
gateways play a more significant role than others. When different database products are used in an
organisation, there may be a need to use multiple tools for a single step of a migration process,
conversely some tools may be of use for multiple steps. The process of using multiple tools for a
migration is complex and difficult as most vendors have not yet addressed the need for tool
interoperability.

The survey carried out in [COMP90] identifies many reverse-engineering products. Among
the 40 vendor products listed there, only three claimed to be able to reverse engineer Oracle,
INGRES or POSTGRES databases (our test databases - see section 6.1) or any SQL based database
products. These three products were: Deft CASE System, Ultrix Programming Environment and
Foundation Vista. Of these products only Deft and Vista produced E-R diagrams. None of the
products in the complete list supported POSTGRES, which was then a research prototype. Of the
two products identified above, only Deft was able to read both Oracle and INGRES databases, while
Vista could read only INGRES databases. This analysis indicated that interoperability among our
preferred databases was rare and that it is not easy to find a tool that will perform the re-engineering
process and support interoperability among existing DBMSs. Although the information published in
[COMP90] may be now outdated, the literature published since then [SHA93, MAY94, SCH95]
does not show that modern CASE tools have addressed the task of re-engineering existing ISs along
with interoperability, both of which are essential for a successful migration process. However, the
functionality of accessing and sharing information from various DBMSs via gateways like ODBC is
a step towards achieving this task. One of the reasons for progress limitation is the inability to
customise existing tools, which in turn prevents them being used in an integrated environment. This
is confirmed to some extent by the failure of the leading Unix database vendor - Oracle - to provide
such tools.

Brodie and Stonebraker, in their book [BRO95], present a study of the migration of large
legacy systems. It identifies an approach (chicken-little) and the commercial tools required to

Page 65


support this approach for legacy ISs in general. In this project we have developed a set of tools to
support an alternative approach for migrating legacy database services in particular. Thus Brodie and
Stonebraker take account of the need to migrate the application processes with a database, using
commercial tools, while in this thesis we concentrate on the development of integrated tools for
enhancing and migrating a database service.

4.7 The Migration Process

Having identified the migration strategies and methods applicable to our work, we can
review our migration process. This process must start with a legacy IS as in figure 4.2 and end with a
target IS as shown in figure 4.3. However, as we are not addressing the application and interface
components of a legacy IS, their conversion is not part of this project.

Our conceptualised constraint visualisation and enhancement system (CCVES) described in
section 2.2 was designed to assist in preparing legacy databases for such a migration. Hence our
migration process can be performed by connecting the legacy and target ISs using CCVES. This is
shown in figure 4.4. The three essential steps performed by CCVES before the actual migration
process occurs are shown using the paths highlighted in this figure as A, B and C, respectively.
These are the same paths that were described in section 2.2.

The identification of all legacy databases used by an application is made prior to the
commencement of path A of figure 4.4. The reverse engineering process is then performed on any
selected database. This process commences when the database schema and its initial constraints are
extracted from the selected database and is completed when the database schema is graphically
displayed in a chosen format. Any missing or new information is supplied via path B in the form of
enhanced constraints, to allow further relationships and constraints to appear in the conceptual
model. The constraint enforcement process of path C is responsible for issuing queries and applying
these constraints to the legacy data and taking necessary actions whenever a violation is detected,
before any migration occurs. This ensures that the legacy data is consistent with its enhanced
constraints before migration. Once these steps are completed, a graceful transparent migration
process can be undertaken via path D. Our work focuses only on evolving and migrating database
services, hence path X representing the application migration is not done via CCVES. The evolution
of database services includes increasing IS program / data independence by identifying and
transferring legacy application services which are concerned with data management functions, like
enforcement of referential constraints, integrity constraints, rules, triggers, etc., to the database
service from the application. Our migration process performs the transformation of the legacy
database to the target environment and passes responsibility for enforcing the newly identified
constraints to this system.

Figure 4.4 indicates that our approach commences with a reverse engineering process. This is
followed by a knowledge augmentation process which itself is a function of a forward engineering
process. These two stages together are referred to as re-engineering (see section 5.1). The constraint
enforcement process is the next stage of our approach. This is associated with the enhanced
constraints of the previous stage as it is necessary to validate the existing and enhanced constraint
specifications against the data held. These three preparatory stages are described in chapter 5. The
final stage of our approach is the database migration process. This is described later after we have

Page 66


fully discussed the application of the earlier stages in relation to our test databases.

Page 67


Page 68

CHAPTER 5

Re-engineering Relational Legacy Databases

This chapter addresses the re-engineering process and issues concerned with relational legacy
DBMSs. Initially, the reverse-engineering process for relational databases is overviewed. Next, we
introduce our re-engineering approach, highlighting its important stages and the role of constraints in
performing these stages. We then present how existing legacy databases can be enhanced with
modern concepts and introduce our knowledge representation techniques which allow the holding of
the enhanced knowledge in the legacy database. Finally, we describe the optional constraint
enforcement process which allows validation of existing and enhanced constraint specifications
against the data held.

5.1 Re-engineering Relational Databases

Software such as programming code and databases is re-engineered for a number of reasons:
for example, to allow reuse of past development efforts, reduce maintenance expense and improve
software flexibility [PRE94]. This re-engineering process consists of two stages, namely: a reverse-
engineering and a forward-engineering process. In database migration the reverse-engineering
process may be applied to help migrate databases between different vendor implementations of a
particular database paradigm (e.g. from INGRES to Oracle), between different versions of a
particular DBMS (e.g. Oracle version 3 to Oracle version 7) and between database types (e.g.
hierarchical to modern relational database systems). The forward-engineering process, which is the
second stage of re-engineering, is performed on the conceptual model derived from the original
reverse-engineering process. At this stage, the objective is to redesign and / or enhance an existing
database system with missing and / or new information.

The application of reverse-engineering to relational databases has been widely described and
applied [DUM81, NAV87, DAV87, JOH89, MAR90, CHI94, PRE94, WIK95b]. The latest
approaches have been extended to construct a higher level of abstraction than the original E-R
model. This includes the representation of object-oriented concepts such as generalisation /
specialisation hierarchies in a reversed-engineered conceptual model.

Due to parallel work that had occurred in this area in the recent years, there are some
similarities and differences in our reverse-engineering approach [WIK95b] when compared with
other recent approaches [CHI94, PRE94]. In the next sub-sections we shall refer to them.

The techniques used in the reverse-engineering process consist of identifying common
characteristics as identified below:

• Identify the database’s contents such as relations and attributes of relations.
• Determine keys, e.g. primary keys, candidate keys and foreign keys.
• Determine entity and relationship types.
• Construct suitable data abstractions, such as generalisation and aggregation structures.

Chapter 5 Re-engineering relational legacy
DBs
5.1.1 Contents of a relational database

Diverse sources provide information that leads to the identification of a database’s contents.
These include the database’s schema, observed patterns of data, semantic understanding of
application and user manuals. Among these the most informative source is the database’s schema,
which can be extracted from the data dictionary of a DBMS. The observed patterns of data usually
provide information such as possible key fields, domain ranges and the related data elements. This
source of information is usually not reliable as invalid, inconsistent, and incomplete data exists in
most legacy applications. The reliability can be increased by using the semantics of an application.
The availability of user manuals for a legacy IS is rare and they are usually out of date, which means
they provide little or no useful information to this search.

Data dictionaries of relational databases store information about relations, attributes of
relations, and rapid data access paths of an application. Modern relational databases record
additional information, such as primary and foreign keys (e.g. Oracle), rules / constraints on relations
(e.g. INGRES, POSTGRES, Oracle) and generalisation hierarchies (e.g. POSTGRES). Hence,
analysis of the data dictionaries of relational databases provides the basic elements of a database
schema, i.e. entities, their attributes, and sometimes the keys and constraints, which are then used to
discover the entity and relationship types that represent the basic components of a conceptual model
for the application. The trend is for each new product release to support more sophisticated facilities
for representing knowledge about the data.

5.1.2 Keys of a relational data model

Theoretically, three types of key are specified in a relational data model. They are primary,
candidate and foreign keys. Early relational DBMSs were not capable of implicitly representing
these. However, sometimes indexes which are used for rapid data access can be used as a clue to
determine some keys of an application database. For instance, the analysis of the unique index keys
of a relational database provides sufficient information to determine possible primary or candidate
keys of an application. The observed attribute names and data patterns may also be used to assist this
process. This includes attribute names ending with ‘#’ or ‘no’ as possible candidate keys, and
attributes in different relations having the same name for possible foreign key attributes. In the latter
case, we need to consider homonyms to eliminate incorrect detections and synonyms to prevent any
omissions due to the use of different names for the same purpose. Such attributes may need to be
further verified using the data elements of the database. This includes explicit checks on data for
validity of uniqueness and referential integrity properties. However the reverse of this process, i.e.
determining a uniqueness property from the data values in the extensional database is not a reliable
source of information, as the data itself is usually not complete (i.e. it may not contain all possible
values) and may not be fully accurate. Hence we do not use this process although it has been used in
[CHI94, PRE94].

The lack of information on keys in some existing database specifications has led to the use of
data instances to derive possible keys. However it is not practicable to automate this process as some
entities have keys consisting of multiple attributes. This means many permutations would have to be
considered to test for all possibilities. This is an expensive operation when the volume of data and /
or the number of attributes is large.

Page 70

DBs

In [CHI94], a consistent naming convention is applied to key attributes. Here attributes used
to represent the same information must have the same name, and as a result referencing and
referenced attributes of a binary relationship between two entities will have the same attribute names
in the entities involved. This naming convention was used in [CHI94] to determine relationship
types, as foreign key specifications are not supported by all databases. An important contribution of
our work is to support the identification of foreign key specifications for any database and hence the
detection of relationships, without performing any name conversions. We note that some reverse-
engineering methods rely on candidate keys (e.g. [NAV87, JOH89]), while others rely on primary
keys (e.g. [CHI94]). These approaches insist on their users meeting their pre-requisites (e.g.
specification of missing keys) to enable the user to successfully apply their reverse-engineering
process. This means it is not possible to produce a suitable conceptual model until the pre-requisites
are supplied. For a large legacy database application the number of these could exceed a hundred
and hence, it is not appropriate to rely on such pre-requisites being met to derive an initial
conceptual model. Therefore, we concentrate on providing an initial conceptual model using only the
available information. This will ensure that the reverse-engineering process will not fail due to the
absence of any vital information (e.g. the key specification for an entity).

5.1.3 Entity and Relationship Types of a data model

In the context of an E-R model an entity is classified as strong15 or weak depending on an
existence-dependent property of the entity. A weak entity cannot exist without the entity it is
dependent on. The enhanced E-R model (EER) [ELM94] identifies more entity types, namely:
composite, generalised and specialised entities. In section 3.3.1 we described these entity types and
the relationships formed among them. Different classifications of entities are due to their associative
properties with other entities. The identification of an appropriate entity type for each entity will
assist in constructing a graphically informative conceptual model for its users. The extraction of
information from legacy systems to classify the appropriate entity type is a difficult task as such
information is usually lost during an implementation. This is because implementations take different
forms even within a particular data model [ELM94]. Hence, an information extraction process may
need to interact with a user to determine some of the entity and relationship types. The type of
interaction required depends on the information available for processing and will take different
forms. For this reason we focus only on our approach, i.e. determining entity and relationship types
using enhanced knowledge such as primary and foreign key information. This is described in section
5.4.

5.1.4 Suitable Data Abstractions for a data model

Entities and relationships form the basic components of a conceptual data model. These
components describe specific structures of a data model. A collection of entities may be used to
represent more than one data structure. For example, entities Person and Student may be represented
as a 1:1 relationship or as a is-a relationship. Each representation has its own view and hence the
user understanding of the data model will differ with the choice of data structure. Hence it is
important to be able to introduce any data structure for a conceptual model and view using the most

15
In some literature this type of entity is referred to as regular entity, e.g. [DAT95].

Page 71

DBs
suitable data abstraction.

Data structures such as generalisation and aggregation have inherent behavioural properties
which give additional information about their participating entities (e.g. an instance of a specialised
entity of a generalisation hierarchy is made up from an instance of its generalised entity). These
structures are specialised relationships and representation of them in a conceptual model provides a
higher level of data abstraction and a better user understanding than the basic E-R data model gives.
These data abstractions originated in the object-oriented data model and they are not implicitly
represented in existing relational DBMSs. Extended-relational DBMSs support the O-O paradigm
(e.g. POSTGRES) with generalisation structures being created using inheritance definitions on
entities. However in the context of legacy DBMSs such information is not normally available, and as
a result such data abstractions can only be introduced either by introducing them without affecting
the existing data structures or by transforming existing entities and relationships to support their
representation. For example, entities Staff and Student may be transformed to represent a
generalisation structure by introducing a Person entity.

Other forms of transformation can also be performed. These include decomposing all n-ary
relationships for n > 3 into their constituent relationships of order 2 to remove such relationships and
hence simplify the association among their entities. At this stage double buried relationships are
identified and merged and relationships formed with subclasses are eliminated. Transitive closure
relationships are also identified and changed to form simplified hierarchies. We use constraints to
determine relationships and hierarchies. By controlling these constraints (i.e. modifying or deleting
them) it is possible to transform or eliminate necessary relationships and hierarchies.

5.2 Our Re-engineering Process

Our re-engineering process has two stages. Firstly, the relational legacy database is accessed
to extract the meta-data of the application. This extracted meta-data is translated into an internal
representation which is independent of the vendor database language. This information is next
analysed to determine the entity and relationship types, their attributes, generalisation / specialisation
hierarchies and application constraints. The conceptual model of the database is then derived using
this information and is presented graphically for the user. This completes the first stage which is a
reverse-engineering process for a relational database.

To complete the re-engineering process, any changes to the existing design and any new
enhancements are done at the second stage. This is a forward-engineering process that is applied to
the reverse-engineered model of the previous stage. We call this process constraint enhancement as
we use constraints to enhance the stored knowledge of a database and hence perform our forward-
engineering process. These constraint enhancements are done with the assistance of the DBA.

5.2.1 Our Reverse-Engineering Process

Our reverse-engineering process concentrates on producing an initial conceptual model
without any user intervention. This is a step towards automating the reverse-engineering process.
However the resultant conceptual model is usually incomplete due to the limited meta-knowledge
available in most legacy databases. Also, as a result of incomplete information and unseen inclusion

Page 72

DBs
dependencies we may represent redundant relationships as well as fail to identify some of the entity
and / or relationship types. We depend on constraint enhancement (i.e. the forward-engineering
process) to supply this missing knowledge so that subsequent conceptual models will be more
complete. The DBA can investigate the reversed-engineered model to detect and resolve such cases
with the assistance of the initial display of that model. The system will need to guide the DBA by
identifying missing keys and assisting in specifying keys and other relevant information. It also
assists in examining the extent to which the new specifications conform to the existing data.

Our reverse-engineering process does not depend on information about specialised
constraints. When no information about these is available, we treat all entities of a database to be of
the same type (i.e. strong entities) and any links present in the database will not be identified. In such
a situation the conceptual model will display only the entities and attributes of the database schema
without any links. For example, a relational database schema for a university college database
system with no constraint-specific information will initially be viewed as shown in figure 5.1. This is
the usual case for most legacy databases as they lack constraint-specific information. However, the
DBA will be able to provide any missing information at the next stage so that any intended data
structures can be reconstructed. Obviously if some constraints are available our reverse-engineering
process will try to derive possible entity types and links during its initial application.

University College Faculty Employee
office code code name
building building address
name name
birthDate
address address
secretary gender
Student principal
phone phone empNo
name dean designation
address worksFor
birthDate Department
yearJoined
gender deptCode EmpStudent room
collegeNo building
collegeNo phone
course name
address empNo salary
department head remarks
tutor phone
regYear faculty

Figure 5.1 : A relational legacy database schema for a university college database

Our reverse-engineering process first identifies all the available information by accessing the
legacy database service (cf. section 5.3). The accessed information is processed to derive the
relationship and entity types for our database schema (cf. section 5.4). These are then presented to
the user using our graphical display function.

5.2.2 Our Forward-Engineering Process

The forward-engineering process is provided to allow the designer (i.e. DBA) to interact with
a conceptual model. The designer is responsible for verifying the displayed model and can supply
any additional information to the reverse-engineered model at this stage. The aim of this process is
to allow the designer to define and add any of the constraint types we identified in section 3.5 (i.e.
primary key constraints, foreign key constraints, uniqueness constraints, check constraints,
generalisation / specialisation structures, cardinality constraints and other constraints) which are not

Page 73

DBs
present. Such additions will enhance the knowledge held about a legacy database. As a result, new
links and data abstractions that should have been in the conceptual model can be derived using our
reverse-engineering techniques and presented in the graphical display. This means that the legacy
database schema originally viewed as in figure 5.1 can be enhanced with constraints and presented
as in figure 5.2, which is a vast improvement on the original display. Such an enhanced display
demonstrates the extent to which a user’s understanding of a legacy database schema can be
improved by providing some additional knowledge about the database. In sections 6.3.4 and 6.4 we
introduce the enhanced constraints of our test databases including those used to improve the legacy
database schema of figure 5.1 to figure 5.2.

University
Person
name
Constraints: address
..............
birthDate
gender
office
Constraints:
..............
Office inCharge
code
siteName
worksFor
unitName
address
phone 4+
Employee Student
Constraints:
.............. empNo collegeNo
designation course
dean tutor
yearJoined department
room regYear
phone
College-Office Faculty-Office Constraints:
salary ..............
siteName AS building siteName AS building Constraints:
unitName AS name unitName AS name ..............
inCharge AS principal inCharge AS secretary
Constraints: Constraints:
.............. ..............

Dept-Office
faculty
code AS deptCode
siteName AS building EmpStudent
unitName AS name 2-12
inCharge AS head remarks

Constraints: Constraints:
.............. ..............

Figure 5.2 : The enhanced university college database schema

We support the examination of existing specifications and identification of possible new
specifications (cf. section 5.5) for legacy databases. Once these are identified, the designer defines
new constraints using a graphical interface (cf. section 5.6). The new constraint specifications are
stored in the legacy database using a knowledge augmentation process (cf. section 5.7). We also
supply a constraint verification module to give users the facility to verify and ensure that the data
conforms to all the enhanced constraints (cf. section 5.8) being introduced.

5.3 Identifying Information of a Legacy Database Service

Schema information about a database (i.e. meta-data) is stored in the data dictionaries of that
database. The representation of information in these data dictionaries is dependent on the type of the
DBMS. Hence initially the relational DBMS and the databases used by the application are identified.
The name and the version of the DBMS (e.g. INGRES version 6), the names of all the databases in

Page 74

DBs
use (e.g. faculty / department), and the name of the host machine (e.g. llyr.athro.cf.ac.uk) are
identified at this stage. These are the input data that allows us to access the required meta-data. As
the access process is dependent on the type of the DBMS, we describe this process in section 6.5
after specifying our test DBMSs. This process is responsible for identifying all existing entities, keys
and other available information in a legacy database schema.

5.4 Identification of Relationship and Entity Types

Once the entities and their attributes along with primary and candidate keys have been
provided, we are ready to classify relationships and entity types. Three types of binary relationships
(i.e. 1:1, 1:N and N:M) and five types of entities (i.e. strong, weak, composite, generalised and
specialised) are identified at this stage.

Initially we assume that all entities are strong and look for certain properties associated with
them (mainly primary and foreign key), so that they can be reclassified into any of the other four
types. Weak and composite entities are identified using relationship properties and generalised /
specialised entities are determined using generalisation hierarchies.

5.4.1 Relationship Types

(a) A M:N relationship

If the primary key of an entity is formed from two foreign keys then their referenced entities
participate in an M:N relationship. This is a special case of n-ary relationship involving two
referenced entities (see section ‘a’ of figure 5.3). This entity becomes a composite entity having a
composite key. For example, entity Option with primary key (course,subject) participates in an M:N
relationship as the primary key attributes are foreign keys - see tables 6.2, 6.4 and 6.6 (later).

In a n-ary relationship (e.g. 3-ary or ternary if the number of foreign keys is 3, see section ‘b’
of figure 5.3) the primary key of an entity is formed from a set of n foreign keys. As stated in section
5.1.4, n-ary relationships for n > 3 are usually decomposed into their constituent relationships of
order 2 to simplify their association. Hence we do not specifically describe this case. For example,
entity Teach with primary key (lecturer, course, subject) participates in a 3-ary relationship when
lecturer, course and subject are foreign keys referencing entities Employee, Course and Subject,
respectively. However, as Option is made up using Course and Subject entities we could decompose
this 3-ary relationship into a binary relationship by defining course and subject of Teach to be a
foreign key referencing entity Option - see tables 6.2, 6.4 and 6.6.

Page 75

DBs
Relational Model ER Model Concept Graphical Notation

(a) PK = FK + FK (i.e. n = 2) M:N relationship M N
re relation re
1 2 binary

n
L N
(b) PK = FK n>2 n-ary relationship re relation re
ternary
i=1 i
M e.g.
3-ary
re
1 N
(c) FK attr. is part of PK and 1:N relationship re attribute e
other part does not contain a key
of any other relation
1 N
1:N relationship re attribute e
(d) A non-key FK and
non-unique attr.

1:1 relationship 1 1
(e) A non-key FK and re attribute e
unique attr.

PK - Primary Key e - referencing entity
FK - Foreign Key re - referenced entity

Figure 5.3: Mapping foreign key references to an ER relationship type

Sometimes a foreign key refers to the same entity, forming a unary relationship, like in the
case where some courses may have pre-requisites. In this case the attribute pre-requisites of entity
Course is a foreign key referencing the same entity.

(b) A 1:N relationship

There are two types of 1:N relationships. One is formed with a weak entity and the other with
a strong entity.

If part of the primary key of an entity is a foreign key and the other part does not contain a
key of any other relation, then the entity concerned is a weak entity and will participate in a weak
1:N relationship (see section ‘c’ of figure 5.3) with its referenced entity. For example, entity
Committee with primary key (name, faculty) is a weak entity as only a part of its primary key
attributes (i.e. faculty) is a foreign key.

A non-key foreign key attribute (i.e. an attribute that is not part of a primary key) that may
have multiple values will participate in a strong 1:N relationship (see section ‘d’ of figure 5.3) if it
does not satisfy the uniqueness property. For example, attribute tutor of entity Student is a non-key,
non-unique foreign key referencing the entity Employee (cf. tables 6.2 to 6.4). Here tutor participates
in a 1:N relationship with Employee - see table 6.6.

(c) A 1:1 relationship

A non-key foreign key attribute will participate in a 1:1 relationship if a uniqueness
constraint is defined for that attribute (see section ‘e’ of figure 5.3). For example, attribute head of
entity Department participates in a 1:1 relationship with entity Employee as it is a non-key foreign
key with the uniqueness property, referencing Employee - see tables 6.2 to 6.4 and 6.6.

Page 76

DBs

The specialised and generalised entity pair of a generalisation hierarchy has a 1:1 is-a
relationship. Hence it is possible to define a binary relationship in place of a generalisation
hierarchy. For example, it is possible to define a foreign key (empNo) on entity EmpStudent,
referencing entity Employee to form a 1:1 relationship instead of representing it as a generalisation
hierarchy. Such cases must be detected and corrected by the database designer. We introduce
inheritance constraints involving these entities to resolve such cases.

5.4.2 Entity Types

(a) A strong entity

This is the default entity type, as any entity that cannot be classified as one of the other types
will be a strong (regular) entity.

(b) A composite entity

An entity that is used to represent an M:N relationship is referred to as a composite (link)
entity (cf. section 5.4.1 (a)). The identification of M:N relationships will result in the identification
of composite entities.

(c) A weak entity

An entity that participates in a weak 1:N relationship is referred to as a weak entity (cf.
section 5.4.1 (b)). The identification of weak 1:N relationships will result in the identification of
weak entities.

(d) A generalised / specialised entity

An entity defined to contain an inheritance structure (i.e. inheriting properties from others) is
a specialised entity. Entities whose properties are used for inheritance are generalised entities. The
identification of inheritance structures will result in the identification of specialised and generalised
entities. An inheritance structure defines a single inheritance structure (e.g. entities X1 to Xj inherit
from entity A in figure 5.4). However, a set of inheritance structures may form a multiple inheritance
structure (e.g. entity Xj inherits from entities A and B in figure 5.4). To determine the existence of
multiple inheritance structures we analyse all subtype entities of the database (e.g. entities X1 to Xn
in figure 5.4) and derive their supertypes (e.g. entity A or B or both in figure 5.4). For example,
entity EmpStudent inherits from Employee and Student entities forming a multiple inheritance, while
entity Employee inherits from Person to form a single inheritance.

Page 77

DBs
A B

X1 • • Xi • • Xj • • Xn

Entities X 1, .. ,Xi, .. ,Xj inherit from entity A and entities X j, .. ,Xn inherit from entity B.

Figure 5.4: Single and multiple inheritance structures using EER notations

5.5 Examining and Identifying Information

Our forward-engineering process allows the designer to specify new information. To
successfully perform this task the designer needs to be able to examine the current contents of the
database and identify possible missing information from it.

5.5.1 Examining the contents of a database

At this stage the user needs to be able to browse through all features of the database. Overall,
this includes viewing existing primary keys, foreign keys, uniqueness constraints and other
constraint types defined for the database. When inheritance is involved the user may need to
investigate the participating entities at each level of inheritance. As specific viewing the user may
want to investigate the behaviour of individual entities. This includes identifying constraints
associated with a particular entity (i.e. intra-object properties) and those involving other entities (i.e.
inter-object properties). Our system provides for this via its graphical interface. We describe viewing
of these properties in section 7.5.1, as it is directly associated with this interface. Here, global
information is tabulated and presented for each constraint type, while specific information (i.e. inter-
and intra- object) presents constraints associated with an individual entity.

5.5.2 Identifying possible missing, hidden and redundant information

This process allows the designer to search for specific types of information, including
information about the type of entities that do not contain primary keys, possible attributes for such
keys, buried foreign key definitions and buried inheritance structures. In this section we describe
how we identify this type of information.

i) Possible primary key constraints

Entities that do not contain primary keys are identified by comparing the list of entities
having primary keys with the list of all entities of the database. When such entities are identified the
user can view the attributes of these and decide on a possible primary key constraint. Sometimes, an
entity may have several attributes and hence the user may find it difficult to decide on suitable
primary key attributes. In such a situation the user may need to examine existing properties of that
entity (cf. section 5.5.1) to identify attributes with uniqueness properties and not null values.

Page 78

DBs
Sometimes, attribute names such as those ending with ‘no’ or ‘#’ may give a clue in selecting the
appropriate attributes. Once the primary key attributes have been decided the user may want to
verify this choice against the data of the database (cf. section 5.8).

ii) Possible foreign key constraints

Existence of either an inclusion dependency between a non-key attribute of one table and a
key attribute of another (e.g. deptno of Employee and deptno of Department), or a weak or n-ary
relationship between a key attribute and part of a key attribute (e.g. cno of strong entity Course and
cno of link entity Teach) implies the possible existence of a foreign key definition. Such possibilities
are detected by matching attribute names satisfying the required condition. Otherwise, the user needs
to inspect the attributes and detect their possible occurrence (e.g. if attribute name worksfor instead
of deptno was used in Employee).

iii) Possible uniqueness constraints

Detection of a uniqueness index gives a clue to a possible uniqueness constraint. All other
indications of this type of constraint have to be identified by the user.

iv) Possible inheritance structures

Existence of an inclusion dependency between two strong entities having the same key
implies a subtype / supertype relationship between the two entities. Such possible relationships are
detected by matching identical key attribute names of strong entities (e.g. empno of Person and
empno of Employee). Otherwise, the user needs to inspect the table and 1:1 relationships to detect
these structures (e.g. if personid instead of empno was used in Person then the link between empno
and personid would have to be identified by the user).

In distributed database design some entities are partitioned using either horizontal or vertical
fragmentation. In this situation strong entities having the same key will exist with a possible
inclusion dependency between vertically fragmented tables. Such cases need to be identified by the
designer to avoid incorrect classifications occurring. For example, employee records can be
horizontally fragmented and distributed over each department as opposed to storing at one site (e.g.
College). Also, employee records in a department may be vertically fragmented at the College site as
the college is interested in a subset of information recorded for a department.

v) Possible unnormalised structures

All entities of a relational model are at least in 1NF, as this model does not allow multivalued
attributes. When entities are not in 3NF (i.e. a non-key attribute is dependent on part of a key or
another non-key attribute: violating 2NF or 3NF, respectively), there are hidden functional
dependencies. These entities need to be identified and transformed into 3NF to show their
dependencies. New entities in the form of views are used to construct this transformation. For
example, entity Teach can be defined to contain attributes lecturer, course, subNo, subName and
room. Here we see that subName is fully dependent on subNo and hence Teach is in 2NF. Using a
view we separate subName from Teach and use it as a separate entity with primary key subNo. This

Page 79

DBs
allows us to transform the original Teach to 3NF and view Subject and Teach as a binary, instead of
an unary relationship. This will assist in improving conceptual model readability.

vi) Possible redundant constraints

Redundant inclusion dependencies representing projection or transitivity must be removed,
otherwise incorrect entity or relationship types may be represented. For instance, if there is an
inclusion dependency between entities A, B and B, C then the transitivity inclusion dependency
between A, C is redundant. Such relationships should be detected and removed. For example,
EmpStudent is an Employee and Employee is a Person, thus EmpStudent is a Person is a redundant
relationship. Redundant constraints are often most obvious when viewing the graphical display of a
conceptual model with its inter- and intra- object constraints.

5.6 Specifying New Information

We can specify new information using constraints. In a modern DBMS which supports
constraints we can use its query language to specify these. However this approach will fail for legacy
databases as they do not normally support the specification of constraints. To deal with both cases
we have designed our system to externally accept constraints of any type, but represent them
internally by adopting the appropriate approach depending on the capabilities of the DBMS in use.
Thus if constraint specification is supported by the DBMS in use we will issue a DDL statement (cf.
figure 5.5 which is based on SQL-3 syntax) to create the constraint. If constraint specification is not
supported by the DBMS in use we will store the constraint in the database using techniques
described in section 5.7. These constraints are not enforced by the system but they may be used to
verify the extent to which the data conforms with the constraints (cf. section 5.8). In both cases this
enhanced knowledge is used by our conceptual model wherever it is applicable. The following sub-
sections describe the specification process for each constraint type. We cover all types of constraints
that may not be supported by a legacy system, including primary key. We use the SQL syntax to
introduce them. In SQL, constraints are specified as column/table constraint definitions and can
optionally contain a constraint name definition and constraint attributes (see sections A.3 and A.4)
which are not included here.

i) Specifying Primary Key Constraints

Only one primary key is allowed for an entity. Hence our system will not allow any input that
may violate this status. Once an entity is specified the primary key attributes are chosen. Modern
SQL DBMSs will use the DDL statement ‘a’ of figure 5.5 to create a new primary key constraint,
older systems do not have this capability in their syntax.

ii) Foreign Key Constraints

A foreign key establishes a relationship between two entities. Hence, when the enhancing
constraint type is chosen as a foreign key, our system requests two entity names. The first is the
referencing entity and the second the referenced entity. Once the entity names are identified the
system automatically shows the referenced attributes. These attributes are those having the
uniqueness property. When these attributes are chosen a new foreign key is established. This

Page 80

DBs
constraint will only be valid if there is an inclusion dependency between the referencing and
referenced attributes. Modern SQL DBMSs will use the DDL statement ‘b’ of figure 5.5 to create a
new foreign key constraint in this situation. This statement can optionally contain a match type and
referential triggered action (see section A.8) which are not shown here.

iii) Uniqueness Constraints

A uniqueness constraint may be defined on any combination of attributes. However such
constraints should be meaningful (e.g. these is no point in defining a uniqueness constraint for a set
of attributes when a subset of it already holds the uniqueness status), and should not violate any
existing data. Modern SQL DBMSs will use the DDL statement ‘c’ of figure 5.5 to create a new
uniqueness constraint.

(a) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
PRIMARY KEY (Primary_Key_Attributes)
(b) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
FOREIGN KEY (Foreign_Key_Attributes)
REFERENCES Referenced_Entity_Name (Referenced_Attributes)
(c) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
UNIQUE (Uniqueness_Attributes)
(d) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
CHECK (Check_Constraint_Expression)
(e) ALTER TABLE Entity_Name ADD
UNDER Inherited_Entities [ WITH (Renamed_Attributes) ]
(f) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
FOREIGN KEY (Foreign_Key_Attributes)
[ CARDINALITY (Referencing_Cardinality_Value) ]
REFERENCES Referenced_Entity_Name (Reference_Attributes)

Our optional extensions to the SQL-3 syntax are highlighted using bold letters here.

Figure 5.5 : Constraints expressed in extended SQL-3 syntax

iv) Check Constraints

A check constraint may be defined to represent a complex expression involving any
combination of attributes and system functions. However such constraints should not be redundant
(i.e. not a subset of an existing check constraint) and should not violate any existing data. Modern
SQL DBMSs will use the DDL statement ‘d’ of figure 5.5 to create a new check constraint.

v) Generalisation / Specialisation Structures

An inheritance hierarchy may be defined without performing any structural changes if its
existence can be detected by our process described in part ‘d’ of section 5.4.2. In this case we need
to specify the entities being inherited (cf. statement ‘e’ of figure 5.5). If an inherited attribute’s name
differs from the target attribute name it is necessary to rename them. For example, attributes
siteName, unitName and inCharge of Office are renamed to building, name and principal when it is
inherited by College - see figures 6.2 and 6.3 (later).

It is also possible to make some structural changes in order to introduce new generalisation /
specialisation structures. In such situations new entities are created to represent the
specialisation/generalisation. Appropriate data for these entities are copied to them during this
process. For instance, in our university college example of figure 5.1, the entities College, Faculty and

Page 81

DBs
Department can be restructured to represent a generalisation hierarchy, by introducing a generalised
entity called Office and transforming the entities College, Faculty and Department to College-Office,
Faculty-Office and Dept-Office, respectively (cf. figure 5.2). Once this transformation is done the
entities Office, College-Office, Faculty-Office and Dept-Office will represent a generalisation hierarchy
as shown in figure 5.2. Any change to existing structures and table names will affect the application
programs which use them. To overcome this we introduce view tables in the legacy database to
represent new structures. These tables are defined using the syntax of figure 5.6. For example, the
generalised entity will be Office and the specialised entities will be College-Office, Faculty-Office and
Dept-Office. The introduction of view tables means that legacy application code using the original
tables will not be affected by the change. However, appropriate changes must be introduced in the
target application code and database if we are going to introduce these features permanently. We
introduced the concept of defining view tables in the legacy database to assist the gateway service in
managing these structural changes.

CREATE VIEW GeneralisedEntity (GeneralisedAttributes) AS
SELECT Attributes FROM SpecialisedEntity
[ [UNION SELECT Attributes FROM SpecialisedEntity] ..]
CREATE VIEW SpecialisedEntity (SpecialisedAttributes) AS
SELECT g1.Attributes [ [, g2.Attributes] ..]
FROM GeneralisedEntity g1 [ [, GeneralisedEntity g2] ..]
[ WHERE specialised-conditions ]
Figure 5.6 : Creation of view table to represent a hierarchy

vi) Cardinality Constraints

Cardinality constraints specify the minimum / maximum number of instances associated with
a relationship. In a 1:1 relationship type the number of instances associated with a relationship will
be 0 or 1, and in a M:N relationship type it can take any value from 0 upwards. The ability to define
more specific limits allows users not only to gain a better understanding about a relationship, but
also to be able to verify its conformance by using its instances. We suggest creating such
specifications using an extended syntax of the current SQL foreign key definition (cf. statement ‘f’
of figure 5.5) as this is the key which initially establishes this relationship. The minimum / maximum
instance occurrences for a particular relationship of a referential value (i.e. cardinality values) can be
specified using a keyword CARDINALITY as shown in figure 5.5. Here the
Referencing_Cardinality_Value corresponds to the many side of a relationship. Hence the value of
this indicates the minimum instances. When the referencing attribute is not null then the minimum
cardinal value is 1, else it is 0. In our examples introduced in part ‘b’ of section 6.2.3, we have used
‘+’ to represent the many symbol (e.g. 0+ represents zero or many) and ‘-’ to represent up to (e.g. -1
represents 0 to 1).

vii) Other Constraints

In the example in figure 5.2, we have also shown an aggregation relationship between the
entities University and Office. Here we have assumed that a reference to a set of instances can be
defined. In such a situation, as with the other constraint types, an appropriate SQL statement should
be used to describe the constraint and an appropriate augmented table such as those used in figure
5.7 must be used to record this information in the database itself. We discuss this case here for the
purpose of highlighting that other constraint types can be introduced and incorporated into our

Page 82

DBs
system using the same general approach. However our implementations have concentrated only on
the constraints discussed above.

The enhanced constraints, once they are absorbed into the system, will be stored internally in
the same way as any existing constraints. Hence the reconstruction process to produce an enhanced
conceptual model can utilise this information directly as it is fully automated. To hold the
enhancements in the database itself we need to issue appropriate query statements. The
enhancements can be effected using the SQL statements shown in figure 5.5 if the database is SQL
based and such changes are implicitly supported by it. In section 5.7 we describe how this is done
when the database supports such specifications (e.g. Oracle version 7) and when it does not (e.g.
INGRES version 6). When the DBMS does not support SQL, the query statement to be issued is
translated using QMTS [HOW87] to a form appropriate to the target DBMS. As there are variants of
SQL16 we send all queries via QMTS so that the necessary query statements will automatically get
translated to the target language before entering the target DBMS environment.

5.7 The Knowledge Augmentation Approach

In this section we describe how the enhanced constraints are retained in a database. Our aim
has been to make these enhancements compatible with the newer versions of commercial DBMSs,
so that the migration process is facilitated as fully as possible.

Many types of constraint are defined in a conceptual model during database design. These
include relationship, generalisation, existence condition, identity and dependency constraints. In
most current DBMSs these constraints are not represented as part of the database meta-data.
Therefore, to represent and enforce such constraints in these systems, one needs to adopt a
procedural approach which makes use of some embedded programming language code to perform
the task. Our system uses a declarative approach (cf. section 3.6) for constraint manipulation, as it is
easier to process constraints in this form than when they are represented in the traditional form of
procedural code.

16
The date functions of most SQL databases (e.g. INGRES and Oracle) are different.

Page 83

DBs
CREATE TABLE Table_Constraints ( CREATE TABLE Check_Constraints (
Constraint_Id char(32) NOT NULL, Constraint_Id char(32) NOT NULL,
Constraint_Name char(32) NOT NULL, Constraint_Name char(32) NOT NULL,
Table_Id char(32) NOT NULL, Check_Clause char(240) NOT NULL );
Table_Name char(32) NOT NULL,
Constraint_Type char(32) NOT NULL,
Is_Deferrable char(3) NOT NULL, CREATE TABLE Sub_tables (
Initially_Deferred char(3) NOT NULL ); Table_Id char(32) NOT NULL,
Sub_Table_Name char(32) NOT NULL,
CREATE TABLE Key_Column_Usage ( Super_Table_Name char(32) NOT NULL,
Constraint_Id char(32) NOT NULL, Super_Table_Column integer(4) NOT NULL );
Constraint_Name char(32) NOT NULL,
Table_Id char(32) NOT NULL,
Table_Name char(32) NOT NULL, CREATE TABLE Altered_Sub_Table_Columns (
Column_Name char(32) NOT NULL, Table_Id char(32) NOT NULL,
Ordinal_Position integer(2) ); Sub_Table_Name char(32) NOT NULL,
Sub_Table_Column char(32) NOT NULL,
CREATE TABLE Referential_Constraints ( Super_Table_Name char(32) NOT NULL,
Constraint_Id char(32) NOT NULL, Super_Table_Column char(32) NOT NULL );
Constraint_Name char(32) NOT NULL,
Unique_Constraint_Id char(32) NOT NULL,
Unique_Constraint_Name char(32) NOT NULL, CREATE TABLE Cardinality_Constraints (
Match_Option char(32) NOT NULL, Constraint_Id char(32) NOT NULL,
Update_Rule char(32) NOT NULL, Constraint_Name char(32) NOT NULL,
Delete_Rule char(32) NOT NULL ); Referencing_Cardinal char(32) );

Figure 5.7: Knowledge-based table descriptions

The constraint enhancement module of our system (CCVES) accepts new constraints (cf.
figure 5.5) irrespective of whether they are supported by the selected DBMS. These new constraints
are the enhanced knowledge which is stored in the current database, using a set of user defined
knowledge-based tables, each of which represents a particular type of constraint. These tables
provide general structures for all constraint types of interest. In figure 5.7 we introduce our table
structures which are used to hold constraint-based information in a database. We have followed the
current SQL-3 approach to representing constraint types supported by the standards. In areas which
the current standards have yet to address (e.g. representation of cardinality constraints) we have
proposed our own table structures. Thus, all general constraints associated with a table (i.e. an entity)
are recorded in Table_Constraints. The constraint description for each type is recorded elsewhere in
other tables, namely, Key_Column_Usage for attribute identifications, Referential_Constraints for
foreign key definitions, Check_Constraints to hold constraint expressions, Sub_Tables to hold
generalisation / specialisation structures (i.e. inherited tables), Altered_Sub_Table_Columns to hold
any attributes renamed during inheritance, and Cardinality_Constraints to hold cardinal values
associated with relationship structures.

The use of these table structures to represent constraint-based information in a database
depends on the type of DBMS in use and the features it supports. The features supported by a DBMS
may differ from the standards to which it claims to conform, as database vendors do not always
follow the standards fully when they develop their systems. However, DBMSs supporting the
representation of constraints need not have identical table structures to our approach as they may
have used an alternative way of dealing with constraints. In such situations it is not necessary to
insist on the use of our table structures for constraint representation, as the database is capable of
managing them itself if we follow its approach. Therefore we need to identify which SQL standards
have been used and in which DBMSs we should introduce our own tables to hold enhanced
constraints. In figure 5.8 we identify the tables required for the three SQL standards and for three
selected DBMSs. The selected DBMSs were used as our test DBMSs, as we shall see in section 6.1.

The CCVES determines for the DBMS being used the required knowledge-based tables and

Page 84

DBs
creates and maintains them automatically. The creation of these tables and the input of data to them
are ideally done at the database application implementation stage, by extracting data from the
conceptual model used originally to design a database. However, as current tools do not offer this
type of facility, one may have to externally define and manage these tables in order to hold this
knowledge in a database. Our system has been primed with the knowledge of the tables required for
each DBMS it supports, and so it automatically creates these tables and stores information in them if
the database is enhanced with new constraints. Here, Table_Constraints, Referential_Constraints,
Key_Column_Usage, Check_Constraints and Sub_Tables are those used by SQL-3 to represent
constraint specifications. SQL-2 has the same tables, except for Sub_Tables, Hence, as shown in
figure 5.8, these tables are not required as augmented tables when a DBMS conforms to SQL-3 or
SQL-2 standards, respectively. Adopting the same names and structures as used in the SQL
standards makes our approach compatible with most database products. We have introduced two
more tables (namely: Cardinality_Constraints and Altered_Sub_Table_Columns) to enable us to
represent cardinality constraints and to record any synonyms involved in generalisation /
specialisation structures. The representation of this type of information is not yet addressed by the
SQL standards.

CCVES utilises the above mentioned user defined knowledge-based tables not only to
automatically reproduce a conceptual model, but also to enhance existing databases by detecting and
cleaning inconsistent data. To determine the presence of these tables, CCVES looks for user defined
tables such as Table_Constraints, Referential_Constraints, etc., which can appear in known existing
legacy databases only if the DBMS maintains our proposed knowledge-base. For example, in
INGRES version 6 we know that such tables are not maintained as part of its system provision,
hence the presence of tables with these names in this context confirms the existence of our
knowledge-base. Use of our knowledge-based tables is database system specific as they are used
only to represent knowledge that is not supported by that DBMS's meta-data facility. Hence, the
components of two distinct knowledge-bases, e.g. for INGRES version 6 and Oracle version 7, are
different from each other (see figure 5.8).

Table Name S1 S2 S3 I O P
- - D V6 V7 V4

Table_Constraints Y N N Y N Y
Referential_Constraints Y N N Y N Y
Key_Column_Usage Y N N Y N Y
Check_Constraints Y N N N N N
Sub_Tables Y Y N Y Y N
Altered_Sub_Table_Columns Y Y Y Y Y Y
Cardinality_Constraints Y Y Y Y Y Y

S1 - SQL/86, S2 - SQL-2, S3 - SQL-3 Y - Yes, required
I - INGRES, O - Oracle, P - POSTGRES N - No, not required
D - Draft, V - Version

Figure 5.8 : Requirement of augmented tables for SQL standards and some current DBMSs

The different types of constraints are identified using the attribute Constraint_Type of
Table_Constraints, which must have one of the values PRIMARY KEY, UNIQUE, FOREIGN KEY or
CHECK. A set of example instances are give in figure 5.9 to show the types of information held in
our knowledge-based tables. The constraint type NOT NULL may also appear in Table_Constraints
when dealing with a DBMS that does not support NULL value specifications. We have not included
it in our sample data as it is supported by our test DBMSs and all the SQL standards. The constraint

Page 85

DBs
and table identifications in our knowledge-based tables (i.e. Constraint_Id and Table_Id of figure
5.9), may be of composite type as they need to identify not only the name, but also the schema,
catalog and location of the database.

Foreign key constraints are associated with their referenced table through a unique constraint.
Hence, the ‘Const_Name_Key’ instance of attribute Unique_Constraint_Name of table
Referential_Constraints should also appear in Key_Column_Usage as a unique constraint. This
means that each of the knowledge-based tables has its own set of properties to ensure the accuracy
and consistency of the information retained in these tables. For instance Constraint_Type of
Table_Constraints must be one of {‘PRIMARY KEY’, ‘UNIQUE’, ‘FOREIGN KEY’, ‘CHECK’} if
these are the only type of constraints that are to be represented. Also, within a particular schema a
constraint name is unique. Hence Constraint_Name of Table_Constraints must be unique for a
particular type of Constraint_Id. In figure 5.10 we present the set of constraints associated with our
knowledge-based tables. Besides these there are a few others which are associated with other system
tables, such as Tables and Columns which are used to represent all entity and attribute names
respectively. Such constraints are used in systems supporting the above constraint types. This allows
us to maintain consistency and accuracy within the constraint definitions.

Table_Constraints { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type,
Is_Deferrable, Initially_Deferred }
('dbId', 'Const_Name_PK', 'TableId', 'Entity_Name_PK', 'PRIMARY KEY', 'NO', 'NO')
('dbId', 'Const_Name_UNI', 'TableId', 'Entity_Name_UNI', 'UNIQUE', 'NO', 'NO')
('dbId', 'Const_Name_FK', 'TableId', 'Entity_Name_FK', 'FOREIGN KEY', 'NO', 'NO')
('dbId', 'Const_Name_CHK', 'TableId', 'Entity_Name_CHK', 'CHECK', 'NO', 'NO')

Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name,
Ordinal_Position }
('dbId', 'Const_Name_PK', 'TableId','Entity_Name_PK', 'Attribute_Name_PK', i)
('dbId', 'Const_Name_UNI', 'TableId','Entity_Name_UNI', 'Attribute_Name_UNI', i)
('dbId', 'Const_Name_FK', 'TableId','Entity_Name_FK', 'Attribute_Name_FK', i)

Referential_Constraints { Constraint_Id, Constraint_Name,Unique_Constraint_Id,
Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule }
('dbId', 'Entity_Name_FK', 'TableId', 'Const_Name_Key', 'NONE', 'NO ACTION', 'NO ACTION')

Check_Constraints { Constraint_Id, Constraint_Name, Check_Clause }
('dbId', 'Const_Name_CHK', 'Const_Expression')

Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name }
('dbId', 'Entity_Name_Sub', 'Entity_Name_Super')

Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name,
Super_Table_Column }
('dbId', 'Entity_Name_Sub', 'newAttribute_Name', 'Entity_Name_Super', 'oldAttribute_Name')

Cardinality_Constraints { Constraint_Id, Constraint_Name, Referencing_Cardinal }
('dbId', 'Entity_Name_FK', 'Const_Value_Ref')

Figure 5.9 : Augmented tables with different instance occurrences

Some attributes of these knowledge-based tables are used to indicate when to execute a
constraint and what action is to be taken. The actions are application dependent or have no effect on
the approach proposed here, and hence we have used a default value as proposed in the standards.
However, it is possible to specify trigger actions like ON DELETE CASCADE so that when a value of
the referenced table is deleted the corresponding values in the referencing table will automatically
get deleted. These features were initially introduced in the form of rule based constraints to allow
triggers and alerters to be specified in databases and make them active [ESW76, STO88]. Such
actions may also have been implemented in legacy ISs as in the case of general constraints. The

Page 86

DBs
constraints used in our constraint enforcement process (cf. section 5.8) are alerters as they draw
attention to constraints that do not conform to the existing legacy data.

Table_Constraints
PRIMARY KEY (Constraint_Id, Constraint_Name)
CHECK (Constraint_Type IN ('UNIQUE','PRIMARY KEY','FOREIGN KEY','CHECK') )
CHECK ( (Is_Deferrable, Initially_Deferred) IN ( values ('NO','NO'), ('YES','NO'), ('YES','YES') ) )
CHECK ( UNIQUE ( SELECT Table_Id, Table_Name FROM Table_Constraints
WHERE Constraint_Type = 'PRIMARY KEY' ) )
Key_Column_Usage
PRIMARY KEY (Constraint_Id, Constraint_Name, Column_Name)
UNIQUE (Constraint_Id, Constraint_Name, Ordinal_Position)
CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name
FROM Table_Constraints WHERE Constraint_Type
IN ('UNIQUE', 'PRIMARY KEY','FOREIGN KEY' ) ) )
Referential_Constraints
CHECK ( Match_Option IN ('NONE','PARTIAL','FULL') )
CHECK ( Update_Rule IN ('CASCADE','SET NULL','SET DEFAULT','RESTRICT','NO ACTION') )
CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name
FROM Table_Constraints WHERE Constraint_Type = 'FOREIGN KEY' ) )
CHECK ( (Unique_Constraint_Id, Unique_Constraint_Name) IN (
SELECT Constraint_Id, Constraint_Name FROM Table_Constraints
WHERE Constraint_Type IN ('UNIQUE','PRIMARY KEY') ) )
Check_Constraints
Sub_Tables
PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name)
Altered_Sub_Table_Columns
PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name, Column_Name)
FOREIGN KEY (Table_Id, Sub_Table_Name, Super_Table_Name) REFERENCES Sub_Tables
Cardinality_Constraints
FOREIGN KEY (Constraint_Id, Constraint_Name) REFERENCES Referential_Constraints

Figure 5.10: Consistency constraints of our knowledge-based tables

Many other types of constraint are possible in theory [GRE93]. We shall not deal with all of
them as our work is concerned only with constraints applicable at the conceptual modelling stage.
These applicable constraints take the form of logical expressions and are stored in the database using
the knowledge-based table Check_Constraints. They are identified by the keyword 'CHECK' in
Table_Constraints in figure 5.9. Similarly, other constraint types (e.g. rules and procedures) are
represented by means of distinct keywords and tables. Figure 5.9 also includes generalisation and
cardinality constraints. A generalisation hierarchy is defined using the SQL-3 syntax (i.e. UNDER,
see figure 5.5), while a cardinality constraint is defined using an extended foreign key definition (see
figure 5.5). These specifications are also held in the database, using the tables Sub_Tables,
Altered_Sub_Table_Columns and Cardinality_Constraints, respectively (see figure 5.9).

5.8 The Constraint Enforcement Process

This is an optional process provided by our system, as the third stage of its application to a
database. The objective is to give users the facility to verify / ensure that the data conforms to all the
enhanced constraints. This process is optional so that the user can decide whether these constraints
should be enforced to improve the quality of the legacy database prior to its migration or whether it
is best left as it stands.

Page 87

DBs

During the constraint enforcement process any violations of the enhanced constraints are
identified. In some cases this may result in removing the violated constraint as it may be an incorrect
constraint specification. However, the DBA may decide to keep such constraints as the constraint
violation may be as a result of incorrect data instances or due to a change in a business rule that has
occurred during the lifetime of the database. Such a rule may be redefined with a temporal
component to reflect this change. Such data are manageable using versions of data entities as in
object-oriented DBMSs [KIM90].

We use the enhanced constraint definitions to identify constraints that do not conform to the
existing legacy data. Here each constraint is used to produce a query statement. This query statement
depends on the type of constraint, as shown in figure 5.11. CCVES uses constraint definitions to
produce data manipulation language statements suitable for the host DBMS. Once such statements
are produced, CCVES will execute them against the current database to identify any violated data for
each of these constraints. When such violated data are found for an enhanced constraint it is up to
the user to take appropriate action. Enforcement of such constraints can prevent data rejection by the
target DBMS, possible losses of data and/or delays in the migration process, as the migrating data’s
quality will have been ensured by prior enforcement of the constraints. However as the enforcement
process is optional, the user need not take immediate action. He can take his own time to determine
the exact reasons for each violation and take action at his convenience prior to migration.

5.9 The Migration Process

The migration process is the fourth and final stage in the application of our approach. This is
incrementally performed by initially creating the meta-data in the target DBMS, using the schema
meta-translation technique of Ramfos [RAM91], and then copying the legacy data to the target
system, using the import/export tools of source and target DBMSs. During this activity, legacy
applications must continue to function until they too are migrated. To support this process we need
to use an interface (i.e. a forward gateway) that can capture and process all database queries of the
legacy application and then re-direct those related to the target system via CCVES. The functionality
that is required here is a distributed query processing facility which is supported by current
distributed DBMSs. However, in our case the source and target databases are not necessarily of the
same type as in the case of distributed DBMSs, so we need to perform a query translation in
preparation for the target environment. Such a facility can be provided using the query meta-
translation technique of Howells [HOW87]. This approach will facilitate transparent migration for
legacy databases as it will allow the legacy IS users to continue working while the legacy data is
being migrated incrementally.

Page 88

DBs
Constraint Queries to detect Constraint Violation Instances

Primary Key SELECT Attribute_Names, COUNT(*) FROM Entity_Name
GROUP BY Attribute_Names HAVING COUNT(*) > 1
UNION
SELECT Attribute_Names, 1 FROM Entity_Name
WHERE Attribute_Names IS NULL
Unique SELECT Attribute_Names, COUNT(*) FROM Entity_Name
GROUP BY Attribute_Names HAVING COUNT(*) > 1
Referential SELECT * FROM Referencing_Entity_Name WHERE NOT
(Referencing_Attributes IS NULL OR Referencing_Attributes IN
(SELECT Referenced_Attributes FROM Referenced_Entity_Name))
Check SELECT * FROM Entity_Name
WHERE NOT (Check_Constraint_Expression)
Cardinality SELECT Attribute_Names, COUNT(*) FROM Entity_Name
GROUP BY Attribute_Names HAVING COUNT(*) < Min_Cardinality_Value
UNION
SELECT Attribute_Names, COUNT(*) FROM Entity_Name
GROUP BY Attribute_Names HAVING COUNT(*) > Max_Cardinality_Value

Figure 5.11: Detection of violated constraints in SQL

Page 89

CHAPTER 6

Test Databases and their Access Process

In this chapter we introduce our example databases, by describing their physical and logical
components. The selection criteria for these test databases, and the associated constraints in
accessing and using them are discussed here. We investigate the tools available for our test
DBMSs. We then apply our re-engineering process to our test databases to show its applicability.
Lastly, we refer to the organisation of system information in a relational DBMS and describe how
we identify and access information about entities, attributes, relationships and constraints in our
test DBMSs.

6.1 Introduction to our Test Databases

In the following sub-sections we introduce our test databases. We first identify the main
requirements for these databases. This is followed by a description of associated constraints and
their role in database access and use. Finally, we identify how we established our test databases
and the DBMSs we have used for this purpose.

6.1.1 Main Requirements

The design of our test databases was based on two important requirements. Firstly, to
establish a suitable legacy test database environment to enable us to demonstrate the practicability
of our re-engineering and migration techniques. Secondly, to establish a heterogeneous database
environment for the test databases to enable us to test the generality of our approach.

As described in section 2.1.2.1, the problems of legacy databases apply mostly to long
existing database systems. Most of these systems use traditional file-based methods or an old
version of the hierarchical, network or relational database models for their database management.
Due to complexity and availability of resources, which are discussed in section 6.1.2, we decided
to focus on a particular type of database model to apply our legacy database enhancement and
migration techniques. Test databases were developed for the chosen database model, while
establishing the required levels of heterogeneity in the context of that model.

6.1.2 Availability and Choice of DBMSs

At University of Wales College of Cardiff, where our research was conducted, there were
only a few application databases. These included systems used to process student and staff
information for personnel and payroll applications. This information was centrally processed and
managed using third party software. Due to the licence conditions on this software, the university
did not have the authority to modify and improve it on their own. Also, most of this software was
developed with 3GL technology using files to manipulate information. There were recent
enhancements which had been developed using 4GL tools. However, no proper DBMS had been
used to build any of these applications, although future plans included using Oracle for new

database applications. These databases were therefore not well suited to our work.

Other than the personnel and payroll applications there were a few departmental and
specific project based applications. Some of these were based on DBMSs (such as Oracle),
although their application details were not readily available. Information gathered from these
sources revealed that not many database applications existed in our university environment and
gaining permission to access them for research purposes was practically impossible. Also, until
we obtained access and investigated each application we would not be able to fully justify its
usefulness as a test database, as it might not fulfil all our requirements. Therefore, it was decided
to initially design and develop our own database applications to suit our requirements and then if
possible to test our system on any other available real world databases.

Access to DBMSs was restricted to products running on Personal Computers (PCs) and
some Unix systems. Most of these products were based on the relational data model and some on
the object-oriented data model. The older database models - hierarchical and network - were no
longer being used or available as DBMSs. Also, the available DBMSs were in their latest
versions, making the task of building a proper legacy database environment more difficult.

The relational model has been in use for database applications over the last 20 years and
currently is the most widely used data model. During this time many database products and
versions have been used to manage these database applications. As a result, many of them are now
legacy systems and their users need assistance to enhance and migrate them to modern
environments. Thus the choice of the relational data model for our tests is reasonable, although
one may argue that similar requirements exist for database applications which have been in use
prior to this data model gaining its pre-eminent position.

Due to the superior power of workstations as compared to PC’s it was decided to work on
these Unix platforms and to build test databases using the available relational DBMSs, as our
main aim was simply to demonstrate the applicability of our approach. Two popular commercial
relational DBMSs, namely: INGRES and Oracle, were accessible via the local campus network.
We selected these two products to implement our test databases as they are leading,
commercially-established products which have been in existence since the early days of relational
databases. The differences between these two database products made them ideal for representing
heterogeneity within our test environment. Both products supported the standard database query
language, SQL. However, only one of them (Oracle) conforms to the current SQL-2 standard.
Oracle is also a leading relational database product, along with SYBASE and INFORMIX, on
Unix platforms [ROS94]. As described in section 3.8, SQL standards have been regularly
reviewed and hence it is also important to choose a database environment that will support at least
some of the modern concepts, such as object-oriented features. In recent database products these
features have been introduced either via extended relational or via object-oriented database
technology. Obviously the choice of an extended relational data model is the most suitable for our
purposes as it incorporates natural extensions to the relational data model. Hence we selected
POSTGRES, which is a research DBMS providing modern object-oriented features in an extended
relational model, as our third test DBMS.

6.1.3 Choice of database applications and levels of heterogeneity

Page 91

Designing a single large database application as our test database would result in one very
complex database application. To overcome the need to devise and manage a single complex
application to demonstrate all of our tasks, we decided to build a series of simple applications and
later to provide a single integrated application derived from these simple database applications.

Our own university environment was chosen to construct these test database systems as we
were able to perform a detailed system study in this context and collect sufficient information to
create appropriate test applications. Typical text book examples [MCF91, ROB93, ELM94,
DAT95] were also used to verify the contents chosen for our test databases. Three databases
representing college, faculty and department information were chosen for our simple test
databases. To ensure simplicity, no more than ten entities were included for each of these
databases. However, each was carefully designed to enable us to thoroughly test our ideas, as well
as to represent three levels of heterogeneity within our test systems.

These systems were implemented on different computer systems using different DBMSs
so that they represented heterogeneity at the physical level. INGRES, POSTGRES and Oracle
running on DEC station, SUN Sparc and DEC Alpha, respectively, were chosen. The differences
in characteristics among these three DBMSs introduced heterogeneity at the logical level. Here,
Oracle conforms to the current SQL/92 standard and supports most modern relational data model
requirements. INGRES and POSTGRES, although they are based on the same data model, have
some basic differences in handling certain database functions such as integrity constraints. These
two DBMSs use a rule subsystem to handle constraints, which is a different approach from that
proposed by the SQL standards. POSTGRES, which is regarded as an extended relational DBMS
having many object-oriented features, is also regarded as an object-oriented DBMS. These
inherent differences ensure the initial heterogeneity of our environment at the logical level. Our
test databases were designed to highlight these logical differences, as we shall see.

6.2 Migration Support Tools for our Test DBMSs

Prior to creating and applying our approach it was useful to investigate the availability of
tools for our test DBMSs to assist the migration of databases. As indicated in the following sub-
sections, only a few tools are provided to assist this process and they have limited functionality
that is inadequate to assist all the stages of enhancing and migrating a legacy database service.

6.2.1 INGRES

INGRES permits manipulation of data in non-INGRES databases [RTI92] and the
development of applications that are portable across all INGRES servers. This type of data
manipulation is done through an INGRES gateway. INGRES/Open SQL, a subset of INGRES
SQL, is used for this purpose. The type of services provided by this gateway include [RTI92]:

• Translation between Open SQL and non-INGRES SQL DBMS query interfaces such as
Rdb/VMS (for DEC) or DB2 (for IBM).
• Conversion between INGRES data types and non-INGRES data types.
• Translation of non-INGRES DBMS error messages to INGRES generic error types.

Page 92

This functionality is useful in creating a target database service. However, as the target
databases supported by INGRES/Open SQL do not include Oracle and POSTGRES, this tool was
not helpful to us. The PRODBI interface for INGRES [LUC93] allows access to INGRES
databases from Prolog code. This tool is useful in our work as our main processing is done using
Prolog. Hence we have used this tool to implement our constraint enforcement process.

Meta-data access from INGRES databases could have been done using PRODBI.
However, due to its unavailability at the start of our project we implemented this using C
programs embedded with SQL code. INGRES does not support any CASE tools that assist in
reverse-engineering or analysing INGRES applications. Its only support was in the form of a 4GL
environment [RTI90b] which is useful for INGRES application development, but not for any
INGRES based legacy ISs and their reverse engineering.

6.2.2 Oracle

The latest version of Oracle (i.e. version 7) is a RDBMS that conforms to the SQL-2
standards. Hence, this DBMS supports most modern database functions, including the
specification, representation and enforcement of integrity constraints. Oracle has provided
migration tools to convert databases from either of its two most recent versions (i.e. versions 5 or
6) to version 7.

Oracle, a leading database product on the Unix platform [ROS94], has its own tool set to
assist in developing Oracle based application systems [KRO93]. This includes screen-based
application development tools SQL*Forms and SQL*Menu, and the report-writing product
SQL*Report. These tools assist in implementing Oracle applications but do not provide any form
of support to analyse the system being developed. To overcome this, a series of CASE products
are provided by Oracle (i.e. CASE*Bridge, CASE*Designer, CASE*Dictionary,
CASE*Generator, CASE*Method and CASE*Project) [BAR90]. The objective of these tools is to
assist users by supporting a structured approach to the design, analysis and implementation of an
Oracle application.

CASE*Designer provides different views of the application using Entity Relationship
Diagrams, Function Hierarchies, Dataflow Diagrams and matrix handlers to show the inter-
relationship between different objects held in an Oracle dictionary. Oracle*Dictionary maintains
complete definitions of the requirements and the detailed design of the application.
Oracle*Generator uses these definitions to generate the code for the target environment and
CASE*Bridge is used to extract information from other Oracle CASE tools or vice versa.
However, such functions can be performed only on applications developed using these tools and
not on an Oracle legacy database developed in any other way, which means they are no help with
the current legacy problem. Hence, Oracle CASE tools are useful when developing new
applications but cannot be used to re-engineer a pre-existing Oracle application, unless that
original application was developed in an Oracle CASE environment. This limitation is shared by
most CASE tools [COMP90, SHA93].

Currently, Oracle and other vendors are working on overcoming this limitation, and

Page 93

Oracle’s open systems architecture for heterogeneous data access [HOL93] is a step towards this.
ANSI standard embedded SQL [ANSI89b] is used for application portability along with a set of
function calls. In Oracle’s open systems architecture, standard call level interfaces are used to
dynamically link and run applications on different vendor engines without having to recompile
the application programs. This functionality is a subset of Microsoft’s ODBC [RIC94, GEI95] and
the aim is to provide a transparent gateway to access non-Oracle SQL database products (e.g.
IMS, DB2, SQL/DS and VSAM for IBM machines, or RMS and Rdb for DEC) via Oracle’s
SQL*Connect. Transparent gateway products are machine and DBMS dependent in that they need
to be recompiled or modified to run on different computers and support access to a variety of
DBMSs.

In the past, developers had to create special code for each type of database their users
wanted to access. This limitation can now be overcome using a tool like ODBC to permit access
to multiple heterogeneous databases. Most database vendors have development strategies which
include plans to interoperate with open systems vendors as well as proprietary database vendors.
This facility is being implemented using the 17SQL Access Group’s RDA (Remote Database
Access) standard. As a result, products such as Microsoft’s Open Database Connectivity (ODBC),
INFORMIX-Gateway [PUR93] and Oracle Transparent Gateway [HOL93] support some form of
connectivity between their own and other products.

For our work with Oracle, we developed our own C programs embedded with the query
language SQL to access and update our prototype Oracle database. There is a version of PRODBI
for Oracle that allows access to Oracle databases from Prolog code which was used in this project.

6.2.3 POSTGRES

POSTGRES was developed at the University of California at Berkeley as a research
oriented relational database extended with object-oriented features. Since 1994 a commercial
version called ILLUSTRA [JAE95] has been available. However, POSTGRES has yet to address
the inter-operability and other issues associated with our migration approach.

6.3 The Design of our Test University Databases

6.3.1 Background

In our university system, we assume that departments and faculties have common user
requirements and ideally could share a common database. Based on this assumption we have
developed our test database schema to contain shared information. Hence, our three simple test
databases, known as: College, Faculty and Department, can be easily integrated. A complete
integration of these three databases will result in the generation of a global University database
schema. However, in practice, schemas used by different departments and faculties may differ,

17
SQL Access Group (SAG) is a non-profit corporation open to vendors and users that
develops technical specifications to enable multiple SQL-based RDBMS’s and application tools to
interoperate. The specifications defined by the SAG consist of a combination of current and
evolving standards that include ANSI SQL, ISO RDA and X/Open SQL.

Page 94

making the task of integration more difficult and bringing up more issues of heterogeneity. As our
work is concerned with legacy database issues in a heterogeneous environment and not with
integrating or resolving conflicts that arise in these environments, the differences that exist within
this type of environment were not considered. Hence, we shall be looking at each of these
databases independently. The main advantage of being able to easily integrate our test databases
was the ability, thereby, to readily generate a complex database schema which could also be used
to test our ideas.

Each test database was designed to represent a specific kind of information, for example
the Faculty and Department databases represent all kinds of structural relationships (e.g. 1:1, 1:M,
and M:N; strong and weak relationships and entity types). The College database represents
specialisation / generalisation structures, while the University database acts as a global system
consisting of all the sub-database systems. This allows all sub-database systems, i.e. College,
Faculty and Department, to act as a distributed system - the University database system. This is
illustrated in figure 6.1 and is further described in section 6.3.2. We also need to be able to specify
and represent all the constraint types discussed in section 3.5, as our re-engineering techniques are
based on constraints. These were chosen to reflect actual database systems as closely as possible.
We introduce these constraints in section 6.4 after identifying the entities of each of our test
databases.

College Database

FPS Database A Faculty Database

COMMA Database MATHS Database Departmental Databases

Figure 6.1: The UWCC Database
6.3.2 The UWCC Database

We shall use the term UWCC database to refer to our example university database, as the
data of our system is based on that used at University of Wales College of Cardiff (UWCC).

The UWCC database consists of many distributed database sites each used to perform the
functions either of a particular department or school, or of a faculty, or of the college. The
functions of the college are performed using the database located at the main college, which we
shall refer to as the College database. The College consists of five faculties, each having its own
local database located at the administrative section of the faculty. Our test database has been
populated for one faculty, namely: The Faculty of Physical Science (FPS), and we shall refer to

Page 95

this database as the Faculty database. The College has 28 departments or schools, with five of
them belonging to FPS [UWC94a, UWC94b]. Our test databases were populated using data from
two departments of FPS, namely: The Department of Computing Mathematics (COMMA), which
is now called the Department of Computer Science, and The Department of Mathematics
(MATHS). These are referred to as Department databases.

The component databases of our UWCC database form a hierarchy as shown in figure 6.1.
This will let us demonstrate how the global University database formed by integrating these
components incorporates all the functions present in the individual databases. In the next section
we identify our test databases by summarising their entities and specific features.

Entity Database
Name (Meaning) College Faculty Department University
University (university data) x - - x
Employee (university employees) x x x x
Student (university students) x - x x
EmpStudent (employees as students) x - - x
College (college data) x - - x
Faculty (faculty data) x x - x
Department (department data) x x x x
Committee (faculty committees) - x - x
ComMember (committee members) - x - x
Teach (subjects taught by staff) - - x x
Course (offered by the department) - - x x
Subject (subject details) - - x x
Option (subjects for each course) - - x x
Take (subjects taken by each student) - - x x

Table 6.1: Entities used in our test databases

6.3.3 The Test Database schemas

Fourteen entities shown in table 6.1 were represented in our test database schemas. As we
are not concerned with heterogeneity issues associated with schema integration, we have
simplified our local schemas by using the same attribute definitions in schemas having the same
entity name. The attribute definitions of all our entities are given in figure 6.2. Each test database
schema is defined using the data definition language (DDL) of the chosen DBMS, and is governed
by a set of rules to establish integrity within the database. In the context of a legacy system these
rules may not appear as part of the database schema. In this situation our approach is to supply
them externally via our constraint enhancement process. Therefore we present the set of
constraints defined on our test databases separately, so that the initial state of these databases
conforms to the database structure of a typical legacy system.

6.3.4 Features of our Test Database schemas

Among the specific features represented in our test databases are relationship types which
form weak and link entities, cardinality constraints which highlight the behaviour of entities, and
inheritance and aggregation which form specialised relationships among entities. These features
(if not present) are introduced to our test database schemas by enhancing them with new
constraints.

Page 96

a) Relationship types

Our reverse-engineering process uses the knowledge of constraint definitions to construct
a conceptual model for a legacy database system. The foreign key definitions of table 6.4 along
with their associated primary (cf. table 6.2) and uniqueness constraints (cf. table 6.3) are used to
determine the relationship structures of a conceptual model. In this section we look at our foreign
key constraint definitions to identify the types of relationship formed in our test database schemas.
The check constraints of table 6.5 are used purely to restrict the domain values of our test
databases.

The foreign keys of table 6.4 are processed to find relationships according to our approach
described in section 5.4.1. Here we identify keys defined on primary key attributes to determine
M:N and 1:N weak relationships. The remaining keys will form 1:N or 1:1 relationships
depending on the uniqueness property of the attributes of these keys. Table 6.6 shows all the
relationships found in our test databases. We have also identified the criteria used to determine
each relationship type according to section 5.4.1.

Page 97

CREATE TABLE University ( CREATE TABLE Department (
Office char(50) NOT NULL ); DeptCode char(5) NOT NULL,
Building char(20) NOT NULL,
CREATE TABLE Employee ( Name char(50) NOT NULL,
Name char(25) NOT NULL, Address char(30),
Address char(30) NOT NULL, Head char(9),
BirthDate date(7) NOT NULL, Phone char(13),
Gender char(1) NOT NULL, Faculty char(5) NOT NULL );
EmpNo char(9) NOT NULL,
Designation char(30) NOT NULL, CREATE TABLE Committee (
WorksFor char(5) NOT NULL, Name char(15) NOT NULL,
YearJoined integer(2) NOT NULL, Faculty char(5) NOT NULL,
Room char(9), Chairperson char(9) );
Phone char(13),
Salary decimal(8,2) ); CREATE TABLE ComMember (
ComName char(15) NOT NULL,
CREATE TABLE Student ( MemName char(9) NOT NULL,
Name char(20) NOT NULL, Faculty char(5) NOT NULL,
Address char(30) NOT NULL, YearJoined integer(2) NOT NULL );
BirthDate date(7) NOT NULL,
Gender char(1) NOT NULL, CREATE TABLE Teach (
CollegeNo char(9) NOT NULL, Lecturer char(9) NOT NULL,
Course char(5) NOT NULL, Course char(5) NOT NULL,
Department char(5) NOT NULL, Subject char(5) NOT NULL,
Tutor char(9), Room char(9) );
RegYear integer(2) NOT NULL );
CREATE TABLE Course (
CREATE TABLE EmpStudent ( CourseNo char(5) NOT NULL,
CollegeNo char(9) NOT NULL, Name char(35) NOT NULL,
EmpNo char(9) NOT NULL, Coordinator char(9),
Remark char(10) ); Offeredby char(5) NOT NULL,
Type char(1) NOT NULL,
CREATE TABLE College ( Length char(10),
Code char(5) NOT NULL, Options integer(2) );
Building char(20) NOT NULL,
Name char(40) NOT NULL, CREATE TABLE Subject (
Address char(30), SubNo char(5) NOT NULL,
Principal char(9), Name char(40) NOT NULL );
Phone char(13) );
CREATE TABLE Option (
CREATE TABLE Faculty ( Course char(5) NOT NULL,
Code char(5) NOT NULL, Subject char(5) NOT NULL,
Building char(20) NOT NULL, Year integer(2) NOT NULL );
Name char(40) NOT NULL,
Address char(30), CREATE TABLE Take (
Secretary char(9), CollegeNo char(9) NOT NULL,
Phone char(13), Subject char(5) NOT NULL,
Dean char(9) ); Year integer(2) NOT NULL,
Grade char(1) );

Figure 6.2: Test database schema entities and their attribute descriptions

We can see that the selected constraints cover four of the five relationship identification
categories of figure 5.3. The remaining category (i.e. ‘b’) is a special case of category ‘a’ which
could be represented in the entity Take by introducing two separate foreign keys to link entities
Course and Subject, instead of linking with the entity Option. However, as stated in section 5.4.1,
n-ary relationships are simplified whenever possible. Hence, in the test examples presented here
we do not show this type to reduce the complexity of our diagrams. In appendix C we present the
graphical view of all our test databases. The figures there show the graphical representation of all
the relationships identified in table 6.6.

b) Inheritance

We have introduced two inheritance structures, one representing a single inheritance and
the other a multiple inheritance (see figure 5.2 and table 6.7). To do so, two generalised entities,

Page 98

namely: Office and Person, have been introduced (see figure 6.3). Entities College, Faculty and
Department now inherit from Office, while entities Employee and Student inherit from Person.
Entity EmpStudent has been modified to become a specialised combination of Student and
Employee. Figure 6.3 also contain all constraints associated with these entities.

Constraint Entity(s)

PRIMARY KEY (office) University
PRIMARY KEY (empNo) Employee
PRIMARY KEY (collegeNo) Student, EmpStudent
PRIMARY KEY (code) College, Faculty
PRIMARY KEY (deptCode) Department
PRIMARY KEY (name,faculty) Committee
PRIMARY KEY (comName,memName,faculty) ComMember
PRIMARY KEY (lecturer,cource,subject) Teach
PRIMARY KEY (courseNo) Course
PRIMARY KEY (subNo) Subject
PRIMARY KEY (course,subject) Option
PRIMARY KEY (collegeNo,subject,year) Take

Table 6.2: Primary Key constraints of our test databases


UNIQUE (empNo) EmpStudent
UNIQUE (name) College, Department, Faculty
UNIQUE (principal) College
UNIQUE (dean) Faculty
UNIQUE (head) Department
UNIQUE (name,offeredBy) Course

Table 6.3: Uniqueness Key constraints of our test databases

c) Cardinality constraints

We have introduced some cardinality constraints on our test databases to show how these
can be specified for a legacy database. In table 6.8 we show those used in the College database.
Here the cardinality constraints for worksFor and faculty have been explicitly specified (see figure
6.3), while the others (inCharge, tutor and dean) have been derived using their relationship types.
For example inCharge and tutor are 1:N relationships while dean is a 1:1 relationship. Our
conceptual diagrams incorporate these constraint values (cf. appendix C and figure 5.2).


FOREIGN KEY (course) REFERENCES Course Student, Option
FOREIGN KEY (department) REFERENCES Department Student
FOREIGN KEY (tutor) REFERENCES Employee Student
FOREIGN KEY (dean) REFERENCES Employee Faculty
FOREIGN KEY (faculty) REFERENCES Faculty Committee
FOREIGN KEY (chairPerson) REFERENCES Employee Committee
FOREIGN KEY (comName,faculty) REFERENCES Committee ComMember
FOREIGN KEY (memName) REFERENCES Employee ComMember
FOREIGN KEY (lecturer) REFERENCES Employee Teach
FOREIGN KEY (course,subject) REFERENCES Option Teach
FOREIGN KEY (coordinator) REFERENCES Employee Course
FOREIGN KEY (offeredBy) REFERENCES Department Course
FOREIGN KEY (subject) REFERENCES Subject Option, Take
FOREIGN KEY (collegeNo) REFERENCES Student Take

Table 6.4: Foreign Key constraints of our test databases

Page 99


CHECK (yearJoined >= 21 + birthDate INTERVAL YEAR) Employee
CHECK (salary BETWEEN 200 AND 3000 OR salary IS NULL) Employee
CHECK (regYear >= 18 + birthDate INTERVAL YEAR) Student
CHECK (phone IS NOT NULL) College, Department, Faculty
CHECK (type IN ('U','P','E','O')) Course
CHECK (options >= 0 OR options IS NULL) Course
CHECK (year BETWEEN 1 AND 7) Option

Table 6.5: Check constraints of our test databases

d) Aggregation

A university has many offices (e.g. faculties, departments etc.) and an office belongs to a
university. Also, attribute office is the key of entity University. Hence, entities University and
Office participate in a 1:1 relationship. However, it is natural to represent this as a specialised
relationship by considering office of University to be of type set. Then University and Office
participate in an aggregation relationship which is a special form of a binary relationship. We
introduce this type to show how specialised constraints could be introduced into a legacy database
system. As shown in figure 6.3 we have used the key word REF SET to specify this type of
relationship. In this case, as office is the key of University, a foreign key definition on office (see
figure 6.3) will treat University as a link entity and hence can be classified as a special
relationship.

Attribute(s) Entity Relationship Entity(s) Criteria

course Student 1 :N Course (d)
department Student 1 :N Department (d)
tutor Student 1 :N Employee (d)
dean Faculty 1 :1 Employee (e)
faculty Committee 1 :N Faculty (c)
chairPerson Committee 1 :N Employee (d)
comName, faculty, memName ComMember M :N Committee, Employee (a)
lecturer, course, subject Teach M :N Employee, Option (a)
coordinator Course 1 :N Employee (d)
offeredBy Course 1 :N Department (d)
course, subject Option M :N Course, Subject (a)
collegeNo, subject Take M :N Student, Subject (a)

Table 6.6: Relationship types of our test databases

Entity Inherited Entities
Employee Person
Student Person
EmpStudent Student, Employee
College Office
Faculty Office
Department Office

Table 6.7: Inherited Entities

Participating Referencing Referenced Referencing Referenced
Attribute Entity Entity Cardinality Cardinality
inCharge Office Employee 0+ -1
worksFor Employee Office 4+ 1
tutor Student Employee 0+ -1
dean Faculty Employee -1 -1
faculty Department Faculty 2-12 1

Table 6.8: Cardinality constraints of College database

Page 100

6.4 Constraints Specification, Enhancement and Enforcement

In the context of legacy systems, our test database schemas (cf. figure 6.2) will not
explicitly contain most of the constraints introduced in tables 6.2 to 6.5, 6.7 and 6.8. Thus we need
to specify them using the approach described in section 5.6. In figure 6.3 we present these
constraints for the College database.

CREATE TABLE Office
(code, siteName, unitName, address, inCharge, phone) AS
SELECT code, building, name, address, principal, phone FROM College
UNION SELECT code, building, name, address, secretary, phone FROM Faculty
UNION SELECT deptCode, building, name, address, head, phone FROM Department;
ALTER TABLE Office
ADD CONSTRAINT Office_PK PRIMARY KEY (code)
ADD CONSTRAINT Office_Unique_name UNIQUE (siteName, unitName)
ADD CONSTRAINT Office_FK_Staff FOREIGN KEY (inCharge) REFERENCES Employee
ADD UNIQUE (phone);

ALTER TABLE College ADD UNDER Office
WITH (siteName AS building, unitName AS name, inCharge AS principal);
ALTER TABLE Faculty ADD UNDER Office
WITH (siteName AS building, unitName AS name, inCharge AS secretary)
ADD FOREIGN KEY (faculty) CARDINALITY (2-12) REFERENCES Faculty ;
ALTER TABLE Department ADD UNDER Office
WITH (code AS deptCode, siteName AS building, unitName AS name, inCharge AS head);

CREATE VIEW College_Office AS SELECT * FROM Office
WHERE code in (SELECT code FROM College);
CREATE VIEW Faculty_Office AS
SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, f.dean
FROM Office o, Faculty f WHERE o.code = f.code;
CREATE VIEW Dept_Office AS
SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, d.faculty
FROM Office o, Department d WHERE o.code = d.deptCode;

ALTER TABLE University
ALTER COLUMN office REF SET(Office) NOT NULL
| ADD FOREIGN KEY (office) REFERENCES Office ;

CREATE TABLE Person AS
SELECT name, address, birthDate, gender FROM Employee
UNION SELECT name, address, birthDate, gender FROM Student;
ALTER TABLE Person
ADD PRIMARY KEY (name, address, birthDate)
ADD CHECK (gender IN ('M', 'F'));

ALTER TABLE Employee ADD UNDER Person
ADD CONSTRAINT Employee_FK_Office FOREIGN KEY (worksFor)
CARDINALITY (4) REFERENCES Office;
ALTER TABLE Student ADD UNDER Person;
ALTER TABLE EmpStudent ADD UNDER Student, Employee
ADD CHECK (tutor <> empNo OR tutor IS NULL);

Figure 6.3 : Enhanced constraints of college database in extended SQL-3 syntax
When all the above constraints are not supported by a legacy database management
system, we need to be able to store the constraints in the database using our knowledge
augmentation techniques (cf. section 5.7). In figure 6.4 we present selected instances used in our
knowledge-based tables to represent the enhanced constraints for the College database. The
selected instances represent all the possible constraint types so we have not represented all the
enhanced constraints of figure 6.3.

Page 101

Our constraint enforcement process (cf. section 5.8) allows users to verify the extent to
which the data in a database conforms to its enhanced constraints. The different types of queries
used for this process in the College database are given in figure 6.5.

Table_Constraint { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type,
Is_Deferrable, Initially_Deferred }
('Uni_db', 'Office_PK', 'Col', 'Office', 'PRIMARY KEY', 'NO', 'NO')
('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'UNIQUE', 'NO', 'NO')
('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'FOREIGN KEY', 'NO', 'NO')
('Uni_db', 'Employee_PK', 'Col', 'Employee', 'PRIMARY KEY', 'NO', 'NO')
('Uni_db', 'Employee_FK_Office', 'Col, 'Employee', 'FOREIGN KEY', 'NO', 'NO')
('Uni_db', 'College_phone', 'Col', 'College', 'CHECK', 'NO', 'NO')
Referential_Constraint { Constraint_Id, Constraint_Name,Unique_Constraint_Id,
Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule }
('Uni_db', 'Office_FK_Employee', 'Col', 'Employee_PK', 'NONE', 'NO ACTION', 'NO ACTION')
('Uni_db', 'Employee_FK_Office', 'Uni', 'Office_PK', 'NONE', 'NO ACTION', 'NO ACTION')
Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name,
Ordinal_Position }
('Uni_db', 'Office_PK', 'Col', 'Office', 'Code', 1)
('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'siteName', 1)
('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'unitName', 2)
('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'inCharge', 1)
('Uni_db', 'Employee_PK', 'Col', 'Employee', 'empNo', 1)
('Uni_db', 'Employee_FK_Office', 'Col', 'Employee', 'worksFor', 1)
Check_Constraint { Constraint_Id, Constraint_Name, Check_Clause }
('Uni_db', 'College_phone', 'phone is NOT NULL')
Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name }
('Uni_db', 'College', 'Office')
Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name,
Super_Table_Column }
('Uni_db', 'College', 'building', 'Office', 'siteName')
('Uni_db', 'College', 'name', 'Office', 'unitName')
('Uni_db', 'College', 'principal', 'Office', 'inCharge')
Cardinality_Constraint { Constraint_Id, Constraint_Name, Referencing_Cardinal }
('Uni_db', 'Office_FK_Employee', '0+')
('Uni_db', 'Employee_FK_Office', '4+')

Figure 6.4 : Augmented tables with selected sample data for the college database

6.5 Database Access Process

Having described the application of our re-engineering processes using our test databases,
we identify the tools developed and used to access those databases. The database access process is
the initial stage of our application. This process extracts meta-data from legacy databases and
represents it internally so that it can be used by other stages of our application.

During re-engineering we need to access a database at three different stages: to extract
meta-data and any existing constraint knowledge specifications to commence our reverse-
engineering process; to add enhanced knowledge to the database; and to verify the extent to which
the data conforms to the existing and enhanced constraints. We also need to access the database
during the migration process. In all these cases, the information we require is held in either system
or user-defined tables. Extraction of information from these tables can be done using the query
language of the database, thus what we need for this stage is a mechanism that will allow us to
issue queries and capture their responses.

Page 102

Constraint Type Constraint Violation Instances

Primary Key SELECT code, COUNT(*) FROM Office
GROUP BY code HAVING COUNT(*) > 1
UNION SELECT code, 1 FROM Office WHERE code IS NULL
Unique SELECT dean, COUNT(*) FROM Faculty
GROUP BY dean HAVING COUNT(*) > 1
Referential SELECT * FROM Office WHERE NOT (inCharge IS NULL OR
inCharge IN (SELECT empNo FROM Employee))
Check SELECT * FROM College WHERE NOT (phone IS NOT NULL)
Cardinality SELECT worksFor, COUNT(*) FROM Employee
GROUP BY worksFor HAVING COUNT(*) < 4

Figure 6.5: Selected constraints to be enforced for the college database in SQL

As our system implementation is in Prolog, the necessary query statements are generated
from Prolog rules. The PRODBI interface allows access to several relational DBMSs, namely:
Oracle, INGRES, INFORMIX and SYBASE [LUC93], from Prolog as if their relational tables are
in the Prolog environment. The availability of INGRES PRODBI enabled us to use this tool to
communicate with our INGRES test databases in the latter stages of our project. This interface has
a performance as good as that of INGRES/SQL and hence, to the user, database interaction is fully
transparent. Such Prolog database interface tools are currently commercially available only for
relational database products. This means that we were not in a position to use this approach to
perform database interactions for our POSTGRES test databases. Tools such as ODBC allow
access to heterogeneous databases. This option would have been ideal for our application, but was
not considered due to its unavailability in our development time scale.

As far as our work is concerned, we needed a facility to issue specific types of query and
obtain the response in such a way that Prolog could process responses without having to download
the entire database. The PRODBI interfaces for relational databases perform this task efficiently,
and also have many other useful data manipulation features. Due to the absence of any PRODBI
equivalent tools to access non-relational or extended-relational DBMSs, we decided to develop
our own version for POSTGRES. Here the functionality of our POSTGRES tool is to accept a
POSTGRES DML statement (i.e a QUEL query statement) and produce the results for that query
in a form that is usable by our (Prolog based) system.

For Oracle, a PRODBI interface is available commercially, and to use it with our system
the only change we would have to make is to load the Oracle library. As far as our code is
concerned there is no change in any other commands, since they support the same rules as in
INGRES. However at Cardiff only the PRODBI interface for INGRES was available, and even
this was in the latter stages of our project. Therefore we developed our own tool to perform this
functionality for INGRES and Oracle databases. However the implementation of this tool was not
fully generalised, given that such tools were commercially available. When developing this tool
we were not too concerned by performance degradation as our aim was to test functionality, not
performance. Also, in the case of INGRES we have confirmed performance by subsequently using
a commercially developed PRODBI tool with an SQL equivalent query facility.

6.5.1 Connecting to a database

To establish a connection with a database the user needs to specify the site name (i.e. the

Page 103

location of the database), the DBMS name (e.g. Oracle v7) and the database name (e.g.
department) to ensure a unique identification of a database located over a network. The site name
is the address of the host machine (e.g. thor.cf.ac.uk) and is used to gain access to that machine
via the network. The type of the named DBMS identifies the kind of data to be accessed, and the
name18 of the database tells us which database is to be used in the extraction process.

In our system (CCVES), we provide a pop-up window (cf. left part of figure 6.6) to select
and specify these requirements. Here, a set of commonly used site names and the DBMSs
currently supported at a site are embedded in the menu to make this task easy. The specification of
new site and database names can also be done via this pop-up menu (cf. right part of figure 6.6).

6.5.2 Meta-data extraction

Once a physical connection to a database is achieved it is possible to commence the meta-
data extraction process. This process is DBMS dependent as the kind of meta-data represented in a
database and the methods of retrieving it vary between DBMSs. The information to be extracted
is recorded in the system catalogues (i.e. data dictionaries) of respective databases. The most basic
type of information is entity and attribute names, which are common to all DBMSs. However,
information about different types of constraints is specific to DBMSs and may not be present in
legacy database system catalogues.

Figure 6.6: Database connection process of CCVES

The organisation of meta-data in databases differs with DBMSs, although all relational
database systems use some table structure to represent this information. For example, a table
structure for an Oracle user table is straight forward as they are separated from its system tables,
while it is more complex in INGRES as all tables are held in a single form using attribute values
to differentiate user defined tables from system and view tables. Hence the extraction query
statements to retrieve entity names of a database schema differ for each system, as shown in table
6.9. These query statements indicate that the meta-data extraction process is done using the query
language of that DBMS (e.g. SQL for Oracle and POSTQUEL for POSTGRES) and that the
query table names and conditions vary with the type of the DBMS. This clearly demonstrates the
DBMS dependency of the extraction process. Once the meta-data is obtained from system

18
For simplicity, identification details like the owner of the database are not included here.

Page 104

catalogues we can process it to produce the database schema in the DDL formalism of the source
database and to represent this in our internal representation (see section 7.2). The extraction
process for entity names (cf. table 6.9) covers only one type of information. A similar process is
used to extract all the other types of information, including our enhanced knowledge based tables.
Here, the main difference is in the queries used to extract meta-data and any processing required
to map the extracted information into our internal structures, which are introduced in section 7.2
(see also appendix D).

6.5.3 Meta-data storage

The generated internal structures are stored in text files for further use as input data for our
system. These text files are stored locally using distinct directories for each database. The system
uses the database connection specifications to construct a unique directory name for each
database (e.g. department-Oracle7-thor.cf.ac.uk). We have given public access to these files so
that the stored data and knowledge is not only reusable locally, but also usable from other sites.
This directory structure concept provides a logically coherent database environment for users. It
means that any future re-engineering processes may be done without physically connecting to the
database (i.e. by selecting a database logically from one of the public directories instead).

DBMS Query

Oracle V7 SELECT table_name FROM user_table;
INGRES V6 SELECT table_name FROM iitables WHERE table_type='T' and system_use='U';
POSTGRES V4 RETRIEVE pg_class.relname WHERE pg_class.relowner!='6';
SQL-3 SELECT table_name FROM tables WHERE table_type='BASE TABLE';

Table 6.9: Query statement to extract entity names of a database schema

The process of connecting to a database and accessing its meta-data usually does not take
much time (e.g. at most 2 minutes). However, trying to access an active database whenever a user
wants to view its structure slows down the regular activities of this database. Also, local working
is more cost effective than regularly performing remote accesses. This alternative also guarantees
access to the database service as it is not affected by network traffic and breakdowns. We
experienced such breakdowns during our system development, especially when accessing
INGRES databases. A database schema can be considered to be static, whereas its instances are
not. Hence, the decision to simulate a logical database environment after the first physical remote
database access is justifiable because it allows us to work on meta-data held locally.

6.5.4 Schema viewing

As meta-data is stored in text files for easy database access sessions, it is possible to skip
the stages described in section 6.5.1 to 6.5.3 when viewing a database schema which has been
accessed recently. During a database connection session, our system will only extract and store the
meta-data of a database. Once the database connection process is completed the user needs to
invoke a schema viewing session. Here, the user is prompted with a list of the current logically
connected databases, as shown on the left of figure 6.7. When a database is selected from this list,

Page 105

its name descriptions (i.e. database name and associated schema names) are placed in the main
window of CCVES (cf. right of figure 6.7). The user selects schemas to view them. Our reverse-
engineering process is applied at this point. Here, meta-data extracted from the database schema is
processed further to derive the necessary constructs to produce the conceptual model as an E-R or
OMT diagram.

CCVES allows multiple selections of the same database schema (i.e. by selecting the same
schema from the main window; cf. right of figure 6.7). As a result, multiple schema visualisation
windows can be produced for the same database. The advantage of this is that it allows a user to
simultaneously view and operate on different sections of the same schema, which otherwise would
not be visible simultaneously due to the size of the overall schema (i.e. we would have to scroll
the window to make other parts of the schema visible). Also, the facility to visualise schemas
using a user preferred display model means that the user could now view the same schema
simultaneously using different display models.

Figure 6.7: Database selection and selected databases of CCVES
To produce a graphical view of a schema, we apply our reverse-engineering process. This
process uses the meta-data which we extracted and represented internally. In chapter 7 we
introduce the representation of our internal structures and describe the internal and external
architecture and operation of our system, CCVES.

Page 106

CHAPTER 7

Architecture and Operation of CCVES

The Conceptualised Constraint Visualisation and Enhancement System (CCVES) is defined by
describing its internal architecture and operation - i.e. the way in which different legacy database
schemas are processed within CCVES in the course of enhancing and migrating them into a target
DBMS’s schema - and its external architecture and operation - i.e. CCVES as seen and operated
by its users. Finally, we look into the possible migrations that can be performed using CCVES.

7.1 Internal Architecture of CCVES

In previous chapters, we discussed overall information flow (section 2.2), our re-
engineering process (section 5.2) and the database access process (section 6.5). Here we describe
how the meta-data accessed from a database is stored and manipulated by CCVES in order to
successfully perform its various tasks.

There are two sources of input information available to CCVES (cf. figure 7.1): initially,
by accessing a legacy database service via the database connectivity (DBC) process, and later by
using the database enhancement (DBE) process. This information is converted into our internal
representation (see section 7.2) and held in this form for use by other modules of CCVES. For
example, the Schema Meta-Visualisation System (SMVS) uses it to display a conceptual model of
a legacy database, the Query Meta-Translation System (QMTS) uses it to construct queries that
verify the extent to which the data conforms to existing and enhanced constraints, and the Schema
Meta-Translation system (SMTS) uses it to generate and create target databases for migration.

7.2 Internal Representation

To address heterogeneity issues, meta-representation and translation techniques have been
successfully used in several recent research projects at Cardiff [HOW87, RAM91, QUT93,
IDR94]. A key to this approach is the transformation of the source meta-data or query into a
common internal representation which is then separately transformed into a chosen target
representation. Thus components of a schema, referred to as meta-data, are classified as entity
(class) and attribute (property) on input, and are stored in a database language independent fashion
in the internal representation. This meta-data is then processed to derive the appropriate schema
information of a particular DBMS. In this way it is possible to use a single representation and yet
deal with issues related to most types of DBMSs. A similar approach is used for query
transformation between source and target representations.

DBE

Internal
DBC Representation SMVS

QMTS SMTS

Figure 7.1: Internal Architecture of CCVES

The meta-data we deal with has been classified into two types. The first category
represents essential meta-data and the other represents derived meta-data. Information that
describes an entity and its attributes, and constraints that identify relationships/hierarchies among
entities are the essential meta-data (see section 7.2.1), as they can be used to build a conceptual
model. Information that is derived for use in the conceptual model from the essential meta-data
constitutes the other type of meta-data. When performing our reverse-engineering process we look
only at the essential meta-data. This information is extracted from the respective databases during
the initial database access process (i.e. DBC in figure 7.1).

7.2.1 Essential Meta-data

Our essential (basic) meta-data internal representation captures sufficient information to
allow us to reproduce a database schema using the DDL syntax of any DBMS. This representation
covers entity and view definitions and their associated constraints. The following 5 Prolog style
constructs were chosen to represent this basic meta-data, see figure 7.2. The first two constructs,
namely: class and class property, are fundamental to any database schema as they describe the
schema entities and their attributes, respectively. The third construct represents constraints
associated with entities. This information is only partially represented by most DBMSs. The next
two constructs are relevant only to some recent object-oriented DBMSs and are not supported by
most DBMSs. We have included them mainly to demonstrate how modern abstraction
mechanisms such as inheritance hierarchies could be incorporated into legacy DBMSs. By a
similar approach, it is possible to add any other appropriate essential meta-data constructs. For
conceptual modelling, and for the type of testing we perform for the chosen DBMSs, namely:
Oracle, INGRES and POSTGRES, we found that the 5 constructs described here are sufficient.
However, some additional environmental data (see section 7.2.2) which allows identification of
the name and the type of the current database is also essential.

Page 108

1. class(SchemaId, CLASS_NAME).
2. class_property(SchemaId, CLASS_NAME, PROPERTY_NAME, PROPERTY_TYPE).
3. constraint(SchemaId, CLASS_NAME, PROPERTY_list, CONST_TYPE, CONST_NAME,
CONST_EXPR).
4. class_inherit(SchemaId, CLASS_NAME, SUPER_list).
5. renamed_attr(SchemaId, SUPER_NAME, SUPER_PROP_NAME, CLASS_NAME,
PROPERTY_NAME).

Figure 7.2: Our Essential Meta-data Representation Constructs

We now provide a detailed description of our meta-representation constructs. This
representation is based on the concepts of the Object Abstract Conceptual Schema (OACS)
[RAM91] used in his SMTS and other meta-processing systems. Hence we have used the same
name to refer to our own internal representation. Ramfos’s OACS internal representation provides
a natural abstraction of a particular structure based on the notion of objects. For example, when an
object is described, its attributes, constraints and other related properties are treated as a single
construct although only part of it may be used at a time. Our OACS representation directly
resembles the internal representation structure of most relational DBMSs (e.g. class represents an
entity, class_property represent the list of attributes of an entity). This is the main difference
between the two representations. However it is possible to map the OACS constructs of Ramfos to
our internal representation and vice-versa, hence our decision to use a variation of the original
OACS does not affect the meta-representation and processing principles in general.

• Meta-data Representation of class

The names of all the entities for a particular schema are recorded using class. This
information is processed to identify all the entities of a database schema.

• Meta-data Representation of class_property

The names of all attributes and their data types for a particular schema are recorded using
class_property. This information is processed to identify all attributes of an entity.

• Meta-data Representation of constraint

All types of constraints associated with entities are recorded using constraint. This
information has been organised to represent constraints as logical expressions, along with an
identification name and participating attributes. Different types of constraint, i.e. primary key,
foreign key, unique, not null, check constraints, etc., are each processed and stored in this
form. Usually a certain amount of preprocessing is required for the construction of our
generalised representation of a constraint. For example, some check constraints extracted
from the INGRES DBMS need to be preprocessed to allow them to be classified as check
constraints by our system.

• Meta-data Representation of class_inherit

Page 109

Entities that participate in inheritance hierarchies are recorded using class_inherit. The names
of all super-entities for a particular entity are recorded here. This information is processed to
identify all sub-entities of an entity and the inheritance hierarchies of a database schema.

• Meta-data Representation of renamed_attr

During an inheritance process, some attribute names may be changed to give more meaningful
names for the inherited attributes. Once the inherited names are changed it makes it
impossible to automatically reverse engineer these entities as their attribute names no longer
match. To overcome this problem we have introduced an additional meta-data representation
construct, namely: renamed_attr, which keeps track of all attributes whose names have
changed due to inheritance. This is a representation of synonyms for attribute names of an
inheritance hierarchy.

7.2.2 Environmental Data

This is recorded using ccves_data, which is used to represent three types of information,
namely: the database name, the DBMS name and the name of the host machine (see figure 7.3).
These are captured at the database connection stage.

7.2.3 Processed Meta-data

The essential meta-data described in section 7.2.1 is processed to derive additional
information required for conceptual modelling. This additional information is schema_data,
class_data and relationship. Here, schema_data (cf. figure 7.4 section 1), identifies all entities
(all_classes, using class of figure 7.2 section 1) and entity types (link_classes and weak_classes,
by the process described in section 5.4, using constraint types such as primary and foreign key
which are recorded in constraint of figure 7.2 section 3). Class_data (cf. figure 7.4 section 2)
identifies all class properties (property_list, using class_property of figure 7.2 section 2), inherited
properties (using class_property, class_inherit and renamed_attr of figure 7.2 sections 2, 4 and 5,
respectively), sub- and super- classes (subclass_list and superclass_list, using class_inherit of
figure 7.2) and referencing and referenced classes (ref and refed, using foreign key constraints
recorded in constraint of figure 7.2). Relationship (cf. figure 7.4 section 3) records the relationship
types (derived using the process described in section 5.4) and cardinality information (using
derived relationship types and available cardinality values).

ccves_data(dbname, DATABASE_NAME).
ccves_data(dbms, DBMS_NAME).
ccves_data(host, HOST_MACHINE_NAME).

Figure 7.3: OACS Constructs used as environmental data

Page 110

1. schema_data(SchemaId, [
all_classes(ALL_CLASS_list),
link_classes(LINK_CLASS_list),
weak_classes(WEAK_CLASS_list) ]).
2. class_data(SchemaId, CLASS_NAME, [
property_list(OWN_PROPERTY_list, INHERIT_PROPERTY_list),
subclass_list(SUBCLASS_list),
superclass_list(SUPERCLASS_list),
ref(REFERENCING_CLASS_list),
refed(REFERENCED_CLASS_list) ]).
3. relationship(SchemaId, REFERENCING_CLASS_NAME, RELATIONSHIP_TYPE,
CARDINALITY, REFERENCED_CLASS_NAME).

Figure 7.4: Derived OACS Constructs

7.2.4 Graphical Constructs

Besides the above OACS representations it is necessary to support additional constructs to
produce a graphical display of a conceptual model. For this we produce graphical constructs using
our derived OACS constructs (cf. figure 7.4) and apply a display layout algorithm (see section
7.3). We call these graphical object abstract conceptual schema (GOACS) constructs, as they are
graphical extensions of our OACS constructs.

The graphical display represents entities, their attributes (optional), relationships, etc.,
using graphical symbols which consist of strings, lines and widgets (basic objects in a toolkit
which contains data that will not be forgotten after writing to the screen as in the case of strings
and lines [NYE93]). To produce this display, coordinates of the positions of all entities,
relationships etc., are derived and recorded in our graphical constructs. The coordinates of each
entity are recorded using class_info as shown in section 1 of figure 7.5. This information identifies
the top left coordinates of an entity.

1. class_info(SchemaId, CLASS_NAME, [ x(X0), y(Y0) ] ).

2. box(SchemaId, X0, Y0, W, H, REGULAR_CLASS_NAME).
box_box(SchemaId, X0, Y0, W, H, Gap, WEAK_CLASS_NAME).
diamond_box(SchemaId, X0, Y0, W, H, LINK_CLASS_NAME).

3. ref_info(Schema_Id, REFERENCING_CLASS_NAME,
REFERENCING_CLASS_CONNECTING_SIDE,
REFERENCING_CLASS_CONNECTING_SIDE_COORDINATE,
REFERENCED_CLASS_NAME, REFERENCED_CLASS_CONNECTING_SIDE,
REFERENCED_CLASS_CONNECTING_SIDE_COORDINATE).

4. line(SchemaId, X1, Y1, X2, Y2).
string(SchemaId, X0, Y0, STRING_NAME).
diamond(SchemaId, X0, Y0, W, H, ASSOCIATION_NAME).

5. property_line(SchemaId, CLASS_NAME, X1, Y1, X2, Y2).
property_string(SchemaId, CLASS_NAME, PROPERTY_NAME, DISPLAY_COLOUR, X0, Y0).

Figure 7.5: Graphical Constructs (GOACS)

The graphical symbol for an entity depends on the entity type. Thus, further processing is
required to graphically categorise entity types. For the EER model, we categorise entities: regular
as box, weak as box_box and link as diamond_box (cf. section 2 of figure 7.5, and figure 7.6). We
use an intermediate representation construct, namely: ref_info, to assist in the derivation of
appropriate coordinates for all associations (cf. section 3 of figure 7.5). With the assistance of

Page 111

ref_info, coordinates to represent relationships are derived and recorded using line, string and
diamond (cf. section 4 of figure 7.5, and figure 7.7).

Users of our schema displays are allowed to interact with schema entities. During this
process, optional information such as properties (i.e. attributes) of selected entities can be added to
the display. This feature is the result of providing the entities and their attributes at different levels
of abstraction. The added information is recorded separately using property_line and
property_string (cf. section 5 of figure 7.5, and figure 7.8).

(X0,Y0) W (X0,Y0) W (X0,Y0) W
Gap
H REGULAR H WEAK H LINK
CLASS CLASS CLASS

box box_box diamond_box

Figure 7.6: Graphical representation of entity types in EER notations

(X0,Y0) W
(X0,Y0)
(X1,Y1) (X2,Y2)
. STRING_NAME
Association
Name
H

line string diamond

Figure 7.7: Graphical representation of connections, labels and associations
in EER notations

(X0,Y0)
(X1,Y1) (X2,Y2)
. PROPERTY_NAME
property_line property_string

Figure 7.8: Graphical representation of selected attributes of a class
in EER notations

7.3 Display Layout Algorithm

To produce a suitable display of a database schema it was necessary to adopt an intelligent
algorithm which determines the positioning of objects in the display. Such algorithms have been
used by many researchers and also commercially for similar purposes [CHU89]. We studied these
ideas and implemented our own layout algorithm which proved to be effective for small,
manageable database schemas. However, to allow displays to be altered to a user preferred style
and for our method to be effective for large schemas we decided to incorporate an editing facility.
This feature allows users to move entities and change their original positions in a conceptual
schema. Internally, this is done by changing the coordinates recorded in class_info for a
repositioned entity and recomputing all its associated graphical constructs.

Page 112

When the location of an entity is changed the connection side of that entity may also need
to be changed. To deal with this case, appropriate sides for all entities can be derived at any stage
of our editing process. When appropriate sides are derived, the ref_info construct (cf. section 3 of
figure 7.5) is appropriately updated to enable us to reproduce the revised coordinates of line,
string and diamond constructs (cf. section 4 of figure 7.5).

Our layout algorithm does the following things:

1. Entities connected to each other are identified (i.e. grouped) using their referenced entity
information. This process highlights unconnected entities as well as entities forming
hierarchies or graph structures.
2. Within a group, entities are rearranged according to the number of connections associated
with them. This arrangement puts entities with most connections in the centre of the
display structure and entities with the least connections at the periphery.
3. A tree representation is then constructed starting from the entity having the most
connections. During the construction of subsequent trees, entities which have already been
used are not considered, to prevent their original position being changed. This makes it
easy to visualise relationships/aggregations present in a conceptual model. The
identification of such properties allow us to gain a better understanding of the application
being modelled. Similarly, attempts are made to highlight inheritance hierarchies whenever
they are present. However, when too many inter-related entities are involved, it is
sometimes necessary to use the move editing facility to relocate some entities so that their
properties (e.g. relationships) are highlighted in the diagram. The existence of such hidden
structures is due to the cross connection of some entities. To prevent overlapping of
entities, relationships, etc., initial placement is done using an internal layout grid. However,
the user is permitted to overlap or place entities close to each other during schema editing.

The coordinate information of a final diagram is saved in disk files, so that these
coordinates are automatically available to all subsequent re-engineering processes. Hence our
system first checks for the existence of a file containing these coordinates and only in its absence
would it use the above layout algorithm.

7.4 External Architecture and Operation of CCVES

We characterise CCVES by first considering the type of people who may use this system.
This is followed by an overview of the external system components. Finally the external
operations performed by the system are described.

7.4.1 CCVES Operation

The three main operations of CCVES, i.e. analysing, enhancing and incremental migration,
need to be performed by a suitably qualified person. This person must have a good knowledge of
the current database application to ensure that only appropriate enhancements are made to it. Also,
this person must be able to interpret and understand conceptual modelling and the data
manipulation language SQL, as we have used the SQL syntax to specify the contents of databases.

Page 113

This person must have the skills to design and operate a database application. Thus they are a
more specialised user than the traditional adhoc user. We shall therefore refer to this person as a
DataBase Administrator (DBA), although they need not be a professional DBA. It is this person
who will be in charge of migrating the current database application.

To this DBA the process of accessing meta-data from a legacy database service in a
heterogeneous distributed database environment is fully automated once the connection to the
database of interest is made. The production of a graphical display representation for the relevant
database schema is also fully automated. This representation shows all available meta-data, links
and constraints in the existing database schema. All links and constraints defined by hand coding
in the legacy application (i.e. not in the database schema but appearing in the application in the
form of 3GL or equivalent code) will not be shown until such links and constraints are supplied to
CCVES during the process of enhancing the legacy database. Such enhancements are represented
in the database itself to allow automatic reuse of these additions, not only by our system users but
also by others (i.e. users of other database support tools).

The enhancement process will assist the DBA in incrementally building the database
structure for the target database service. Possible decomposable modules for the legacy system are
identified during this stage. Finally, when the incremental migration process has been performed,
the DBA may need to review its success by viewing both the source and the target database
schemas. This is achieved using the facility to visualise multiple heterogeneous databases.

We have sought to meet our objectives by developing an interactive schema visualisation
and knowledge acquisition tool which is directed by an inference engine using a real world data
modelling framework based on the EER and OMT conceptual models and extended relational
database modelling concepts. This tool has been implemented in prototype form mostly in Prolog,
supported by some C language routines embedded with SQL to access and use databases built
with the INGRES DBMS (version 6), Oracle DBMS (version 6 and 7) or POSTGRES O-O data
model (version 4). The Prolog code which does the main processing and uses X window and
Motif widgets exceeds 13,000 lines, while the C language embedded with SQL code is from 100
to 1,000 lines depending on the DBMS.

7.4.2 System Overview

This section defines the external architecture and operation of CCVES. It covers the design
and structure of its main interfaces, namely: database connection (access), database selection
(processing) and user interaction (see figure 7.9). The heart of the system consists of a meta-
management module (MMM) (see figure 7.10), which processes and manages meta-data using a
common internal intermediate schema representation (cf. section 7.2). A presentation layer which
offers display and dialog windows has been provided for user interaction. The schema
visualisation, schema enhancement, constraint visualisation and database migration modules (cf.
figure 7.9) communicate with the user.

Page 114

Start
GUI Database
Tools
Query
Tool

Database Access User Interaction

Schema
Connect Enhancement
Database
Database Processing

Schema Constraint
Visualisation Visualisation

Select
Database
Database
Migration

Figure 7.9: Principal processes and control flow of CCVES

The meta-data and knowledge for this system is extracted from respective database system
tables and stored using a common internal representation (OACS). This knowledge is further
processed to derive the graphical constructs (GOACS) of a visualised conceptual model.
Information is represented in Prolog, as dynamic predicates, to describe facts, and semantic
relationships that hold between facts, about graphical and textual schema components. The meta-
management module has access to the selected database to store any changes (e.g. schema
enhancements) made by the user. The input / output interfaces of MMM manage the presentation
layer of CCVES. This consists of X window and Motif widgets used to create an interactive
graphical environment for users.

In section 2.2 we introduced the functionality of CCVES in terms of information flow with
special emphasis on its external components (cf. figure 2.1). Later, in sections 2.3 and 7.1, we
described the main internal processes of CCVES (cf. figures 2.2 and 7.1). Here, in figure 7.10, we
show both internal and external components of CCVES together with special emphasis on the
external aspect.

7.4.3 System Operation

The system has three distinct operational phases: meta-data access, meta-data processing
and user interaction. In the first phase, the system communicates with the source legacy database
service to extract meta-data19 and knowledge20 specifications from the database concerned. This
is achieved when connection to the database (connect database of figure 7.10) is made by the
system, and is the meta-data access phase. In the second phase, the source specifications extracted
from the database system tables are analysed, along with any graphical constructs we may have
subsequently derived, to form the meta-data and meta-knowledge base of MMM. This information

19
meta-data represents the original database schema specifications of the database.
20
knowledge represents subsequent knowledge we may have already added to augment this
database schema.

Page 115

is used to produce a visual representation in the form of a conceptual model. This phase is known
as meta-data processing and is activated when select database (cf. figure 7.10) is chosen by the
user. The final phase is interaction with the user. Here, the user may supply the system with
semantic information to enrich the schema; visualise the schema using a preferred modelling
technique (EER and OMT are currently available); select graphical objects (i.e. classes) and
visualise their properties and intra- and inter- object constraints using the constraint window; and
modify the graphical view of the displayed conceptual model. He may also incrementally migrate
selected schema constructs; transfer selected meta-data to other tools (e.g. MITRA, a Query Tool
[MAD95]); accept meta-data from other tools (e.g. REVEERD, a reverse-engineering tool
[ASH95]); and examine the same database using another window of the CCVES or other database
design tools (e.g. Oracle*Design). The objective of providing the user with a wide range of design
tools is to optimise process of analysing the source legacy database. The enhancement of the
legacy database with constraints is an attempt to collect information that is managed by modern
DBMSs in the legacy database without affecting its operation and in preparation for its migration.

The Designer

Display and Dialog Windows Designer Interaction

Constraint
Text Files
External Database
Window Constraints
Host : Dept-Oracle Tools
DBMS: College-POSTGRES DB Sch SQL-3 Const
DB Name: Faculty-INGRES
Select Define GUI
Schema Display GQL
Oracle *
Constraint External
Visualiser Constraints

View
Schema Display
Design
Select,
Select Define

..........
Move, OMT/EER

Transfer OMT/EER
Select, Move, Transfer
..........

Connect Select Meta-Translation
Database Database (OUTPUT)

Meta-Management Module Input / Output Interface

Meta-Knowledge base GOACS

Meta-Processor
Meta-Transformation

Meta-Storage System OACS

Meta-Translation
(INPUT)

Connect
Database
External Constraints
Store & Enforce

Heterogeneous Distributed Databases

Figure 7.10: External Architecture of CCVES

For successful system operation, users need not be aware of the internal schema
representation or any other non-SQL database specific syntax of the source or target database.
This is because all source schemas are mapped into our internal representation and are always

Page 116

presented to the user using the the standard SQL language syntax (unless specifically requested
otherwise). This enables the user to deal with the problem of heterogeneity, since at the global
level local databases are viewed as if they come from the same DBMS. The SQL syntax is used
by default to express the associated constraints of a database. If specifically requested, the SQL
syntax can be translated and viewed using the DDL of the legacy DBMS; as far as CCVES is
concerned this is just performing another meta-translation process. A textual version of the
original legacy database definition is also created by CCVES when connection to the legacy
database is established. This definition may be viewed by the user for better understanding of the
database being modelled.

The ultimate migration process allows the user to employ a single target database
environment for all legacy databases. This will assist in removing the physical heterogeneity
between those databases. The complete migration process may take days for large information
systems as they already hold a large volume of data. Hence the ability to enhance and migrate
while legacy databases continue to function is an important feature. Our enhancement process
does not affect existing operations as it involves adding new knowledge and validating existing
data. Whenever structural changes are introduced (e.g. an inheritance hierarchy), we have
proposed the use of view tables (cf. section 5.6) to ensure that normal database operations will not
be affected until the actual migration is commenced. This is because some data items may
continue to change while the migration is in preparation, and indeed during migration itself. We
have proposed an incremental migration process to minimise this effect and use of a forward
gateway to deal with such situations.

7.5 External Interfaces of CCVES

CCVES is seen by users as consisting of four processes, namely: a database access
process, a schema and constraint visualisation process, a schema enhancement process, and a
schema migration process. The database access process was described in section 6.5. In the next
subsections we describe the other three processes of CCVES to complete the picture.

7.5.1 Schema and Constraint Visualisation

The input / output interfaces of MMM manage the presentation layers of CCVES. These
layers consist of display and dialog windows used to provide an interactive graphical environment
for users. The user is presented with a visual display of the conceptual model for a selected
database, and may perform many operations on this schema display window (SDW) to analyse,
enhance, evolve, visualise and migrate any portion of that database. Most of these operations are
done via SDW as they make use of the conceptual model.

The traditional conceptual model is an E-R diagram which displays only entities, their
attributes and relationships. This level of abstraction gives the user a basic idea of the structure of
a database. However this information is not sufficient to gain a more detailed understanding of the
components of a conceptual model, including identification of intra- and/or inter- object
constraints. Intra-object constraints for an entity provide extra information that allows the user to
identify behavioural properties of the entity. For instance, the attributes of an entity do not provide
sufficient information to determine the type of data that may be held by an attribute and any

Page 117

restrictions that may apply to it. Hence providing a higher level of abstraction by displaying
constraints along with their associated entities and attributes gives the user a better understanding
of the conceptual model. The result is much more than a static entity and attribute description of a
data model as it describes how the model behaves for dynamic data (i.e. a constraint implies that
any data item which violates it cannot be held by the database).

The schema visualisation module allows users to view the conceptual schema and
constraints defined for a database. This visualisation process is done using three levels of
abstraction. The top level describes all the entity types along with any hierarchies and
relationships. The properties of each entity are viewed at the next level of abstraction to increase
the readability of the schema. Finally, all constraints associated with the selected entities and their
properties are viewed to gain a deeper and better understanding of the actual behaviour of selected
database components. The conceptual diagrams of our test databases are given in appendix C.
These diagrams are at the top level of abstraction. Figures 7.11 and 7.12 show the other two levels
of abstraction.

The graphical schema displayed in the SDW for a selected database uses by default the
OMT notation, which can be changed to EER from a menu. Users can produce any number of
schema displays for the same schema, and thus can visualise a database schema using both OMT
and EER diagrams at the same time (a picture of our system while viewing the same schema using
both forms is provided in figure 7.11). The display size of the schema may also be changed from a
menu. A description that identifies the source of each display is provided as we are dealing with
many databases in a heterogeneous environment. The diagrams produced by CCVES can be
edited to alter the location of their displayed entities and hence to permit visualisation of a schema
in the personally preferred style and format of an individual user. This is done by selecting and
moving a set of entities within the scrolling window, thus altering the relative positions of entities
within the diagram produced. These changes can be saved and automatically restored for the next
session by users.

The system allows interactive selection of entities and attributes from the SDW. We
initially do not include any attributes as part of the displayed diagram, because we provide them
as a separate level of abstraction. A list of attributes associated with an entity can be viewed by
first selecting the entity from the display window (abstraction at level 2), and then browsing
through its attributes in the attribute window, which is placed just below the display window (see
figure 7.12). Any attribute selected from this window will automatically be transferred to the
main display window, so that only attributes of interest are displayed there. This technique
increases the readability of our display window. At each stage, appropriate messages produced by
the system display unit are shown at the bottom of this window. For successful interactions,
graphical responses are used whenever applicable. For example, when an entity is selected by
clicking on it, the selected entity will be highlighted. In this thesis we do not provide interaction
details as these are provided at the user manual level.

The 'browser' menu option for SDW will invoke the browser window. This is done only
when a user wishes to visit the constraint visualisation abstraction level, our third level of
abstraction. Here we see all entities and their properties of interest as the default option, but we
can expand this to display other entities and properties by choosing appropriate menu options

Page 118

from the browsing window. We can also filter the displayed constraints to include those of interest
(e.g. show only domain constraints). In cases where inherited attributes are involved the system
will initially show only those attributes owned by an entity (the default option), others can be
viewed by selecting the required level of abstraction (available in the menu) for the entity
concerned.

The ability to select an entity from the display window and display its properties in the
browsing window satisfies our requirement of visualising intra-object constraints. The reverse of
this process, i.e. selecting a constraint from the browsing window and displaying all its associated
entities in the display window, satisfies our inter-object constraint visualisation requirement. Both
of these facilities are provided by our system (see figures 7.11 and 7.12, respectively).

All operations with the mouse device are done using the left button except when altering
the location of an entity of a displayed conceptual model. We have allowed the use of the middle
button of the mouse to select, drag and drop21 such an entity. This process alters the position of the
entity and redraws the conceptual model. By this means, object hierarchies, relationships, etc., can
be made prominent by placing the objects concerned in hierarchies close to each other. This
feature was introduced firstly to allow users to visualise a conceptual model in a preferred manner
and secondly as our display algorithm was not capable of automatically providing such features
when constructing complex schemas having many entities, hierarchies and relationships (cf.
section 7.3).

21
CCVES changes the cursor symbol to confirm the activation of this mode.

Page 119

7.5.2 Schema Enhancement

Schema enhancements are also done via the schema display window. This module is
mainly used to specify dynamic constraints. These constraints are usually extracted from the
legacy code of a database application, as in older systems they were specified in the application
itself. Constraint extraction from legacy applications is not supported by CCVES. Hence, this
information must be extracted by other means. We assume that such constraints can be extracted
by either examining the legacy code, using any existing documentation or using user knowledge
of the application, and have introduced this module to capture them. We have also provided an
option to detect possible missing, hidden and redundant information (cf. section 5.5.2) to assist
users in formulating new constraints. The user specifies constraints via the constraint
enhancement interface by choosing the constraint type and associated attributes. In the case of a
check constraint the user needs to specify it using SQL syntax.

The constraint enhancement process allows further constraints to appear in the graphical
model. This development is presented via the graphical display, so that the user is aware of the
existing links and constraints present for the schema. For instance, when a foreign key is
specified, CCVES will try to derive a relationship using this information. If this process is
successful a new relationship will appear in the conceptual model.

A graphical user interface in the form of a pop-up sub-menu is used to specify constraints,
which take the form of integrity constraints (e.g. primary key, foreign key, check constraints) and
structural components (e.g. inheritance hierarchies, entity modifications). In figure 7.13 we
present some pop-up menus of CCVES which assist users in specifying various types of
constraints. Here, names of entities and their attributes are automatically supplied via pull-down
menus to ensure the validity of certain input components of user specified constraints. For all
constraints, information about the type of constraint and the class involved is initially specified.
When the type of constraint is known, prior existence of such a constraint is checked in the case of
primary key and foreign key constraints. For primary keys, the process will not proceed if a key
specification already exists; for foreign keys, if the attributes already participate in such a
relationship they will not appear in the referencing attribute specification list. In the case of
referenced attributes, only attributes with at least the uniqueness property will appear in order to
prevent specification of any invalid foreign keys. All enhanced constraints are stored internally
until they are added to the database using another menu option. Prior to this augmentation process
the user should verify the validity of the constraints. In the case of recent DBMSs like Oracle,
invalid constraints will be rejected automatically and the user will be requested to amend them or
discard them. In such situations the incorrect constraints are reported to the user and are stored on
disk as a log file. Also, all changes made during a session will not be saved until specifically
instructed by the user. This gives the opportunity to rollback (in the event of an incorrect addition)
and resume from the previous state.

Page 121

Figure 7.13: Two stages of a Foreign Key constraint specification

Input data in the form of constraints to enhance the schema provides new knowledge about
a database. It is essential to retain this knowledge within the database itself, if it is to be used for
any future processing. CCVES achieves this task using its knowledge augmentation process as
described in section 5.7. From a user’s point of view this process is fully automated and hence no
intermediate interactions are involved. The enhanced knowledge is augmented only if the database
is unable to naturally represent the new knowledge. Such knowledge cannot be automatically
enforced except via a newer version of the DBMS or other DBMS (if supported), by migrating the
database. However, the data being held by the database may already not conform to the new
constraints, and hence existing data may be rejected by the target DBMS. This will result in loss
of data and/or migration delays. To address this problem, we provide an optional constraint
enforcement process which checks the conformation of the data to the new constraints prior to
migration.

7.5.3 Constraint Enforcement and Verification

Constraint enforcement is automatically managed only by relatively recent DBMSs. If
CCVES is used to enhance a recent DBMS such as Oracle then verification and enforcement will
be handled by the DBMS, as CCVES will just create constraints using the DDL commands of that
DBMS. However, when such features are not supported by the underlying DBMS, CCVES has to
provide such a service itself. Our objective in this process is to give users the facility to ensure
that the database conforms to all the enhanced constraints. Constraints are chosen from the
browser window to verify their validity. Once selected, the constraint verification process will
issue each constraint to the database using the technique described in section 5.8 and report any
violations to the user. When a violated constraint is detected, the user can decide whether to keep
or discard it. The user could decide to retain the constraint in the knowledge-base for various
reasons. These include: ensuring that future data conforms to the constraint; providing users with
a guideline to the system data contents irrespective of violations that may occur occasionally;
assisting the user in improving the data or the constraint. To enable the enforcement of such
constraints for future data instances, it is necessary to either use a trigger (e.g. on append check
constraint) or add a temporal component to the constraint (e.g. system date > constraint input
date). This will ensure that the constraint will not be enforced on existing data.

Page 122

When using queries to verify enhanced constraints the retrieved data are instances that
violate a constraint. In such a situation, retrieving a large number of instances for a given query
does not make much sense as it may be due to an incorrect constraint specification rather than the
data itself. Therefore in the event of an output exceeding 20 instances we terminate the query and
instruct the user to inspect this constraint as a separate action.

7.5.4 Schema Migration

The migration process is provided to allow an enhanced legacy system to be ported to a
new environment. This is incrementally performed by initially creating the schema in the target
DBMS and then copying the legacy data to the target system. To create the schema in the target
system, DDL statements are generated by CCVES. An appropriate schema meta-translation
process is performed if required (e.g. if the target DBMS has a non-SQL query language). The
legacy data is migrated using the import/export tools of source and target DBMSs.

The migration process is not fully automated as certain conflicts cannot be resolved
without user intervention. For example, if the target database only accepts names of length 16 (as
in Oracle) instead of 32 (as in INGRES) in the source database, then a name resolution process
must be performed by the user. Also, names used in one DBMS may be keywords in another. Our
system resolves these problems by adding a tag to those names or by truncating the length of a
name. This approach is not generic as the uniqueness property of an attribute cannot be
maintained by truncating its original name. In these situations user involvement is unavoidable.

CCVES, although it has been tested for only three types of DBMS, namely: INGRES,
POSTGRES and Oracle, could be easily adapted for other relational DBMSs as they represent
their meta-data similarly - i.e. in the form of system tables, with minor differences such as table
and attribute names and some table structures. Non relational database models accessible via
ODBC or other tools (e.g. Data Extract for DB2, which permits movement of data from IMS/VS,
DL/1, VSAM, SAM to SQL/DS or DB2), could also be easily adapted as the meta-data required
by CCVES could be extracted from them. Previous work related to meta-translation [HOW87] has
investigated the translation of dBase code to INGRES/QUEL, demonstrating the applicability of
this technique in general, not only to the relational data model but also to others such as
CODASYL and hierarchical data models. This means CCVES is capable in principle of being
extended to cope with other data models.

Page 123

CHAPTER 8

Evaluation, Future Work and Conclusion

In this chapter the Conceptualised Constraint Visualisation and Enhancement System (CCVES)
described in Chapters 5, 6 and 7 is evaluated with respect to our hypotheses and objectives listed
in Chapter 1. We describe the functionality of different components of CCVES to identify their
strengths and summarise their limitations. Potential extensions and improvements are considered
as part of future work. Finally, conclusions about the work are drawn by reviewing the objectives
and the evaluation.

8.1 Evaluation

8.1.1 System Objectives and Achievements

The major technical challenge in designing CCVES was to provide an interactive graphical
environment to access and manipulate legacy databases within an evolving heterogeneous
distributed database environment for the purpose of analysing, enhancing and incrementally
migrating legacy database schemas to modern representations. The objective of this exercise was
to enrich a legacy database with valuable additional knowledge that has many uses, without being
restricted by the existing legacy database service and without affecting the operation of the legacy
IS. This knowledge is in the form of constraints that can be used to understand and improve the
data of the legacy IS.

Here, we assess the main external and internal aspects of our system, CCVES, based on
the objectives laid out in sections 1.2 and 2.4. Externally, CCVES performs 3 important tasks -
initially, a reverse-engineering process; then, a knowledge augmentation process, which is a re-
engineering process on the original system; and finally, an incremental migration process. The
reverse-engineering process is fully automated and is generalised to deal with the problems caused
by heterogeneity.

a) A framework to address the problem of heterogeneity

The problems of heterogeneity have been addressed by many researchers, and at Cardiff
the meta-translation technique has been successfully used to demonstrate its wide-ranging
applicability to heterogeneous systems. This previous work, which includes query meta-
translation [HOW87], schema meta-translation [RAM91] and schema meta-integration [QUT93],
was developed using Prolog - emphasising its suitability for meta-data representation and
processing. Hence Prolog was chosen as the main programming language for the development of
our system. Among the many Prolog versions around, we found that Quintus Prolog was well
suited to supporting an interactive graphical environment as it provided access to X window and
Motif widget routines. Also, the PRODBI tools [LUC93] were available with Quintus Prolog, and
these enabled us to directly access a number of relational DBMSs, like INGRES, Oracle and
SYBASE.

Chapter 8 Evaluation, Future Work and
Conclusion

Our framework for meta-data representation and manipulation has been described in
section 7.2. The meta-programming approach enabled us to implement many other features, such
as the ability to easily customise our system for different data models, e.g. relational and object-
oriented (cf. section 7.2.1), the ability to easily enhance or customise for different display models,
e.g. E-R, EER and OMT (cf. section 7.2.4), and the ability to deal with heterogeneity due to
differences in local databases (e.g. at the global level the user views all local databases as if they
come from the same DBMS, and is also able to view databases using a preferred DDL syntax).

b) An interactive graphical environment

An interactive graphical environment which makes extensive use of modern graphical user
interface facilities was required to provide graphical displays of conceptual models and allow
subsequent interaction with them. To fulfil these requirements the CCVES software development
environment had to be based on a GUI sub-system consisting of pop-up windows, pull-down
menus, push buttons, icons etc. We selected X window and Motif widgets to build such an
environment on a UNIX platform. SunSparc workstations were used for this purpose. Provision of
interactive graphical responses when working via this interface was also included to ensure user
friendliness (cf. section 7.5).

c) Ability to access and work on legacy database systems

An initial, basic facility of our system was the ability to access legacy database systems.
This process, which is described in section 6.5, enables users to specify and access any database
system over a network. Here, as the schema information is usually static, CCVES has been
designed to provide the user with the option of by-passing the physical database access process
and using instead an existing (already accessed) logical schema. This saves time once the initial
access to a schema has been made. Also, it guarantees access to meta-data of previously accessed
databases during server and network breakdowns, which were not uncommon during the
development of our system.

d) A framework to perform the reverse-engineering process

A framework to perform the reverse-engineering process for legacy database systems has
been provided. This process is based on applying a set procedure which produces an appropriate
conceptual model (cf. section 5.2). It is performed automatically even if there is very limited
meta-knowledge. In such a situation, links that should be present in the conceptual model will not
appear in the corresponding graphical display. Hence, the full success of this process depends on
the availability of adequate meta-knowledge. This means that a real world data modelling
framework that facilitates the enhancement of legacy systems must be provided, as described next.

e) A framework to enhance existing systems

A comprehensive data modelling framework that facilitates the enhancement of
established database systems has been provided (cf. section 5.6). A method of retaining the

Page 125

Conclusion
enhanced knowledge for future use which is in line with current international standards is
employed. Techniques that are used in recent versions of commercial DBMSs are supported to
enable legacy databases to logically incorporate modern data modelling techniques irrespective of
whether these are supported by their legacy DBMSs or not (cf. section 5.7). This enhancement
facility gives users the ability to exploit existing databases in new ways (i.e. restructuring and
viewing them using modern features even when these are not supported by the existing system).
The enhanced knowledge is retained in the database itself so that it is readily available for future
exploitation by CCVES or other tools, or by the target system in a migration.

f) Ability to view a schema using preferred display models

The original objective of producing a conceptual model as a result of our reverse-
engineering process was to display the structure of databases in a graphical form (i.e. conceptual
model) and so make it easier for users to comprehend their contents. As all users are not
necessarily familiar with the same display model, the facility to visualise using a user preferred
display model (e.g. EER or OMT) has been provided. This is more flexible than our original aim.

g) High level of data abstraction for better understanding

A high level of data abstraction for most components of a conceptual model (i.e.
visualising the contents, relationships and behavioural properties of entities and constraints;
including identification of intra- and inter-object constraints) has been provided (cf. section 7.5.1).
Such features are not usually incorporated in visualisation tools. These features and various other
forms of interaction with conceptual models are provided via the user interface of CCVES.

h) Ability to enhance schema and to verify the database

The schema enhancement process was provided originally to enrich a legacy database
schema and its resultant conceptual model. A facility to determine the constraints on the
information held and the extent to which the legacy data conforms to these constraints is also
provided to enable the user to verify their applicability (section 5.7). The graphical user interface
components used for this purpose are described in section 7.5.2.

i) Ability to migrate while the system continues to function

The ability to enhance and migrate while a legacy database continues to function normally
was considered necessary as it ensures that this process will not affect the ongoing operation of
the legacy system (section 5.8). The ability to migrate to a single target database environment for
all legacy databases assists in removing the physical heterogeneity between these databases.
Finally, the ability to integrate CCVES with other tools to maximise the benefits to the user
community was also provided (section 7.4.3).

8.1.2 System Development and Performance

A working prototype CCVES system that enabled us to test all the contributions of this
research was implemented using Quintus Prolog with X window and Motif libraries; INGRES,

Page 126

Conclusion
Oracle and POSTGRES DBMSs; the C programming language embedded with SQL and
POSTQUEL; and the PRODBI interface to INGRES. This system can be split into 4 parts,
namely: the database access process to capture meta-data from legacy databases; the mapping of
the meta-data of a legacy database to a conceptual model to present the semantics of the database
using a graphical environment; the enhancement of a legacy database schema with constraint
based knowledge to improve its semantics and functionality; and the incremental migration of the
legacy database to a target database environment.

Our initial development commenced using POPLOG, which was at that time the only
Prolog version with any graphical capabilities available on UNIX workstations at Cardiff. Our
initial exposure to X window library routines occurred at this stage. Later, with the availability of
Quintus Prolog, which had a more powerful graphical capability due to its support of X windows
and Motif widgets, it was decided to transfer our work to this superior environment. To achieve
this we had to make two significant changes, namely: converting all POPLOG graphic routines to
Quintus equivalents and modifying a particular implementation approach adopted by us when
working with POPLOG. The latter took advantage of POPLOG’s support for passing unevaluated
expressions as arguments of Prolog clauses. In Quintus Prolog we had to evaluate all expressions
before passing them as arguments.

Due to the use of slow workstations (i.e. SPARC1s) and running Prolog interactively, there
was a delay in most interactions with our original system. This delay was significant (e.g. nearly a
minute) when having to redraw a conceptual model. It was necessary to redraw this model when
we moved an object of the display in order to change its location and whenever the drawing
window was exposed. This exposure occurred when the window’s position changed, or was
overlapped with another window or a menu, or when someone clicked on this window. In such
situations it was necessary to refresh the drawing window by redrawing the model. Redrawing
was required as our initial attempt at producing a conceptual model was based solely on drawing
routines. This method was inefficient as such drawings had to be redone every time the drawing
window became exposed.

Our second attempt was to draw conceptual models in the background using a pixmap.
This process allocates part of the memory of the computer to enable us to directly draw and retain
an image. A pixmap can be copied to any drawing window without having to reconstruct its
graphical components. This means that when the drawing window becomes exposed it is possible
to copy this pixmap to that window without redrawing the conceptual model. The process of
copying a pixmap to the drawing window took only a few seconds and so there was a significant
improvement over our original method. However with this new approach whenever a move
operation is performed, it is still necessary to recompute all graphical settings and redraw. This
took a similarly long time to the original method.

The use of a pixmap took up a significant part of the computer’s memory and as a result
Quintus was unable to cope if there was a need to simultaneously view multiple conceptual
models. We also experienced several instances of unusual system behaviour such as failure to
execute routines that had been tested previously. This was due to the full utilisation by Prolog of
run time memory because of the existence of this pixmap. We noticed that Quintus Prolog had a
bug of not being able to release the memory used by a pixmap. In order to regain this memory we

Page 127

Conclusion
had to logout (exit) from the workstation, as the xnew process which was collecting garbage was
unable to deal with this case. Hence, we decided to use widgets instead of drawing rectangles for
entities, as widgets are managed automatically by X windows and Motif routines. This allowed us
to reduce the drawing components in our conceptual model and hence to minimise redrawing time
when the drawing window became exposed. We discarded the pixmap approach as it gave us
many problems. However as widgets themselves take up memory their behaviour under some
complex conceptual models is questionable. We decided not to test this in depth as we had already
spent too much time on this module, and its feasibility had been demonstrated satisfactorily.

During the course of CCVES development, Quintus Prolog was upgraded from release 3.1
to 3.1.4. Due to incompatibilities between the two versions, certain routines of our system had to
be modified to suit the new version. This meant that a full test of the entire system was required.
Also, since three versions of INGRES, two versions of Oracle and one POSTGRES were used
during our project, this meant that more and more system testing was required. Thus, we have
experienced several changes to our system due to technological changes in its development
environment. Comparing the lifespan and scale of our project with those of a legacy IS, we could
more clearly appreciate the amount of change that is required for such systems to keep up with
technological progress and business needs. Hence, the migration of any IS is usually a complex
process. However, the ability to enhance and evolve such a system without affecting its normal
operation is a significant step towards assisting this process.

Our final task was to produce a compiled version of our system. This is still being
undertaken, as although we have been able to produce executable code, some user interface
options are not being activated for unknown reasons (we think this may be due to insufficient
memory), although the individual modules work correctly.

8.1.3 System Appraisal

The approach presented in this thesis for mapping a relational database schema to a
conceptual schema is in many ways simpler and easier to apply than any previous attempts as it
has eliminated the need for any initial user interaction to provide constraint based knowledge for
this process. Constraint information such as primary and foreign keys are used to automatically
derive the entity and relationship types. Use of foreign key information was not considered in
previous approaches as most database systems did not support such facilities at that time.

One major contribution of our work is providing the facility for specifying and using
constraint based information in any type of DBMS. This means that once a database is enhanced
with constraints, it is semantically richer. If the source DBMS does not support constraints then
the conceptual model will still be enhanced, and our tool will augment the database with these
constraints in an appropriate form.

Another innovative feature of our system is the automated use of the DML of a database to
determine the extent to which the enhanced constraints conform to its data. This enables users to
take appropriate compensatory actions prior to migrating legacy databases.

We provided an additional level of schema abstraction for our conceptual models. This is

Page 128

Conclusion
in the form of viewing the constraints associated with a schema. This feature allows users to gain
a better understanding of databases.

The facility to view multiple schemas allows users to compare different components of a
global system if it comprises several databases. This feature is very useful when dealing with
heterogeneous databases. We also deal with heterogeneity at the conceptual viewing stage by
providing users with the facility to view a schema using their preferred modelling notation. For
example, in our system the user can choose either an EER or an OMT display to view a schema.
This ensures greater accuracy in understanding, as the user can select a familiar modelling
notation to view database schemas. The display of the same schema in multiple windows using
different scales allows the user to focus on a small section of the schema in one window while
retaining a larger view in another. The ability to view multiple schemas also means that it is
possible to jointly monitor the progress or status of the source and target databases during an
incremental migration process. The introduction of both EER and OMT as modelling options
means that the recent advances which were not present in the original E-R model and some
subsequent variants can be represented using our system.

Our approach of augmenting a database itself with new semantic knowledge rather than
using separate specialised knowledge-bases means that our enhanced knowledge is accessible by
any user or tools using the DML of the database. This knowledge is represented in the database
using an extended version of the SQL-3 standards for constraint representation. Thus this
knowledge will be compatible with future database products, which should conform to the new
SQL-3 standards. Also, no semantics are lost due to the mapping from a conceptual model to a
database schema. Oracle version 6 provided similar functionality by allowing constraints to be
specified even though they could not be applied until the introduction of version 7.

8.1.4 Useful real-life Applications

We were able to successfully reverse-engineer a leading telecommunication database
extract consisting of over 50 entities. This enabled us to test our tool on a scale greater than that
of our test databases. The successful use of all or parts of our system for other research work,
namely: accessing POSTGRES databases for semantic object-oriented multi-database access
[ALZ96], viewing heterogeneous conceptual schemas when dealing with graphical query
interfaces [MAD95], and viewing heterogeneous conceptual schemas via the world wide web
(WWW) [KAR96] indicates its general usefulness and applicability.

The display of conceptual models can be of use in many areas such as database design,
database integration and database migration. We could identify similar areas of use for CCVES.
These include training new users by allowing them to understand an existing system, and enabling
users to experiment with possible enhancements to existing systems.

8.2 Limitations and possible future Extensions

There are a number of possible extensions that could be incorporated to improve the
current functionalities of our system. Some of these are associated with improving run time
efficiency, accommodating a wider range of users and extending graphical user interaction

Page 129

Conclusion
capabilities. Such extensions would not have great significance with respect to demonstrating the
applicability of our fundamental ideas. Examples of improvements are: engineer the system to the
level of a commercial product so that it could be used by a wide range of users with minimal user
training; improve the run time efficiency of the system by producing a compiled version; test it in
a proper distributed database environment, as our test databases did not emphasise distribution;
extend the graphical display options to offer other conceptual models, such as ECR; extend the
system to enable us to test migrations to a proper object-oriented DBMS (i.e. not only to an
extended relational DBMS with O-O features, like POSTGRES); and improve the display layout
algorithm (cf. section 7.3) to efficiently manage large database schemas. The time scale for such
improvements would vary from a few weeks to many months each, depending on the work
involved.

Our system is designed to cope with two important extensions. They are: extend the
graphical display option to offer other forms of conceptual models, and extend the number of
DBMSs and their versions it can support. Of these two extensions, the least work involved is in
supporting a new graphical display. Here, the user needs to identify the notations used by the new
display and write the necessary Prolog rules to generate strings and lines used for the drawings.
This process will take at most one week, as we do not change graphical constructs such as
class_info and ref_info (cf. section 7.2.4) to support different display models. On the other hand,
inclusion of a new relational DBMS or version can take a few months as it affects 3 stages of our
system. These stages are: meta-data access, constraint enforcement and database migration. All 3
stages uses the query language (SQL) of the DBMS and hence, if this is variant we will need to
expand our QMTS. The time required for such an extension will depend on its similarities when
compared with standard SQL and may take 2-4 person weeks. Next, we need to assess the
constraint handling features supported by the new DBMS so that we can use our knowledge-based
tables to overcome any constraint representation limitations. This process may take 1-2 person
weeks. To access the meta-data from a database it is necessary to know the structures of its system
tables. Also, we need a mechanism to externally access this information (i.e. use an ODBC driver
or write our own). This stage can take 1-6 person weeks as in many cases system documentation
will be inadequate. Inclusion of a different data model would be a major extension as it affects all
stages of our system. It would require provision of new abstraction mechanisms such as parent-
child relationships for a hierarchical model and owner-member relationships for a network model.

Other possible extensions are concerned with incorporating software modules that would
expand our approach. These include a forward gateway for use at the incremental migration stage;
an integration module for merging related database applications; and analysers for extracting
constraint based information from legacy IS code. These are important and major areas of
research, hence the development of such modules could take from many months to years in each
case.

8.3 Conclusion

8.3.1 Overall Summary

This thesis has reported the results of a research investigation aimed at the design and
implementation of a tool for enhancing and migrating heterogeneous legacy databases.

Page 130

Conclusion

In the first two chapters we introduced our research and its aims and objectives. Then in
chapter 3, we presented some preliminary database concepts and standards relevant to our work.
In chapters 4 and 5, we introduced wider aspects of our problem and studied alternative ways
proposed to solve major parts of this problem. Many important points emerged from this study.
These include: application of meta-translation techniques to deal with legacy database system
heterogeneity; application of migration techniques to specific components of a database
application (i.e. the database service) as opposed to an IS as a whole; extending the application of
database migration beyond the traditional COBOL oriented and IBM database products;
application of a migration approach to distributed database systems; enhancing previous re-
engineering approaches to incorporate modern object-oriented concepts and multi-database
capabilities; introducing semantic integrity constraints into legacy database systems and hence
exploring them beyond their structural semantics. In chapter 5, we described our re-engineering
approach and explained how we accomplished our goals in enhancing and preparing legacy
databases for migration, while chapter 6 was concerned with testing our ideas using carefully
designed test databases. Also in chapter 6, we provided illustrative examples of our system
working. In chapter 7, we described the overall architecture and operation of the system together
with related implementation considerations. Here we also gave some examples of our system
interfaces. In chapter 8, we carried out a detailed evaluation which included research
achievements, limitations and suggestions for possible future extensions. We also looked at some
real-life areas of application in which our prototype system has been tested and/or could be used.
Finally, some major conclusions that can be drawn from this research are presented, below.

8.3.2 Conclusions

The important conclusions that can be drawn from the work described in this thesis are as
follows:

• Although many approaches have been proposed for mapping relational schemas to a form
where their semantics can be more easily understood by users, they either lack the
application of modern modelling concepts or have been applied to logically centralised or
decentralised database schemas, not physically heterogeneous databases.
• Previous proposed approaches for mapping relational schemas to conceptual models involve
user interactions and pre-requisites. This is confusing for first time users of a system as they
don’t have any prior direct experience or knowledge about the underlying schema. We
produce an initial conceptual model automatically, prior to any user interaction, to
overcome this problem. Our user interaction commences only after the production of the
initial conceptual model. This gives users the opportunity to gain some vital basic
understanding of a system prior to any serious interaction with it.
• Most previous reverse-engineering tools have ignored an important source of database
semantics, namely semantic integrity constraints such as foreign key definitions. One
obvious reason for this is that many existing database systems do not support the
representation of such semantics. We have identified and showed the important contribution
that semantic integrity constraints can make by presenting them and applying them to the
conceptual and physical models. We have also successfully incorporated them into legacy
database systems which do not directly support such semantics.

Page 131

Conclusion
• The problem of legacy IS migration has not been studied for multi-database systems in
general. This appears to present many difficulties to users. We have tested and demonstrated
the use of our tools with a wide range of relational and extended relational database
systems.
• The problem of legacy IS migration has not been studied for more recent and modern
systems; as a result, ways of eliminating the need for migration have not yet been addressed.
Our approach of enhancing legacy ISs irrespective of their DBMS type will assist in
redesigning modern database applications and hence overcome the need to migrate such
applications in many cases.
• Our evaluation has concluded that most of the goals and objectives of our system, presented in
sections 1.2 and 2.4 have been successfully met or exceeded.

Page 132

Assisting Migration and Evolution of Relational Legacy Databases

More Related Content

What's hot (15)

Similar to Assisting Migration and Evolution of Relational Legacy Databases (20)

More from Gihan Wikramanayake (20)

Recently uploaded (20)

Assisting Migration and Evolution of Relational Legacy Databases