SlideShare a Scribd company logo
Assisting

 Migration and Evolution

               of

Relational Legacy Databases




               by

      G.N. Wikramanayake




 Department of Computer Science,

   University of Wales Cardiff,

             Cardiff



         September 1996
Assisting Migration and Evolution of Relational Legacy Databases
Abstract

The research work reported here is concerned with enhancing and preparing databases with limited
DBMS capability for migration to keep up with current database technology. In particular, we have
addressed the problem of re-engineering heterogeneous relational legacy databases to assist them in
a migration process. Special attention has been paid to the case where the legacy database service
lacks the specification, representation and enforcement of integrity constraints. We have shown how
knowledge constraints of modern DBMS capabilities can be incorporated into these systems to
ensure that when migrated they can benefit from the current database technology.

To this end, we have developed a prototype conceptual constraint visualisation and enhancement
system (CCVES) to automate as efficiently as possible the process of re-engineering for a
heterogeneous distributed database environment, thereby assisting the global system user in
preparing their heterogeneous database systems for a graceful migration. Our prototype system has
been developed using a knowledge based approach to support the representation and manipulation of
structural and semantic information about schemas that the re-engineering and migration process
requires. It has a graphical user interface, including graphical visualisation of schemas with
constraints using user preferred modelling techniques for the convenience of the user. The system
has been implemented using meta-programming technology because of the proven power and
flexibility that this technology offers to this type of research applications.

The important contributions resulting from our research includes extending the benefits of meta-
programming technology to the very important application area of evolution and migration of
heterogeneous legacy databases. In addition, we have provided an extension to various relational
database systems to enable them to overcome their limitations in the representation of meta-data.
These extensions contribute towards the automation of the reverse-engineering process of legacy
databases, while allowing the user to analyse them using extended database modelling concepts.




                                                                                            Page v
CHAPTER 1

                                          Introduction

This chapter introduces the thesis. Section 1.1 is devoted to the background and motivations of the
research undertaken. Section 1.2 presents the broad goals of the research. The original achievements
which have resulted from the research are summarised in Section 1.3. Finally, the overall
organisation of the thesis is described in Section 1.4.

1.1 Background and Motivations of the Research

         Over the years rapid technological changes have taken place in all fields of computing. Most
of these changes have been due to the advances in data communications, computer hardware and
software [CAM89] which together have provided a reliable and powerful networking environment
(i.e. standard local and wide area networks) that allow the management of data stored in computing
facilities at many nodes of the network [BLI92]. These changes have turned round the hardware
technology from centralised mainframes to networked file-server and client-server architectures
[KHO92] which support various ways to use and share data. Modern computers are much more
powerful than the previous generations and perform business tasks at a much faster rate by using
their increased processing power [CAM88, CAM89]. Simultaneous developments in the software
industry have produced techniques (e.g. for system design and development) and products capable of
utilising the new hardware resources (e.g. multi-user environments with GUIs). These new
developments are being used for a wide variety of applications, including modern distributed
information processing applications, such as office automation where users can create and use
databases with forms and reports with minimal effort, compared to the development efforts using
3GLs [HIR85, WOJ94]. Such applications are being developed with the aid of database technology
[ELM94, DAT95] as this field too has advanced by allowing users to represent and manipulate
advanced forms of data and their functionalities. Due to the program data independence feature of
DBMSs the maintenance of database application programs has become easier as functionalities that
were traditionally performed by procedural application routines are now supported declaratively
using database concepts such as constraints and rules.

       In the field of databases, the recent advances resulting from technological transformation
include many areas such as the use of distributed database technology [OZS91, BEL92], object-
oriented technology [ATK89, ZDO90], constraints [DAT83, GRE93], knowledge-based systems
[MYL89, GUI94], 4GLs and CASE tools [COMP90, SCH95, SHA95]. Meanwhile, the older
technology was dealing with files and primitive database systems which now appear inflexible, as
the technology itself limits them from being adapted to meet the current changing business needs
catalysed by newer technologies. The older systems which have been developed using 3GLs and in
operation for many years, often suffer from failures, inappropriate functionality, lack of
documentation, poor performance and are referred to as legacy information systems [BRO93,
COMS94, IEE94, BRO95, IEEE95].

       The current technology is much more flexible as it supports methods to evolve (e.g. 4GLs,
CASE tools, GUI toolkits and reusable software libraries [HAR90, MEY94]), and can share
resources through software that allows interoperability (e.g. ODBC [RIC94, GEI95]). This evolution
reflects the changing business needs. However, modern systems need to be properly designed and
implemented to benefit from this technology, which may still be unable to prevent such systems
themselves being considered to be legacy information systems in the near future due to the advent of
the next generation of technology with its own special features. The only salvation would appear to
be building in evolution paths in the current systems.

        The increasing power of computers and their software has meant they have already taken
over many day to day functions and are taking over more of these tasks as time passes. Thus
computers are managing a larger volume of information in a more efficient manner. Over the years
most enterprises have adopted the computerisation option to enable them to efficiently perform their
business tasks and to be able to compete with their counterparts. As the performance ability of
computers has increased, the enterprises still using early computer technology face serious problems
due to the difficulties that are inherent in their legacy systems.

        This means that new enterprises using systems purely based on the latest technology have an
advantage over those which need to continue to use legacy information systems (ISs), as modern ISs
have been developed using current technology which provides not only better performance, but also
utilises the benefits of improved functionality. Hence, managers of legacy IS enterprises want to
retire their legacy code and use modern database management systems (DBMSs) in the latest
environment to gain the full benefits from this newer technology. However they want to use this
technology on the information and data they already hold as well as on data yet to be captured. They
also want to ensure that any attempts to incorporate the modern technology will not adversely affect
the ongoing functionality of their existing systems. This means legacy ISs need to be evolved and
migrated to a modern environment in such a way that the migration is transparent to the current
users. The theme of this thesis is how we can support this form of system evolution.

         1.1.1 The Barriers to Legacy Information System Migration

       Legacy ISs are usually those systems that have stood the test of time and have become a core
service component for a business’s information needs. These systems are a mix of hardware and
software, sometimes proprietary, often out of date, and built to earlier styles of design,
implementation and operation. Although they were productive and fulfilled their original
performance criteria and their requirements, these systems lack the ability to change and evolve. The
following can be seen as barriers to evolution in legacy IS [IEE94].

    • The technology used to build and maintain the legacy IS is obsolete,
    • The system is unable to reflect changes in the business world and to support new needs,
    • The system cannot integrate with other sub-systems,
    • The cost, time and risk involved in producing new alternative systems to the legacy IS.

        The risk factor is that a new system may not provide the full functionality of the current
system for a period because of teething problems. Due to these barriers, large organisations [PHI94]
prefer to write independent sub-systems to perform new tasks using modern technology which will
run alongside the existing systems, rather than attempt to achieve this by adapting existing code or
by writing a new system that replaces the old and has new facilities as well. We see the following
immediate advantages of this low risk approach.



Page 4
• The performance, reliability and functionality of the existing system is not affected,
    • New applications can take advantage of the latest technology,
    • There is no need to retrain those staff who only need the facilities of the old system.

       However with this approach, as business requirements evolve with time, more and more new
needs arise, resulting in the development and regular use of many diverse systems within the same
organisation. Hence, in the long term the above advantages are overshadowed by the more serious
disadvantages of this approach, such as:

    • The existing systems continue to exist and are legacy IS running on older and older
      technology,
    • The need to maintain many different systems to perform similar tasks increases the
      maintenance and support costs of the organisation,
    • Data becomes duplicated in different systems which implies the maintenance of redundant data
      with its associated increased risk of inconsistency between the data copies if updating occurs,
    • The overall maintenance cost for hardware, software and support personnel increases as many
      platforms are being supported,
    • The performance of the integrated information functions of the organisation decreases due to
      the need to interface many disparate systems.

        To address the above issues, legacy ISs need to be evolved and migrated to new computing
environments, when their owning organisation upgrades. This migration should occur within a
reasonable time after the upgrade occurs. This means that it is necessary to migrate legacy ISs to
new target environments in order to allow the organisation to dispose of the technology which is
becoming obsolete. Managers of some enterprises have chosen an easy way to overcome this
problem, by emulating [CAM89, PHI94] the current environment on the new platforms (e.g. AS/400
emulators for IBM S/360 and ICL’s DME emulators for 1900 and System 4 users). An alternative
strategy is achieved by translating [SHA93, PHI94, SHE94, BRO95] the software to run in new
environments (i.e. code-to-code level translation). The emulator approach perpetuates all the
software deficiencies of the legacy ISs although successfully removing the old-fashioned hardware
technology and so it does enjoy the increased processing power of the new hardware. The translation
approach takes advantage of some of the modern technological benefits in the target environment as
the conversions - such as IBM’s JCL and ICL’s SCL code to Unix shell scripts, Assembler to
COBOL, COBOL to COBOL embedded with SQL, and COBOL data structures to relational DBMS
tables - are also done as part of the translation process. This approach, although a step forward, still
carries over most of the legacy code as legacy systems are not evolved by this process. For example,
the basic design is not changed. Hence the barrier to change and/or integration to a common sub-
system still remains, and the translated systems were not designed for the environment they are now
running in, so they may not be compatible with it.

       There are other approaches to overcoming this problem which have been used by enterprises
[SHA93, BRO95]. These include re-implementing systems under the new environment and/or
upgrading existing systems to achieve performance improvements. As computer technology
continues to evolve at an ever quicker pace the need to migrate arises more rapidly. This means,
most small organisations and individuals are left behind and are forced to work in a technologically



Page 5
obsolete environment, mainly due to the high cost of frequently migrating to newer systems and/or
upgrading existing software, as this process involves time and manpower which cost money. The
gap between the older and newer system users will very soon create a barrier to information sharing
unless some tools are developed to assist the older technology users’ migration to new technology
environments. This assistance for the older technology users may take many forms, including tools
for: analysing and understanding existing systems; enhancing and modifying existing systems;
migrating legacy ISs to newer platforms. The complete migration process for a legacy IS needs to
consider these requirements and many other aspects, as recently identified by Brodie and
Stonebraker in [BRO95]. Our work was primarily motivated by these business oriented legacy
database issues and by work in the area of extending relational database technology to enable it to
represent more knowledge about its stored data [COD79, STO86a, STO86b, WIK90]. This second
consideration is an important aspect of legacy system migration, since if a graceful migration is to be
achieved we must be able to enhance a legacy relational database with such knowledge to take full
advantage of the new system environment.

         1.1.2 Heterogeneous Distributed Environments

        As well as the problem of having to use legacy ISs, most large enterprises are faced with the
problem of heterogeneity and the need for interoperability between existing ISs [IMS91]. This arises
due to the increased use of different computer systems and software tools for information processing
within an organisation as time passes. The development of networking capabilities to manage and
share information stored over a network has made interoperability a requirement and local area
networks finding broad acceptance in business enterprises has enhanced the need to perform this task
within organisations. Network file servers, client-server technology and the use of distributed
databases [OZS91, BEL92, KHO92] are results of these challenging innovations. This technology is
currently being used to create and process information held in heterogeneous databases, which
involves linking different databases in an interoperable environment. An aspect of this work is
legacy database interoperation, since as time passes these databases will have been built using
different generations of software.

        In recent years, the demand for distributed database capabilities has been fuelled mostly by
the decentralisation of business functions in large organisations to address customer needs, and by
mergers and acquisitions that have taken place in the corporate world. As a consequence, there is a
strong requirement among enterprises for the ability to cross-correlate data stored in different
existing heterogeneous databases. This has led to the development of products referred to as
gateways, to enable users to link different databases together, e.g. Microsoft’s Open Database
Connectivity (ODBC) drivers can link Access, FoxPro, Btrieve, dBASE and Paradox databases
together [COL94, RIC94]. There are similar products for other database vendors, such as Oracle1
[HOL93] and others2 [PUR93, SME93, RIC94, BRO95]. Database vendors have targetted cross-
platform compatibility via SQL access protocols to support interoperability in a heterogeneous
environment. As heterogeneity in distributed systems may occur in various forms ranging from

   1
     For IBM’s DB2, UNISYS’s DMS, DEC RMS.
   2
     For INGRES, SYBASE, Informix and other popular SQL DBMSs.
3
  During the life-time of this project the SQL-3 standards moved from a preliminary draft, through
several modifications before being finalised in 1995.


Page 6
different hardware platforms, operating systems, networking protocols and local database systems,
cross-platform compatibility via SQL provides only a simple form of heterogeneous distributed
database access. The biggest challenge comes in addressing heterogeneity due to differences in local
databases [OZS91, BEL92]. This challenge is also addressed in the design and development of our
system.

        Distributed DBMSs have become increasingly popular in organisations as they offer the
ability to interconnect existing databases, as well as having many other advantages [OZS91,
BEL92]. The interconnection of existing databases leads to two types of distributed DBMS, namely:
homogeneous and heterogeneous distributed DBMSs. In homogeneous systems all of the constituent
nodes run the same DBMS and the databases can be designed in harmony with each other. This
simplifies both the processing of queries at different nodes and the passing of data between nodes. In
heterogeneous systems the situation is more complex, as each node can be running a different
DBMS and the constituent databases can be designed independently. This is the normal situation
when we are linking legacy databases, as the DBMS and the databases used are more likely to be
heterogeneous since they are usually implemented for different platforms during different
technological eras. In such a distributed database environment, heterogeneity may occur in various
forms, at different levels [OZS91, BEL92], namely :

    • The logical level (i.e. involving different database designs),
    • The data management level (i.e. involving different data models),
    • The physical level, (i.e. involving different hardware, operating systems and network
      protocols), and
    • At all three or any pair of these levels.

         1.1.3 The Problems and Search for a Solution

        The concept of heterogeneity itself is valuable as it allows designers a freedom of choice
between different systems and design approaches, thus enabling them to identify those most suitable
for different applications. The exploitation of this freedom over the years in many organisations has
resulted in the creation of multiple local and remote information systems which now need to be
made interoperable to provide an efficient and effective information service to the enterprise
managers. Open database connectivity (ODBC) [RIC94, GEI95] and its standards has been proposed
to support interoperability among databases managed by different DBMSs. Database vendors such
as Oracle, INGRES, INFORMIX and Microsoft have already produced tools, engines and
connectivity products to fulfil this task [HOL93, PUR93, SME93, COL94, RIC94, BRO95]. These
products allow limited data transfer and query facilities among databases to support interoperability
among heterogeneous DBMSs. These features, although they permit easy, transparent heterogeneous
database access, still do not provide a solution to legacy IS where a primary concern is to evolve and
migrate the system to a target environment so that obsolete support systems can be retired.
Furthermore, the ODBC facilities are developed for current DBMSs and hence may not be capable
of accessing older generation DBMSs, and, if they are, are unlikely to be able to enhance them to
take advantage of the newer technologies. Hence there is a need to create tools that will allow ODBC
equivalent functionality for older generation DBMSs. Our work provides such functionality for all
the DBMSs we have chosen for this research. It also provides the ability to enhance and evolve
legacy databases.



Page 7
In order to evolve an information system, one needs to understand the existing system’s
structure and code. Most legacy information systems are not properly documented and hence
understanding such systems is a complex process. This means that changing any legacy code
involves a high risk as it could result in unexpected system behaviour. Therefore one needs to
analyse and understand existing system code before performing any changes to the system.

        Database system design and implementation tools have appeared recently which have the
aim of helping new information system development. Reverse and re-engineering tools are also
appearing in an attempt to address issues concerned with existing databases [SHA93, SCH95]. Some
of these tools allow the examination of databases built using certain types of DBMSs, however, the
enhancements they allow are done within the limitation of that system. Due to continuous ongoing
technology changes, most current commercial DBMSs do not support the most recent software
modelling techniques and features (e.g. Oracle version 7 does not support Object-Oriented features).
Hence a system built using current software tools is guaranteed to become a legacy system in the
near future (i.e. when new products with newer techniques and features begin to appear in the
commercial market place).

        Reverse engineering tools [SHA93] are capable of recreating the conceptual model of an
existing database and hence they are an ideal starting point when trying to gain a comprehensive
understanding of the information held in the database and its current state, as they create a visual
picture of that state. However, in legacy systems the schemas are basic, since most of the
information used to compose a conceptual model is not available in these databases. Information
such as constraints that show links between entities is usually embedded in the legacy application
code and users find it difficult to reverse engineer these legacy ISs. Our work addresses these issues
while assisting in overcoming this barrier within the knowledge representation limitations of existing
DBMSs.

         1.1.4 Primary and Secondary Motivations

        The research reported in this thesis therefore was primarily promoted by the need to provide,
for a logically heterogeneous distributed database environment, a design tool that allows users not
only to understand their existing systems but also to enhance and visualise an existing database’s
structure using new techniques that are either not yet present in existing systems or not supported by
the existing software environment. It was also motivated by:

a) Its direct applicability in the business world, as the new technique can be applied to incrementally
    enhance existing systems and prepare them to be easily migrated to new target environments,
    hence avoiding continued use of legacy information systems in the organisation.

        Although previous work and some design tools address the issue of legacy information
system analysis, evolution and migration, these are mainly concerned with 3GL languages such as
COBOL and C [COMS94, BRO95, IEEE95]. Little work has been reported which addresses the new
issues that arise due to the Object-Oriented (O-O) data model or the extended relational data model
[CAT94]. There are no reports yet of enhancing legacy systems so that they can migrate to O-O or
extended relational environments in a graceful migration from a relational system. There has been



Page 8
some work in the related areas of identifying extended entity relationship structures in relational
schemas, and some attempts at reverse-engineering relational databases [MAR90, CHI94, PRE94].

b) The lack of previous research in visualising pre-existing heterogeneous database schemas and
   evolving them by enhancing them with modern concepts supported in more recent releases of
   software.

         Most design tools [COMP90, SHA93] which have been developed to assist in Entity-
Relationship (E-R) modelling [ELM94] and Object Modelling Technique (OMT) modelling
[RUM91] are used in a top-down database design approach (i.e. forward engineering) to assist in
developing new systems. However, relatively few tools attempt to support a bottom-up approach
(i.e. reverse engineering) to allow visualisation of pre-existing database schemas as E-R or OMT
diagrams. Among these tools only a very few allow enhancement of the pre-existing database
schemas, i.e. they apply forward engineering to enhance a reverse-engineered schema. Even those
which do permit this action to some extent, always operate on a single database management system
and work mostly with schemas originally designed using such systems (e.g. CASE tools). The tools
that permit only the bottom-up approach are referred to as reverse-engineering tools and those which
support both (i.e. bottom-up and top-down) are called re-engineering tools [SHA93]. This thesis is
primarily concerned with creating re-engineering tools that assist legacy database migration.

        The commercially available re-engineering tools are customised for particular DBMSs and
are not easily usable in a heterogeneous environment. This barrier against widespread usability of re-
engineering tools means that a substantial adaptation and reprogramming effort (costing time and
money) is involved every time a new DBMS appears in a heterogeneous environment. An obvious
example that reflects this limitation arises in a heterogeneous distributed database environment
where there may be a need to visualise each participant database’s schema. In such an environment
if the heterogeneity occurs at the database management level (where each node uses a different
DBMS, for example, one node uses INGRES [DAT87] and another uses Oracle [ROL92]), then we
have to use two different re-engineering tools to display these schemas. This situation is exacerbated
for each additional DBMS that is incorporated into the given heterogeneous context. Also, legacy
databases are migrated to different DBMS environments as newer versions and better database
products have appeared since the original release of their DBMS. This means that a re-engineering
tool that assists legacy database migration must work in an heterogeneous environment so that its
use will not be restricted to particular types of ISs.

        Existing re-engineering tools provide a single target graphical data model (usually the E-R
model or a variant of it), which may differ in presentation style between tools and therefore inhibits
the uniformity of visualisation that is highly desirable in an interoperable heterogeneous distributed
database environment. This limitation means that users may need to use different tools to provide the
required uniformity of display in such an environment. The ability to visualise the conceptual model
of an information system using a user-preferred graphical data model is important as it ensures that
no inaccurate enhancements are made to the system due to any misinterpretation of graphical
notations used.

c) The need to apply rules and constraints to pre-existing databases to identify and clean inconsistent
    legacy data, as preparation for migration or as an enhancement of the database’s quality.



Page 9
The inability to define and apply rules and constraints on early database systems due to
system limitations resulted in them not using constraints to increase the accuracy and consistency of
the data held by these systems. This limitation is now a barrier to information system migration as a
new target DBMS is unable to enforce constraints on a migrated database until all violations are
investigated and resolved either by omitting the violating data or by cleaning it. This investigation
may also show that a constraint has to be adjusted as the violating data is needed by the organisation.
The enhancement of such a system by rules and constraints provides knowledge that is usable to
determine possible data violations. The process of detecting constraint violations may be done by
applying queries that are generated from these enhanced constraints. Similar methods have been
used to implement integrity constraints [STO75], optimise queries [OZS91] and obtain intensional
answers [FON92, MOT89]. This is essential as constraints may have been implemented at the
application coding level and that can lead to their inconsistent application.

d) An awareness of the potential contribution that knowledge-based systems and meta-programming
   technologies, in association with extended relational database technology, have to offer in coping
   with semantic heterogeneity.

        The successful production of a conceptual model is highly dependent on the semantic
information available, and on the ability to reason about these semantics. A knowledge-based system
can be used to assist in this task, as the process to generalise effective exploitation of semantic
information for pre-existing heterogeneous databases needs to undergo three sub-processes, namely:
knowledge acquisition, representation and manipulation. The knowledge acquisition process extracts
the existing knowledge from a database’s data dictionaries. This knowledge may include subsequent
enhancements made by the user, as the use of a database to store such knowledge will provide easy
access to this information along with its original knowledge. The knowledge representation process
represents existing and enhanced knowledge. The knowledge manipulation process is concerned
with deriving new knowledge and ensuring consistency of existing knowledge. These stages are
addressable using specific processes. For instance, the reverse-engineering process used to produce a
conceptual model can be used to perform the knowledge acquisition task. Then the derived and
enhanced knowledge can be stored in the same database by adopting a process that will allow us to
distinguish this knowledge from its original meta-data. Finally, knowledge manipulation can be done
with the assistance of a Prolog based system [GRA88], while data and knowledge consistency can be
verified using the query language of the database.

1.2 Goals of the Research

         The broad goals of the research reported in this thesis are highlighted here, with detailed aims
and objectives presented in section 2.4. These goals are to investigate interoperability problems,
schema enhancement and migration in a heterogeneous distributed database environment, with
particular emphasis on extended relational systems. This should provide a basis for the design and
implementation of a prototype software system that brings together new techniques from the areas of
knowledge-based systems, meta-programming and O-O conceptual data modelling with the aim of
facilitating schema enhancement, by means of generalising the efficient representation of constraints
using the current standards. Such a system is a tool that would be a valuable asset in a logically
heterogeneous distributed extended relational database environment as it would make it possible for



Page 10
global users to incrementally enhance legacy information systems. This offers the potential for users
in this type of environment to work in terms of such a global schema, through which they can
prepare their legacy systems to easily migrate to target environments and so gain the benefits of
modern computer technology.

1.3 Original Achievements of the Research

        The importance of this research lies in establishing the feasibility of enhancing, cleaning and
migrating heterogeneous legacy databases using meta-programming technology, knowledge-based
system technology, database system technology and O-O conceptual data modelling concepts, to
create a comprehensive set of techniques and methods that form an efficient and useful generalised
database re-engineering tool for heterogeneous sets of databases. The benefits such a tool can bring
are also demonstrated and assessed.

       A prototype Conceptual Constraint Visualisation and Enhancement System (CCVES)
[WIK95a] has been developed as a result of the research. To be more specific, our work has made
four important contributions to progress in the database topic area of Computer Science:

1) CCVES is the first system to bring the benefits of meta-programming technology to the very
   important application area of enhancing and evolving heterogeneous distributed legacy databases
   to assist the legacy database migration process [GRA94, WIK95c].

2) CCVES is also the first system to enhance existing databases with constraints to improve their
   visual presentation and hence provide a better understanding of existing applications [WIK95b].
   This process is applicable to any relational database application, including those which are unable
   to naturally support the specification and enforcement of constraints. More importantly, this
   process does not affect the performance of an existing application.

3) As will be seen later, we have chosen the current SQL-3 standards [ISO94] as the basis for
   knowledge representation in our research. This project provides an extension to the
   representation of the relational data model to cope with automated reuse of knowledge in the re-
   engineering process. In order to cope with technological changes that result from the emergence
   of new systems or new versions of existing DBMSs, we also propose a series of extended
   relational system tables conforming to SQL-3 standards to enhance existing relational DBMSs
   [WIK95b].

4) The generation of queries using the constraint specifications of the enhanced legacy systems is an
   easy and convenient method of detecting any constraint violating data in existing systems. The
   application of this technique in the context of a heterogeneous environment for legacy
   information systems is a significant step towards detecting and cleaning inconsistent data in
   legacy systems prior to their migration. This is essential if a graceful migration is to be effected
   [WIK95c].

1.4 Organisation of the Thesis




Page 11
The thesis is organised into 8 chapters. This first chapter has given an introduction to the
research done, covering background and motivations, and outlining original achievements. The rest
of the thesis is organised as follows:

Chapter 2 is devoted to presenting an overview of the research together with detailed aims and
objectives for the work undertaken. It begins by identifying the scope of the work in terms of
research constraints and development technologies. This is followed by an overview of the research
undertaken, where a step by step discussion of the approach adopted and its role in a heterogeneous
distributed database environment is given. Finally, detailed aims and objectives are drawn together
to conclude the chapter.

Chapter 3 identifies the relational data model as the current dominant database model and presents
its development along with its terminology, features and query languages. This is followed by a
discussion of conceptual data models with special emphasis on the data models and symbols used in
our project. Finally, we pay attention to key concepts related to our project, mainly the notion of
semantic integrity constraints and extensions to the relational model. Here, we present important
integrity constraint extensions to the relational model and its support using different SQL standards.

Chapter 4 addresses the issue of legacy information system migration. The discussion commences
with an introduction to legacy and our target information systems. This is followed by migration
strategies and methods for such ISs. Finally, we conclude by referring to current techniques and
identify the trends and existing tools applicable to database migration.

Chapter 5 addresses the re-engineering process for relational databases. Techniques currently used
for this purpose are identified first. Our approach, which uses constraints to re-engineer a relational
legacy database is described next. This is followed by a process for detecting possible keys and
structures of legacy databases. Our schema enhancement and knowledge representation techniques
are then introduced. Finally, we present a process to detect and resolve conflicts that may occur due
to schema enhancement.

Chapter 6 introduces some example test databases which were chosen to represent a legacy
heterogeneous distributed database environment and its access processes. Initially, we present the
design of our test databases, the selection of our test DBMSs and the prototype system environment.
This is followed by the application of our re-engineering approach to our test databases. Finally, the
organisation of relational meta-data and its access is described using our test DBMSs.

Chapter 7 presents the internal and external architecture and operation of our conceptual constraint
visualisation and enforcement system (CCVES) in terms of the design, structure and operation of its
interfaces, and its intermediate modelling system. The internal schema mappings, e.g. mapping from
INGRES QUEL to SQL and vice-versa, and internal database migration processes are presented in
detail here.

Chapter 8 provides an evaluation of CCVES, identifying its limitations and improvements that could
be made to the system. A discussion of potential applications is presented. Finally we conclude the




Page 12
chapter by drawing conclusions about the research project as a whole.




Page 13
CHAPTER 2

                   Research Scope, Approach, Aims and Objectives

This chapter describes, in some detail, the aims and objectives of the research that has been
undertaken. Firstly, the boundaries of the research are defined in section 2.1, which considers the
scope of the project. Secondly, an overview of the research approach we have adopted in dealing
with heterogeneous distributed legacy database evolution and migration is given in section 2.2.
Next, in section 2.3, the discussion is extended to the wider aspects of applying our approach in a
heterogeneous distributed database environment using the existing meta-programming technology
developed at Cardiff in other projects. Finally, the research aims and objectives are detailed in
section 2.4, illustrating what we intend to achieve, and the benefits expected from achieving the
stated aims.

2.1 Scope of the Project

       We identify the scope of the work in terms of research constraints and the limitations of
current development technologies. An overview of the problem is presented along with the
drawbacks and limitations of database software development technology in addressing the
problem. This will assist in identifying our interests and focussing the issues to be addressed.

       2.1.1 Overview of the Problem

        In most database designs, a conceptual design and modelling technique is used in
developing the specifications at the user requirements and analysis stage of the design. This stage
usually describes the real world in terms of object/entity types that are related to one another in
various ways [BAT92, ELM94]. Such a technique is also used in reverse-engineering to portray
the current information content of existing databases, as the original designs are usually either
lost, or inappropriate because the database has evolved from its original design. The resulting
pictorial representation of a database can be used for database maintenance, for database re-
design, for database enhancement, for database integration or for database migration, as it gives its
users a sound understanding of an existing database’s architecture and contents.

        Only a few current database tools [COMP90, BAT92, SHA93, SCH95] allow the capture
and presentation of database definitions from an existing database, and the analysis and display of
this information at a higher level of abstraction. Furthermore, these tools are either restricted to
accessing a specific database management system’s databases or permit modelling with only a
single given display formalism, usually a variant of the EER [COMP90]. Consequently there is a
need to cater for multiple database platforms with different user needs to allow access to a set of
databases comprising a heterogeneous database, by providing a facility to visualise databases
using a preferred conceptual modelling technique which is familiar to the different user
communities of the heterogeneous system.

        The fundamental modelling constructs of current reverse and re-engineering tools are
entities, relationships and associated attributes. These constructs are useful for database design at
a high level of abstraction. However, the semantic information now available in the form of rules
and constraints in modern DBMSs provides their users with a better understanding of the
underlying database as its data conforms to these constraints. This may not necessarily be true for
legacy systems, which may have constraints defined that were not enforced. The ability to
visualise rules and constraints as part of the conceptual model increases user understanding of a
database. Users could also exploit this information to formulate queries that more effectively
utilise the information held in a database. Having these features in mind, we concentrated on
providing a tool that permits specification and visualisation of constraints as part of the graphical
display of the conceptual model of a database. With modern technology increasing the number of
legacy systems and with increasing awareness of the need to use legacy data [BRO95, IEEE95],
the availability of such a visualisation tool will be more important in future as it will let users see
the full definition of the contents of their databases in a familiar format.

         Three types of abstraction mechanism, namely: classification, aggregation and
generalisation, are used in conceptual design [ELM94]. However, most existing DBMSs do not
maintain sufficient meta-data information to assist in identifying all these abstraction mechanisms
within their data models. This means that reverse and re-engineering tools are semi-automated, in
that they extract information, but users have to guide them and decide what information to look
for [WAT94]. This requires interactions with the database designer in order to obtain missing
information and to resolve possible conflicts. Such additional information is supplied by the tool
users when performing the reverse-engineering process. As this additional information is not
retained in the database, it must be re-entered every time a reverse engineering process is
undertaken if the full representation is to be achieved. To overcome this problem, knowledge
bases are being used to retain this information when it is supplied. However, this approach
restricts the use of this knowledge by other tools which may exist in the database’s environment.
The ability to hold this knowledge in the database itself would enhance an existing database with
information that can be widely used. This would be particularly useful in the context of legacy
databases as it would enrich their semantics. One of the issues considered in this thesis is how this
can be achieved.

        Most existing relational database applications record only entities and their properties (i.e.
attribute names and data types) as system meta-data. This is because these systems conformed to
early database standards (e.g. the SQL/86 standard [ANSI86], supported by INGRES version 5
and Oracle version 5). However, more recent relational systems record additional information
such as constraint and rule definitions, as they conform to the SQL/92 standards [ANSI92] (e.g.
Oracle version 7). This additional information includes, for example, primary and foreign key
specifications, and can be used to identify classification and aggregation abstractions used in a
conceptual model [CHI94, PRE94, WIK95b]. However, the SQL/92 standard does not capture the
full range of modelling abstractions, e.g. inheritance representing generalisation hierarchies. This
means that early relational database applications are now legacy systems as they fail to naturally
represent additional information such as constraint and rule definitions. Such legacy database
systems are being migrated to modern database systems not only to gain the benefits of the current
technology but also to be compatible with new applications built with the modern technology. The
SQL standards are currently subject to review to permit the representation of extra knowledge
(e.g. object-oriented features), and we have anticipated some of these proposals in our work - i.e.
SQL-33 [ISO94] will be adopted by commercial systems and thus the current modern DBMSs


Page 15
will become legacy databases in the near future or already may be considered to be legacy
databases in that their data model type will have to be mapped onto the newer version. Having
experienced the development process of recent DBMSs it is inevitable that most current databases
will have to be migrated, either to a newer version of the existing DBMS or to a completely
different newer technology DBMS for a variety of reasons. Thus the migration of legacy
databases is perceived to be a continuing requirement, in any organisation, as technology
advances continue to be made.

        Most migrations currently being undertaken are based on code-to-code level translations of
the applications and associated databases to enable the older system to be functional in the target
environment. Minimal structural changes are made to the original system and database, thus the
design structures of these systems are still old-fashioned, although they are running in a modern
computing environment. This means that such systems are inflexible and cannot be easily
enhanced with new functions or integrated with other applications in their new environment. We
have also observed that more recent database systems have often failed to benefit from modern
database technology due to inherent design faults that have resulted in the use of unnormalised
structures, which cause omission of the features enforcing integrity constraints even when this is
possible. The ability to create and use databases without the benefit of a database design course is
one reason for such design faults. Hence there is a need to assist existing systems to be evolved,
not only to perform new tasks but also to improve their structure so that these systems can
maximise the gains they receive from their current technology environment and any environment
they migrate to in the future.


       2.1.2 Narrowing Down the Problem

        Technological advances in both hardware and software have improved the performance
and maintenance functionality of all information systems (ISs), and as a result, older ISs suffer
from comparatively poor performance and inappropriate functionality when compared with more
modern systems. Most of these legacy systems are written in a 3GL such as COBOL, have been
around for many years, and run on old-fashioned mainframes. Problems associated with legacy
systems are being identified and various solutions are being developed [BRO93, SHE94, BRO95].
These systems basically have three functional components, namely: interface, application and a
database service, which are sometimes inter-related to each other, depending on how they were
used during the design and implementation stages of the IS development. This means that the
complexity of a legacy IS depends on what occurred during the design and implementation of the
system. These systems may range from a simple single user database application using separate
interfaces and applications, to a complex multi-purpose unstructured application. Due to the
complex nature of the problem area we do not address this issue as a whole, but focus only on
problems associated with one sub-component of such legacy information systems, namely the
database service. This in itself is a wide field, and we have further restricted ourselves to legacy
ISs using a specific DBMS for their database service. We considered data models ranging from
original flat file and relational systems, to modern relational DBMSs and object-oriented DBMSs.
From these data models we have chosen the traditional relational model for the following reasons.

      • The relational model is currently the most widely used database model.


Page 16
• During the last two decades the relational model has been the most popular model;
     therefore it has been used to develop many database applications and most of these are now
     legacy systems.
     • There have been many extensions and variations of the relational model, which has
     resulted in many heterogeneous relational database systems being used in organisations.
     • The relational model can be enhanced to represent additional semantics currently
     supported only by modern DBMSs (e.g. extended relational systems [ZDO90, CAT94]).

        As most business requirements change with time, the need to enhance and migrate legacy
information systems exists for almost every organisation. We address problems faced by these
users while seeking for a solution that prevents new systems becoming legacy systems in the near
future. The selection of the relational model as our database service to demonstrate how one could
achieve these needs means that we shall be addressing only relational legacy database systems and
not looking at any other type of legacy information systems.

         This decision means we are not considering the many common legacy IS migration
problems identified by Brodie [BRO95] (e.g. migration of legacy database services such as flat-
file structures or hierarchical databases into modern extended relational databases; migration of
legacy applications with millions of lines of code written in some COBOL-like language into a
modern 4GL/GUI environment). However, as shown later, addressing the problems associated
with relational legacy databases has enabled us to identify and solve problems associated with
more recent DBMSs, and it also assists in identifying precautions which if implemented by
designers of new systems will minimise the chance of similar problems being faced by these
systems as IS developments occur in the future.

2.2 Overview of the Research Approach

       Having presented an overview and narrowing down of our problem, we identify the
following as the main functionalities that should be provided to fulfil our research goal:

     • Reverse-engineering of a relational legacy database to fully portray its current information
     content.
     • Enhancing a legacy database with new knowledge to identify modelling concepts that
     should be available to the database concerned or to applications using that database.
     • Determining the extent to which the legacy database conforms to its existing and enhanced
     descriptions.
     • Ensuring that the migrated IS will not become a legacy IS in the future.

        We need to consider the heterogeneity issue in order to be able to reverse-engineer any
given relational legacy database. Three levels of heterogeneity are present for a particular data
model, namely: at a physical, logical and data management level. The physical level of
heterogeneity usually arises due to different data model implementation techniques, use of
different computer platforms and use of different DBMSs. The physical / logical data
independence of DBMSs hides implementation differences from users, hence we need only
address how to access databases that are built using different DBMSs, running on different
computer platforms.


Page 17
Differences in DBMS characteristics lead to heterogeneity at the logical level. Here, the
different DBMSs conform to a particular standard (e.g. SQL/86 or SQL/92), which supports a
particular database query language (e.g. SQL or QUEL) and different relational data model
features (e.g. handling of integrity constraints and availability of object-oriented features). To
tackle heterogeneity at the logical level, we need to be aware of different standards, and to model
ISs supporting different features and query languages.

        Heterogeneity at the data management level arises due to the physical limitations of a
DBMS, differences in the logical design and inconsistencies that occurred when populating the
database. Logical differences in different database schemas have to be resolved only if we are
going to integrate them. The schema integration process is concerned with merging different
related database applications. Such a facility can assist the migration of heterogeneous database
systems. However any attempt to integrate legacy database schemas prior to the migration process
complicates the entire process as it is similar to attempting to provide new functionalities within
the system which is being migrated. Such attempts increase the chance of failure of the overall
migration process. Hence we consider any integration or enhancements in the form of new
functionalities only after successfully migrating the original legacy IS. However, the physical
limitations of a DBMS and data inconsistencies in the database need to be addressed beforehand
to ensure a successful migration.

       Our work addresses the heterogeneity issues associated with database migration by
adopting an approach that allows its users to incrementally increase the number of DBMSs it
could handle without having to reprogram its main application modules. Here, the user needs to
supply specific knowledge about DBMS schema and query language constructs. This is held
together with the knowledge of the DBMSs already supported and has no effect on the
application’s main processing modules.

       2.2.1 Meta-Programming

        Meta-programming technology allows the meta-data (schema information) of a database to
be held and processed independently of its source specification language. This allows us to work
on a database language independent environment and hence overcome many logical heterogeneity
issues. Prolog based meta-programming technology has been used in previous research at Cardiff
in the area of logical heterogeneity [FID92, QUT94]. Using this technology the meta-translation
of database query languages [HOW87] and database schemas [RAM91] has been performed. This
work has shown how the heterogeneity issues of different DBMSs can be addressed without
having to reprogram the same functionality for each and every DBMS. We use meta-programming
technology for our legacy database migration approach as we need to be able to start with a legacy
source database and end with a modern target database where the respective database schema and
query languages may be different from each other. In this approach the source database schema or
query language is mapped on input into an internal canonical form. All the required processing is
then done using the information held in this internal form. This information is finally mapped to
the target schema or query language to produce the desired output. The advantage of this approach
is that processing is not affected by heterogeneity as it is always performed on data held in the
canonical form. This canonical form is an enriched collection of semantic data modelling features.


Page 18
2.2.2 Application

        We view our migration approach as consisting of a series of stages, with the final stage
being the actual migration and earlier stages being preparatory. At stage 1, the data definition of
the selected database is reverse-engineered to produce a graphical display (cf. paths A-1 and A-2
of figure 2.1). However, in legacy systems much of the information needed to present the database
schema in this way is not available as part of the database meta-data and hence these links which
are present in the database cannot be shown in this conceptual model. In modern systems such
links can be identified using constraint specifications. Thus, if the database does not have any
explicit constraints, or it does but these are incomplete, new knowledge about the database needs
to be entered at stage 2 (cf. path B-1 of figure 2.1), which will then be reflected in the enhanced
schema appearing in the graphical display (cf. path B-2 of figure 2.1). This enhancement will
identify new links that should be present for the database concerned. These new database
constraints can next be applied experimentally to the legacy database to determine the extent to
which it conforms to them. This process is done at stage 3 (cf. paths C-1 and C-2 of figure 2.1).
The user can then decide whether these constraints should be enforced to improve the quality of
the legacy database prior to its migration. At this point the three preparatory stages in the
application of our approach are complete. The actual migration process is then performed. All
stages are further described below to enable us to identify the main processing components of our
proposed system as well as to explain how we deal with different levels of heterogeneity.

       Stage 1: Reverse Engineering

        In stage 1, the data definition of the selected database is reverse-engineered to produce a
graphical display of the database. To perform this task, the database’s meta-data must be extracted
(cf. path A-1 of figure 2.1). This is achieved by connecting directly to the heterogeneous database.
The accessed meta-data needs to be represented using our internal form. This is achieved through
a schema mapping process as used in the SMTS (Schema Meta-Translation System) of Ramfos
[RAM91]. The meta-data in our internal formalism then needs to be processed to derive the
graphical constructs present for the database concerned (cf. path A-2 of figure 2.1). These
constructs are in the form of entity types and the relationships and their derivation process is the
main processing component in stage 1. The identified graphical constructs are mapped to a display
description language to produce a graphical display of the database.




Page 19
Schema
                        Enhanced                             Visualisation                          Enforced
                        Constraints                         (EER or OMT)                           Constraints
                                                            with Constraints

                                        B-1                                                  C-1
                                                        B-2            A-2


                                                           Internal Processing


                                                    B-3                      C-2

                                                                 A-1


                                                      Heterogeneous Databases




                                 Stage 1 (Reverse Engineering)                         Stage 2 (Knowledge Augmentation)
                                                          Stage 3 (Constraint Enforcement)


                   Figure 2.1: Information flow in the 3 stages of our approach prior to migration


       a) Database connectivity for heterogeneous database access

        Unlike the previous Cardiff meta-translation systems [HOW87, RAM91, QUT92], which
addressed heterogeneity at the logical and data management levels, our system looks at the
physical level as well. While these previous systems processed schemas in textual form and did
not access actual databases to extract their DDL specification, our system addresses physical
heterogeneity by accessing databases running on different hardware / software platforms (e.g.
computer systems, operating systems, DBMSs and network protocols). Our aim is to directly
access the meta-data of a given database application by specifying its name, the name and version
of the host DBMS, and the address of the host machine4. If this database access process can
produce a description of the database in DDL formalism, then this textual file is used as the
starting point for the meta-translation process as in previous Cardiff systems [RAM91, QUT92].
We found that it is not essential to produce such a textual file, as the required intermediate
representation can be directly produced by the database access process. This means that we could
also by-pass the meta-translation process that performs the analysis of the DDL text to translate it
into the intermediate representation5. However the DDL formalism of the schema can be used for
optional textual viewing and could also serve as the starting point for other tools6 developed at
Cardiff for meta-programming database applications.

       The initial functionality of the Stage 1 database connectivity process is to access a
heterogeneous database and supply the accessed meta-data as input to our schema meta-translator

   4
     We assume that access privileges for this host machine and DBMS have been granted.
   5
     A list of tokens ready for syntactic analysis in the parsing phase is produced and processed
based on the BNF syntax specification of the DDL [QUT92].
   6
     e.g. The Schema Meta-Integration System (SMIS) of Qutaishat [QUT92].


Page 20
(SMTS). This module needs to deal with heterogeneity at the physical and data management
levels. We achieve this by using DML commands of the specific DBMS to extract the required
meta-data held in database data dictionaries treated like user defined tables.

       Relatively recently, the functionalities of a heterogeneous database access process have
been provided by means of drivers such as ODBC [RIC94]. Use of such drivers will allow access
to any database supported by them and hence obviate the need to develop specialised tools for
each database type as happened in our case. These driver products were not available when we
undertook this stage of our work.

       b) Schema meta-translation

        The schema meta-translation process [RAM91] accepts input of any database schema
irrespective of its DDL and features. The information captured during this process is represented
internally to enable it to be mapped from one database schema to another or to further process and
supply information to other modules such as the schema meta-visualisation system (SMVS)
[QUT93] and the query meta-translation system (QMTS) [HOW87]. Thus, the use of an internal
canonical form for meta representation has successfully accommodated heterogeneity at the data
management and logical levels.

       c) Schema meta-visualisation

        Schema visualisation using graphical notation and diagrams has proved to be an important
step in a number of applications, e.g. during the initial stages of the database design process; for
database maintenance; for database re-design; for database enhancement; for database integration;
or for database migration; as it gives users a sound understanding of an existing database’s
structure in an easily assimilated format [BAT92, ELM94]. Database users need to see a visual
picture of their database structure instead of textual descriptions of the defining schema as it is
easier for them to comprehend a picture. This has led to the production of graphical
representations of schema information, effected by a reverse engineering process. Graphical data
models of schemas employ a set of data modelling concepts and a language-independent graphical
notation (e.g. the Entity Relationship (E-R) model [CHE76], Extended/Enhanced Entity
Relationship (EER) model [ELM94] or the Object Modelling Technique (OMT) [RUM91]). In a
heterogeneous environment different users may prefer different graphical models, and an
understanding of the database structure and architecture beyond that given by the traditional
entities and their properties. Therefore, there is a need to produce graphical models of a database’s
schema using different graphical notations such as either E-R/EER or OMT, and to accompany
them with additional information such as a display of the integrity constraints in force in the
database [WIK95b]. The display of integrity constraints allows users to look at intra- and inter-
object constraints and gain a better understanding of domain restrictions applicable to particular
entities. Current reverse engineering tools do not support this type of display.

        The generated graphical constructs are held internally in a similar form to the meta-data of
the database schema. Hence using a schema meta visualisation process (SMVS) it is possible to
map the internally held graphical constructs into appropriate graphical symbols and coordinates
for the graphical display of the schema. This approach has a similarity to the SMTS, the main


Page 21
difference being that the output is graphical rather than textual.

       Stage 2: Knowledge Augmentation

        In a heterogeneous distributed database environment, evolution is expected, especially in
legacy databases. This evolution can affect the schema description and in particular schema
constraints that are not reflected in the stage 1 (path A-2) graphical display as they may be
implicit in applications. Thus our system is designed to accept new constraint specifications (cf.
path B-1 of figure 2.1) and add them to the graphical display (cf. path B-2 of figure 2.1) so that
these hidden constraints become explicit.

        The new knowledge accepted at this point is used to enhance the schema and is retained in
the database using a database augmentation process (cf. path B-3 of figure 2.1). The new
information is stored in a form that conforms with the enhanced target DBMS’s methods of
storing such information. This assists the subsequent migration stage.

       a) Schema enhancement

        Our system needs to permit a database schema to be enhanced by specifying new
constraints applicable to the database. This process is performed via the graphical display. These
constraints, which are in the form of integrity constraints (e.g. primary key, foreign key, check
constraints) and structural components (e.g. inheritance hierarchies, entity modifications) are
specified using a GUI. When they are entered they will appear in the graphical display.

       b) Database augmentation

        The input data to enhance a schema provides new knowledge about a database. It is
essential to retain this knowledge within the database itself, if it is to be readily available for any
further processing. Typically, this information is retained in the knowledge base of the tool used
to capture the input data, so that it can be reused by the same tool. This approach restricts the use
of this knowledge by other tools and hence it must be re-entered every time the re-engineering
process is applied to that database. This makes it harder for the user to gain a consistent
understanding of an application, as different constraints may be specified during two separate re-
engineering processes. To overcome this problem, we augment the database itself using the
techniques proposed in SQL-3 [ISO94], wherever possible. When it is not possible to use SQL-3
structures we store the information in our own augmented table format which is a natural
extension of the SQL-3 approach.

        When a database is augmented using this method, the new knowledge is available in the
database itself. Hence, any further re-engineering processes need not make requests for the same
additional knowledge. The augmented tables are created and maintained in a similar way to user-
defined tables, but have a special identification to distinguish them. Their structure is in line with
the international standards and the newer versions of commercial DBMSs, so that the enhanced
database can be easily migrated to either a newer version of the host DBMS or to a different
DBMS supporting the latest SQL standards. Migration should then mean that the newer system
can enforce the constraints. Our approach should also mean that it is easy to map our tables for


Page 22
holding this information into the representation used by the target DBMS even if it is different, as
we are mapping from a well defined structure.

       Legacy databases that do not support explicit constraints can be enhanced by using the
above knowledge augmentation method. This requirement is less likely to occur for databases
managed by more recent DBMSs as they already hold some constraint specification information
in their system tables. The direction taken by Oracle version 6 was a step towards our
augmentation approach, as it allowed the database administrator to specify integrity constraints
such as primary and foreign keys, but did not yet enforce them [ROL92]. The next release of
Oracle, i.e. version 7, implemented this constraint enforcement process.


       Stage 3: Constraint Enforcement

        The enhanced schema can be held in the database, but the DBMS can only enforce these
constraints if it has the capability to do so. This will not normally be the case in legacy systems. In
this situation, the new constraints may be enforced via a newer version of the DBMS or by
migrating the database to another DBMS supporting constraint enforcement. However, the data
being held in the database may not conform to the new constraints, and hence existing data may
be rejected by the target DBMS in the migration, thus losing data and / or delaying the migration
process. To address this problem and to assist the migration process, we provide an optional
constraint enforcement process module which can be applied to a database before it is migrated.
The objective of this process is to give users the facility to ensure that the database conforms to all
the enhanced constraints before migration occurs. This process is optional so that the user can
decide whether these constraints should be enforced to improve the quality of the legacy data prior
to its migration, whether it is best left as it stands, or whether the new constraints are too severe.

       The constraint definitions in the augmented schema are employed to perform this task. As
all constraints held have already been internally represented in the form of logical expressions,
these can be used to produce data manipulation statements suitable for the host DBMS. Once
these statements are produced, they are executed against the current database to identify the
existence of data violating a constraint.

       Stage 4: Migration Process

        The migration process itself is incrementally performed by initially creating the target
database and then copying the legacy data over to it. The schema meta-translation (SMTS)
technique of Ramfos [RAM91] is used to produce the target database schema. The legacy data can
be copied using the import / export tools of source and target DBMS or DML statements of the
respective DBMSs. During this process, the legacy applications must continue to function until
they too are migrated. To achieve this an interface can be used to capture and process all database
queries of the legacy applications during migration. This interface can decide how to process
database queries against the current state of the migration and re-direct those newly related to the
target database. The query meta-translation (QMTS) technique of Howells [HOW87] can be used
to convert these queries to the target DML. This approach will facilitate transparent migration for
legacy databases. Our work does not involve the development of an interface to capture and


Page 23
process all database queries, as interaction with the query interface of the legacy IS is embedded
in the legacy application code. However, we demonstrate how to create and populate a legacy
database schema in the desired target environment while showing the role of SMTS and QMTS in
such a process.

2.3 The Role of CCVES in Context of Heterogeneous Distributed Databases

       Our approach described in section 2.2 is based on preparing a legacy database schema for
graceful migration. This involves visualisation of database schemas with constraints and
enhancing them with constraints to capture more knowledge. Hence we call our system the
Conceptualised Constraint Visualisation and Enhancement System (CCVES).

        CCVES has been developed to fit in with the previously developed schema (SMTS)
[RAM91] and query (QMTS) [HOW87] meta-translation systems, and the schema meta-
visualisation system (SMVS) [QUT93]. This allows us to consider the complementary roles of
CCVES, SMTS, QMTS and SMVS during Heterogeneous Distributed Database access in a
uniform way [FID92, QUT94]. The combined set of tools achieves semantic coordination and
promotes interoperability in a heterogeneous environment at logical, physical and data
management levels.

        Figure 2.2 illustrates the architecture of CCVES in the context of heterogeneous
distributed databases. It outlines in general terms the process of accessing a remote (legacy)
database to perform various database tasks, such as querying, visualisation, enhancement,
migration and integration.

        There are seven sub-processes: the schema mapping process [RAM91], query mapping
process [HOW87], schema integration process [QUT92], schema visualisation process [QUT93],
database connectivity process, database enhancement process and database migration process. The
first two processes together have been called the Integrated Translation Support Environment
[FID92], and the first four processes together have been called the Meta-Integration/Translation
Support Environment [QUT92]. The last three processes were introduced as CCVES to perform
database enhancement and migration in such an environment.

        The schema mapping process, referred to as SMTS, translates the definition of a source
schema to a target schema definition (e.g. an INGRES schema to a POSTGRES schema). The
query mapping process, referred to as QMTS, translates a source query to a target query (e.g. an
SQL query to a QUEL query). The meta-integration process, referred to as SMIS, tackles
heterogeneity at the logical level in a distributed environment containing multiple database
schemas (e.g. Ontos and Exodus local schemas with a POSTGRES global schema) - it integrates
the local schemas to create the global schema. The meta-visualisation process, referred to as
SMVS, generates a graphical representation of a schema. The remaining three processes, namely:
database connectivity, enhancement and migration with their associated processes, namely:
SMVS, SMTS and QMTS, are the subject of the present thesis, as they together form CCVES
(centre section of figure 2.2).

       The database connectivity process (DBC), queries meta-data from a remote database (route


Page 24
A-1 in figure 2.2) to supply meta-knowledge (route A-2 in figure 2.2) to the schema mapping
process referred to as SMTS. SMTS translates this meta-knowledge to an internal representation
which is based on SQL schema constructs. These SQL constructs are supplied to SMVS for
further processing (route A-3 in figure 2.2) which results in the production of a graphical view of
the schema (route A-4 in figure 2.2). Our reverse-engineering techniques [WIK95b] are applied to
identify entity and relationship types to be used in the graphical model. Meta-knowledge
enhancements are solicited at this point by the database enhancement process (DBE) (route B-1 in
figure 2.2), which allows the definition of new constraints and changes to the existing schema.
These enhancements are reflected in the graphical view (route B-2 and B-3 in figure 2.2) and may
be used to augment the database (route B-4 to B-8 in figure 2.2). This approach to augmentation
makes use of the query mapping process, referred to as QMTS, to generate the required queries to
update the database via the DBC process. At this stage any existing or enhanced constraints may
be applied to the database to determine the extent to which it conforms to the new enhancements.
Carrying out this process will also ensure that legacy data will not be rejected by the target DBMS
due to possible violations. Finally, the database migration process, referred to as DBMI, assists
migration by incrementally migrating the database to the target environment (route C-1 to C-6 in
figure 2.2). Target schema constructs for each migratable component are produced via SMTS, and
DDL statements are issued to the target DBMS to create the new database schema. The data for
these migrated tables are extracted by instructing the source DBMS to export the source data to
the target database via QMTS. Here too, the queries which implement this export are issued to the
DBMS via the DBC process.

2.4 Research Aims and Objectives

       Our relational database enhancement and augmentation approach is important in three
respects, namely:

    1) by holding the additional defining information in the database itself, this information is
      usable by any design tool in addition to assisting the full automation of any future re-
      engineering of the same database;
    2) it allows better user understanding of database applications, as the associated constraints
      are shown in addition to the traditional entities and attributes at the conceptual level;




Page 25
3) the process which assists a database administrator to clean inconsistent legacy data ensures a
safe migration. To perform this latter task in a real world situation without an automated support
tool is a very difficult, tedious, time consuming and error prone task.

        Therefore the main aim of this project has been the design and development of a tool to
assist database enhancement and migration in a heterogeneous distributed relational database
environment. Such a system is concerned with enhancing the constituent databases in this type of
environment to exploit potential knowledge both to automate the re-engineering process and to
assist in evolving and cleaning the legacy data to prevent data rejection, possible losses of data
and/or delays in the migration process. To this end, the following detailed aims and objectives
have been pursued in our research:

1. Investigation of the problems inherent in schema enhancement and migration for a
heterogeneous distributed relational legacy database environment, in order to fully understand
these processes.

2. Identification of the conceptual foundation on which to successfully base the design and
development of a tool for this purpose. This foundation includes:

    • A framework to establish meta-data representation and manipulation.
    • A real world data modelling framework that facilitates the enhancement of existing working
      systems and which supports applications during migration.
    • A framework to retain the enhanced knowledge for future use which is in line with current
      international standards and techniques used in newer versions of relational DBMSs.
    • Exploiting existing databases in new ways, particularly linking them with data held in other
      legacy systems or more modern systems.
    • Displaying the structure of databases in a graphical form to make it easy for users to
      comprehend their contents.
    • The provision of an interactive graphical response when enhancements are made to a
      database.
    • A higher level of data abstraction for tasks associated with visualising the contents,
      relationships and behavioural properties of entities and constraints.
    • Determining the constraints on the information held and the extent to which the data
      conforms to these constraints.
    • Integrating with other tools to maximise the benefits of the new tool to the user community.

3. Development of a prototype tool to automate the re-engineering process and the migration
assisting tasks as far as possible. The following development aims have been chosen for this
system:

    • It should provide a realistic solution to the schema enhancement and migration assistance
      process.
    • It should be able to access and perform this task for legacy database systems.
    • It should be suitable for the data model at which it is targeted.
    • It should be as generic as possible so that it can be easily customised for other data models.
    • It should be able to retain the enhanced knowledge for future analysis by itself and other


Page 26
tools.
    • It should logically support a model using modern data modelling techniques irrespective of
      whether it is supported by the DBMS in use.
    • It should make extensive use of modern graphical user interface facilities for all graphical
      displays of the database schema.
• Graphical displays should also be as generic as possible so that they can be easily enhanced or
customised for other display methods.




Page 27
CHAPTER 3
                        Database Technology, Relational Model,
                     Conceptual Modelling and Integrity Constraints

The origins and historical development of database technology are initially presented here to focus
the evolution of ISs and the emergence of database models. The relational data model is identified as
currently the most commonly used database model and some terminology for this data model, along
with its features including query languages is then presented. A discussion of conceptual data
models with special emphasis on EER and OMT is provided to introduce these data models and the
symbols used in our project. Finally, we pay attention to crucial concepts relating to our work,
namely the notion of semantic integrity constraints, with special emphasis on those used in semantic
extensions to the relational model. The relational database language SQL is also discussed,
identifying how and when it supports the implementation of these semantic integrity constraints.

3.1 Origins and Historical Developments

        The origin of data management goes back to the 1950’s and hence, this section is sub divided
into two parts: the first part describes database technology prior to the relational data model, and the
second part describes developments since. This division was chosen as the relational model is
currently the most dominant database model for information management [DAT90].

       3.1.1 Database Technology Prior to the Relational Data Model

        Database technology emerged from the need to manipulate large collections of data for
frequently used data queries and reports. The first major step in mechanisation of information
systems came with the advent of punched card machines which worked sequentially on fixed-length
fields [SEN73, SEN77]. With the appearance of stored program computers, tape-oriented systems
were used to perform these tasks with an increase in user efficiency. These systems used sequential
processing of files in batch mode, which was adequate until peripheral storage with random access
capabilities (e.g. DASD) and time sharing operating systems with interactive processing appeared to
support real-time processing in computer systems.

        Access methods such as direct and indexed sequential access methods (e.g. ISAM, VSAM)
[BRA82, MCF91] were used to assist with the storage and location of physical records in stored
files. Enhancements were made to procedural languages (e.g. COBOL) to define and manage
application files, making the application program dependent on the organisation of the file. This
technique caused data redundancy as several files were used in systems to hold the same data (e.g.
emp_name and address in a payroll file; insured_name and address in an insurance file; and
depositors_name and address in a bank file). These stored data files used in the applications of the
1960's are now referred to as conventional file systems, and they were maintained using third
generation programming languages such as COBOL and PL/1. This evolution of mechanised
information systems was influenced by the hardware and software developments which occurred in
the 1950’s and early 1960’s. Most long existing legacy ISs are based on this technology. Our work
does not address this type of IS as they do not use a DBMS for their data management.

       The evolution of databases and database management systems [CHA76, FRY76, SIB76,
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
SEN77, KIM79, MCG81, SEL87, DAT90, ELM94] was to a large extent the result of addressing the
main deficiencies in the use of files, i.e. by reducing data redundancy and making application
programs less dependent on file organisation. An important factor in this evolution was the
development of data definition languages which allowed the description of a database to be
separated from its application programs. This facility allowed the data definition (often called a
schema) to be shared and integrated to provide a wide variety of information to the users. The
repository of all data definitions (meta data) is called data dictionaries and their use allows data
definitions to be shared and widely available to the user community.

        In the late 1960's applications began to share their data files using an integrated layer of
stored data descriptions, making the first true database, e.g. the IMS hierarchical database [MCG77,
DAT90]. This type of database was navigational in nature and applications explicitly followed the
physical organisation of records in files to locate data using commands such as GNP - get next under
parent. These databases provided centralised storage management, transaction management,
recovery facilities in the event of failure and system maintained access paths. These were the typical
characteristics of early DBMSs.

        Work on extending COBOL to handle databases was carried out in the late 60s and 70s. This
resulted in the establishment of the DBTG (i.e. DataBase Task Group) of CODASYL and the formal
introduction of the network model along with its data manipulation commands [DBTG71]. The
relational model was proposed during the same period [COD70], followed by the 3 level
ANSI/SPARC architecture [ANSI75] which made databases more independent of applications, and
became a standard for the organisation of DBMSs. Three popular types of commercial database
systems7 classified by their underlying data model emerged during the 70s [DAT90, ELM94],
namely:

         • hierarchical
         • network
         • relational

and these have been the dominant types of DBMS from the late 60s on into the 80s and 90s.

         3.1.2 Database Technology Since the Relational Data Model

        At the same time as the relational data model appeared, database systems introduced another
layer of data description on top of the navigational functionality of the early hierarchical and
network models to bring extra logical data independence8. The relational model also introduced the
use of non-procedural (i.e. declarative) languages such as SQL [CHA74]. By the early 1980's many
relational database products, e.g. System R [AST76], DB2 [HAD84], INGRES [STO76] and Oracle
were in use and due to their growing maturity in the mid 80s and the complexity of programming,
navigating, and changing data structures in the older DBMS data models, the relational data model
was able to take over the commercial database market with the result that it is now dominant.



   7
       Other types such as flat file, inverted file systems were also used.
   8
       This allows changes to the logical structure of data without changing the application programs.


Page 29
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
       The advent of inexpensive and reliable communication between computer systems, through
the development of national and international networks, has brought further changes in the design of
these systems. These developments led to the introduction of distributed databases, where a
processor uses data at several locations and links it as though it were at a single site. This technology
has led to distributed DBMSs and the need for interoperability among different database systems
[OZS91, BEL92].

       Several shortcomings of the relational model have been identified, including its inability to
perform efficiently compute-intensive applications such as simulation, to cope with computer-aided
design (CAD) and programming language environments, and to represent and manipulate effectively
concepts such as [KIM90]:

       • Complex nested entities (e.g. design and engineering objects),
       • Unstructured data (e.g. images, textual documents),
       • Generalisation and aggregation within a data structure,
       • The notion of time and versioning of objects and schemas,
       • Long duration transactions.

        The notion of a conceptual schema for application-independent modelling introduced by the
ANSI/SPARC architecture led to another data model, namely: the semantic model. One of the most
successful semantic models is the entity-relationship (E-R) model [CHE76]. Its concepts include
entities, relationships, value sets and attributes. These concepts are used in traditional database
design as they are application-independent. Many modelling concepts based on variants/extensions
to the E-R model have appeared since Chen’s paper. The enhanced/extended entity-relationship
model (EER) [TEO86, ELM94], the entity-category-relationship model (ECR) [ELM85], and the
Object Modelling Technique (OMT) [RUM91] are the most popular of these.

        The DAPLEX functional model [SHI81] and the Semantic Data Model [HAM81] are also
semantic models. They capture a richer set of semantic relationships among real-world entities in a
database than the E-R based models. Semantic relationships such as generalisation / specialisation
between a superclass and its subclass, the aggregation relationship between a class and its attributes,
the instance-of relationship between an instance and its class, the part-of relationship between
objects forming a composite object, and the version-of relationship between abstracted versioned
objects are semantic extensions supported in these models. The object-oriented data model with its
notions of class hierarchy, class-composition hierarchy (for nested objects) and methods could be
regarded as a subset of this type of semantic data model in terms of its modelling power, except for
the fact that the semantic data model lacks the notion of methods [KIM90] which is an important
aspect of the object-oriented model.

       The relational model of data and the relational query language have been extended [ROW87]
to allow modelling and manipulation of additional semantic relationships and database facilities.
These extensions include data abstraction, encapsulation, object identity, composite objects, class
hierarchies, rules and procedures. However, these extended relational systems are still being
evolved to fully incorporate features such as implementation of domain and extended data types,
enforcement of primary and foreign key and referential integrity checking, prohibition of duplicate
rows in tables and views, handling missing information by supporting four-valued predicate logic



Page 30
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
(i.e. true, false, unknown, not applicable) and view updatability [KIV92], and they are not yet
available as commercial products.

        The early 1990's saw the emergence of new database systems by a natural evolution of
database technology, with many relational database systems being extended and other data models
(e.g. the object-oriented model) appearing to satisfy more diverse application needs. This opened
opportunities to use databases for a greater diversity of applications which had not been previously
exploited as they were not perceived as tractable by a database approach (e.g. Image, medical,
document management, engineering design and multi-media information, used in complex
information processing applications such as office automation (OA), computer-aided design (CAD),
computer-aided manufacturing (CAM) and hyper media [KIM90, ZDO90, CAT94]). The object-
oriented (O-O) paradigm represents a sound basis for making progress in these areas and as a result
two types of DBMS are beginning to dominate in the mid 90s [ZDO90], namely: the object-oriented
DBMS, and the extended relational DBMS.

        There are two styles of O-O DBMS, depending on whether they have evolved from
extensions to an O-O programming language or by evolving a database model. Extensions have been
created for two database models, namely: the relational and the functional models. The extensions to
existing relational DBMSs have resulted in the so-called Extended Relational DBMSs which have
O-O features (e.g. POSTGRES and Starburst), while extensions to the functional model have
produced PROBE and OODAPLEX. The approach of extending O-O programming language
systems with database management features has resulted in many systems (e.g. Smalltalk into
GemStone and ALLTALK, and C++ into many DBMSs including VBase / ONTOS, IRIS and O2).
References to these systems with additional information and references can be found in [CAT94].

       Research is currently taking place into other kinds of database such as active, deductive and
expert database systems [DAT90]. This thesis focuses on the relational model and possible
extensions to it which can represent semantics in existing relational database information systems in
such a way that these systems can be viewed in new ways and easily prepared for migration to more
modern database environments.

3.2 Relational Data Model

        In this section we introduce some of the commonly used terminology of the relational model.
This is followed by a selective description of the features and query languages of this model. Further
details of this data model can be found in most introductory database text books, e.g. [MCF91,
ROB93, ELM94, DAT95].

        A relation is represented as a table (entity) in which each row represents a tuple (record), the
number of columns being the degree of the relation and the number of rows being its cardinality. An
example of this representation is shown in figure 3.1, which shows a relation holding Student details,
with degree 3 and cardinality 5. This table and each of its columns are named, so that a unique
identity for a table column of a given schema is achieved via its table name and column name. The
columns of a table are called attributes (fields) each having its own domain (data type) representing
its pool of legal data. Basic types of domains are used (e.g. integer, real, character, text, date) to
define the domains of attributes. Constraints may be enforced to further restrict the pool of legal



Page 31
Chapter 3              Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
values for an attribute. Tables which actually hold data are called base tables to distinguish them
from view tables which can be used for viewing data associated with one or more base tables. A
view table can also be an abstraction from a single base table which is used to control access to parts
of the data.

        A column or set of columns whose values uniquely identify a row of a relation is called a
candidate key (key) of the relation. It is customary to designate one candidate key of a relation as a
primary key (e.g. SNO in figure 3.1). The specification of keys restricts the possible values the key
attribute(s) may hold (e.g. no duplicate values), and is a type of constraint enforceable on a relation.
Additional constraints may be imposed on an attribute to further restrict its legal values. In such
cases, there should be a common set of legal values satisfying all the constraints of that attribute,
ensuring its ability to accept some data. For example, a pattern constraint which ensures that the first
character of SNO is ‘S’ further restricts the possible values of SNO - see figure 3.1. Many other
concepts and constraints are associated with the relational model although most of them are not
supported by early relational systems as, indeed, some of the more recent relational systems (e.g. a
value set constraint for the Address field as shown in figure 3.1).

                                                                  Domain
                                                              (type character)
                                                                                  Value Set Constraint
                       Pattern Constraint
                  (all values begin with 'S')

                         Primary Key
                        (unique values)




                                          Student    SNO          Name           Address




                                                                                                                  Cardinality
                                                       S1         Jones           Cardiff
                                                       S2         Smith           Bristol          :
                 Relation                                                                                Tuples
                                                       S3         Gray            Swansea
                                                       S4         Brown           Cardiff          :
                                                       S5         Jones           Newport




                                                                    Attributes
                                                                     Degree

                                                    Figure 3.1: The Student relation


       3.2.1 Requisite Features of the Relational Model

        During the early stages of the development of relational database systems there were many
requisite features identified which a comprehensive relational system should have [KIM79, DAT90].
We shall now examine these features to illustrate the kind of features expected from early relational
database management systems. They included support for:

      • Recovery from both soft and hard crashes,
      • A report generator for formatted display of the results of queries,
      • An efficient optimiser to meet the response-time requirements of users,
      • User views of the stored database,


Page 32
Chapter 3                Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
     • A non-procedural language for query, data manipulation / definition / control,
     • Concurrency control to allow sharing of a database by multiple users and applications,
     • Transaction management,
     • Integrity control during data manipulation,
     • Selective access control to prevent one user’s database being accessed by unauthorised users,
     • Efficient file structures to store the database, and
     • Efficient access paths to the stored data.

        Many early relational DBMSs originated at universities and research institutes, and none of
them were able to provide all the above features. These systems mainly focussed on optimising
techniques for query processing and recovery from soft and hard crashes, and did not pay much
attention to the other features. Few of these database systems were commercially available, and for
those that were the marketing was based on specific features such as report generation (e.g.
MAGNUM), and views with selective access control (e.g. QBE). The early commercial systems did
not support the full range of features either.

        Since the mid 1980’s many database products have appeared which aim to provide most of
the above features. The enforcement of features such as concurrency control was embodied in these
systems, while features such as views, access and integrity control were provided via non-procedural
language commands. Systems which were unable to provide these features via a non-procedural
language offered procedural extensions (e.g. C with embedded SQL) to perform such tasks. This
resulted in the use of two types of data manipulation languages, i.e. procedural and non-procedural,
to perform database system functions. In procedural languages a sequence of statements is issued to
specify the navigation path needed in the database to retrieve the required data, thus they are
navigational languages. This approach was used by all hierarchical and network database systems
and by some relational systems. However, most relational systems offer a non-procedural (i.e. non-
navigational) language. This allows retrieval of the required data by using a single retrieval
expression, which in general has a degree of complexity corresponding to the complexity of the
retrieval (e.g. SQL).

         3.2.2 Query Language for the Relational Model

       Querying or the retrieval of information from a database is perhaps the aspect of relational
languages which has received the most attention. A variety of approaches to querying has emerged,
based on relational calculus, relational algebra, mapping-oriented languages and graphic-oriented
languages [CHA76, DAT90].            During the first decade of relational DBMSs, there were many
experimental implementations of relational systems in universities and industry, particularly at IBM.
The initial projects were aimed at proving the feasibility of relational database systems supporting
high-level non-procedural retrieval languages. The Structured Query Language (SQL9) [AST75]
emerged from an IBM research project. Later projects created more comprehensive relational
DBMSs and among the more important of these systems were probably the System R project at IBM
[AST76] and the INGRES project (with its QUEL query language) at the University of California at
Berkeley [STO76].



   9
       Initially called SEQUEL, and later pronounced as SEQUEL.


Page 33
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
        Standards for relational query languages were introduced [ANSI86] so that a common
language could be used to retrieve information from a database. SQL became the standard query
language for relational databases. These standards were reviewed regularly [ANSI89a, ANSI92] and
are being reviewed [ISO94] to incorporate technological changes that meet modern database
requirements. Hence, the standard query language SQL is evolving, and although some of the recent
database systems conform to [ANSI92] standards they will have to be upgraded to incorporate even
more recent advances such as the object-oriented paradigm additions to the language [ISO94]. This
means that different database system query languages conform to different standards, and provide
different features and facilities to their users even though they are of the same type. Hence,
information systems developed during different eras will have used different techniques to perform
the same task, with early systems being more procedural in their approach than more recent ones.

        Query languages, including SQL, have three categories of statements, i.e. the data
manipulation language (DML) statements to perform all retrievals, updates, insertions and deletions,
the data definition language (DDL) statements to define the schema and its behavioural functions
such as rules and constraints, and the data control language (DCL) statements to specify access
control which is concerned with the privileges to be given to database users.

3.3 Conceptual Modelling

        The conceptual model is a high level representation of a data model, providing an
identification and description of the main data objects (avoiding details of their implementation).
This model is hardware and software independent, and is represented using a set of graphical
symbols in a diagrammatic form. As noted in part ‘c’ of stage 1 of section 2.2.2, different users may
prefer different graphical models and hence it is useful to provide them with a choice of models. We
consider two types of conceptual model in this thesis, namely: the enhanced entity-relationship
model (EER), which is based on the popular entity-relationship model, and the object-modelling
technique (OMT), which uses the more recent concepts of object-oriented modelling as opposed to
the entities of the E-R model. These were chosen as they are among the currently most widely used
conceptual modelling approaches and they allow representation of modelling concepts such as
generalisation hierarchies.

       3.3.1 Enhanced Entity-Relationship Model (EER)

        The entity-relationship approach is considered to be the first useful proposal [CHE76] on the
subject of conceptual modelling. It is concerned with creating an entity model which represents a
high-level conceptual data model of the proposed database, i.e. it is an abstract description of the
structure of the entities in the application domain, including their identity, relationship to other
entities and attributes, without regard for eventual implementation details.

        Thus an E-R diagram describes entities and their relationships using distinctive symbols, e.g.
an entity is a rectangle and a relationship is a diamond. Distinctive symbols for recent modelling
concepts such as generalisation, aggregation and complex structures have been introduced into these
models by practitioners. Despite its popularity, no standard has emerged or been defined for this
model. Hence different authors use different notations to represent the same concept. Therefore we
have to define our symbols for these concepts: we have based our definitions on [ROB93] and



Page 34
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
[ELM94].

        a) Entity

       An entity in the E-R model corresponds to a table in the relational environment and is
represented by a rectangle containing the entity name, e.g. the entity Employee of figure 3.2.

        b) Attributes

        Attributes are represented by a circle that is connected to the corresponding entity by a line.
Each attribute has a name located near the circle10, e.g. attributes EmpNo, Name and Address of the
Employee entity in figure 3.2. Key attributes of a relation are indicated using a colour to fill in the
circle (red on the computer screen or shaded dark in this thesis) (e.g. the attribute EmpNo of
Employee in figure 3.2). Attributes usually have a single value in an entity occurrence although
multivalued attributes can occur and other types such as derived attributes can be represented in the
conceptual model (see appendix B for a comprehensive list of the symbols used in EER models in
this thesis).

        c) Relationships

        A relationship is an association between entities. Each relationship is named and represented
by a diamond-shaped symbol. Three types of relationships (one-to-many or 1:M, many-to-many or
M:N, and one-to-one or 1:1) are used to describe the association between entities. Here 1 means that
an instance of this entity relates to only one instance of the other entity (e.g. an employee works for
only one department), and M or N means that an instance of an entity may relate to more than one
instance of the other entity (e.g. a department can have many employees working for it - see figure
3.2), through this relationship (the same entities can be linked in more than one relationship). The
relationship type is determined by the participating entities and their associated properties. In the
relational model a separate entity is used for M:N relationship types (e.g. a composite entity as in the
case of the entity ComMem of figure 3.2), and the other relationship types (i.e. 1:1 and 1:M) are
represented by repeated attributes (e.g. relationship WorksFor of figure 3.2 is established from the
attribute WorksFor of the entity Employee).




   10
      We do not place the attribute name inside the circle to avoid the use of large circles or ovals in
our diagrams.


Page 35
Chapter 3                 Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
                                         (Weak Entity)
                                                                      (Weak Relationship)
                                                             (1,1)                                  (1,N)
                         Title            Committee                              Fcom                             Faculty
                                                 (1,N)



                  (Composite Entity)         ComMem               YearJoined              Office              d

                                                                                (Generalised Entity)
                                                 (1,N)
                                                             (1,1)             WorksFor            (4,N)
                           (Entity)       Employee            N          (Relationships)             1        Department
                                                            (0,1)                                   (1,1)
                                 (Key)                                          Head                       (Specialised Entity)
                                    EmpNo                Address                                                  of Office
                                             Name
                                         (Attributes)


                                      Figure 3.2: EER diagram for part of the University Database


        A relationship’s degree indicates the number of associated entities (or participants) there are
in the relationship. Relationships with 1, 2 and 3 participants are called unary, binary and ternary
relationships, respectively. In practice most relationships are binary (e.g. relationship WorksFor in
figure 3.2) and relationships of higher order (e.g. four) occur very rarely as they are usually
simplified to a series of binary relationships.

        The term connectivity is used to describe the relationship classification and it is represented
in the diagram by using 1 or N near the related entity (see for example, the WorksFor relationship in
figure 3.2). Alternatively, a more detailed description of the relationship is specified using
cardinality, which expresses the specific number of entity occurrences associated with one
occurrence of the related entity. The actual number of occurrences depends on the organisation’s
policy and hence, can differ from that of another organisation, although both may model the same
information. The cardinality has upper and lower limits indicating a range and is represented in the
diagram within brackets near the related entity (see the WorksFor relationship in figure 3.211).
Cardinality is a type of constraint and in appendix B.2 we provide more details about the symbols
and notations used to represent these types of constraints. Thus in the WorksFor relationship:

        (1,1)   indicates              an employee must belong to a department
        (4,N)   indicates              a department must have at least 4 employees
        N       indicates              a department has many employees
        1       indicates              an employee may work for only one department

        d) Other Relationship and Entity Types

       The original E-R model of Chen did not contain relationship attributes and did not use the
concept of a composite entity. We use this concept as in [ROB93], because the relational model
requires the use of an entity composed of the primary keys of other entities to connect and represent
M:N relationships. Hence, a composite entity (also called a link [RUM91] or regular [CHI94] entity)
   11
      In practise in a diagram only one of these types is shown depending on availability of
information on cardinality limits.


Page 36
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
representing an M:N relationship is represented using a diamond inside the rectangle, indicating that
it as an entity as well as a relationship (e.g. ComMem of figure 3.2). In this type of relationship, the
primary key of the composite entity is created by using the keys of the entities which it connects.
This is usually a binary or 2-ary relationship involving two referenced entities and is a special case
of n-ary relationship which connects with n entities.

        Some entity occurrences cannot exist without an entity occurrence with which they have a
relationship. Such entities are called weak entities and are represented by a double rectangle (e.g.
Committee in figure 3.2). The relationship formed with this entity type is called a weak relationship
and is represented by a double diamond (e.g. Fcom relationship of figure 3.2). In this type of
relationship, the primary key of the weak entity is a proper subset of the key of the entity which it
depends on and the remaining attributes (called dangling attributes) of the primary key do not
contain a key of any other entity.

        When a relationship exists between occurrences of the same entity set (e.g. a unary
relationship) it forms a recursive relationship (e.g. a course may have pre-requisites courses).

       e) Generalisation / Specialisation / Inheritance

        Most organisations employ people with a wide range of skills and special qualifications (e.g.
a university employs academics, secretaries, porters, research associates, etc.) and it may be
necessary to record additional information for certain types of employee (e.g. qualifications of
academics). Representing such additional information in the employee table results in the use of null
values in this attribute for other employees as this additional information is not applicable for these
employees. To overcome this, common characteristics for all employees are chosen to define the
employee entity as a generalised entity, and the additional information is put in a separate entity,
called a specialised entity, which inherits all the properties of its parent entity (i.e. the generalised
entity), creating a parent-child or is-a relationship (also called a generalised hierarchy). The higher
level of this relationship is a supertype entity (i.e. generalised entity) and the lower-level is a subtype
entity (i.e. specialised entity). A supertype entity set is usually composed of several unique and
disjoint (non-overlapping) subtype entity sets. However some supertypes contain overlapping
subtypes (e.g. an employee may also be a student and hence we get two subtypes of person in an
overlapping relationship). There are constraints applicable for generalised hierarchies and special
symbols / notations are used in these cases (see appendixes B.1 figure ‘e’ and B.2 figure ‘b’). In
figure 3.2, the entities Office, Department and Faculty form a generalised hierarchy with Office
being the Supertype entity and Department and Faculty being the subtype entities. Subtype and
supertype entities have a 1:1 relationship although we view it differently, i.e. as a hierarchy.

       The subtypes described above inherit from a single supertype entity. However, there may be
cases where a subtype inherits from multiple supertypes (e.g. an empstudent entity representing
employees who are also students may inherit from employee and student entities). This is known as
multiple inheritance. In such cases the subtype may represent either an intersection or a union. The
concept of inheritance was taken from the O-O paradigm and hence it does not occur in the original
E-R model, but is included in the EER model.

       3.3.2 Object Modelling Technique (OMT)



Page 37
Chapter 3              Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints

         The Object Modelling Technique (OMT) is an O-O development methodology. It creates a
high-level conceptual data model of a proposed system without regard for eventual implementation
details. This model is based on objects.

         The notations of OMT used here are taken from [RUM91] and those used in our work are
described in appendix B, where they are compared with their EER equivalents. Hence we do not
describe this model in depth here. The diagrams produced by this method are known as object
diagrams. They combine O-O concepts (i.e. classes and inheritance) with information modelling
concepts (i.e. entities and relationships). Although the terminology used differs from that used in the
EER model, both create similar conceptual models, although using different graphical notations. The
main notations used in OMT are rectangles with text inside (e.g. for classes and their properties, as
opposed to the EER where attributes appear outside the entity). This makes OMT easier to
implement than EER in a graphical computing environment. OMT is used for most O-O modelling
(e.g. in [COO93, IDS94]), and so it is a widely known technique.

3.4 Semantic Integrity Constraints

        A real world application is always governed by many rules which define the application
domain and are referred to as integrity constraints [DAT83, ELM94]. An important activity when
designing a database application is to identify and specify these integrity constraints for that database
and if possible to enforce them using the DBMS constraint facilities.

        The term integrity refers to the accuracy, correctness or validity of a database. The role of the
integrity constraint enforcer is to ensure that the data in the database is accurate by guarding it
against invalid updates, which may be caused by errors in data entry, mistakes on the part of the
operator or the application programmer, system failures, and even due to deliberate falsification by
users. This latter case is the concern of the security system which protects the database from
unauthorised access (i.e. it implements authorisation constraints). The integrity system uses integrity
rules (integrity constraints) to protect the database from invalid updates supplied by authorised users
and to maintain the logical consistency of the database.

        Integrity is sometimes used to cover both semantic and transaction integrity. The latter case
deals with concurrency control (i.e. the prevention of inconsistencies caused by concurrent access by
multiple users or applications to a database), and recovery techniques which prevent errors due to
malfunctioning of system hardware and software. Protection against this type of integrity-violation is
dealt with by most commercially available systems and is not an issue of this thesis. Here we shall
use the terms integrity and constraints to refer only to semantic integrity constraints.

       Integrity rules cannot detect all types of errors, for instance when dealing with percentage
marks, there is no way that the computer can detect the fact that an input value of 45 for a student
mark should really be 54. However, on the other hand, a value of 455 could be detected and
corrected. Consistency is another term used for integrity. However, this is normally used in cases
where two or more values in the database are required to be in agreement with each other in some
way. For example, the DeptNo in an Employee record should tally with the DeptNo appearing in
some Department record (referential integrity in relational systems), or the Age of a Person must be



Page 38
Chapter 3              Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
equal to the difference in years between today’s date and their date of birth (a property of a derived
attribute).

        In order to check for invalid data, DBMSs use an integrity subsystem to monitor transactions
and detect integrity violations. In the event of a violation the DBMS takes appropriate actions such
as rejecting the operation, reporting the violation, or assisting in correcting the error. To perform
such a task, the integrity subsystem must be provided with a set of rules that define what errors to
check for, when to do the checking, and what to do if an error is detected. Most early DBMSs did
not have an integrity subsystem (mainly due to unacceptable database system performance when
integrity checking was performed in older technological environments) and hence such checking was
not implemented in their information systems. These information systems performed integrity
checking using procedural language extensions of the database to check for invalid entries during the
capture of data via their user interface (e.g. data entry forms). Here too, due to technological
limitations and poor database performance, only specific types of constraints (e.g. range check,
pattern matching), and a limited number of checks were allowed for an attribute. As these rules were
coded in application programs they violated program / data (rule) independence for constraint
specification. However, most recent DBMSs attempt to support such specifications using their DDL
and hence they achieve program / rule independence.

        The original SQL standard specifications [ANSI86] were subsequently enhanced so that
constraints could be specified using SQL [ANSI89a]. Current commercial DBMSs are seeking to
meet these standards by targeting the implementation of the SQL-2 standards [ANSI92] in their
latest releases. Systems such as Oracle now conform to these standards, while others such as
INGRES and POSTGRES have taken a different path by extending their systems with a rule
subsystem, which performs similar tasks but using a procedural style approach where the rules and
procedures are retained in data dictionaries.

       Integrity constraints can be identified for the properties of a data model and for the values of
a database application. We examine both to present a detailed description of the types of constraint
associated with databases and in particular those used for our work.

       3.4.1 Integrity Constraints of a Data Model

        Some constraints are automatically supported by the data model itself. These constraints are
assumed to hold by the definition of that data model (i.e. they are built into the system and not
specified by a user). They are called the inherent constraints of the data model. There are also
constraints that can be specified and represented in a data model. These are called the implicit
constraints of the model and they are specified using DDL statements in a relational schema, or
graphical constructs in an E-R model. Table 3.1 gives some examples of implicit and inherent
constraints for relational and EER data models. The constraint types used in this table are described
in detail in section 3.5.

        The structure of a data model represents inherent constraints implicitly and is also capable of
representing implicit constraints. Hence, constraints represented in these two ways are referred to as
structural constraints. Data models differ in the way constraints are handled. Hierarchical and
network database constraints are handled by being tightly linked to structural concepts (records, sets,



Page 39
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
segment definitions), of which the parent-child and owner-member relationships are logical
examples. The classical relational model, on the other hand, has two constraints represented
structurally by its relations or tables (namely, relations consist of a certain number of simple
attributes and have no duplicate rows). Hence only specific types of structural constraint are defined
for a particular data model (e.g. parent-child relationships are not defined for the relational model).

                    Data Model           Implicit constraint                         Inherent constraint

                                  • Primary key attributes,              • Every relationship instance of an
                                  • Attribute structural constraints,      n-ary relationship type R relates
                                  • Relationship structural constraints,   exactly one entity from each entity
                       EER
                                  • Superclass /subclass relationship,     type participating in R in a specific
                                  • Disjointness /totality constraints     role,
                                    on specialisation /generalisation.   • Every entity instance in a subclass
                                                                            must also exist in its superclass.
                                  • Domain constraints,                    • A relation consists of a certain
                     Relational   • Key constraints,                        number of simple attributes,
                                  • Relationship structural constraints.   • An attribute value is atomic,
                                                                           • No duplicate tuples are allowed.
                                   Table 3.1: Structural constraints of selected data models


        Every data model has a set of concepts and rules (or assertions) that specify the structure and
the implicit constraints of a database describing a miniworld. A given implementation of a data
model by a particular DBMS will usually support only some of the structural (inherent and implicit)
constraints of the data model automatically and the rest must be defined explicitly. These additional
constraints of a data model are called explicit or behavioural constraints. They are defined using
either a procedural or a declarative (non-procedural) approach, which is basically not part of the data
model per se.

        3.4.2 Integrity Constraints of a Database Application

        In database applications, integrity constraints are used to ensure the correctness of a
database. A change to a database application takes place during an update transaction and
constraints are used at this stage to ensure that the database is in a consistent state before and after
that transaction. This type of constraint is called a state (static) constraint as it applies to a particular
state of the database and should hold for every state where the database is not in transition, i.e. not in
the process of being updated. Constraints that are applicable to a database and which change from
one state to another are called transition (dynamic) constraints (e.g. the age of a person can only be
increased, meaning that the new value of age is greater than the old value). In general, transition
constraints occur less frequently than state constraints and are usually specified explicitly.

        The discussion above classifies the types of semantic integrity constraints used in data
models and database applications. We summarise them in figure 3.3 to highlight the basic
classification of integrity constraints. We separate the two approaches using a dotted line as they are
independent of each other. However, most constraints are common to both categories as they are
implemented using a particular data model for a database application. Data models used for
conceptual modelling are more concerned with structural constraints as opposed to the value
constraints of database applications.



Page 40
Chapter 3                 Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints

                                                                      Integrity
                                                                     Constraints



                                            Data Model                                 Database Application




                               structural               explicit                      state            transistion
                              constraints            (behavioural)                   (static)          (dynamic)
                                                      constraints                  constraints         constraints



                      inherent          implicit
                     constraints       constraints


                                              Figure 3.3: Classification of integrity constraints


3.5 Constraint Types

       We consider constraint types in more detail here so that we can later relate them to data
models and database applications. Initially, we describe value constraints (i.e. domain and key
constraints) which are applicable to database values (i.e. attributes). Then, we describe structural
constraints, namely: attribute structural, relationship structural and superclass/subclass structural.
These constraints are often associated with data models and some of them have been mentioned in
section 3.4.1. In this section, we look at them with respect to their structural properties and are
concerned with identifying differences within a structure, in addition to the relationships (e.g.
between entities) formed by them. Finally, constraints that do not fall into either of these categories
are described. As most of these constraints are state constraints we shall refer to the constraint type
only when type distinction is necessary.

        All structural constraints are shown in a conceptual model as this model is used to describe
the structure of a database. Not all value constraints (e.g. check constraints) are shown as they are
not associated with the structure of a database and are described using a DML. However, our work
includes presenting them at optional lower levels of abstraction which involves software dependent
code. This code is based on current SQL standards and may be replaced using equivalent graphical
constructs if necessary12. Here for each type of an SQL statement, we could introduce a suitable
graphical representation and hence increase its readability. All value constraints are implicitly or
explicitly defined when implementing an application. Most constraints considered here are implicit
constraints as they may be specified using the DDL of a modern DBMS. In such cases the DBMS
will monitor all changes to the data in the database to ensure that no constraint has been violated by
these changes.

         3.5.1 Domain Constraints

         Domain constraints are specified by associating every simple attribute type with a value set.

   12
        This idea is beyond the scope of this thesis.


Page 41
Chapter 3                Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
The value set of a simple attribute is initially defined using its data type (integer, real, char, boolean)
and length, and later is further restricted using appropriate constraints such as range (minimum,
maximum) and pattern (letters and digits). For example, the value set for the Deptno attribute of the
entity Department could be specified as data type character of length 5, and the Salary attribute of the
entity Staff as data type decimal of length 8 with 2 decimal places, in the range 3000 to 20000.

      Nonnull constraints can be seen as a special case of domain constraints, since they too restrict
the domain of attributes. These constraints are used to eliminate the possibility of missing, or
unknown values of an attribute occurring in the database.

        A domain constraint is usually used to restrict the value of an attribute, e.g. an employee’s
age is ≥ 18 (i.e. a state constraint), however they may also be used to compare values of two states,
e.g. an employee’s new salary is ≥ to their current salary (i.e. a transition constraint).

       3.5.2 Key Constraints

        Key constraints specify key attribute(s) that can uniquely identify an instance of an entity.
These constraints are also called candidate key or uniqueness constraints. For example, stating
Deptno is a key of Department will ensure that no two departments will have the same Deptno.
When a set of attributes form a key, then that key is called a composite key, as we are dealing with a
composite attribute. When a nonnull constraint is added to a key uniqueness constraint then such
keys are referred to as primary keys. An entity may have several candidate keys and in such cases
one is called the primary key and the others alternate keys.

       Primary key attributes are shown in the EER model (see appendix B.2, figure ‘b’). The OMT
model uses object identities (oids) to uniquely identify objects and as they are usually system
generated they are not shown in this model. However, when modelling relational databases we do
not use the concept of oid, instead we have primary keys which are shown in our diagrams (see
appendix B.2, figure ‘b’) as they carry important information about a relational database.

       3.5.3 Structural Constraints on Attributes

        Attribute structural constraints specify whether an attribute is single valued or multivalued.
Multivalued attributes with a fixed number of possible values are sometimes defined as composite
attributes. For example, name can be a composite attribute composed of first name, middle initial
and last name. However composite attributes cannot be constructed for multivalued attributes like a
student’s course set, where the student can attend several courses. In such a case one would have to
use an alternative solution, such as recording all possibilities in one long string or using a separate
data type like sets. This type of constraint is not generally supported by most traditional DBMSs. In
the relational model we use a separate entity to hold multiple values and these are related to the
correct entity through an identical primary key [ELM94].

       3.5.4 Structural Constraints on Relationships

        Structural constraints on relationships specify limitations on the participation of entities in
relationship instances. Two types of relationship constraints occur frequently. They are called



Page 42
Chapter 3                Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
cardinality ratio constraints and participation constraints. The cardinality ratio constraint specifies
the number of relationship instances that an entity can participate in using 1 and N (many). For
example, the constraints every employee works for exactly one department and a department can
have many employees working for it has the cardinality ratio of 1:N, meaning that each department
entity can be related to numerous employee entities. A participation constraint specifies whether
the existence of an entity depends on its being related to another entity via a certain relationship
type. If all the instances of an entity participate in a relationship of this type then the entity has total
participation (existence dependency) in the relationship. Otherwise the participation is partial,
meaning only some of the instances of that entity participate in a relationship of this type. For
example, the constraint every employee works for exactly one department means that an Employee
entity has a total participation in the relationship WorksFor (see figure 3.2), and the constraint an
employee can be the head of a department, means that the Employee entity has a partial participation
in the relationship Head (see figure 3.2) (i.e. not all employees are head of a department).

        Referential integrity constraints are used to specify a type of structural relationship
constraint. In relational databases, foreign keys are used to define referential integrity constraints. A
foreign key is defined on attributes of a relation. This relation is known as the referencing table. The
foreign key attribute of the referencing table (e.g. WorksFor of Employee in figure 3.4) will always
refer to an attribute(s) of another relation, where the attribute(s) are the primary or alternate key (e.g.
DeptCode of Department in figure 3.4). We refer to this relation as the referenced table. The foreign
key attribute(s) of the referenced table have a uniqueness property, and may be the primary or
alternate key of that relation. This means that references from one relation to another are achieved by
using foreign keys, which indicate a relationship between two entities. Also this establishes an
inclusion dependency between the two entities. Here the values of the attribute of the referencing
entity (e.g. Employee.WorksFor) are a subset of the values of the attribute of the referenced entity
(e.g. Department.DeptCode). Only recent DBMSs such as Oracle version 7 support the specification of
foreign keys using DDL statements.

                           Employee                                           Department
                             ...Attributes...WorksFor                           DeptCode ...Attributes...
                                               COMMA                            COMMA
                                               MATHS                            ELSYM
                                               COMMA                            MATHS
                                               COMMA
                                               MATHS
                           (5 records)                                        (3 records)

                           WorksFor is a foreign key attribute of referencing entity Employee. This attribute refers
                           to the referenced entity Department whose attribute DeptCode is its primary key .


                                                Figure 3.4: A foreign key example


       3.5.5 Structural Constraints on Specialisation/Generalisation

        Disjointness and completeness constraints are defined on specialisation/generalisation
structures. The disjointness constraint specifies that the subclasses (superclass) of the
specialisation (generalisation) must be disjoint. This means that an entity can be a member of at most
one of the subclasses of the specialisation. When an entity is a member of more than one of the
subclasses, then we get an overlapping situation. The completeness constraint on specialisation
(generalisation) defines total/partial specialisation (generalisation). A total specialisation specifies


Page 43
Chapter 3              Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
the constraint that every entity in the superclass must be a member of some subclass in the
specialisation. For example: Lecturer, Secretary and Technician are some of the job types of an
Employee. They describe disjoint subclasses of the entity Employee, having a partial specialisation
as there could be an employee with another job type.

        Generalisation is a feature supported by many object-oriented (O-O) systems, but it has yet to
be adopted by commercial relational DBMSs. However, with O-O features being incorporated into
the relational model (e.g. SQL-3) we can expect to see this feature in many RDBMSs in the near
future.

       3.5.6 General Semantic Constraints

        There are general semantic integrity constraints that do not fall into one of the above
categories (e.g. the constraint the salary of an employee must not be greater than the salary of the
head of the department that the employee works for, or the salary attribute of an employee can only
be increased). These constraints can be either state or transition constraints, and are generally
specified as explicit constraints.

        The transition constraint mentioned above is a single-step transition constraint. Here, a
constraint is evaluated on a pair of pre-transaction and post-transaction states of a database, e.g. in
the employee’s salary example, the current salary will be the pre-transaction state and the new salary
will become the post-transaction state. However, there are transition constraints that are not limited
to a single-step, e.g. temporal constraints specified using the temporal qualifiers always and
sometimes [CHO92]. Other forms of constraint exist, including those defined for incomplete data
(e.g. employees having similar jobs and experience must have almost equal salary) [RAJ88]. These
can also be considered as a type of semantic constraint, mainly as they are not implicitly supported
by the most frequently used (i.e. relational) data model. They may need a special constraint
specification language to support them.

3.6 Specifying Explicit Constraints

       Explicit constraints are generally defined using either a procedural or a declarative approach.

       3.6.1 Procedural Approach

        In the procedural approach (or the coded constraints technique), constraint checking
statements are coded into appropriate update transactions of the application by the programmer. For
example to enforce the constraint, the salary of an employee must not be greater than the salary of
the head of the department that the employee works for, one has to check every update transaction
that may violate this constraint. This includes any transaction that modifies the salary of an
employee, any transaction that links an employee to a department, and any transaction that assigns a
new employee or manager to a department. Thus in all such transactions appropriate code has to be
included that will check for possible violations of this constraint. When a violation occurs the
transaction has to be aborted, and this is also done by including appropriate code in the application.

       The above technique for handling explicit constraints is used by many existing DBMSs. This



Page 44
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
technique is general because the checks are typically programmed in a general-purpose
programming language. It also allows the programmer to code in an effective way. However, it is
not very flexible and places an extra burden on the programmer, who must know where the
constraints may be violated and include checks at each and every place that a violation may occur.
Hence, a misunderstanding, omission or error by the programmer may leave the database able to get
into an inconsistent state.

        Another drawback of specifying constraints procedurally is that they can change with time as
the rules of a real world situation change. If a constraint changes, it is the responsibility of the DBA
to instruct appropriate programmers to recode all the transactions that are affected by the change.
This again opens the possibility of overlooking some transactions, and hence the chance that errors
in constraint representation will render the database inconsistent.

        Another source of error is that it is possible to include conflicting constraints in procedural
specifications that will cause incorrect abortion of correct transactions. This error may occur in other
types of constraint specification, e.g. whenever we attempt to define multiple constraints on the same
entity. However, such errors can be detected more easily in a declarative approach, as it is possible
to evaluate constraints defined in that form to identify conflicts between them.

        The procedural approach is usually adopted only when the DBMS cannot declaratively
support the same constraint. In all early DBMSs the procedural code was part of the application code
and was not retained in the database’s system catalog. However, some recent DBMSs (e.g. INGRES)
provide a rule subsystem where all defined procedures are stored in system catalogs and are fired
using rules which detect update transactions associated with a particular constraint. This approach is
a step towards the declarative approach as it overcomes some of the deficiencies of the procedural
approach described above (e.g. the maintenance of constraints), although the code is still of
procedural type which for example, prevents the detection of conflicting constraints.

        Some DBMSs (e.g. INGRES) do not support the specification of foreign key constraints
through their DDL. Hence, for these systems such constraints have to be explicitly defined using a
procedural approach. In section ‘a’ of table 3.2, we show how a procedure is used in INGRES to
implement a foreign key constraint. Here the existence of a value in the referenced table is checked
and the statement is rejected if it does not exist. For comparison purposes, we include the definition
of the same constraint using an SQL-3 DDL specification (implicit) in section ‘b’ of table 3.2, and
the declarative approach (explicit) in section ‘c’ of table 3.2. When comparing these three
approaches, it is clear that the procedural one is most unfriendly and more error-prone. The DDL
approach is the best and most efficient approach as it is specified and managed implicitly by the
DBMS.




Page 45
Chapter 3                  Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
                      Approach                                      SQL Statements

                (a)   Procedural       CREATE PROCEDURE Employee_FK_Dept (WorksFor CHAR(5)) AS
                      Approach         DECLARE
                      (Explicit)            msg = VARCHAR(70) NOT NULL;
                                            check_value = INTEGER;
                                       BEGIN
                                            IF WorksFor IS NOT NULL THEN
                                               SELECT COUNT(*) INTO :check_value FROM Department
                                                      WHERE DeptCode = :WorksFor;
                                               IF check_value = 0 THEN
                                                   msg = ‘Error 1: Invalid Department Code: ‘ + :WorksFor;
                                                   RAISE ERROR 1 :msg;
                                                   RETURN
                                               ENDIF
                                            ENDIF
                                       END;

                (b)   DDL              CONSTRAINT Employee_FK_Dept
                      Approach         FOREIGN KEY (WorksFor) REFERENCES Department (DeptCode);
                      (Implicit)       Note: This constraint is defined in the Employee table.

                (c)   Declarative      CREATE ASSERTION Employee_FK_Dept
                      Approach         CHECK (NOT EXISTS (SELECT * FROM Employee
                      (Explicit)            WHERE WorksFor IN (SELECT DeptCode FROM Department)));

                      Note: The schema on which these constraints are defined is in figure 6.2.
                                          Table 3.2: Different Approaches to defining a Constraint


       3.6.2 Declarative Approach

        This more formal technique for representing explicit constraints is to use a constraint
specification language, usually based on some variation of relational calculus. This is used to specify
or declare all the explicit constraints. In this declarative approach there is a clean separation between
the constraint base in which the constraints are stored, in a suitable encoded form, and the integrity
control subsystem of the DBMS, which accesses the constraints to apply them to transactions.

        When using this technique, constraints are often called integrity assertions, or simply
assertions, and the specification language is called an assertion specification language. The term
assertion is used in place of explicit constraints, and the assertions are specified as declarative
statements. These constraints are not attached to a specific table as in the case of the implicit
constraint types introduced in section 3.5. This approach is supported by SQL-3. For example, the
constraint head of departments’ salary must be greater than that of his employees, can be specified
as,

          CREATE ASSERTION Salary_Constraint
          AFTER INSERT, UPDATE ON Employee E H, Department
             CHECK (E.Salary < H.Salary and E.Dept = DeptCode and Head = H.EmpNo)

        Assertions can also be used to define implicit constraints, like examination mark is between 0
and 100; or referential integrity constraints, as shown in table 3.2 part ‘b’. However, it is easier and
more efficient (i.e. consumes less computer resources) to monitor and enforce implicit constraints
using the DDL approach as such constraints are attached to an entity and used only when an update
transaction is applied to that entity, as opposed to checking whenever an update transaction is
performed on the database in general.



Page 46
Chapter 3                Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
        In many cases it is convenient to specify the type of action to be taken when a constraint is
violated or satisfied, rather than just having the options of aborting or silently performing the
transaction. In such cases, it is useful to include the option of informing a responsible person
regarding the need to take action or notifying them of the occurrence of that transaction (e.g. in
referential constraints, it is sometimes necessary to perform an action like update or delete on a table
to amend its contents, instead of aborting the transaction). This facility is supported either by an
optional trigger option on an existing DDL statement or by defining triggers [DAT93]. Triggers can
combine the declarative and procedural approaches, as the action part can be procedural, while the
condition part is always declarative (INGRES rules are a form of trigger). A trigger can be used to
activate a chain of associated updates, that will ensure database integrity (e.g. update total quantity
when new stock arrives or when stock is dispatched). An alerter, which is a variant of the trigger
mechanism, is used to notify users of important events in the database. For example, we could send a
message to the head calling to his attention a purchase transaction for £1,000 or more made from
department funds.

        In this section we have introduced concepts from INGRES which also appear in other
DBMSs, namely triggers and alerters. These constructs provide further information about database
contents, but are beyond the scope of this project. They are related to constraints, so could be utilised
in a similar fashion.

3.7 SQL Approach to Implementing Constraints

         In SQL-3, a constraint is either a domain constraint, a table constraint or a general constraint
[ISO94]. It is described by a constraint descriptor, which is either a domain constraint descriptor (cf.
sections 3.7.1 and A.11), a table descriptor (cf. sections 3.7.2 and A.4) or a general descriptor (cf.
sections 3.7.3 and A.12). Every constraint descriptor includes: the name of the constraint, an
indication of whether or not the constraint is deferrable, and an indication of whether or not the
initial constraint mode is deferred or immediate (see section A.3). Constraint descriptors are optional
in that they can be assigned with system generated names (except for the general constraint case,
where a name must be given). A constraint has an action which is either deferrable or non-
deferrable. This can be set using the constraint mode option (see section A.13). Usually, most
constraints are immediate as the default constraint mode is immediate, and in these cases they are
checked at the end of an SQL transaction. To deal with deferred constraints, all constraints are
effectively checked at the end of an SQL session or when an SQL commit statement is executed.
Whenever a constraint is detected as being violated, an exception condition is raised and the
transaction concerned is terminated by an implicit SQL rollback statement to ensure the consistency
of the database system.

       3.7.1 SQL Domain Constraints

        In SQL, domain constraints are specified by means of the CREATE DOMAIN statement (see
section A.11) and can be added to or dropped from an existing domain by means of the ALTER
DOMAIN statement [DAT93]. These constraints are associated with a specific domain and apply to
every column that is defined using that domain. They allow users to define new data types, which in
turn are used to define the structure of a table. For example, a domain Marks may be defined as
shown in figure 3.5. This means SQL will recognise the data type Marks which permits integers



Page 47
Chapter 3             Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
between 0 and 100, thus giving a natural look to that data type when it is used.

                                       CREATE DOMAIN Marks INTEGER
                                        CONSTRAINT icmarks
                                        CHECK(VALUE BETWEEN (0,100));

                                      Figure 3.5: An SQL domain constrant


       3.7.2 SQL Table Constraints

       In SQL, table constraints (i.e. constraints on base tables) are initially specified by means of
the CREATE TABLE statement (see section A.4) and can be added to or dropped from an existing
base table by means of the ALTER TABLE statement [DAT93]. These constraints are associated with
a specific table, as they cannot exist without a base table. However, this does not mean that such
constraints cannot span multiple tables as in the case of foreign keys. Constraints defined on specific
columns of a base table are a type of table constraint, but are usually referred to as column
constraints.

        Three types of table constraints are defined here, namely: candidate key constraints, foreign
key constraints and check constraints. Their definitions may appear next to their respective column
definitions or at the end (i.e. after all column definitions have been defined). We now describe an
example that uses all three types of constraints, using figure 3.6. The PRIMARY KEY (only one per
table) (see section A.6) and UNIQUE (the value in a row position is unique) (see section A.5) are
used to define candidate keys. A FOREIGN KEY definition (see section A.8) defines a referential
integrity constraint and may also include an action part (which is not used in figure 3.6 for
simplicity). Check constraints are defined using a CHECK clause (see section A.9) and may contain
any logical expression. The check constraint CHECK(Name IS NOT NULL) is usually defined using a
shorthand form NOT NULL next to the column Name, as shown in figure 3.6. We have also included
a check constraint spanning multiple tables in figure 3.6. Such table constraints can be included only
after the tables have been created, and hence in practice they are added using ALTER TABLE
statements.

                      CREATE TABLE Employee(
                        EmpNo     CHAR(5)         PRIMARY KEY,
                        Name      CHAR(20)        NOT NULL,
                        Address   CHAR(80)
                        Age       INTEGER         CHECK(Age BETWEEN (21,65)),
                        WorksFor CHAR(5)          FOREIGN KEY REFERENCES (Department),
                        Salary    DECIMAL,
                             CHECK (E.Salary <= (SELECT H.Salary
                                  FROM Department D, Employee H E
                                  WHERE D.Head=H.EmpNo AND E.WorksFor=H.WorksFor),
                             UNIQUE (Name, Address) );

                                         Figure 3.6: SQL table constrants


       3.7.3 SQL General Constraints

        In SQL, general constraints are specified by means of the CREATE ASSERTION statement
(see section A.12) and are dropped by means of the DROP ASSERTION statement. These constraints
must be associated with a user defined constraint name as they are not attached to a specific table


Page 48
Chapter 3              Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
and are used to constrain arbitrary combinations of columns in arbitrary combinations of base tables
in a database. The constraint defined in section ‘c’ of table 3.2 belongs to this type.

       Domain and table constraints are implicit constraints of a database, while assertions used to
define general constraints are explicit constraints (using a declarative approach). SQL data types
have their own constraint checking, which rejects for example string values being entered into a
numeric column definition. This type of constraint checking can be considered as inherent as it is
supported by the SQL language itself.

        All integrity constraints discussed above are deterministic and independent of the application
and system environments. Hence, no parameters, host variables and built in system functions (e.g.
CURRENT_DATE) are allowed in these definitions as they make the database inconsistent. For
example CURRENT_DATE will give different values on different days. This means the validity of a
data entry may not hold during its lifetime despite no changes being made to its original entry. This
makes the task of maintaining the consistency of the database more difficult. Also it makes it
difficult to distinguish these errors from the traditional errors discussed in section 3.5. Hence
attributes such as age, which involves use of CURRENT_DATE should be derived attributes whose
value is computed during retrieval.

3.8 SQL Standards

         To conclude this chapter, we compare different SQL standards to chronicle when respective
constraint specification statements were introduced to the language. A standard for the database
language SQL was first introduced in 1986 [ANSI86], and this is now called SQL/86. The SQL/86
standard specified two levels, namely: level 1 and level 2 (referred to as level 1 and 2 respectively in
table 3.3, column 2); where level 2 defined the complete SQL database language, while level 1 was a
subset of level 2. In 1989, the original standard was extended to include the integrity enhancement
feature [ANSI89a]. This standard is called SQL/89 and is referred to as level Int. in table 3.3,
column 2. The current standard, SQL/92 [ANSI92], is also referred to as SQL-2. This standard
defines three levels, namely: Entry, Intermediate and Full SQL (referred to as level E, I and F,
respectively, in table 3.3, column 4); where Full SQL is the complete SQL database language,
Intermediate SQL is a proper subset of Full SQL, and Entry SQL is a proper subset of Intermediate
SQL. The purpose of introducing 3 levels was to enable database vendors who had incorporated the
SQL/89 extensions into their original SQL/86 implementations to claim SQL/92 Entry level status.
As Intermediate extensions were more straightforward additions than the rest, they were separated
from the Full SQL/92 extensions. However, SQL/92 is also constantly being reviewed [ISO94],
mainly to incorporate O-O features into the language, and this latest release is called SQL-3 (referred
to as level O-O in table 3.3, column 5). Until recently, relational DBMSs supported only the SQL/86
standard and even now most support only up to the Entry level of SQL/92. Hence ISs developed
using these relational DBMSs are not capable of supporting modern features introduced in the
newest standards. Our work focuses on providing these newer features for existing relational legacy
database systems so that features such as primary / foreign key specification can be supported for
relational databases conforming to SQL/86 standards; and sub / super type features can be specified
for all relational products.




Page 49
Chapter 3              Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
                           SQL Version         SQL/86    SQL/89         SQL/92    SQL-3
                           Level               1    2      Int.     E     I   F    O-O
                           Data Type           x     +      =       =     +   +      +
                           Identifier length   x     +      =       =     +   =      =
                           Not Null            x     +      =       =     =   =      =
                           Unique Key          -     x      =       =     +   =      =
                           Primary Key         -     -      x       =     +   =      =
                           Foreign Key         -     -      x       =     =   +      +
                           Check Constraint    -     -      x       =     =   +      =
                           Default Value       -     -      x       =     =   =      =
                           Domain              -     -      -       -     x   +      =
                           Assertion           -     -      -       -     -   x      +
                           Trigger             -     -      -       -     -   -      x
                           Sub/SuperType       -     -      -       -     -   -      x
                               Table 3.3: SQL Standards and introduction of constraints


         The integrity features discussed in previous sections were thus gradually introduced into the
SQL language standards as we can see from table 3.3. In this table ‘x’ indicates when the feature was
first introduced. The ‘+’ sign indicates that some enhancements were made to a previous version, the
‘=’ sign indicates that the same feature was used in a later version, and the ‘-’ sign indicates that the
feature was not provided in that version. For example, the Primary Key constraint for a table was
first introduced in SQL/89 (cf. appendix A.6) and later enhanced (i.e. in SQL/92 Intermediate) by
merging it with the Unique constraint (cf. appendix A.5) to introduce a candidate key constraint (cf.
appendix A.7). Thus, SQL standards for Primary Key are shown in table 3.3 as: ‘-’ for SQL/86; ‘x’
for SQL/89; ‘=’ for SQL/92 Entry level; ‘+’ for SQL/92 Intermediate level; and ‘=’ for all
subsequent versions.

        Simple attributes are defined using their data type and length (cf. section 3.5.1). These
specifications are used as inherent domain constraints. The first two rows of table 3.3 show that they
were among the first constraint features introduced via SQL standards (i.e. SQL/86). The Not Null
constraint, which is a special domain constraint, was also introduced in the initial SQL standard. The
key constraints (cf. section 3.5.2), which specify unique and primary keys, were introduced in a
subsequent standard (i.e. SQL/89) as shown in table 3.3. The referential constraint which specifies a
type of a structural relationship constraint uses foreign keys, and this constraint was also introduced
in SQL/89, along with default values for attributes and check constraints. Later, more complex forms
of constraints were introduced in SQL/92. These include defining new domains for an attribute (e.g.
child as a domain having an inherent constraint of age being less than 18 years), and specifying
domain constraints spanning multiple tables (i.e. assertions). Constraints which are activated when
an event occurs (i.e. triggers), and structural constraints on specialisation / generalisation (i.e.
sub/super type, cf. section 3.5.5) are among other enhancements proposed in the draft SQL-3
standards. A detailed description of the syntax of statements used to provide the features identified in
table 3.3 is given in appendix A. For further details we refer the reader to the standards themselves
[ANSI86, ANSI89a, ANSI92, ISO94].




Page 50
CHAPTER 4

                        Migration of Legacy Information Systems

This chapter addresses in detail the migration process and issues concerned with legacy information
systems (ISs). We identify the characteristics and components of a typical legacy IS and present the
expected features and functions of a migration target IS. An overview of some of the strategies and
methods used for migration of a legacy IS to a target IS is presented along with a detailed study of
migration support tools. Finally, we introduce our tool to assist the enhancement and migration of
relational legacy databases.

4.1 Introduction

        Rapid technological advances in recent years have changed the standard computer hardware
technology from centralised mainframes to network file-server and client/server architectures, and
software data management technology from files and primitive database systems to powerful
extended relational distributed DBMSs, 4GLs, CASE tools, non-procedural application generators
and end-user computing facilities. Information systems (ISs) built using old-fashioned technology
are inflexible, as that technology limits them from being changed and evolving to meet changing
business needs, which adjust rapidly to the potential of technological advances. It also means they
are expensive to maintain, as older systems suffer from failures, inappropriate functionality, lack of
documentation, poor performance and problems in training support staff. Such older systems are
called legacy ISs [BRO93, PHI94, BEN95, BRO95], and they need to be evolved and migrated to a
modern technological environment so that their existence remains beneficial to their user community
and organisation, and their full potential to the organisation can be realised.

4.2 Legacy Information Systems

        Technological advances in hardware and software have improved the performance and
maintainability of new information systems. Equipment and techniques used by older ISs are
obsolete and prone to suffer from frequent breakdowns along with ever increasing maintenance
costs. In addition, older ISs may have other deficiencies depending on the type of system. Common
characteristics of these systems are [BRO93, PHI94, BEN95, BRO95] that they are:

       • scientifically old, as they were built using older technology,
       • written in a 3GL,
       • use an old fashioned database service (e.g. a hierarchical DBMS),
       • have a dated style of user interface (e.g. command driven).

         Furthermore, in very large organisations additional negative characteristics may be present
making the intended migration process even more complex and difficult. These include [BRO93,
AIK94, BEN95, BRO95]: being very large (e.g. having millions of lines of code); being mission
critical (e.g. an on-line monitoring system like customer billing); and being operational all the time
(i.e. 24 hours a day and 7 days a week). These characteristics are not present in most smaller scale
legacy information systems, and hence the latter are less complex to maintain. Our work may not
Chapter 4                                                                       Migration of legacy ISs

assist all types of legacy IS as it deals with one particular component of a legacy IS only (i.e. the
database service).

        Information systems consist of three major functional components, namely: interfaces,
applications and a database service. In the context of a legacy IS these components are, accordingly,
referred to as [BRO93, BRO95] the:

       • legacy interface,
       • legacy application,
       • legacy database service.

        These functional components are sometimes inter-related depending on how they were
designed and implemented in the IS’s development. They may exist independently of each other,
having no functional dependencies (i.e. all three components are decomposable - see section ‘a’ of
figure 4.1); they may be semi-decomposable (e.g. the interface may be separate from the rest of the
system - see section ‘b’ of figure 4.1); or they may be totally non-decomposable (i.e. the functional
components are not discrete but are intertwined and used as a single unit - section ‘c’ of figure 4.1).
This variety makes the legacy IS environment complex to deal with. Due to the potential complexity
of entire legacy ISs, we have focussed on one particular functional component only, namely: the
legacy database service.

        In order to restrict our attention to the database service component, we have to treat the other
components, namely the interface and application, as black boxes. This can be done successfully
when a legacy IS has decomposable modules as in section ‘a’ of figure 4.1. However, when the
legacy database service is not fully decomposable from both the legacy interface and the legacy
application, treating them as black boxes may result in loss of information since relevant database
code may also appear in other components. In such cases, attempts must be made by the designer to
decompose or restructure the legacy code. The designer needs to investigate the legacy code of the
interface and application modules to detect any database service code, then move it to the database
service module (e.g. legacy code used to specify and enforce integrity constraints in the interface or
application components is identified and transferred to this module). This will assist in the
conversion of legacy ISs of the types shown in sections ‘b’ and ‘c’ of figure 4.1 to conform to the
structure of the IS type in section ‘a’ of figure 4.1. The identification and transfer of any legacy
database code left in the legacy application or interface modules can be done at any stage (e.g. even
after the initial migration) as the migration process can be repeated iteratively. Also, the existence of
legacy database code in the application does not affect the operation of the IS as we are not going to
change any existing functionalities during the migration process. Hence, treating a legacy interface
or a legacy application having legacy database code as a black box does not harm migration.




Page 52
Chapter 4                                                                                Migration of legacy ISs


                                                                  I - Interface
                                                                  A -Application
                                   I
                                                                  D - Database Service
                                                      I
                                   A


                                                     A/D                I/A/D
                                   D

                                  (a)                (b)                 (c)

                                Figure 4.1 : Functional Components of a Legacy IS


       4.2.1 Legacy Interfaces

        Early information systems were developed for system users who were computer literate.
These systems did not have system or user level interfaces because they were not regarded as
essential since the system users did these tasks themselves. However, when the business community
and others wanted to use these systems, the need for user interfaces was identified and they have
been incorporated into more recent ISs.

        The introduction of DBMSs in the late 1960’s facilitated easy access to computer held data.
However, in the early DBMSs, the end user had no direct access to their database and their
interactions with the database were done through verbal communication with a skilled database
programmer [ELM94]. All user requests were made via the programmer, who then coded the task as
a batch program using a procedural language. Since the introduction of query languages such as SQL
[CHA74, CHA76], QBE [ZLO77] and QUEL [HEL75], provision of interfaces for database access
has rapidly improved. These interfaces are provided not only to encourage laymen to use the system
but also to hide technical details from users. A presentation layer consisting of forms [ZLO77] was
the initial approach to supporting interaction with the end user. Modern user interfaces rely on
multiple screen windows and iconic (pictorial) representations of entities manipulated by pull-down
or pop-up menus and pointing devices such as cursor mice [SHN86, NYE93, HEL94, QUE93]. The
current trend is towards using separate interfaces for all decomposable application modules of an IS.
Some Graphical User Interface (GUI) development tools (e.g. Vision for graphics and user interfaces
[MEY94]) allow the construction of advanced GUIs supporting portability to various toolkits. This is
an important step towards building the next generation of ISs. Changes in the user interface and
operating environment result in the need for user training, an additional factor in system evolution
costs.

        As defined by Brodie [BRO93, BRO95], we shall use the term legacy interfaces in the
context of all ISs whose applications have no interfaces or use old fashioned user / system interfaces.
In figures 4.1a and 4.1b, these interfaces are distinct and separable from the rest of the system as
they are decomposable modules. However, interfaces can be non-decomposable as shown in figure
4.1c. Migration issues concerning user interfaces have been addressed in [BRO93, BRO95,
MER95], and as mentioned in section 4.2, our work does not address problems associated with such
interface migration.

       4.2.2 Legacy Applications


Page 53
Chapter 4                                                                     Migration of legacy ISs


       Originally, ISs were written using 3GLs, usually the COBOL or PL/1 programming
languages. These languages had many software engineering deficiencies due to the state of the
technology at the time of their development. Techniques such as structured programming,
modularity, flexibility, reusability, portability, extensibility and tailorability [YOU79, SOM89,
BOO94, ELM94] were not readily available until subsequent advances in software engineering
occurred. The lack of these has made 3GL based ISs appear to be inflexible and, hence, difficult and
expensive to maintain and evolve to meet changing business needs.

        The unstructured and non-modular nature of 3GLs resulted in the use of non-decomposable
application modules13 in the development of early ISs. However, with the introduction of software
engineering techniques such as structured modular programming these 3GLs were enhanced, and
new languages, such as Pascal [WIR71], Simula [BIR73], and more recently C++ [STR91] and
Eiffel [MEY92], gradually emerged to support these changing software engineering requirements.

        The emergence of query languages in the 1970’s as standard interfaces to databases saw the
development and use of embedded query languages in host programming languages for large
software application program development. Embedded QUEL for INGRES [RTI90a] and Embedded
SQL for many relational DBMSs [ANSI89b] are examples of this gendre. The emergence of 4GLs is
an evolution which allows users to give a high-level specification of an application expressed
entirely in 4GL code. A tool then automatically generates the corresponding application code from
the 4GL code. For example, in the INGRES Application-by-Forms interface [RTI90b], the
application designer develops a system by using forms displayed on the screen, instead of writing a
program. Similar software products are offered by other vendors, such as Oracle [KRO93].

        Information systems developed recently have partially or totally adopted modern software
engineering practices. As a result, decomposable modules exist in some recent ISs, i.e. their
architecture is as in section ‘a’ of figure 4.1. Applications which do not use the concept of
modularity are non-decomposable (e.g. section ‘c’ of figure 4.1), while those partially using it are
semi-decomposable (section ‘b’ of figure 4.1). Semi- and non- decomposable applications are
referred to as legacy applications and need to be converted into fully-decomposable systems to
simplify maintenance and make it easier for them to evolve and support future business needs.

        Some aspects of legacy application migration need tools to analyse code. These are discussed
in [BIG94, NIN94, BRA95, WON95]. They are beyond the scope of this thesis, except insofar as we
are interested in any legacy code that is relevant to the provisions of a modern database service (e.g.
integrity constraints).

        4.2.3 Legacy Database Service

       Originally, many ISs were developed on centralised mainframes using COBOL and PL/1
based file systems [FRY76, HAN85]. These ISs had no DBMS and their data was managed by the
system using separate files and programs for each file handling task [HAN85]. Later ISs were based

   13
      often containing calculated or parameter-driven GOTO statements preventing a reasonable
analysis of their structure.


Page 54
Chapter 4                                                                      Migration of legacy ISs

on using database technology with limited DBMS capabilities. These systems typically used
hierarchical or network DBMSs for their data management [ELM94, DAT95], such as IMS
[MCG77] and IDMS [DAT86, ELM94].

        The introduction and rapid acceptance of the relational model for DBMSs in recent years has
resulted in most applications now being developed with original relational DBMSs (e.g. System R
[AST76], DB2 [HAD84], SQL/DS [DAT88], INGRES [STO76, DAT87]). The steady evolution of
the relational data model has resulted in the emergence of extended relational DBMSs (e.g.
POSTGRES [STO91]) and newer versions of existing products (e.g. Oracle [ROL92], INGRES
[DAT87] and SYBASE [DAT90]) which have been used for most recent database applications. This
relational data model has been widely accepted as the dominant current generation standard for
supporting ISs. The rapidly expanding horizon of applications means that DBMSs are now expected
to cater for diverse data handling needs such as management of image, spatial, statistical or temporal
databases [ELM94, DAT95] and it is in its support of these that they are often weak. This highlights
the different range of functionality that is supported by various DBMSs. Thus applications using
older database services support modern database functionalities by means of application modules.
This is a typical characteristic of a legacy IS. Hence, the structure of such ISs is more complex and is
poorly understood as it is not adequately engineered in accordance with current technology.

        The database services offered by most hierarchical, network and original relational DBMSs
are now considered to be primitive, as they fail to support many functions (e.g. backup, recovery,
transaction support, increased data independence, security, performance improvements and views
[DAT77, DAT81, DAT86, DAT90, ELM94]) found in modern DBMSs. These functions facilitate
the system maintenance of databases developed using modern DBMSs. Hence, the database services
provided by early DBMSs and file based systems are now referred to as legacy database services,
since they do not fulfil many current requirements and expectations of such services.

        The specifications of a database service are described by a database schema which is held in
the database using data dictionaries. Analysis of the contents of these data dictionaries will provide
information that is useful in constructing a conceptual model for a legacy system. Our approach
focuses on using the data dictionaries of a relational legacy database to extract as much information
as possible about the specifications of that database.

       4.2.4 Other Characteristics

        The programming constructs of 3GL programs are less powerful than the data manipulation
features offered by 4GLs. As 4GL code uses the higher level DML code of modern DBMSs, it uses
less code (e.g. about 20% less) than its predecessors to accomplish even more powerful
applications. A typical program of a 3GL based information system is large and may consist of over
a hundred thousand lines of 3GL code. This means that a 20% reduction is a considerable saving in
quantity of code to be maintained [BRO93, BRO95]. Code translation tools [SHA93, SHE94] are
being built to automate as far as possible the conversion between 3GL and 4GL. These translations
sometimes optimise the translated code. Some of these techniques were mentioned in section 1.1.1.

       The long lifetime of some ISs also leads to deficiencies in documentation. This may be due
to non-existent, out of date or lost documentary materials. The extent of this deficiency was only



Page 55
Chapter 4                                                                     Migration of legacy ISs

realised recently when people tried to transform ISs. To address this problem in the future, CASE
tools are being developed to automatically produce suitable documentation for current ISs developed
using them [COMP90]. However, this solution does not apply to legacy ISs as they were not built
using such tools and it is impossible to use these tools on the legacy systems. Thus we must solve
this problem in another way.

        Sometimes, certain critical functions of an IS are written for high performance, often using a
specific, machine dependent set of instructions on some obsolete computer. This results in the use of
mysterious and complex machine code constructs which need to be deciphered to enable the code to
be ported to other computer systems. Such code is usually not convertible using generalised
translation tools. In general, the performance of legacy ISs is poor as most of their functions are not
optimised. This is inevitable, due to the state of the technology at the time of their original
development. Thus problems arise when we try to translate 3GL code into 4GL equivalent code in a
straightforward manner.

       Solving the problems identified above is the overall concern when assisting the migration
and evolution of legacy ISs. However, our aim is to address only some of the problems concerning
legacy ISs, as the complete task is beyond the scope of our project.

       4.2.5 Legacy Information System Architecture

        Having considered the characteristics of the components of legacy ISs, we can conclude that
a typical IS consists of many application modules, which may or may not use an interface for user /
system interactions, and may use a legacy database service to manage legacy data. This database
service may use a DBMS to manage its database.

        Hence, in general, the architecture of most legacy ISs is not strictly decomposable, semi-
decomposable or non-decomposable, as they may have evolved several times during their lifetime.
As a result, parts of the system may belong to any of the three categories shown in figure 4.2. This
means that the general architecture of a legacy IS is a hybrid one, as defined in [BRO93, BRO95,
KAR95]. Figure 4.2 suggests that some interfaces and application modules are inseparable from the
legacy database service while others are modular and independent of each other. This legacy IS
architecture highlights the database service component, as our interactions are with this layer to
extract the legacy database schema and any other database services required.

4.3 Target Information System

        A legacy IS can be migrated to a target IS with an associated computing environment. This
target environment is intended to take maximum advantage of the benefits of rightsized computers,
client/server network architecture, and modern software including relational DBMSs, 4GLs and
CASE tools. In this section we present the hardware and software environments needed for the target
ISs.

       4.3.1 Hardware Environment

       The target environment must be equipped with modern technology supporting current



Page 56
Chapter 4                                                                                                          Migration of legacy ISs

business needs which should be flexible enough to evolve and fulfil future requirements. The
fundamental goal of a legacy IS migration is that the target IS must not itself become a legacy IS in
the near future. Thus, the target hardware environment needs to be flexibly networked (e.g. client-
server architecture) to support a distributed multi-user community. This type of environment
includes a desk top computer for each target IS user with an appropriate server machine(s)
controlling and resourcing the network provision. A PC (e.g. IBM PC or compatible) or a
workstation (e.g. Sun SPARC) may be used as the desk top computer (i.e. client / local machine),
each being connected using a local area network (LAN) (e.g. Ethernet) to the server(s).



                                     I 1..Il       I l+1                         I m+1            In
                                                            • •       Im

                                                                                         • •
                                    A1 ..A l               Al+1..Am              Mm+1             Mn
                                non-decomposable    semi-decomposable              decomposable



                                                   Legacy Database Service




                                        Legacy                    Legacy
                                                     • •                   • •       Database
                                        Database                   Data


                        I - Interface module                A - Application module with legacy database services
                                                   M - Decomposed application module

                                               Figure 4.2 : Legacy IS Architecture


       4.3.2 Software Environment

       The target database software is typically based on a relational DBMS with a 4GL, SQL,
report writers and CASE tools (e.g. Oracle v7 with Oracle CASE). Use of such software provides
many benefits to its users, such as an increase in program / data independence, introduction of
modularised software components, graphical user interfaces, reduction in code, ease of maintenance,
support for future evolution and integration of heterogeneous systems.

        The target database can be centralised on a single server machine or distributed over multiple
servers in a networked environment. The target system may consist of application modules
representing the decomposed system components, each having its corresponding graphical user
interface (see figure 4.3). A typical architecture for a modern IS consists of layers for each of the
system functions (e.g. interface, application, database, network) as identified in [BRO93, BRO95].
In figure 4.3 we introduce such an architecture with special emphasis on the database service, which
will be a modern DBMS.




Page 57
Chapter 4                                                                                                       Migration of legacy ISs




                                     GUI1                          GUIi                        GUIn

                                                    • •                           • •
                                      M1                            Mi                             Mn




                                                            Target DBMS




                                        Target                    Target                 Target
                                       Database      • •         Database   • •         Database


                         GUI - graphical user interface module              M - Decomposed application module

                                             Figure 4.3 : Target IS Architecture


       The complete migration process involves significant changes, not only in hardware and
software of the applications but also in the skills required by users and management. Thus they will
have to be trained or replaced to operate the target IS. These changes must be done in some
organised manner as the complete migration process itself is complex, and may take months or even
years depending on the size and complexity of the legacy IS. The number of persons involved in the
migration process and the resources available also contribute towards determining the ultimate
duration and cost of the migration.

4.4 Migration Strategies

        The migration process for legacy ISs may take one of two main forms [BRO93, BRO95].
The first approach involves rewriting a legacy IS from scratch to produce the target IS using modern
software techniques (i.e. a complete migration). The other approach involves gradually migrating
the legacy IS in small steps until the desired long term objective is reached (i.e. incremental
migration). The approach of complete rewriting carries substantial risks and has failed many times in
large organisations as it is not an easy task to perform, especially when dealing with systems that
must remain operational throughout the process, or large ISs [BRO93, BEN95, BRO95]. However, if
the incremental approach fails, then only the failed step must be repeated rather than redoing the
entire migration. Hence, it is argued [BRO95] that the latter approach involves a lower risk and is
more appropriate in most situations. These approaches are further described in the next two
subsections. Our work is directed towards assisting this incremental migration approach.

       4.4.1 Complete Migration

        The process of complete migration involves rewriting a legacy IS from scratch to produce the
intended target IS. This approach carries a substantial risk. We discuss some of the reasons for this
risk to explain why this approach is not considered to be suitable by us. These are:

       a) A better system is expected.


Page 58
Chapter 4                                                                       Migration of legacy ISs


        A 1-1 rewrite of a complex IS is nearly impossible to undertake, as additional functions not
present in the original system are expected to be provided by the target IS. Besides, it is a significant
problem to evolve a developing replacement IS in step with an evolving legacy IS and to incorporate
in both ongoing changes in business requirements. Changes to and requests to evolve ISs may occur
at any time, without warning, and hence, it is difficult to incorporate any minor / major ad hoc
changes to the new system as they may not fit into its design. Also, an attempt to change this design
may violate its original goals and contribute towards a never ending cycle of development changes.

        b) Specifications rarely exist for the current system.

        Documentation for the current system is often non-existent and typically available only in the
form of the code14 itself. Due to the evolution of the IS many undocumented dependencies will have
been added to the system without the knowledge of the legacy IS owners (i.e. uncontrolled
enhancements have occurred). These additions and dependencies must be identified and
accommodated when rewriting from scratch. This adds to the complexity of a complete rewriting
process and raises the risk of unpredicted failure of dependent ISs when we rewrite a legacy system
as they are dependent on undocumented features of that system.

        c) Information system is too critical to the organisation.

        Many legacy ISs must be operational almost all the time and cannot be dormant during
evolution. This means that migrating live data from a legacy IS to the target IS may take more time
than the business can afford to be without its mission critical information. Such situations often
prohibit complete rewriting altogether and make this approach a non-starter. It also means that a
carefully thought out staged migration plan must be followed in this situation.

        d) Management of large projects is hard.

        The management of large projects involves managing more and more people. This normally
results in less and less productive work because of the effort required to manage organisational
complexity. As a result the timing of most large projects is seriously under-estimated. Frequently,
this results in partial or complete abortion of the project, as the inability to keep up with original
targets due to time lost is not always tolerated by an impatient company management.

        4.4.2 Incremental Migration

       An incremental migration process involves a series of steps, each requiring a relatively small
resource allocation (e.g. a few person weeks or months in the case of small or medium scale
systems), and a short time to produce a specific small result towards the desired goal. This is in sharp
contrast to the complete rewrite approach which may involve a large resource allocation (e.g. several
person months or years), and a development project spanning several years to produce a single
massive result. To perform a migration involving a series of steps, it is important to identify

   14
      This code is sometimes provided only in the form of executable code, as ISs are often in-house
developments.


Page 59
Chapter 4                                                                    Migration of legacy ISs

independent increments (i.e. portions of the legacy interfaces, applications and databases that can be
migrated independently of each other), and sequence them to achieve the desired goal. However, as
legacy ISs have a wide spectrum of forms from well-structured to unstructured, this process is
complex and usually has to deal with unavoidable problems due to dependencies between migration
steps. The following are the most important steps to apply in an incremental migration approach:

       a) Iteratively migrate the computing environment.

         The target environment must be selected, tested and established based on the total target IS
requirements. To determine the target IS requirements, it may be necessary to partially or totally
analyse and decompose the legacy IS. The installation of the target environment typically involves
installing a desk top computer for each target IS user and appropriate server machines, as identified
in section 4.3.1. The process of replacing dumb terminals with a PC or a workstation and connecting
them with a local area network can be done incrementally. This process allows the development of
the application modules and GUIs on an inexpensive local machine by downloading the relevant
code from a server machine, rather than by working on the server itself to develop this software.

        Software and hardware changes are gradually introduced in this approach along with the
necessary user and management training. Hence, although we explicitly refer to a particular process
there are many processes that take place simultaneously. This is due to the involvement of many
people in the overall migration activity, with each person contributing towards the desired migration
goal in a controlled and measurable way.

       Our work is concerned with iteratively migrating part of the legacy software (i.e. the database
service) and not the computing environment. Therefore we worked on a preinstalled target software
and hardware environment.

          b) Iteratively analyse the legacy information system.

       The purpose of this process is to understand the legacy IS in detail so that ultimately the
system can be modified to consist of decomposable modules. Any existing documentation, along
with the system code are used for this analysis. Knowledge and experience from people who support
and manage the legacy IS is also used to document the existing and the target IS requirements. This
knowledge has played a key role in other migration projects [DED95].

        Some existing code analysis tools such as Reasoning Systems' Software Refinery and
Bachman Information Systems' Product Set [COMP90, CLA92, BRO93, MAR94] can be used to
assist in this process. It may be useful to conduct experiments to examine the current system using
its known interfaces and available tools (e.g. CASE tools), so that the information gathered with one
tool can be reused by other tools. Here, functions and the data content of the current system are
analysed to extract as much information as possible about the legacy IS. Other available information
for this type of analysis includes: documentation, discussions with users, dumps (system backups),
the history of system operation and the services it provides.

      We do not perform any code analysis as part of our work. However, the analysis we do by
automated conceptual modelling identifies the usage of the data structures of the legacy IS. Our



Page 60
Chapter 4                                                                    Migration of legacy ISs

analysis assists in identifying the structural components of the legacy IS and their functional
dependencies. This information may then be used to restructure the legacy code.

          c) Iteratively decompose the legacy information system structure.

        The objective of this process is to construct well-defined interfaces between the modules and
the database services of the legacy IS. The process may involve restructuring the legacy IS and
removing dependencies between modules. It will thereby simplify the migration, that otherwise must
support all these dependencies. This step may be too complex in the worst case, when the legacy IS
will have to remain in its original form. Such a situation will complicate the migration process and
may result in increased cost, reduced performance and additional risk. However, in such cases an
attempt to perform even limited restructuring could facilitate the migration, and is preferable to
totally avoiding the decomposition step altogether.

         We investigate supporting some structural changes in order to improve the existing structures
of the legacy database (e.g. introduction of inheritance to represent hierarchical structures and
specification of relationship structures). These changes eventually affect the application modules and
the interfaces of the legacy IS. Hence there is a direct dependency with respect to decomposing the
legacy database service and an indirect dependency with respect to decomposing the other
components of the legacy IS. The actual testing of this indirect dependency was not considered due
to its involvement with the application module. However, the ability to define referential integrity
constraints and assertions spanning multiple tables allows us to redefine functional dependencies in
the form of constraints or rules. When these constraints are stored in the database, it is possible to
remove such dependencies from the legacy applications. This assists decomposition of some
functional components of a legacy IS.

       d) Iteratively migrate the identified portions.

        An identified portion of the legacy IS may be an interface, application or a database service.
These components are individually migrated to the target environment. When this is done the
migrated portion will then run in the target environment with the remaining parts of the legacy
system continuing to operate. A gateway is used to encapsulate system components undergoing
changes. The objective of this gateway is to hide the ongoing changes in the application and the
database service from the system users. Obviously any change made to the appearance of any user
interface components will be noticeable, along with any significant performance improvements in
application modules processing large volumes of data.

       Our work is applicable only to a legacy database service and hence any incremental
migration of interfaces or application modules is not considered at this stage. The complete
migration of legacy data takes a significant amount of time from hours to days depending on the
volume of data held. During this process no updates or additions can be made to the data as they
cause inconsistency problems. This means all functions of the database application have to be
stopped to perform a complete data migration in one go. For large organisations this type of action is
not appropriate. Hence iterative migration of selected data portions is desirable. To ensure a
successful migration, each chosen portion needs to be validated for consistency and guarded against
being rejected by the target database. When migrating data in stages it is necessary to be aware of



Page 61
Chapter 4                                                                      Migration of legacy ISs

the two sources of data as queries involving a migrated portion need to be re-directed to the target
system while other queries must continue to access the legacy database. This process may cause a
delay when a response for the query involves both the legacy and target databases. Hence it is
important to minimise this delay by choosing independent portions wherever possible for the
migration process.

4.5 Migration Method

        A formal approach to migrating legacy ISs has been proposed by Brodie [BRO93, BRO95]
based on his experience in the field of telecommunication and other related projects. These methods,
referred to as forward, backward/reverse and general (a combination of forward and backward)
migration, are based on his “chicken little” incremental migration process. A forward migration
incrementally analyses the legacy IS and then decomposes its structure, while incrementally
designing the target IS or installing the target environment. In this approach the database is migrated
prior to the other IS components and hence unchanged legacy applications are migrated forward onto
a modern DBMS before they are migrated to new target applications. Conversely, backward
migration creates the target applications and allows them to access the legacy data as the database
migration is postponed to the end. General migration is more complex as it uses a combination of
both these methods based on the characteristics of the legacy application and databases. However,
this is more suitable for most ISs as the approach can be tailored at will.

       The incremental migration process consists of a number of migration steps that together
achieve the desired migration. Each step is responsible for a specific aspect of the migration (i.e.
computer environment, legacy application, legacy data, system and user interfaces). The selection
and ordering of each aspect of the migration may differ as it depends on the application, as well as
the money and effort allocated for each process. Independent components can be migrated
sequentially or in parallel.

        As we see here, the migration methods of Brodie deal with all components of a legacy IS.
Our interest in this project is to focus on a particular component, namely the database service, and as
a result a detailed review of Brodie’s migration methods is not relevant here. However, our
approach has taken account of his forward migration approach as it first deals with the migration of
the legacy database service and then allows the legacy applications to access the post-migrated data
management environment through a forward database gateway.

4.6 Migration Support Tools

        There is no tool that is capable of performing a complete automatic migration for a given IS.
However, there are tools that can assist at various stages of this process. Hence, categorising tools by
their functions according to the stages of a migration process can help in identifying and selecting
those most appropriate. There are three main types of tools, namely: gateways, analysers and
migration tools, which can be of use at different stages of migration [BRO95]. For large migration
projects, testing and configuration management tools are also of use.

       a) Gateways




Page 62
Chapter 4                                                                      Migration of legacy ISs

        The function of a gateway is to coordinate between different components of an IS and hide
ongoing changes (i.e. to interfaces, data, applications and other system components being migrated)
from users. One of the main functions of these tools is to intercept calls on an application or database
service and direct them to the appropriate part of the legacy or target IS.

        To incrementally migrate a legacy IS to a target IS, we need to select independent
manageable portions, replicate them in the target environment and give control to the new target
modules while the legacy system is still operational. To perform such a transition in a fashion
transparent to users, we need a software module (i.e. a gateway) which encapsulates system
components that are undergoing change behind an unchanged interface. Such a software module
manages information flow between two different environments, the legacy and target environments.
Functions such as retrieval, processing, management and representation of data from various systems
are expected from gateways. These expectations from a gateway managing a migration process are
similar to those we have of DBMS’s to manage data. DBMSs were designed to provide general
purpose data management and similarly the gateway needs to manage the migration process in a
generalised form. Development of such a gateway is beyond the scope of this project as it may take
several man years of effort. Hence our work will focus on some selected functionalities of a
gateway, which are sufficient to produce a realistic prototype.

       We aim to provide a simplified form of the functionality of a gateway, which permits the
evolution of an existing IS at the logical level, by creating a target database and managing an
incremental migration of the existing database service in a way transparent to its users. This facility
should be provided not only for centralised database systems, but also for heterogeneous distributed
databases. This means our gateway functionality should support databases built using different types
of DBMS. We expect some of this functionality to be incorporated in future DBMSs as part of their
system functionality. Functions such as schema evolution, management of heterogeneous distributed
databases and schema integration are expected capabilities of modern database products.

       b) Analysers

        These tools employ a wide variety of techniques in exploring the technical, functional and
architectural aspects of an application program or database service to provide graphical and textual
information about an IS. The functions of reverse and forward engineering are provided by these
tools.

        Many tools are used in this way to analyse the different components of an IS. Most of these
tools are semi-automatic as some form of user interaction is required to successfully complete their
task. For example, an application or database translation process is automatic if the source program
and data conform to all the standards supported by the tool. Otherwise, the translation process will
be terminated with the unconvertible portions indicated, leaving the database administrator to
complete the job manually by either correcting or re-programming those unconvertible portions of
the source program into target language code. We experienced this situation when attempting to
migrate an Oracle version 6 database to Oracle version 7, using the Oracle tools. In this case, Oracle
failed to successfully convert date functions used to check the constraints of its version 6 databases
to the equivalent coding in version 7 (Note: Oracle version 6’s use of non-standard date functions
was the cause of this problem).



Page 63
Chapter 4                                                                     Migration of legacy ISs


       c) Migration tools

        These tools are responsible for creating the components of the target IS, including interfaces,
data, data definitions and applications.

       d) Testing

        An important task is to ensure that the migrated target IS performs in the same way as its
legacy original, with possibly better performance. For this task we need test beds to check the most
amount of logic using the least amount of data. There are tools that allow for easy manipulation of
testing functions like break points and data values. However, they do not help with the generation of
test beds or validation of the adequacy of the testing process. Comparing the results that are
generated using both systems will assist the achievement of a reasonable form of testing. This may
not be sufficient to test new features such as the introduction of distributed computing functionality
to our systems. It is up to the person involved to ensure that a reasonable amount of testing has been
done to ensure the functionality and the accuracy of the new IS.

       e) Configuration management

       This type of tool is needed for large migration projects involving many people, to coordinate
functions such as documentation, synchronisation, keeping track of changes made (auditing),
management of revisions to system elements (version control), and automatic building of a particular
version of a system component.

        Our work focuses on bringing these tools together into a single environment. We wish to
analyse a legacy database service, hence the functions of reverse and forward engineering are of
particular interest. We integrate these functions with some forward gateway and migration functions
as they are the relevant components for us to address the enhancement and migration of a database
service. Thus, we are not interested in all the features associated with migration support tools.

         The classification of reverse and re-engineering tools given in [SHA93, BRO95] provides a
broad description of the functions of existing CASE tools. These include maintaining, understanding
and enhancing existing systems; converting / migrating existing systems to new technology; and
building new replacement systems. There are many tools which perform some of these functions.
However, none of them is capable of performing the integrated task of all the above functions. This
is one of the important requirements for future CASE tools. As it is practically impossible to produce
a single tool to perform all these tasks, the way to overcome this deficiency is to provide a gateway
that permits multiple tools to exchange information and hence provide the required integrated
facility.

        The need to integrate different software components (i.e. database, spreadsheet, word
processing, e-mail and graphic applications) has resulted in the production of some integration tools,
such as DEC’s Link Works and Dictionary Software’s InCASE [MAY94]. However, what we need
is to integrate data extraction and downloading tools with schema enhancement and evolution
functions as they are together vital in the context of enhancing and migrating legacy databases.



Page 64
Chapter 4                                                                      Migration of legacy ISs


        Support for interoperability among various DBMSs and the ability to re-engineer a DBMS
are important functions for a successful migration process. Of these two, the former has not been
given any attention until very recently, and there has been some progress relating to the latter in the
form of CASE tools. However, among the many CASE tools available only a handful support the re-
engineering process. The reason for this is that most CASE tools focus on forward-engineering. In
this situation, new or replacement software systems are being designed and appropriate code
generated. The re-engineering process is a combination of both forward-engineering and reverse-
engineering. The reverse-engineering process analyses the data structures of the databases of
existing systems to produce a higher level representation of these systems. This higher level
representation is usually in diagrammatic form and may be an entity-relationship diagram, data-flow
diagram, cross-reference diagram or structure chart.

       We came across some tools that are commercially available for performing various tasks of
the migration process. These include data access and / or extraction tools for Oracle [BAR90,
HOL93, KRO93] and INGRES [RTI92] - two of our test DBMSs. Some other tools, mainly those
capable of performing the initial reverse engineering task, are also identified here. These tools are
not suitable for legacy ISs in general, as they fail to support a variety of DBMSs or the re-
engineering of most pre-existing databases. Among the different tools available, tools such as
gateways play a more significant role than others. When different database products are used in an
organisation, there may be a need to use multiple tools for a single step of a migration process,
conversely some tools may be of use for multiple steps. The process of using multiple tools for a
migration is complex and difficult as most vendors have not yet addressed the need for tool
interoperability.

        The survey carried out in [COMP90] identifies many reverse-engineering products. Among
the 40 vendor products listed there, only three claimed to be able to reverse engineer Oracle,
INGRES or POSTGRES databases (our test databases - see section 6.1) or any SQL based database
products. These three products were: Deft CASE System, Ultrix Programming Environment and
Foundation Vista. Of these products only Deft and Vista produced E-R diagrams. None of the
products in the complete list supported POSTGRES, which was then a research prototype. Of the
two products identified above, only Deft was able to read both Oracle and INGRES databases, while
Vista could read only INGRES databases. This analysis indicated that interoperability among our
preferred databases was rare and that it is not easy to find a tool that will perform the re-engineering
process and support interoperability among existing DBMSs. Although the information published in
[COMP90] may be now outdated, the literature published since then [SHA93, MAY94, SCH95]
does not show that modern CASE tools have addressed the task of re-engineering existing ISs along
with interoperability, both of which are essential for a successful migration process. However, the
functionality of accessing and sharing information from various DBMSs via gateways like ODBC is
a step towards achieving this task. One of the reasons for progress limitation is the inability to
customise existing tools, which in turn prevents them being used in an integrated environment. This
is confirmed to some extent by the failure of the leading Unix database vendor - Oracle - to provide
such tools.

       Brodie and Stonebraker, in their book [BRO95], present a study of the migration of large
legacy systems. It identifies an approach (chicken-little) and the commercial tools required to



Page 65
Chapter 4                                                                    Migration of legacy ISs

support this approach for legacy ISs in general. In this project we have developed a set of tools to
support an alternative approach for migrating legacy database services in particular. Thus Brodie and
Stonebraker take account of the need to migrate the application processes with a database, using
commercial tools, while in this thesis we concentrate on the development of integrated tools for
enhancing and migrating a database service.

4.7 The Migration Process

        Having identified the migration strategies and methods applicable to our work, we can
review our migration process. This process must start with a legacy IS as in figure 4.2 and end with a
target IS as shown in figure 4.3. However, as we are not addressing the application and interface
components of a legacy IS, their conversion is not part of this project.

       Our conceptualised constraint visualisation and enhancement system (CCVES) described in
section 2.2 was designed to assist in preparing legacy databases for such a migration. Hence our
migration process can be performed by connecting the legacy and target ISs using CCVES. This is
shown in figure 4.4. The three essential steps performed by CCVES before the actual migration
process occurs are shown using the paths highlighted in this figure as A, B and C, respectively.
These are the same paths that were described in section 2.2.

        The identification of all legacy databases used by an application is made prior to the
commencement of path A of figure 4.4. The reverse engineering process is then performed on any
selected database. This process commences when the database schema and its initial constraints are
extracted from the selected database and is completed when the database schema is graphically
displayed in a chosen format. Any missing or new information is supplied via path B in the form of
enhanced constraints, to allow further relationships and constraints to appear in the conceptual
model. The constraint enforcement process of path C is responsible for issuing queries and applying
these constraints to the legacy data and taking necessary actions whenever a violation is detected,
before any migration occurs. This ensures that the legacy data is consistent with its enhanced
constraints before migration. Once these steps are completed, a graceful transparent migration
process can be undertaken via path D. Our work focuses only on evolving and migrating database
services, hence path X representing the application migration is not done via CCVES. The evolution
of database services includes increasing IS program / data independence by identifying and
transferring legacy application services which are concerned with data management functions, like
enforcement of referential constraints, integrity constraints, rules, triggers, etc., to the database
service from the application. Our migration process performs the transformation of the legacy
database to the target environment and passes responsibility for enforcing the newly identified
constraints to this system.

        Figure 4.4 indicates that our approach commences with a reverse engineering process. This is
followed by a knowledge augmentation process which itself is a function of a forward engineering
process. These two stages together are referred to as re-engineering (see section 5.1). The constraint
enforcement process is the next stage of our approach. This is associated with the enhanced
constraints of the previous stage as it is necessary to validate the existing and enhanced constraint
specifications against the data held. These three preparatory stages are described in chapter 5. The
final stage of our approach is the database migration process. This is described later after we have



Page 66
Chapter 4                                                                        Migration of legacy ISs

fully discussed the application of the earlier stages in relation to our test databases.




Page 67
Chapter 4   Migration of legacy ISs




Page 68
CHAPTER 5

                       Re-engineering Relational Legacy Databases

This chapter addresses the re-engineering process and issues concerned with relational legacy
DBMSs. Initially, the reverse-engineering process for relational databases is overviewed. Next, we
introduce our re-engineering approach, highlighting its important stages and the role of constraints in
performing these stages. We then present how existing legacy databases can be enhanced with
modern concepts and introduce our knowledge representation techniques which allow the holding of
the enhanced knowledge in the legacy database. Finally, we describe the optional constraint
enforcement process which allows validation of existing and enhanced constraint specifications
against the data held.

5.1 Re-engineering Relational Databases

        Software such as programming code and databases is re-engineered for a number of reasons:
for example, to allow reuse of past development efforts, reduce maintenance expense and improve
software flexibility [PRE94]. This re-engineering process consists of two stages, namely: a reverse-
engineering and a forward-engineering process. In database migration the reverse-engineering
process may be applied to help migrate databases between different vendor implementations of a
particular database paradigm (e.g. from INGRES to Oracle), between different versions of a
particular DBMS (e.g. Oracle version 3 to Oracle version 7) and between database types (e.g.
hierarchical to modern relational database systems). The forward-engineering process, which is the
second stage of re-engineering, is performed on the conceptual model derived from the original
reverse-engineering process. At this stage, the objective is to redesign and / or enhance an existing
database system with missing and / or new information.

        The application of reverse-engineering to relational databases has been widely described and
applied [DUM81, NAV87, DAV87, JOH89, MAR90, CHI94, PRE94, WIK95b]. The latest
approaches have been extended to construct a higher level of abstraction than the original E-R
model. This includes the representation of object-oriented concepts such as generalisation /
specialisation hierarchies in a reversed-engineered conceptual model.

        Due to parallel work that had occurred in this area in the recent years, there are some
similarities and differences in our reverse-engineering approach [WIK95b] when compared with
other recent approaches [CHI94, PRE94]. In the next sub-sections we shall refer to them.

       The techniques used in the reverse-engineering process consist of identifying common
characteristics as identified below:

       • Identify the database’s contents such as relations and attributes of relations.
       • Determine keys, e.g. primary keys, candidate keys and foreign keys.
       • Determine entity and relationship types.
       • Construct suitable data abstractions, such as generalisation and aggregation structures.
Chapter 5                                                           Re-engineering relational legacy
                                                  DBs
       5.1.1 Contents of a relational database

       Diverse sources provide information that leads to the identification of a database’s contents.
These include the database’s schema, observed patterns of data, semantic understanding of
application and user manuals. Among these the most informative source is the database’s schema,
which can be extracted from the data dictionary of a DBMS. The observed patterns of data usually
provide information such as possible key fields, domain ranges and the related data elements. This
source of information is usually not reliable as invalid, inconsistent, and incomplete data exists in
most legacy applications. The reliability can be increased by using the semantics of an application.
The availability of user manuals for a legacy IS is rare and they are usually out of date, which means
they provide little or no useful information to this search.

        Data dictionaries of relational databases store information about relations, attributes of
relations, and rapid data access paths of an application. Modern relational databases record
additional information, such as primary and foreign keys (e.g. Oracle), rules / constraints on relations
(e.g. INGRES, POSTGRES, Oracle) and generalisation hierarchies (e.g. POSTGRES). Hence,
analysis of the data dictionaries of relational databases provides the basic elements of a database
schema, i.e. entities, their attributes, and sometimes the keys and constraints, which are then used to
discover the entity and relationship types that represent the basic components of a conceptual model
for the application. The trend is for each new product release to support more sophisticated facilities
for representing knowledge about the data.

       5.1.2 Keys of a relational data model

        Theoretically, three types of key are specified in a relational data model. They are primary,
candidate and foreign keys. Early relational DBMSs were not capable of implicitly representing
these. However, sometimes indexes which are used for rapid data access can be used as a clue to
determine some keys of an application database. For instance, the analysis of the unique index keys
of a relational database provides sufficient information to determine possible primary or candidate
keys of an application. The observed attribute names and data patterns may also be used to assist this
process. This includes attribute names ending with ‘#’ or ‘no’ as possible candidate keys, and
attributes in different relations having the same name for possible foreign key attributes. In the latter
case, we need to consider homonyms to eliminate incorrect detections and synonyms to prevent any
omissions due to the use of different names for the same purpose. Such attributes may need to be
further verified using the data elements of the database. This includes explicit checks on data for
validity of uniqueness and referential integrity properties. However the reverse of this process, i.e.
determining a uniqueness property from the data values in the extensional database is not a reliable
source of information, as the data itself is usually not complete (i.e. it may not contain all possible
values) and may not be fully accurate. Hence we do not use this process although it has been used in
[CHI94, PRE94].

        The lack of information on keys in some existing database specifications has led to the use of
data instances to derive possible keys. However it is not practicable to automate this process as some
entities have keys consisting of multiple attributes. This means many permutations would have to be
considered to test for all possibilities. This is an expensive operation when the volume of data and /
or the number of attributes is large.



Page 70
Chapter 5                                                              Re-engineering relational legacy
                                                    DBs

        In [CHI94], a consistent naming convention is applied to key attributes. Here attributes used
to represent the same information must have the same name, and as a result referencing and
referenced attributes of a binary relationship between two entities will have the same attribute names
in the entities involved. This naming convention was used in [CHI94] to determine relationship
types, as foreign key specifications are not supported by all databases. An important contribution of
our work is to support the identification of foreign key specifications for any database and hence the
detection of relationships, without performing any name conversions. We note that some reverse-
engineering methods rely on candidate keys (e.g. [NAV87, JOH89]), while others rely on primary
keys (e.g. [CHI94]). These approaches insist on their users meeting their pre-requisites (e.g.
specification of missing keys) to enable the user to successfully apply their reverse-engineering
process. This means it is not possible to produce a suitable conceptual model until the pre-requisites
are supplied. For a large legacy database application the number of these could exceed a hundred
and hence, it is not appropriate to rely on such pre-requisites being met to derive an initial
conceptual model. Therefore, we concentrate on providing an initial conceptual model using only the
available information. This will ensure that the reverse-engineering process will not fail due to the
absence of any vital information (e.g. the key specification for an entity).

         5.1.3 Entity and Relationship Types of a data model

         In the context of an E-R model an entity is classified as strong15 or weak depending on an
existence-dependent property of the entity. A weak entity cannot exist without the entity it is
dependent on. The enhanced E-R model (EER) [ELM94] identifies more entity types, namely:
composite, generalised and specialised entities. In section 3.3.1 we described these entity types and
the relationships formed among them. Different classifications of entities are due to their associative
properties with other entities. The identification of an appropriate entity type for each entity will
assist in constructing a graphically informative conceptual model for its users. The extraction of
information from legacy systems to classify the appropriate entity type is a difficult task as such
information is usually lost during an implementation. This is because implementations take different
forms even within a particular data model [ELM94]. Hence, an information extraction process may
need to interact with a user to determine some of the entity and relationship types. The type of
interaction required depends on the information available for processing and will take different
forms. For this reason we focus only on our approach, i.e. determining entity and relationship types
using enhanced knowledge such as primary and foreign key information. This is described in section
5.4.

         5.1.4 Suitable Data Abstractions for a data model

       Entities and relationships form the basic components of a conceptual data model. These
components describe specific structures of a data model. A collection of entities may be used to
represent more than one data structure. For example, entities Person and Student may be represented
as a 1:1 relationship or as a is-a relationship. Each representation has its own view and hence the
user understanding of the data model will differ with the choice of data structure. Hence it is
important to be able to introduce any data structure for a conceptual model and view using the most

   15
        In some literature this type of entity is referred to as regular entity, e.g. [DAT95].


Page 71
Chapter 5                                                            Re-engineering relational legacy
                                                  DBs
suitable data abstraction.

        Data structures such as generalisation and aggregation have inherent behavioural properties
which give additional information about their participating entities (e.g. an instance of a specialised
entity of a generalisation hierarchy is made up from an instance of its generalised entity). These
structures are specialised relationships and representation of them in a conceptual model provides a
higher level of data abstraction and a better user understanding than the basic E-R data model gives.
These data abstractions originated in the object-oriented data model and they are not implicitly
represented in existing relational DBMSs. Extended-relational DBMSs support the O-O paradigm
(e.g. POSTGRES) with generalisation structures being created using inheritance definitions on
entities. However in the context of legacy DBMSs such information is not normally available, and as
a result such data abstractions can only be introduced either by introducing them without affecting
the existing data structures or by transforming existing entities and relationships to support their
representation. For example, entities Staff and Student may be transformed to represent a
generalisation structure by introducing a Person entity.

        Other forms of transformation can also be performed. These include decomposing all n-ary
relationships for n > 3 into their constituent relationships of order 2 to remove such relationships and
hence simplify the association among their entities. At this stage double buried relationships are
identified and merged and relationships formed with subclasses are eliminated. Transitive closure
relationships are also identified and changed to form simplified hierarchies. We use constraints to
determine relationships and hierarchies. By controlling these constraints (i.e. modifying or deleting
them) it is possible to transform or eliminate necessary relationships and hierarchies.

5.2 Our Re-engineering Process

        Our re-engineering process has two stages. Firstly, the relational legacy database is accessed
to extract the meta-data of the application. This extracted meta-data is translated into an internal
representation which is independent of the vendor database language. This information is next
analysed to determine the entity and relationship types, their attributes, generalisation / specialisation
hierarchies and application constraints. The conceptual model of the database is then derived using
this information and is presented graphically for the user. This completes the first stage which is a
reverse-engineering process for a relational database.

       To complete the re-engineering process, any changes to the existing design and any new
enhancements are done at the second stage. This is a forward-engineering process that is applied to
the reverse-engineered model of the previous stage. We call this process constraint enhancement as
we use constraints to enhance the stored knowledge of a database and hence perform our forward-
engineering process. These constraint enhancements are done with the assistance of the DBA.

       5.2.1 Our Reverse-Engineering Process

       Our reverse-engineering process concentrates on producing an initial conceptual model
without any user intervention. This is a step towards automating the reverse-engineering process.
However the resultant conceptual model is usually incomplete due to the limited meta-knowledge
available in most legacy databases. Also, as a result of incomplete information and unseen inclusion



Page 72
Chapter 5                                                                         Re-engineering relational legacy
                                                  DBs
dependencies we may represent redundant relationships as well as fail to identify some of the entity
and / or relationship types. We depend on constraint enhancement (i.e. the forward-engineering
process) to supply this missing knowledge so that subsequent conceptual models will be more
complete. The DBA can investigate the reversed-engineered model to detect and resolve such cases
with the assistance of the initial display of that model. The system will need to guide the DBA by
identifying missing keys and assisting in specifying keys and other relevant information. It also
assists in examining the extent to which the new specifications conform to the existing data.

        Our reverse-engineering process does not depend on information about specialised
constraints. When no information about these is available, we treat all entities of a database to be of
the same type (i.e. strong entities) and any links present in the database will not be identified. In such
a situation the conceptual model will display only the entities and attributes of the database schema
without any links. For example, a relational database schema for a university college database
system with no constraint-specific information will initially be viewed as shown in figure 5.1. This is
the usual case for most legacy databases as they lack constraint-specific information. However, the
DBA will be able to provide any missing information at the next stage so that any intended data
structures can be reconstructed. Obviously if some constraints are available our reverse-engineering
process will try to derive possible entity types and links during its initial application.

                          University          College            Faculty             Employee
                          office             code               code                name
                                             building           building            address
                                             name               name
                                                                                    birthDate
                                             address            address
                                                                secretary           gender
                          Student            principal
                                             phone              phone               empNo
                          name                                  dean                designation
                          address                                                   worksFor
                          birthDate         Department
                                                                                    yearJoined
                          gender             deptCode          EmpStudent           room
                          collegeNo          building
                                                               collegeNo            phone
                          course             name
                                             address           empNo                salary
                          department         head              remarks
                          tutor              phone
                          regYear            faculty


                    Figure 5.1 : A relational legacy database schema for a university college database


        Our reverse-engineering process first identifies all the available information by accessing the
legacy database service (cf. section 5.3). The accessed information is processed to derive the
relationship and entity types for our database schema (cf. section 5.4). These are then presented to
the user using our graphical display function.

       5.2.2 Our Forward-Engineering Process

        The forward-engineering process is provided to allow the designer (i.e. DBA) to interact with
a conceptual model. The designer is responsible for verifying the displayed model and can supply
any additional information to the reverse-engineered model at this stage. The aim of this process is
to allow the designer to define and add any of the constraint types we identified in section 3.5 (i.e.
primary key constraints, foreign key constraints, uniqueness constraints, check constraints,
generalisation / specialisation structures, cardinality constraints and other constraints) which are not


Page 73
Chapter 5                                                                                                               Re-engineering relational legacy
                                                DBs
present. Such additions will enhance the knowledge held about a legacy database. As a result, new
links and data abstractions that should have been in the conceptual model can be derived using our
reverse-engineering techniques and presented in the graphical display. This means that the legacy
database schema originally viewed as in figure 5.1 can be enhanced with constraints and presented
as in figure 5.2, which is a vast improvement on the original display. Such an enhanced display
demonstrates the extent to which a user’s understanding of a legacy database schema can be
improved by providing some additional knowledge about the database. In sections 6.3.4 and 6.4 we
introduce the enhanced constraints of our test databases including those used to improve the legacy
database schema of figure 5.1 to figure 5.2.

                                           University
                                                                                                                Person
                                                                                                               name
                                          Constraints:                                                         address
                                             ..............
                                                                                                               birthDate
                                                                                                               gender
                                                   office
                                                                                                             Constraints:
                                                                                                                ..............
                                              Office                    inCharge
                                            code
                                            siteName
                                                                     worksFor
                                            unitName
                                            address
                                            phone                                       4+
                                                                                           Employee                                Student
                                          Constraints:
                                             ..............                              empNo                                    collegeNo
                                                                                         designation                              course
                                                                              dean                               tutor
                                                                                         yearJoined                               department
                                                                                         room                                     regYear
                                                                                         phone
                       College-Office                      Faculty-Office                                                        Constraints:
                                                                                         salary                                     ..............
                     siteName AS building             siteName AS building               Constraints:
                     unitName AS name                 unitName AS name                      ..............
                     inCharge AS principal            inCharge AS secretary
                      Constraints:                        Constraints:
                         ..............                      ..............

                                          Dept-Office
                                                                              faculty
                                    code AS deptCode
                                    siteName AS building                                                       EmpStudent
                                    unitName AS name                2-12
                                    inCharge AS head                                                          remarks

                                      Constraints:                                                           Constraints:
                                         ..............                                                         ..............


                                 Figure 5.2 : The enhanced university college database schema


        We support the examination of existing specifications and identification of possible new
specifications (cf. section 5.5) for legacy databases. Once these are identified, the designer defines
new constraints using a graphical interface (cf. section 5.6). The new constraint specifications are
stored in the legacy database using a knowledge augmentation process (cf. section 5.7). We also
supply a constraint verification module to give users the facility to verify and ensure that the data
conforms to all the enhanced constraints (cf. section 5.8) being introduced.

5.3 Identifying Information of a Legacy Database Service

       Schema information about a database (i.e. meta-data) is stored in the data dictionaries of that
database. The representation of information in these data dictionaries is dependent on the type of the
DBMS. Hence initially the relational DBMS and the databases used by the application are identified.
The name and the version of the DBMS (e.g. INGRES version 6), the names of all the databases in


Page 74
Chapter 5                                                            Re-engineering relational legacy
                                                  DBs
use (e.g. faculty / department), and the name of the host machine (e.g. llyr.athro.cf.ac.uk) are
identified at this stage. These are the input data that allows us to access the required meta-data. As
the access process is dependent on the type of the DBMS, we describe this process in section 6.5
after specifying our test DBMSs. This process is responsible for identifying all existing entities, keys
and other available information in a legacy database schema.

5.4 Identification of Relationship and Entity Types

        Once the entities and their attributes along with primary and candidate keys have been
provided, we are ready to classify relationships and entity types. Three types of binary relationships
(i.e. 1:1, 1:N and N:M) and five types of entities (i.e. strong, weak, composite, generalised and
specialised) are identified at this stage.

        Initially we assume that all entities are strong and look for certain properties associated with
them (mainly primary and foreign key), so that they can be reclassified into any of the other four
types. Weak and composite entities are identified using relationship properties and generalised /
specialised entities are determined using generalisation hierarchies.

  5.4.1 Relationship Types

       (a) A M:N relationship

        If the primary key of an entity is formed from two foreign keys then their referenced entities
participate in an M:N relationship. This is a special case of n-ary relationship involving two
referenced entities (see section ‘a’ of figure 5.3). This entity becomes a composite entity having a
composite key. For example, entity Option with primary key (course,subject) participates in an M:N
relationship as the primary key attributes are foreign keys - see tables 6.2, 6.4 and 6.6 (later).

        In a n-ary relationship (e.g. 3-ary or ternary if the number of foreign keys is 3, see section ‘b’
of figure 5.3) the primary key of an entity is formed from a set of n foreign keys. As stated in section
5.1.4, n-ary relationships for n > 3 are usually decomposed into their constituent relationships of
order 2 to simplify their association. Hence we do not specifically describe this case. For example,
entity Teach with primary key (lecturer, course, subject) participates in a 3-ary relationship when
lecturer, course and subject are foreign keys referencing entities Employee, Course and Subject,
respectively. However, as Option is made up using Course and Subject entities we could decompose
this 3-ary relationship into a binary relationship by defining course and subject of Teach to be a
foreign key referencing entity Option - see tables 6.2, 6.4 and 6.6.




Page 75
Chapter 5                                                                                     Re-engineering relational legacy
                                                                  DBs
                    Relational Model                      ER Model Concept                Graphical Notation


                  (a) PK = FK + FK (i.e. n = 2)            M:N relationship               M                  N
                                                                                     re         relation            re
                             1    2                                                                          binary

                             n
                                                                                          L                    N
                  (b) PK =      FK n>2                     n-ary relationship        re         relation             re
                                                                                                             ternary
                             i=1 i
                                                                                                M           e.g.
                                                                                                            3-ary
                                                                                                    re
                                                                                          1                       N
                  (c) FK attr. is part of PK and           1:N relationship         re         attribute                e
                      other part does not contain a key
                      of any other relation
                                                                                          1                    N
                                                           1:N relationship          re         attribute               e
                  (d) A non-key FK and
                     non-unique attr.

                                                           1:1 relationship               1                   1
                  (e) A non-key FK and                                               re        attribute              e
                      unique attr.

                                                                          PK - Primary Key     e - referencing entity
                                                                          FK - Foreign Key     re - referenced entity

                             Figure 5.3: Mapping foreign key references to an ER relationship type


      Sometimes a foreign key refers to the same entity, forming a unary relationship, like in the
case where some courses may have pre-requisites. In this case the attribute pre-requisites of entity
Course is a foreign key referencing the same entity.

       (b) A 1:N relationship

        There are two types of 1:N relationships. One is formed with a weak entity and the other with
a strong entity.

        If part of the primary key of an entity is a foreign key and the other part does not contain a
key of any other relation, then the entity concerned is a weak entity and will participate in a weak
1:N relationship (see section ‘c’ of figure 5.3) with its referenced entity. For example, entity
Committee with primary key (name, faculty) is a weak entity as only a part of its primary key
attributes (i.e. faculty) is a foreign key.

        A non-key foreign key attribute (i.e. an attribute that is not part of a primary key) that may
have multiple values will participate in a strong 1:N relationship (see section ‘d’ of figure 5.3) if it
does not satisfy the uniqueness property. For example, attribute tutor of entity Student is a non-key,
non-unique foreign key referencing the entity Employee (cf. tables 6.2 to 6.4). Here tutor participates
in a 1:N relationship with Employee - see table 6.6.

       (c) A 1:1 relationship

        A non-key foreign key attribute will participate in a 1:1 relationship if a uniqueness
constraint is defined for that attribute (see section ‘e’ of figure 5.3). For example, attribute head of
entity Department participates in a 1:1 relationship with entity Employee as it is a non-key foreign
key with the uniqueness property, referencing Employee - see tables 6.2 to 6.4 and 6.6.


Page 76
Chapter 5                                                            Re-engineering relational legacy
                                                  DBs

        The specialised and generalised entity pair of a generalisation hierarchy has a 1:1 is-a
relationship. Hence it is possible to define a binary relationship in place of a generalisation
hierarchy. For example, it is possible to define a foreign key (empNo) on entity EmpStudent,
referencing entity Employee to form a 1:1 relationship instead of representing it as a generalisation
hierarchy. Such cases must be detected and corrected by the database designer. We introduce
inheritance constraints involving these entities to resolve such cases.

  5.4.2 Entity Types

       (a) A strong entity

        This is the default entity type, as any entity that cannot be classified as one of the other types
will be a strong (regular) entity.

       (b) A composite entity

        An entity that is used to represent an M:N relationship is referred to as a composite (link)
entity (cf. section 5.4.1 (a)). The identification of M:N relationships will result in the identification
of composite entities.



       (c) A weak entity

       An entity that participates in a weak 1:N relationship is referred to as a weak entity (cf.
section 5.4.1 (b)). The identification of weak 1:N relationships will result in the identification of
weak entities.

       (d) A generalised / specialised entity

        An entity defined to contain an inheritance structure (i.e. inheriting properties from others) is
a specialised entity. Entities whose properties are used for inheritance are generalised entities. The
identification of inheritance structures will result in the identification of specialised and generalised
entities. An inheritance structure defines a single inheritance structure (e.g. entities X1 to Xj inherit
from entity A in figure 5.4). However, a set of inheritance structures may form a multiple inheritance
structure (e.g. entity Xj inherits from entities A and B in figure 5.4). To determine the existence of
multiple inheritance structures we analyse all subtype entities of the database (e.g. entities X1 to Xn
in figure 5.4) and derive their supertypes (e.g. entity A or B or both in figure 5.4). For example,
entity EmpStudent inherits from Employee and Student entities forming a multiple inheritance, while
entity Employee inherits from Person to form a single inheritance.




Page 77
Chapter 5                                                                                               Re-engineering relational legacy
                                                                       DBs
                                                      A                                                B




                         X1           • •             Xi           • •              Xj               • •                Xn

                   Entities X 1, .. ,Xi, .. ,Xj inherit from entity A and entities X j, .. ,Xn inherit from entity B.

                         Figure 5.4: Single and multiple inheritance structures using EER notations


5.5 Examining and Identifying Information

       Our forward-engineering process allows the designer to specify new information. To
successfully perform this task the designer needs to be able to examine the current contents of the
database and identify possible missing information from it.

       5.5.1 Examining the contents of a database

        At this stage the user needs to be able to browse through all features of the database. Overall,
this includes viewing existing primary keys, foreign keys, uniqueness constraints and other
constraint types defined for the database. When inheritance is involved the user may need to
investigate the participating entities at each level of inheritance. As specific viewing the user may
want to investigate the behaviour of individual entities. This includes identifying constraints
associated with a particular entity (i.e. intra-object properties) and those involving other entities (i.e.
inter-object properties). Our system provides for this via its graphical interface. We describe viewing
of these properties in section 7.5.1, as it is directly associated with this interface. Here, global
information is tabulated and presented for each constraint type, while specific information (i.e. inter-
and intra- object) presents constraints associated with an individual entity.

       5.5.2 Identifying possible missing, hidden and redundant information

       This process allows the designer to search for specific types of information, including
information about the type of entities that do not contain primary keys, possible attributes for such
keys, buried foreign key definitions and buried inheritance structures. In this section we describe
how we identify this type of information.

       i) Possible primary key constraints

        Entities that do not contain primary keys are identified by comparing the list of entities
having primary keys with the list of all entities of the database. When such entities are identified the
user can view the attributes of these and decide on a possible primary key constraint. Sometimes, an
entity may have several attributes and hence the user may find it difficult to decide on suitable
primary key attributes. In such a situation the user may need to examine existing properties of that
entity (cf. section 5.5.1) to identify attributes with uniqueness properties and not null values.


Page 78
Chapter 5                                                           Re-engineering relational legacy
                                                  DBs
Sometimes, attribute names such as those ending with ‘no’ or ‘#’ may give a clue in selecting the
appropriate attributes. Once the primary key attributes have been decided the user may want to
verify this choice against the data of the database (cf. section 5.8).

       ii) Possible foreign key constraints

        Existence of either an inclusion dependency between a non-key attribute of one table and a
key attribute of another (e.g. deptno of Employee and deptno of Department), or a weak or n-ary
relationship between a key attribute and part of a key attribute (e.g. cno of strong entity Course and
cno of link entity Teach) implies the possible existence of a foreign key definition. Such possibilities
are detected by matching attribute names satisfying the required condition. Otherwise, the user needs
to inspect the attributes and detect their possible occurrence (e.g. if attribute name worksfor instead
of deptno was used in Employee).

       iii) Possible uniqueness constraints

        Detection of a uniqueness index gives a clue to a possible uniqueness constraint. All other
indications of this type of constraint have to be identified by the user.

       iv) Possible inheritance structures

        Existence of an inclusion dependency between two strong entities having the same key
implies a subtype / supertype relationship between the two entities. Such possible relationships are
detected by matching identical key attribute names of strong entities (e.g. empno of Person and
empno of Employee). Otherwise, the user needs to inspect the table and 1:1 relationships to detect
these structures (e.g. if personid instead of empno was used in Person then the link between empno
and personid would have to be identified by the user).

        In distributed database design some entities are partitioned using either horizontal or vertical
fragmentation. In this situation strong entities having the same key will exist with a possible
inclusion dependency between vertically fragmented tables. Such cases need to be identified by the
designer to avoid incorrect classifications occurring. For example, employee records can be
horizontally fragmented and distributed over each department as opposed to storing at one site (e.g.
College). Also, employee records in a department may be vertically fragmented at the College site as
the college is interested in a subset of information recorded for a department.

       v) Possible unnormalised structures

        All entities of a relational model are at least in 1NF, as this model does not allow multivalued
attributes. When entities are not in 3NF (i.e. a non-key attribute is dependent on part of a key or
another non-key attribute: violating 2NF or 3NF, respectively), there are hidden functional
dependencies. These entities need to be identified and transformed into 3NF to show their
dependencies. New entities in the form of views are used to construct this transformation. For
example, entity Teach can be defined to contain attributes lecturer, course, subNo, subName and
room. Here we see that subName is fully dependent on subNo and hence Teach is in 2NF. Using a
view we separate subName from Teach and use it as a separate entity with primary key subNo. This



Page 79
Chapter 5                                                          Re-engineering relational legacy
                                                  DBs
allows us to transform the original Teach to 3NF and view Subject and Teach as a binary, instead of
an unary relationship. This will assist in improving conceptual model readability.

       vi) Possible redundant constraints

        Redundant inclusion dependencies representing projection or transitivity must be removed,
otherwise incorrect entity or relationship types may be represented. For instance, if there is an
inclusion dependency between entities A, B and B, C then the transitivity inclusion dependency
between A, C is redundant. Such relationships should be detected and removed. For example,
EmpStudent is an Employee and Employee is a Person, thus EmpStudent is a Person is a redundant
relationship. Redundant constraints are often most obvious when viewing the graphical display of a
conceptual model with its inter- and intra- object constraints.

5.6 Specifying New Information

        We can specify new information using constraints. In a modern DBMS which supports
constraints we can use its query language to specify these. However this approach will fail for legacy
databases as they do not normally support the specification of constraints. To deal with both cases
we have designed our system to externally accept constraints of any type, but represent them
internally by adopting the appropriate approach depending on the capabilities of the DBMS in use.
Thus if constraint specification is supported by the DBMS in use we will issue a DDL statement (cf.
figure 5.5 which is based on SQL-3 syntax) to create the constraint. If constraint specification is not
supported by the DBMS in use we will store the constraint in the database using techniques
described in section 5.7. These constraints are not enforced by the system but they may be used to
verify the extent to which the data conforms with the constraints (cf. section 5.8). In both cases this
enhanced knowledge is used by our conceptual model wherever it is applicable. The following sub-
sections describe the specification process for each constraint type. We cover all types of constraints
that may not be supported by a legacy system, including primary key. We use the SQL syntax to
introduce them. In SQL, constraints are specified as column/table constraint definitions and can
optionally contain a constraint name definition and constraint attributes (see sections A.3 and A.4)
which are not included here.

       i) Specifying Primary Key Constraints

        Only one primary key is allowed for an entity. Hence our system will not allow any input that
may violate this status. Once an entity is specified the primary key attributes are chosen. Modern
SQL DBMSs will use the DDL statement ‘a’ of figure 5.5 to create a new primary key constraint,
older systems do not have this capability in their syntax.

       ii) Foreign Key Constraints

       A foreign key establishes a relationship between two entities. Hence, when the enhancing
constraint type is chosen as a foreign key, our system requests two entity names. The first is the
referencing entity and the second the referenced entity. Once the entity names are identified the
system automatically shows the referenced attributes. These attributes are those having the
uniqueness property. When these attributes are chosen a new foreign key is established. This



Page 80
Chapter 5                                                                                     Re-engineering relational legacy
                                                 DBs
constraint will only be valid if there is an inclusion dependency between the referencing and
referenced attributes. Modern SQL DBMSs will use the DDL statement ‘b’ of figure 5.5 to create a
new foreign key constraint in this situation. This statement can optionally contain a match type and
referential triggered action (see section A.8) which are not shown here.

       iii) Uniqueness Constraints

        A uniqueness constraint may be defined on any combination of attributes. However such
constraints should be meaningful (e.g. these is no point in defining a uniqueness constraint for a set
of attributes when a subset of it already holds the uniqueness status), and should not violate any
existing data. Modern SQL DBMSs will use the DDL statement ‘c’ of figure 5.5 to create a new
uniqueness constraint.

                          (a)      ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
                                         PRIMARY KEY (Primary_Key_Attributes)
                          (b)      ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
                                         FOREIGN KEY (Foreign_Key_Attributes)
                                         REFERENCES Referenced_Entity_Name (Referenced_Attributes)
                          (c)      ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
                                         UNIQUE (Uniqueness_Attributes)
                          (d)      ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
                                         CHECK (Check_Constraint_Expression)
                          (e)      ALTER TABLE Entity_Name ADD
                                         UNDER Inherited_Entities [ WITH (Renamed_Attributes) ]
                          (f)      ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name
                                         FOREIGN KEY (Foreign_Key_Attributes)
                                         [ CARDINALITY (Referencing_Cardinality_Value) ]
                                         REFERENCES Referenced_Entity_Name (Reference_Attributes)

                          Our optional extensions to the SQL-3 syntax are highlighted using bold letters here.

                                Figure 5.5 : Constraints expressed in extended SQL-3 syntax


       iv) Check Constraints

        A check constraint may be defined to represent a complex expression involving any
combination of attributes and system functions. However such constraints should not be redundant
(i.e. not a subset of an existing check constraint) and should not violate any existing data. Modern
SQL DBMSs will use the DDL statement ‘d’ of figure 5.5 to create a new check constraint.

          v) Generalisation / Specialisation Structures

        An inheritance hierarchy may be defined without performing any structural changes if its
existence can be detected by our process described in part ‘d’ of section 5.4.2. In this case we need
to specify the entities being inherited (cf. statement ‘e’ of figure 5.5). If an inherited attribute’s name
differs from the target attribute name it is necessary to rename them. For example, attributes
siteName, unitName and inCharge of Office are renamed to building, name and principal when it is
inherited by College - see figures 6.2 and 6.3 (later).

        It is also possible to make some structural changes in order to introduce new generalisation /
specialisation structures. In such situations new entities are created to represent the
specialisation/generalisation. Appropriate data for these entities are copied to them during this
process. For instance, in our university college example of figure 5.1, the entities College, Faculty and


Page 81
Chapter 5                                                                              Re-engineering relational legacy
                                                           DBs
Department can be restructured to represent a generalisation hierarchy, by introducing a generalised
entity called Office and transforming the entities College, Faculty and Department to College-Office,
Faculty-Office and Dept-Office, respectively (cf. figure 5.2). Once this transformation is done the
entities Office, College-Office, Faculty-Office and Dept-Office will represent a generalisation hierarchy
as shown in figure 5.2. Any change to existing structures and table names will affect the application
programs which use them. To overcome this we introduce view tables in the legacy database to
represent new structures. These tables are defined using the syntax of figure 5.6. For example, the
generalised entity will be Office and the specialised entities will be College-Office, Faculty-Office and
Dept-Office. The introduction of view tables means that legacy application code using the original
tables will not be affected by the change. However, appropriate changes must be introduced in the
target application code and database if we are going to introduce these features permanently. We
introduced the concept of defining view tables in the legacy database to assist the gateway service in
managing these structural changes.

                              CREATE VIEW GeneralisedEntity (GeneralisedAttributes) AS
                                  SELECT Attributes FROM SpecialisedEntity
                                  [ [UNION SELECT Attributes FROM SpecialisedEntity] ..]
                              CREATE VIEW SpecialisedEntity (SpecialisedAttributes) AS
                                  SELECT g1.Attributes [ [, g2.Attributes] ..]
                                  FROM GeneralisedEntity g1 [ [, GeneralisedEntity g2] ..]
                                  [ WHERE specialised-conditions ]
                             Figure 5.6 : Creation of view table to represent a hierarchy


          vi) Cardinality Constraints

        Cardinality constraints specify the minimum / maximum number of instances associated with
a relationship. In a 1:1 relationship type the number of instances associated with a relationship will
be 0 or 1, and in a M:N relationship type it can take any value from 0 upwards. The ability to define
more specific limits allows users not only to gain a better understanding about a relationship, but
also to be able to verify its conformance by using its instances. We suggest creating such
specifications using an extended syntax of the current SQL foreign key definition (cf. statement ‘f’
of figure 5.5) as this is the key which initially establishes this relationship. The minimum / maximum
instance occurrences for a particular relationship of a referential value (i.e. cardinality values) can be
specified using a keyword CARDINALITY as shown in figure 5.5. Here the
Referencing_Cardinality_Value corresponds to the many side of a relationship. Hence the value of
this indicates the minimum instances. When the referencing attribute is not null then the minimum
cardinal value is 1, else it is 0. In our examples introduced in part ‘b’ of section 6.2.3, we have used
‘+’ to represent the many symbol (e.g. 0+ represents zero or many) and ‘-’ to represent up to (e.g. -1
represents 0 to 1).

          vii) Other Constraints

        In the example in figure 5.2, we have also shown an aggregation relationship between the
entities University and Office. Here we have assumed that a reference to a set of instances can be
defined. In such a situation, as with the other constraint types, an appropriate SQL statement should
be used to describe the constraint and an appropriate augmented table such as those used in figure
5.7 must be used to record this information in the database itself. We discuss this case here for the
purpose of highlighting that other constraint types can be introduced and incorporated into our



Page 82
Chapter 5                                                           Re-engineering relational legacy
                                            DBs
system using the same general approach. However our implementations have concentrated only on
the constraints discussed above.

        The enhanced constraints, once they are absorbed into the system, will be stored internally in
the same way as any existing constraints. Hence the reconstruction process to produce an enhanced
conceptual model can utilise this information directly as it is fully automated. To hold the
enhancements in the database itself we need to issue appropriate query statements. The
enhancements can be effected using the SQL statements shown in figure 5.5 if the database is SQL
based and such changes are implicitly supported by it. In section 5.7 we describe how this is done
when the database supports such specifications (e.g. Oracle version 7) and when it does not (e.g.
INGRES version 6). When the DBMS does not support SQL, the query statement to be issued is
translated using QMTS [HOW87] to a form appropriate to the target DBMS. As there are variants of
SQL16 we send all queries via QMTS so that the necessary query statements will automatically get
translated to the target language before entering the target DBMS environment.



5.7 The Knowledge Augmentation Approach

        In this section we describe how the enhanced constraints are retained in a database. Our aim
has been to make these enhancements compatible with the newer versions of commercial DBMSs,
so that the migration process is facilitated as fully as possible.

        Many types of constraint are defined in a conceptual model during database design. These
include relationship, generalisation, existence condition, identity and dependency constraints. In
most current DBMSs these constraints are not represented as part of the database meta-data.
Therefore, to represent and enforce such constraints in these systems, one needs to adopt a
procedural approach which makes use of some embedded programming language code to perform
the task. Our system uses a declarative approach (cf. section 3.6) for constraint manipulation, as it is
easier to process constraints in this form than when they are represented in the traditional form of
procedural code.




   16
        The date functions of most SQL databases (e.g. INGRES and Oracle) are different.


Page 83
Chapter 5                                                                         Re-engineering relational legacy
                                                          DBs
                 CREATE TABLE Table_Constraints (              CREATE TABLE Check_Constraints (
                 Constraint_Id       char(32) NOT NULL,        Constraint_Id      char(32) NOT NULL,
                 Constraint_Name     char(32) NOT NULL,        Constraint_Name    char(32) NOT NULL,
                 Table_Id            char(32) NOT NULL,        Check_Clause       char(240) NOT NULL );
                 Table_Name          char(32) NOT NULL,
                 Constraint_Type     char(32) NOT NULL,
                 Is_Deferrable       char(3)     NOT NULL,     CREATE TABLE Sub_tables (
                 Initially_Deferred  char(3)     NOT NULL );   Table_Id            char(32)     NOT NULL,
                                                               Sub_Table_Name      char(32)     NOT NULL,
                 CREATE TABLE Key_Column_Usage (               Super_Table_Name    char(32)     NOT NULL,
                 Constraint_Id      char(32) NOT NULL,         Super_Table_Column  integer(4)   NOT NULL );
                 Constraint_Name    char(32) NOT NULL,
                 Table_Id           char(32) NOT NULL,
                 Table_Name         char(32) NOT NULL,         CREATE TABLE Altered_Sub_Table_Columns (
                 Column_Name        char(32) NOT NULL,         Table_Id             char(32) NOT NULL,
                 Ordinal_Position   integer(2) );              Sub_Table_Name       char(32) NOT NULL,
                                                               Sub_Table_Column     char(32) NOT NULL,
                 CREATE TABLE Referential_Constraints (        Super_Table_Name     char(32) NOT NULL,
                 Constraint_Id          char(32) NOT NULL,     Super_Table_Column   char(32) NOT NULL );
                 Constraint_Name        char(32) NOT NULL,
                 Unique_Constraint_Id   char(32) NOT NULL,
                 Unique_Constraint_Name char(32) NOT NULL,     CREATE TABLE Cardinality_Constraints (
                 Match_Option           char(32) NOT NULL,     Constraint_Id        char(32) NOT NULL,
                 Update_Rule            char(32) NOT NULL,     Constraint_Name      char(32) NOT NULL,
                 Delete_Rule            char(32) NOT NULL );   Referencing_Cardinal char(32) );

                                     Figure 5.7: Knowledge-based table descriptions


        The constraint enhancement module of our system (CCVES) accepts new constraints (cf.
figure 5.5) irrespective of whether they are supported by the selected DBMS. These new constraints
are the enhanced knowledge which is stored in the current database, using a set of user defined
knowledge-based tables, each of which represents a particular type of constraint. These tables
provide general structures for all constraint types of interest. In figure 5.7 we introduce our table
structures which are used to hold constraint-based information in a database. We have followed the
current SQL-3 approach to representing constraint types supported by the standards. In areas which
the current standards have yet to address (e.g. representation of cardinality constraints) we have
proposed our own table structures. Thus, all general constraints associated with a table (i.e. an entity)
are recorded in Table_Constraints. The constraint description for each type is recorded elsewhere in
other tables, namely, Key_Column_Usage for attribute identifications, Referential_Constraints for
foreign key definitions, Check_Constraints to hold constraint expressions, Sub_Tables to hold
generalisation / specialisation structures (i.e. inherited tables), Altered_Sub_Table_Columns to hold
any attributes renamed during inheritance, and Cardinality_Constraints to hold cardinal values
associated with relationship structures.

        The use of these table structures to represent constraint-based information in a database
depends on the type of DBMS in use and the features it supports. The features supported by a DBMS
may differ from the standards to which it claims to conform, as database vendors do not always
follow the standards fully when they develop their systems. However, DBMSs supporting the
representation of constraints need not have identical table structures to our approach as they may
have used an alternative way of dealing with constraints. In such situations it is not necessary to
insist on the use of our table structures for constraint representation, as the database is capable of
managing them itself if we follow its approach. Therefore we need to identify which SQL standards
have been used and in which DBMSs we should introduce our own tables to hold enhanced
constraints. In figure 5.8 we identify the tables required for the three SQL standards and for three
selected DBMSs. The selected DBMSs were used as our test DBMSs, as we shall see in section 6.1.

       The CCVES determines for the DBMS being used the required knowledge-based tables and


Page 84
Chapter 5                                                                          Re-engineering relational legacy
                                                 DBs
creates and maintains them automatically. The creation of these tables and the input of data to them
are ideally done at the database application implementation stage, by extracting data from the
conceptual model used originally to design a database. However, as current tools do not offer this
type of facility, one may have to externally define and manage these tables in order to hold this
knowledge in a database. Our system has been primed with the knowledge of the tables required for
each DBMS it supports, and so it automatically creates these tables and stores information in them if
the database is enhanced with new constraints. Here, Table_Constraints, Referential_Constraints,
Key_Column_Usage, Check_Constraints and Sub_Tables are those used by SQL-3 to represent
constraint specifications. SQL-2 has the same tables, except for Sub_Tables, Hence, as shown in
figure 5.8, these tables are not required as augmented tables when a DBMS conforms to SQL-3 or
SQL-2 standards, respectively. Adopting the same names and structures as used in the SQL
standards makes our approach compatible with most database products. We have introduced two
more tables (namely: Cardinality_Constraints and Altered_Sub_Table_Columns) to enable us to
represent cardinality constraints and to record any synonyms involved in generalisation /
specialisation structures. The representation of this type of information is not yet addressed by the
SQL standards.

        CCVES utilises the above mentioned user defined knowledge-based tables not only to
automatically reproduce a conceptual model, but also to enhance existing databases by detecting and
cleaning inconsistent data. To determine the presence of these tables, CCVES looks for user defined
tables such as Table_Constraints, Referential_Constraints, etc., which can appear in known existing
legacy databases only if the DBMS maintains our proposed knowledge-base. For example, in
INGRES version 6 we know that such tables are not maintained as part of its system provision,
hence the presence of tables with these names in this context confirms the existence of our
knowledge-base. Use of our knowledge-based tables is database system specific as they are used
only to represent knowledge that is not supported by that DBMS's meta-data facility. Hence, the
components of two distinct knowledge-bases, e.g. for INGRES version 6 and Oracle version 7, are
different from each other (see figure 5.8).

                                 Table Name                      S1   S2   S3 I     O     P
                                                                 -    -    D V6    V7    V4

                                 Table_Constraints               Y    N    N   Y    N    Y
                                 Referential_Constraints         Y    N    N   Y    N    Y
                                 Key_Column_Usage                Y    N    N   Y    N    Y
                                 Check_Constraints               Y    N    N   N    N    N
                                 Sub_Tables                      Y    Y    N   Y    Y    N
                                 Altered_Sub_Table_Columns       Y    Y    Y   Y    Y    Y
                                 Cardinality_Constraints         Y    Y    Y   Y    Y    Y

                          S1 - SQL/86, S2 - SQL-2, S3 - SQL-3                  Y - Yes, required
                          I - INGRES, O - Oracle, P - POSTGRES                 N - No, not required
                          D - Draft, V - Version

                   Figure 5.8 : Requirement of augmented tables for SQL standards and some current DBMSs


        The different types of constraints are identified using the attribute Constraint_Type of
Table_Constraints, which must have one of the values PRIMARY KEY, UNIQUE, FOREIGN KEY or
CHECK. A set of example instances are give in figure 5.9 to show the types of information held in
our knowledge-based tables. The constraint type NOT NULL may also appear in Table_Constraints
when dealing with a DBMS that does not support NULL value specifications. We have not included
it in our sample data as it is supported by our test DBMSs and all the SQL standards. The constraint


Page 85
Chapter 5                                                                                    Re-engineering relational legacy
                                               DBs
and table identifications in our knowledge-based tables (i.e. Constraint_Id and Table_Id of figure
5.9), may be of composite type as they need to identify not only the name, but also the schema,
catalog and location of the database.

        Foreign key constraints are associated with their referenced table through a unique constraint.
Hence, the ‘Const_Name_Key’ instance of attribute Unique_Constraint_Name of table
Referential_Constraints should also appear in Key_Column_Usage as a unique constraint. This
means that each of the knowledge-based tables has its own set of properties to ensure the accuracy
and consistency of the information retained in these tables. For instance Constraint_Type of
Table_Constraints must be one of {‘PRIMARY KEY’, ‘UNIQUE’, ‘FOREIGN KEY’, ‘CHECK’} if
these are the only type of constraints that are to be represented. Also, within a particular schema a
constraint name is unique. Hence Constraint_Name of Table_Constraints must be unique for a
particular type of Constraint_Id. In figure 5.10 we present the set of constraints associated with our
knowledge-based tables. Besides these there are a few others which are associated with other system
tables, such as Tables and Columns which are used to represent all entity and attribute names
respectively. Such constraints are used in systems supporting the above constraint types. This allows
us to maintain consistency and accuracy within the constraint definitions.

                  Table_Constraints { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type,
                                    Is_Deferrable, Initially_Deferred }
                      ('dbId', 'Const_Name_PK', 'TableId', 'Entity_Name_PK', 'PRIMARY KEY', 'NO', 'NO')
                      ('dbId', 'Const_Name_UNI', 'TableId', 'Entity_Name_UNI', 'UNIQUE', 'NO', 'NO')
                      ('dbId', 'Const_Name_FK', 'TableId', 'Entity_Name_FK', 'FOREIGN KEY', 'NO', 'NO')
                      ('dbId', 'Const_Name_CHK', 'TableId', 'Entity_Name_CHK', 'CHECK', 'NO', 'NO')

                  Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name,
                                    Ordinal_Position }
                     ('dbId', 'Const_Name_PK', 'TableId','Entity_Name_PK', 'Attribute_Name_PK', i)
                     ('dbId', 'Const_Name_UNI', 'TableId','Entity_Name_UNI', 'Attribute_Name_UNI', i)
                     ('dbId', 'Const_Name_FK', 'TableId','Entity_Name_FK', 'Attribute_Name_FK', i)

                  Referential_Constraints { Constraint_Id, Constraint_Name,Unique_Constraint_Id,
                                     Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule }
                      ('dbId', 'Entity_Name_FK', 'TableId', 'Const_Name_Key', 'NONE', 'NO ACTION', 'NO ACTION')

                  Check_Constraints { Constraint_Id, Constraint_Name, Check_Clause }
                     ('dbId', 'Const_Name_CHK', 'Const_Expression')

                  Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name }
                     ('dbId', 'Entity_Name_Sub', 'Entity_Name_Super')

                  Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name,
                                      Super_Table_Column }
                      ('dbId', 'Entity_Name_Sub', 'newAttribute_Name', 'Entity_Name_Super', 'oldAttribute_Name')

                  Cardinality_Constraints { Constraint_Id, Constraint_Name, Referencing_Cardinal }
                      ('dbId', 'Entity_Name_FK', 'Const_Value_Ref')

                                Figure 5.9 : Augmented tables with different instance occurrences


        Some attributes of these knowledge-based tables are used to indicate when to execute a
constraint and what action is to be taken. The actions are application dependent or have no effect on
the approach proposed here, and hence we have used a default value as proposed in the standards.
However, it is possible to specify trigger actions like ON DELETE CASCADE so that when a value of
the referenced table is deleted the corresponding values in the referencing table will automatically
get deleted. These features were initially introduced in the form of rule based constraints to allow
triggers and alerters to be specified in databases and make them active [ESW76, STO88]. Such
actions may also have been implemented in legacy ISs as in the case of general constraints. The



Page 86
Chapter 5                                                                                       Re-engineering relational legacy
                                                  DBs
constraints used in our constraint enforcement process (cf. section 5.8) are alerters as they draw
attention to constraints that do not conform to the existing legacy data.

                 Table_Constraints
                   PRIMARY KEY (Constraint_Id, Constraint_Name)
                   CHECK (Constraint_Type IN ('UNIQUE','PRIMARY KEY','FOREIGN KEY','CHECK') )
                   CHECK ( (Is_Deferrable, Initially_Deferred) IN ( values ('NO','NO'), ('YES','NO'), ('YES','YES') ) )
                   CHECK ( UNIQUE ( SELECT Table_Id, Table_Name FROM Table_Constraints
                       WHERE Constraint_Type = 'PRIMARY KEY' ) )
                 Key_Column_Usage
                   PRIMARY KEY (Constraint_Id, Constraint_Name, Column_Name)
                   UNIQUE (Constraint_Id, Constraint_Name, Ordinal_Position)
                   CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name
                      FROM Table_Constraints WHERE Constraint_Type
                      IN ('UNIQUE', 'PRIMARY KEY','FOREIGN KEY' ) ) )
                 Referential_Constraints
                   PRIMARY KEY (Constraint_Id, Constraint_Name)
                   CHECK ( Match_Option IN ('NONE','PARTIAL','FULL') )
                   CHECK ( Update_Rule IN ('CASCADE','SET NULL','SET DEFAULT','RESTRICT','NO ACTION') )
                   CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name
                      FROM Table_Constraints WHERE Constraint_Type = 'FOREIGN KEY' ) )
                   CHECK ( (Unique_Constraint_Id, Unique_Constraint_Name) IN (
                      SELECT Constraint_Id, Constraint_Name FROM Table_Constraints
                      WHERE Constraint_Type IN ('UNIQUE','PRIMARY KEY') ) )
                 Check_Constraints
                   PRIMARY KEY (Constraint_Id, Constraint_Name)
                 Sub_Tables
                   PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name)
                 Altered_Sub_Table_Columns
                   PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name, Column_Name)
                   FOREIGN KEY (Table_Id, Sub_Table_Name, Super_Table_Name) REFERENCES Sub_Tables
                 Cardinality_Constraints
                   PRIMARY KEY (Constraint_Id, Constraint_Name)
                   FOREIGN KEY (Constraint_Id, Constraint_Name) REFERENCES Referential_Constraints

                                Figure 5.10: Consistency constraints of our knowledge-based tables


        Many other types of constraint are possible in theory [GRE93]. We shall not deal with all of
them as our work is concerned only with constraints applicable at the conceptual modelling stage.
These applicable constraints take the form of logical expressions and are stored in the database using
the knowledge-based table Check_Constraints. They are identified by the keyword 'CHECK' in
Table_Constraints in figure 5.9. Similarly, other constraint types (e.g. rules and procedures) are
represented by means of distinct keywords and tables. Figure 5.9 also includes generalisation and
cardinality constraints. A generalisation hierarchy is defined using the SQL-3 syntax (i.e. UNDER,
see figure 5.5), while a cardinality constraint is defined using an extended foreign key definition (see
figure 5.5). These specifications are also held in the database, using the tables Sub_Tables,
Altered_Sub_Table_Columns and Cardinality_Constraints, respectively (see figure 5.9).

5.8 The Constraint Enforcement Process

         This is an optional process provided by our system, as the third stage of its application to a
database. The objective is to give users the facility to verify / ensure that the data conforms to all the
enhanced constraints. This process is optional so that the user can decide whether these constraints
should be enforced to improve the quality of the legacy database prior to its migration or whether it
is best left as it stands.


Page 87
Chapter 5                                                          Re-engineering relational legacy
                                                 DBs

        During the constraint enforcement process any violations of the enhanced constraints are
identified. In some cases this may result in removing the violated constraint as it may be an incorrect
constraint specification. However, the DBA may decide to keep such constraints as the constraint
violation may be as a result of incorrect data instances or due to a change in a business rule that has
occurred during the lifetime of the database. Such a rule may be redefined with a temporal
component to reflect this change. Such data are manageable using versions of data entities as in
object-oriented DBMSs [KIM90].

        We use the enhanced constraint definitions to identify constraints that do not conform to the
existing legacy data. Here each constraint is used to produce a query statement. This query statement
depends on the type of constraint, as shown in figure 5.11. CCVES uses constraint definitions to
produce data manipulation language statements suitable for the host DBMS. Once such statements
are produced, CCVES will execute them against the current database to identify any violated data for
each of these constraints. When such violated data are found for an enhanced constraint it is up to
the user to take appropriate action. Enforcement of such constraints can prevent data rejection by the
target DBMS, possible losses of data and/or delays in the migration process, as the migrating data’s
quality will have been ensured by prior enforcement of the constraints. However as the enforcement
process is optional, the user need not take immediate action. He can take his own time to determine
the exact reasons for each violation and take action at his convenience prior to migration.

5.9 The Migration Process

        The migration process is the fourth and final stage in the application of our approach. This is
incrementally performed by initially creating the meta-data in the target DBMS, using the schema
meta-translation technique of Ramfos [RAM91], and then copying the legacy data to the target
system, using the import/export tools of source and target DBMSs. During this activity, legacy
applications must continue to function until they too are migrated. To support this process we need
to use an interface (i.e. a forward gateway) that can capture and process all database queries of the
legacy application and then re-direct those related to the target system via CCVES. The functionality
that is required here is a distributed query processing facility which is supported by current
distributed DBMSs. However, in our case the source and target databases are not necessarily of the
same type as in the case of distributed DBMSs, so we need to perform a query translation in
preparation for the target environment. Such a facility can be provided using the query meta-
translation technique of Howells [HOW87]. This approach will facilitate transparent migration for
legacy databases as it will allow the legacy IS users to continue working while the legacy data is
being migrated incrementally.




Page 88
Chapter 5                                                                         Re-engineering relational legacy
                                                                   DBs
 Constraint     Queries to detect Constraint Violation Instances

 Primary Key   SELECT Attribute_Names, COUNT(*) FROM Entity_Name
                GROUP BY Attribute_Names HAVING COUNT(*) > 1
               UNION
               SELECT Attribute_Names, 1 FROM Entity_Name
                WHERE Attribute_Names IS NULL
 Unique        SELECT Attribute_Names, COUNT(*) FROM Entity_Name
                GROUP BY Attribute_Names HAVING COUNT(*) > 1
 Referential   SELECT * FROM Referencing_Entity_Name WHERE NOT
                (Referencing_Attributes IS NULL OR Referencing_Attributes IN
                (SELECT Referenced_Attributes FROM Referenced_Entity_Name))
 Check         SELECT * FROM Entity_Name
                WHERE NOT (Check_Constraint_Expression)
 Cardinality   SELECT Attribute_Names, COUNT(*) FROM Entity_Name
                GROUP BY Attribute_Names HAVING COUNT(*) < Min_Cardinality_Value
               UNION
               SELECT Attribute_Names, COUNT(*) FROM Entity_Name
                GROUP BY Attribute_Names HAVING COUNT(*) > Max_Cardinality_Value

                 Figure 5.11: Detection of violated constraints in SQL




Page 89
CHAPTER 6

                        Test Databases and their Access Process

In this chapter we introduce our example databases, by describing their physical and logical
components. The selection criteria for these test databases, and the associated constraints in
accessing and using them are discussed here. We investigate the tools available for our test
DBMSs. We then apply our re-engineering process to our test databases to show its applicability.
Lastly, we refer to the organisation of system information in a relational DBMS and describe how
we identify and access information about entities, attributes, relationships and constraints in our
test DBMSs.

6.1 Introduction to our Test Databases

        In the following sub-sections we introduce our test databases. We first identify the main
requirements for these databases. This is followed by a description of associated constraints and
their role in database access and use. Finally, we identify how we established our test databases
and the DBMSs we have used for this purpose.

       6.1.1 Main Requirements

        The design of our test databases was based on two important requirements. Firstly, to
establish a suitable legacy test database environment to enable us to demonstrate the practicability
of our re-engineering and migration techniques. Secondly, to establish a heterogeneous database
environment for the test databases to enable us to test the generality of our approach.

        As described in section 2.1.2.1, the problems of legacy databases apply mostly to long
existing database systems. Most of these systems use traditional file-based methods or an old
version of the hierarchical, network or relational database models for their database management.
Due to complexity and availability of resources, which are discussed in section 6.1.2, we decided
to focus on a particular type of database model to apply our legacy database enhancement and
migration techniques. Test databases were developed for the chosen database model, while
establishing the required levels of heterogeneity in the context of that model.

       6.1.2 Availability and Choice of DBMSs

        At University of Wales College of Cardiff, where our research was conducted, there were
only a few application databases. These included systems used to process student and staff
information for personnel and payroll applications. This information was centrally processed and
managed using third party software. Due to the licence conditions on this software, the university
did not have the authority to modify and improve it on their own. Also, most of this software was
developed with 3GL technology using files to manipulate information. There were recent
enhancements which had been developed using 4GL tools. However, no proper DBMS had been
used to build any of these applications, although future plans included using Oracle for new
database applications. These databases were therefore not well suited to our work.

        Other than the personnel and payroll applications there were a few departmental and
specific project based applications. Some of these were based on DBMSs (such as Oracle),
although their application details were not readily available. Information gathered from these
sources revealed that not many database applications existed in our university environment and
gaining permission to access them for research purposes was practically impossible. Also, until
we obtained access and investigated each application we would not be able to fully justify its
usefulness as a test database, as it might not fulfil all our requirements. Therefore, it was decided
to initially design and develop our own database applications to suit our requirements and then if
possible to test our system on any other available real world databases.

       Access to DBMSs was restricted to products running on Personal Computers (PCs) and
some Unix systems. Most of these products were based on the relational data model and some on
the object-oriented data model. The older database models - hierarchical and network - were no
longer being used or available as DBMSs. Also, the available DBMSs were in their latest
versions, making the task of building a proper legacy database environment more difficult.

        The relational model has been in use for database applications over the last 20 years and
currently is the most widely used data model. During this time many database products and
versions have been used to manage these database applications. As a result, many of them are now
legacy systems and their users need assistance to enhance and migrate them to modern
environments. Thus the choice of the relational data model for our tests is reasonable, although
one may argue that similar requirements exist for database applications which have been in use
prior to this data model gaining its pre-eminent position.

        Due to the superior power of workstations as compared to PC’s it was decided to work on
these Unix platforms and to build test databases using the available relational DBMSs, as our
main aim was simply to demonstrate the applicability of our approach. Two popular commercial
relational DBMSs, namely: INGRES and Oracle, were accessible via the local campus network.
We selected these two products to implement our test databases as they are leading,
commercially-established products which have been in existence since the early days of relational
databases. The differences between these two database products made them ideal for representing
heterogeneity within our test environment. Both products supported the standard database query
language, SQL. However, only one of them (Oracle) conforms to the current SQL-2 standard.
Oracle is also a leading relational database product, along with SYBASE and INFORMIX, on
Unix platforms [ROS94]. As described in section 3.8, SQL standards have been regularly
reviewed and hence it is also important to choose a database environment that will support at least
some of the modern concepts, such as object-oriented features. In recent database products these
features have been introduced either via extended relational or via object-oriented database
technology. Obviously the choice of an extended relational data model is the most suitable for our
purposes as it incorporates natural extensions to the relational data model. Hence we selected
POSTGRES, which is a research DBMS providing modern object-oriented features in an extended
relational model, as our third test DBMS.

       6.1.3 Choice of database applications and levels of heterogeneity


Page 91
Designing a single large database application as our test database would result in one very
complex database application. To overcome the need to devise and manage a single complex
application to demonstrate all of our tasks, we decided to build a series of simple applications and
later to provide a single integrated application derived from these simple database applications.

        Our own university environment was chosen to construct these test database systems as we
were able to perform a detailed system study in this context and collect sufficient information to
create appropriate test applications. Typical text book examples [MCF91, ROB93, ELM94,
DAT95] were also used to verify the contents chosen for our test databases. Three databases
representing college, faculty and department information were chosen for our simple test
databases. To ensure simplicity, no more than ten entities were included for each of these
databases. However, each was carefully designed to enable us to thoroughly test our ideas, as well
as to represent three levels of heterogeneity within our test systems.

        These systems were implemented on different computer systems using different DBMSs
so that they represented heterogeneity at the physical level. INGRES, POSTGRES and Oracle
running on DEC station, SUN Sparc and DEC Alpha, respectively, were chosen. The differences
in characteristics among these three DBMSs introduced heterogeneity at the logical level. Here,
Oracle conforms to the current SQL/92 standard and supports most modern relational data model
requirements. INGRES and POSTGRES, although they are based on the same data model, have
some basic differences in handling certain database functions such as integrity constraints. These
two DBMSs use a rule subsystem to handle constraints, which is a different approach from that
proposed by the SQL standards. POSTGRES, which is regarded as an extended relational DBMS
having many object-oriented features, is also regarded as an object-oriented DBMS. These
inherent differences ensure the initial heterogeneity of our environment at the logical level. Our
test databases were designed to highlight these logical differences, as we shall see.

6.2 Migration Support Tools for our Test DBMSs

         Prior to creating and applying our approach it was useful to investigate the availability of
tools for our test DBMSs to assist the migration of databases. As indicated in the following sub-
sections, only a few tools are provided to assist this process and they have limited functionality
that is inadequate to assist all the stages of enhancing and migrating a legacy database service.

       6.2.1 INGRES

       INGRES permits manipulation of data in non-INGRES databases [RTI92] and the
development of applications that are portable across all INGRES servers. This type of data
manipulation is done through an INGRES gateway. INGRES/Open SQL, a subset of INGRES
SQL, is used for this purpose. The type of services provided by this gateway include [RTI92]:

    • Translation between Open SQL and non-INGRES SQL DBMS query interfaces such as
      Rdb/VMS (for DEC) or DB2 (for IBM).
    • Conversion between INGRES data types and non-INGRES data types.
    • Translation of non-INGRES DBMS error messages to INGRES generic error types.


Page 92
This functionality is useful in creating a target database service. However, as the target
databases supported by INGRES/Open SQL do not include Oracle and POSTGRES, this tool was
not helpful to us. The PRODBI interface for INGRES [LUC93] allows access to INGRES
databases from Prolog code. This tool is useful in our work as our main processing is done using
Prolog. Hence we have used this tool to implement our constraint enforcement process.

       Meta-data access from INGRES databases could have been done using PRODBI.
However, due to its unavailability at the start of our project we implemented this using C
programs embedded with SQL code. INGRES does not support any CASE tools that assist in
reverse-engineering or analysing INGRES applications. Its only support was in the form of a 4GL
environment [RTI90b] which is useful for INGRES application development, but not for any
INGRES based legacy ISs and their reverse engineering.

       6.2.2 Oracle

        The latest version of Oracle (i.e. version 7) is a RDBMS that conforms to the SQL-2
standards. Hence, this DBMS supports most modern database functions, including the
specification, representation and enforcement of integrity constraints. Oracle has provided
migration tools to convert databases from either of its two most recent versions (i.e. versions 5 or
6) to version 7.

        Oracle, a leading database product on the Unix platform [ROS94], has its own tool set to
assist in developing Oracle based application systems [KRO93]. This includes screen-based
application development tools SQL*Forms and SQL*Menu, and the report-writing product
SQL*Report. These tools assist in implementing Oracle applications but do not provide any form
of support to analyse the system being developed. To overcome this, a series of CASE products
are provided by Oracle (i.e. CASE*Bridge, CASE*Designer, CASE*Dictionary,
CASE*Generator, CASE*Method and CASE*Project) [BAR90]. The objective of these tools is to
assist users by supporting a structured approach to the design, analysis and implementation of an
Oracle application.

        CASE*Designer provides different views of the application using Entity Relationship
Diagrams, Function Hierarchies, Dataflow Diagrams and matrix handlers to show the inter-
relationship between different objects held in an Oracle dictionary. Oracle*Dictionary maintains
complete definitions of the requirements and the detailed design of the application.
Oracle*Generator uses these definitions to generate the code for the target environment and
CASE*Bridge is used to extract information from other Oracle CASE tools or vice versa.
However, such functions can be performed only on applications developed using these tools and
not on an Oracle legacy database developed in any other way, which means they are no help with
the current legacy problem. Hence, Oracle CASE tools are useful when developing new
applications but cannot be used to re-engineer a pre-existing Oracle application, unless that
original application was developed in an Oracle CASE environment. This limitation is shared by
most CASE tools [COMP90, SHA93].

       Currently, Oracle and other vendors are working on overcoming this limitation, and


Page 93
Oracle’s open systems architecture for heterogeneous data access [HOL93] is a step towards this.
ANSI standard embedded SQL [ANSI89b] is used for application portability along with a set of
function calls. In Oracle’s open systems architecture, standard call level interfaces are used to
dynamically link and run applications on different vendor engines without having to recompile
the application programs. This functionality is a subset of Microsoft’s ODBC [RIC94, GEI95] and
the aim is to provide a transparent gateway to access non-Oracle SQL database products (e.g.
IMS, DB2, SQL/DS and VSAM for IBM machines, or RMS and Rdb for DEC) via Oracle’s
SQL*Connect. Transparent gateway products are machine and DBMS dependent in that they need
to be recompiled or modified to run on different computers and support access to a variety of
DBMSs.

       In the past, developers had to create special code for each type of database their users
wanted to access. This limitation can now be overcome using a tool like ODBC to permit access
to multiple heterogeneous databases. Most database vendors have development strategies which
include plans to interoperate with open systems vendors as well as proprietary database vendors.
This facility is being implemented using the 17SQL Access Group’s RDA (Remote Database
Access) standard. As a result, products such as Microsoft’s Open Database Connectivity (ODBC),
INFORMIX-Gateway [PUR93] and Oracle Transparent Gateway [HOL93] support some form of
connectivity between their own and other products.

       For our work with Oracle, we developed our own C programs embedded with the query
language SQL to access and update our prototype Oracle database. There is a version of PRODBI
for Oracle that allows access to Oracle databases from Prolog code which was used in this project.

        6.2.3 POSTGRES

        POSTGRES was developed at the University of California at Berkeley as a research
oriented relational database extended with object-oriented features. Since 1994 a commercial
version called ILLUSTRA [JAE95] has been available. However, POSTGRES has yet to address
the inter-operability and other issues associated with our migration approach.

6.3 The Design of our Test University Databases

        6.3.1 Background

        In our university system, we assume that departments and faculties have common user
requirements and ideally could share a common database. Based on this assumption we have
developed our test database schema to contain shared information. Hence, our three simple test
databases, known as: College, Faculty and Department, can be easily integrated. A complete
integration of these three databases will result in the generation of a global University database
schema. However, in practice, schemas used by different departments and faculties may differ,

   17
       SQL Access Group (SAG) is a non-profit corporation open to vendors and users that
develops technical specifications to enable multiple SQL-based RDBMS’s and application tools to
interoperate. The specifications defined by the SAG consist of a combination of current and
evolving standards that include ANSI SQL, ISO RDA and X/Open SQL.


Page 94
making the task of integration more difficult and bringing up more issues of heterogeneity. As our
work is concerned with legacy database issues in a heterogeneous environment and not with
integrating or resolving conflicts that arise in these environments, the differences that exist within
this type of environment were not considered. Hence, we shall be looking at each of these
databases independently. The main advantage of being able to easily integrate our test databases
was the ability, thereby, to readily generate a complex database schema which could also be used
to test our ideas.

         Each test database was designed to represent a specific kind of information, for example
the Faculty and Department databases represent all kinds of structural relationships (e.g. 1:1, 1:M,
and M:N; strong and weak relationships and entity types). The College database represents
specialisation / generalisation structures, while the University database acts as a global system
consisting of all the sub-database systems. This allows all sub-database systems, i.e. College,
Faculty and Department, to act as a distributed system - the University database system. This is
illustrated in figure 6.1 and is further described in section 6.3.2. We also need to be able to specify
and represent all the constraint types discussed in section 3.5, as our re-engineering techniques are
based on constraints. These were chosen to reflect actual database systems as closely as possible.
We introduce these constraints in section 6.4 after identifying the entities of each of our test
databases.




                                              College Database




                             FPS Database                             A Faculty Database




                COMMA Database      MATHS Database               Departmental Databases

                                      Figure 6.1: The UWCC Database
       6.3.2 The UWCC Database

        We shall use the term UWCC database to refer to our example university database, as the
data of our system is based on that used at University of Wales College of Cardiff (UWCC).

        The UWCC database consists of many distributed database sites each used to perform the
functions either of a particular department or school, or of a faculty, or of the college. The
functions of the college are performed using the database located at the main college, which we
shall refer to as the College database. The College consists of five faculties, each having its own
local database located at the administrative section of the faculty. Our test database has been
populated for one faculty, namely: The Faculty of Physical Science (FPS), and we shall refer to


Page 95
this database as the Faculty database. The College has 28 departments or schools, with five of
them belonging to FPS [UWC94a, UWC94b]. Our test databases were populated using data from
two departments of FPS, namely: The Department of Computing Mathematics (COMMA), which
is now called the Department of Computer Science, and The Department of Mathematics
(MATHS). These are referred to as Department databases.

       The component databases of our UWCC database form a hierarchy as shown in figure 6.1.
This will let us demonstrate how the global University database formed by integrating these
components incorporates all the functions present in the individual databases. In the next section
we identify our test databases by summarising their entities and specific features.

                  Entity                                   Database
                  Name              (Meaning)             College     Faculty   Department   University
                  University (university data)               x           -            -          x
                  Employee (university employees)            x           x            x          x
                  Student (university students)              x           -            x          x
                  EmpStudent (employees as students)         x           -            -          x
                  College (college data)                     x           -            -          x
                  Faculty (faculty data)                     x           x            -          x
                  Department (department data)               x           x            x          x
                  Committee (faculty committees)             -           x            -          x
                  ComMember (committee members)              -           x            -          x
                  Teach (subjects taught by staff)           -           -            x          x
                  Course (offered by the department)         -           -            x          x
                  Subject (subject details)                  -           -            x          x
                  Option (subjects for each course)          -           -            x          x
                  Take (subjects taken by each student)      -           -            x          x

                                            Table 6.1: Entities used in our test databases


       6.3.3 The Test Database schemas

        Fourteen entities shown in table 6.1 were represented in our test database schemas. As we
are not concerned with heterogeneity issues associated with schema integration, we have
simplified our local schemas by using the same attribute definitions in schemas having the same
entity name. The attribute definitions of all our entities are given in figure 6.2. Each test database
schema is defined using the data definition language (DDL) of the chosen DBMS, and is governed
by a set of rules to establish integrity within the database. In the context of a legacy system these
rules may not appear as part of the database schema. In this situation our approach is to supply
them externally via our constraint enhancement process. Therefore we present the set of
constraints defined on our test databases separately, so that the initial state of these databases
conforms to the database structure of a typical legacy system.

       6.3.4 Features of our Test Database schemas

        Among the specific features represented in our test databases are relationship types which
form weak and link entities, cardinality constraints which highlight the behaviour of entities, and
inheritance and aggregation which form specialised relationships among entities. These features
(if not present) are introduced to our test database schemas by enhancing them with new
constraints.



Page 96
a) Relationship types

       Our reverse-engineering process uses the knowledge of constraint definitions to construct
a conceptual model for a legacy database system. The foreign key definitions of table 6.4 along
with their associated primary (cf. table 6.2) and uniqueness constraints (cf. table 6.3) are used to
determine the relationship structures of a conceptual model. In this section we look at our foreign
key constraint definitions to identify the types of relationship formed in our test database schemas.
The check constraints of table 6.5 are used purely to restrict the domain values of our test
databases.

        The foreign keys of table 6.4 are processed to find relationships according to our approach
described in section 5.4.1. Here we identify keys defined on primary key attributes to determine
M:N and 1:N weak relationships. The remaining keys will form 1:N or 1:1 relationships
depending on the uniqueness property of the attributes of these keys. Table 6.6 shows all the
relationships found in our test databases. We have also identified the criteria used to determine
each relationship type according to section 5.4.1.




Page 97
CREATE TABLE University (                      CREATE TABLE Department (
                Office          char(50)      NOT NULL );       DeptCode       char(5)        NOT NULL,
                                                                Building       char(20)       NOT NULL,
                CREATE TABLE Employee (                         Name           char(50)       NOT NULL,
                 Name          char(25)     NOT NULL,           Address        char(30),
                 Address       char(30)     NOT NULL,           Head           char(9),
                 BirthDate     date(7)      NOT NULL,           Phone          char(13),
                 Gender        char(1)      NOT NULL,           Faculty        char(5)        NOT NULL );
                 EmpNo         char(9)      NOT NULL,
                 Designation   char(30)     NOT NULL,          CREATE TABLE Committee (
                 WorksFor      char(5)      NOT NULL,           Name          char(15)        NOT NULL,
                 YearJoined    integer(2) NOT NULL,             Faculty       char(5)         NOT NULL,
                 Room          char(9),                         Chairperson   char(9) );
                 Phone         char(13),
                 Salary        decimal(8,2) );                 CREATE TABLE ComMember (
                                                                ComName       char(15)        NOT NULL,
                CREATE TABLE Student (                          MemName       char(9)         NOT NULL,
                 Name           char(20)      NOT NULL,         Faculty       char(5)         NOT NULL,
                 Address        char(30)      NOT NULL,         YearJoined    integer(2)      NOT NULL );
                 BirthDate      date(7)       NOT NULL,
                 Gender         char(1)       NOT NULL,        CREATE TABLE Teach (
                 CollegeNo      char(9)       NOT NULL,         Lecturer       char(9)        NOT NULL,
                 Course         char(5)       NOT NULL,         Course         char(5)        NOT NULL,
                 Department     char(5)       NOT NULL,         Subject        char(5)        NOT NULL,
                 Tutor          char(9),                        Room           char(9) );
                 RegYear        integer(2)    NOT NULL );
                                                               CREATE TABLE Course (
                CREATE TABLE EmpStudent (                       CourseNo      char(5)         NOT NULL,
                 CollegeNo     char(9)        NOT NULL,         Name          char(35)        NOT NULL,
                 EmpNo         char(9)        NOT NULL,         Coordinator   char(9),
                 Remark        char(10) );                      Offeredby     char(5)         NOT NULL,
                                                                Type          char(1)         NOT NULL,
                CREATE TABLE College (                          Length        char(10),
                 Code           char(5)       NOT NULL,         Options       integer(2) );
                 Building       char(20)      NOT NULL,
                 Name           char(40)      NOT NULL,        CREATE TABLE Subject (
                 Address        char(30),                       SubNo          char(5)        NOT NULL,
                 Principal      char(9),                        Name           char(40)       NOT NULL );
                 Phone          char(13) );
                                                               CREATE TABLE Option (
                CREATE TABLE Faculty (                          Course         char(5)        NOT NULL,
                 Code           char(5)       NOT NULL,         Subject        char(5)        NOT NULL,
                 Building       char(20)      NOT NULL,         Year           integer(2)     NOT NULL );
                 Name           char(40)      NOT NULL,
                 Address        char(30),                      CREATE TABLE Take (
                 Secretary      char(9),                        CollegeNo      char(9)        NOT NULL,
                 Phone          char(13),                       Subject        char(5)        NOT NULL,
                 Dean           char(9) );                      Year           integer(2)     NOT NULL,
                                                                Grade          char(1) );


                       Figure 6.2: Test database schema entities and their attribute descriptions


        We can see that the selected constraints cover four of the five relationship identification
categories of figure 5.3. The remaining category (i.e. ‘b’) is a special case of category ‘a’ which
could be represented in the entity Take by introducing two separate foreign keys to link entities
Course and Subject, instead of linking with the entity Option. However, as stated in section 5.4.1,
n-ary relationships are simplified whenever possible. Hence, in the test examples presented here
we do not show this type to reduce the complexity of our diagrams. In appendix C we present the
graphical view of all our test databases. The figures there show the graphical representation of all
the relationships identified in table 6.6.

       b) Inheritance

       We have introduced two inheritance structures, one representing a single inheritance and
the other a multiple inheritance (see figure 5.2 and table 6.7). To do so, two generalised entities,



Page 98
namely: Office and Person, have been introduced (see figure 6.3). Entities College, Faculty and
Department now inherit from Office, while entities Employee and Student inherit from Person.
Entity EmpStudent has been modified to become a specialised combination of Student and
Employee. Figure 6.3 also contain all constraints associated with these entities.

                           Constraint                                      Entity(s)

                           PRIMARY KEY (office)                            University
                           PRIMARY KEY (empNo)                             Employee
                           PRIMARY KEY (collegeNo)                         Student, EmpStudent
                           PRIMARY KEY (code)                              College, Faculty
                           PRIMARY KEY (deptCode)                          Department
                           PRIMARY KEY (name,faculty)                      Committee
                           PRIMARY KEY (comName,memName,faculty)           ComMember
                           PRIMARY KEY (lecturer,cource,subject)           Teach
                           PRIMARY KEY (courseNo)                          Course
                           PRIMARY KEY (subNo)                             Subject
                           PRIMARY KEY (course,subject)                    Option
                           PRIMARY KEY (collegeNo,subject,year)            Take

                              Table 6.2: Primary Key constraints of our test databases


                                   Constraint                Entity(s)

                                   UNIQUE (empNo)            EmpStudent
                                   UNIQUE (name)             College, Department, Faculty
                                   UNIQUE (principal)        College
                                   UNIQUE (dean)             Faculty
                                   UNIQUE (head)             Department
                                   UNIQUE (name,offeredBy)   Course

                            Table 6.3: Uniqueness Key constraints of our test databases



       c) Cardinality constraints

       We have introduced some cardinality constraints on our test databases to show how these
can be specified for a legacy database. In table 6.8 we show those used in the College database.
Here the cardinality constraints for worksFor and faculty have been explicitly specified (see figure
6.3), while the others (inCharge, tutor and dean) have been derived using their relationship types.
For example inCharge and tutor are 1:N relationships while dean is a 1:1 relationship. Our
conceptual diagrams incorporate these constraint values (cf. appendix C and figure 5.2).

                      Constraint                                                       Entity(s)

                      FOREIGN KEY (course) REFERENCES Course                           Student, Option
                      FOREIGN KEY (department) REFERENCES Department                   Student
                      FOREIGN KEY (tutor) REFERENCES Employee                          Student
                      FOREIGN KEY (dean) REFERENCES Employee                           Faculty
                      FOREIGN KEY (faculty) REFERENCES Faculty                         Committee
                      FOREIGN KEY (chairPerson) REFERENCES Employee                    Committee
                      FOREIGN KEY (comName,faculty) REFERENCES Committee               ComMember
                      FOREIGN KEY (memName) REFERENCES Employee                        ComMember
                      FOREIGN KEY (lecturer) REFERENCES Employee                       Teach
                      FOREIGN KEY (course,subject) REFERENCES Option                   Teach
                      FOREIGN KEY (coordinator) REFERENCES Employee                    Course
                      FOREIGN KEY (offeredBy) REFERENCES Department                    Course
                      FOREIGN KEY (subject) REFERENCES Subject                         Option, Take
                      FOREIGN KEY (collegeNo) REFERENCES Student                       Take

                              Table 6.4: Foreign Key constraints of our test databases




Page 99
Constraint                                                                  Entity(s)

                CHECK (yearJoined >= 21 + birthDate INTERVAL YEAR)                          Employee
                CHECK (salary BETWEEN 200 AND 3000 OR salary IS NULL)                       Employee
                CHECK (regYear >= 18 + birthDate INTERVAL YEAR)                             Student
                CHECK (phone IS NOT NULL)                                                   College, Department, Faculty
                CHECK (type IN ('U','P','E','O'))                                           Course
                CHECK (options >= 0 OR options IS NULL)                                     Course
                CHECK (year BETWEEN 1 AND 7)                                                Option

                                    Table 6.5: Check constraints of our test databases


       d) Aggregation

        A university has many offices (e.g. faculties, departments etc.) and an office belongs to a
university. Also, attribute office is the key of entity University. Hence, entities University and
Office participate in a 1:1 relationship. However, it is natural to represent this as a specialised
relationship by considering office of University to be of type set. Then University and Office
participate in an aggregation relationship which is a special form of a binary relationship. We
introduce this type to show how specialised constraints could be introduced into a legacy database
system. As shown in figure 6.3 we have used the key word REF SET to specify this type of
relationship. In this case, as office is the key of University, a foreign key definition on office (see
figure 6.3) will treat University as a link entity and hence can be classified as a special
relationship.

                             Attribute(s)            Entity       Relationship            Entity(s)            Criteria

                  course                           Student           1     :N      Course                        (d)
                  department                       Student           1     :N      Department                    (d)
                  tutor                            Student           1     :N      Employee                      (d)
                  dean                             Faculty           1     :1      Employee                      (e)
                  faculty                          Committee         1     :N      Faculty                       (c)
                  chairPerson                      Committee         1     :N      Employee                      (d)
                  comName, faculty, memName        ComMember         M     :N      Committee, Employee           (a)
                  lecturer, course, subject        Teach             M     :N      Employee, Option              (a)
                  coordinator                      Course            1     :N      Employee                      (d)
                  offeredBy                        Course            1     :N      Department                    (d)
                  course, subject                  Option            M     :N      Course, Subject               (a)
                  collegeNo, subject               Take              M     :N      Student, Subject              (a)

                                        Table 6.6: Relationship types of our test databases


                                                 Entity           Inherited Entities
                                                 Employee         Person
                                                 Student          Person
                                                 EmpStudent       Student, Employee
                                                 College          Office
                                                 Faculty          Office
                                                 Department       Office

                                                   Table 6.7: Inherited Entities


                        Participating        Referencing      Referenced        Referencing      Referenced
                        Attribute            Entity           Entity            Cardinality      Cardinality
                        inCharge             Office           Employee              0+                  -1
                        worksFor             Employee         Office                4+                   1
                        tutor                Student          Employee              0+                  -1
                        dean                 Faculty          Employee             -1                   -1
                        faculty              Department       Faculty              2-12                  1

                                       Table 6.8: Cardinality constraints of College database




Page 100
6.4 Constraints Specification, Enhancement and Enforcement

        In the context of legacy systems, our test database schemas (cf. figure 6.2) will not
explicitly contain most of the constraints introduced in tables 6.2 to 6.5, 6.7 and 6.8. Thus we need
to specify them using the approach described in section 5.6. In figure 6.3 we present these
constraints for the College database.

                CREATE TABLE Office
                   (code, siteName, unitName, address, inCharge, phone) AS
                   SELECT code, building, name, address, principal, phone FROM College
                   UNION SELECT code, building, name, address, secretary, phone FROM Faculty
                   UNION SELECT deptCode, building, name, address, head, phone FROM Department;
                ALTER TABLE Office
                   ADD CONSTRAINT Office_PK PRIMARY KEY (code)
                   ADD CONSTRAINT Office_Unique_name UNIQUE (siteName, unitName)
                   ADD CONSTRAINT Office_FK_Staff FOREIGN KEY (inCharge) REFERENCES Employee
                   ADD UNIQUE (phone);

                ALTER TABLE College ADD UNDER Office
                   WITH (siteName AS building, unitName AS name, inCharge AS principal);
                ALTER TABLE Faculty ADD UNDER Office
                   WITH (siteName AS building, unitName AS name, inCharge AS secretary)
                   ADD FOREIGN KEY (faculty) CARDINALITY (2-12) REFERENCES Faculty ;
                ALTER TABLE Department ADD UNDER Office
                   WITH (code AS deptCode, siteName AS building, unitName AS name, inCharge AS head);

                CREATE VIEW College_Office AS SELECT * FROM Office
                   WHERE code in (SELECT code FROM College);
                CREATE VIEW Faculty_Office AS
                   SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, f.dean
                   FROM Office o, Faculty f WHERE o.code = f.code;
                CREATE VIEW Dept_Office AS
                   SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, d.faculty
                   FROM Office o, Department d WHERE o.code = d.deptCode;

                ALTER TABLE University
                   ALTER COLUMN office REF SET(Office) NOT NULL
                   | ADD FOREIGN KEY (office) REFERENCES Office ;

                CREATE TABLE Person AS
                   SELECT name, address, birthDate, gender FROM Employee
                   UNION SELECT name, address, birthDate, gender FROM Student;
                ALTER TABLE Person
                   ADD PRIMARY KEY (name, address, birthDate)
                   ADD CHECK (gender IN ('M', 'F'));

                ALTER TABLE Employee ADD UNDER Person
                   ADD CONSTRAINT Employee_FK_Office FOREIGN KEY (worksFor)
                       CARDINALITY (4) REFERENCES Office;
                ALTER TABLE Student ADD UNDER Person;
                ALTER TABLE EmpStudent ADD UNDER Student, Employee
                   ADD CHECK (tutor <> empNo OR tutor IS NULL);

                    Figure 6.3 : Enhanced constraints of college database in extended SQL-3 syntax
        When all the above constraints are not supported by a legacy database management
system, we need to be able to store the constraints in the database using our knowledge
augmentation techniques (cf. section 5.7). In figure 6.4 we present selected instances used in our
knowledge-based tables to represent the enhanced constraints for the College database. The
selected instances represent all the possible constraint types so we have not represented all the
enhanced constraints of figure 6.3.



Page 101
Our constraint enforcement process (cf. section 5.8) allows users to verify the extent to
which the data in a database conforms to its enhanced constraints. The different types of queries
used for this process in the College database are given in figure 6.5.

                 Table_Constraint { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type,
                                  Is_Deferrable, Initially_Deferred }
                     ('Uni_db', 'Office_PK', 'Col', 'Office', 'PRIMARY KEY', 'NO', 'NO')
                     ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'UNIQUE', 'NO', 'NO')
                     ('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'FOREIGN KEY', 'NO', 'NO')
                     ('Uni_db', 'Employee_PK', 'Col', 'Employee', 'PRIMARY KEY', 'NO', 'NO')
                     ('Uni_db', 'Employee_FK_Office', 'Col, 'Employee', 'FOREIGN KEY', 'NO', 'NO')
                     ('Uni_db', 'College_phone', 'Col', 'College', 'CHECK', 'NO', 'NO')
                 Referential_Constraint { Constraint_Id, Constraint_Name,Unique_Constraint_Id,
                                  Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule }
                     ('Uni_db', 'Office_FK_Employee', 'Col', 'Employee_PK', 'NONE', 'NO ACTION', 'NO ACTION')
                     ('Uni_db', 'Employee_FK_Office', 'Uni', 'Office_PK', 'NONE', 'NO ACTION', 'NO ACTION')
                 Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name,
                                  Ordinal_Position }
                    ('Uni_db', 'Office_PK', 'Col', 'Office', 'Code', 1)
                    ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'siteName', 1)
                    ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'unitName', 2)
                    ('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'inCharge', 1)
                    ('Uni_db', 'Employee_PK', 'Col', 'Employee', 'empNo', 1)
                    ('Uni_db', 'Employee_FK_Office', 'Col', 'Employee', 'worksFor', 1)
                 Check_Constraint { Constraint_Id, Constraint_Name, Check_Clause }
                    ('Uni_db', 'College_phone', 'phone is NOT NULL')
                 Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name }
                    ('Uni_db', 'College', 'Office')
                 Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name,
                                   Super_Table_Column }
                     ('Uni_db', 'College', 'building', 'Office', 'siteName')
                     ('Uni_db', 'College', 'name', 'Office', 'unitName')
                     ('Uni_db', 'College', 'principal', 'Office', 'inCharge')
                 Cardinality_Constraint { Constraint_Id, Constraint_Name, Referencing_Cardinal }
                     ('Uni_db', 'Office_FK_Employee', '0+')
                     ('Uni_db', 'Employee_FK_Office', '4+')

                       Figure 6.4 : Augmented tables with selected sample data for the college database


6.5 Database Access Process

        Having described the application of our re-engineering processes using our test databases,
we identify the tools developed and used to access those databases. The database access process is
the initial stage of our application. This process extracts meta-data from legacy databases and
represents it internally so that it can be used by other stages of our application.

       During re-engineering we need to access a database at three different stages: to extract
meta-data and any existing constraint knowledge specifications to commence our reverse-
engineering process; to add enhanced knowledge to the database; and to verify the extent to which
the data conforms to the existing and enhanced constraints. We also need to access the database
during the migration process. In all these cases, the information we require is held in either system
or user-defined tables. Extraction of information from these tables can be done using the query
language of the database, thus what we need for this stage is a mechanism that will allow us to
issue queries and capture their responses.




Page 102
Constraint Type        Constraint Violation Instances

                    Primary Key       SELECT code, COUNT(*) FROM Office
                                       GROUP BY code HAVING COUNT(*) > 1
                                      UNION SELECT code, 1 FROM Office WHERE code IS NULL
                    Unique            SELECT dean, COUNT(*) FROM Faculty
                                       GROUP BY dean HAVING COUNT(*) > 1
                    Referential       SELECT * FROM Office WHERE NOT (inCharge IS NULL OR
                                       inCharge IN (SELECT empNo FROM Employee))
                    Check             SELECT * FROM College WHERE NOT (phone IS NOT NULL)
                    Cardinality       SELECT worksFor, COUNT(*) FROM Employee
                                       GROUP BY worksFor HAVING COUNT(*) < 4

                   Figure 6.5: Selected constraints to be enforced for the college database in SQL


        As our system implementation is in Prolog, the necessary query statements are generated
from Prolog rules. The PRODBI interface allows access to several relational DBMSs, namely:
Oracle, INGRES, INFORMIX and SYBASE [LUC93], from Prolog as if their relational tables are
in the Prolog environment. The availability of INGRES PRODBI enabled us to use this tool to
communicate with our INGRES test databases in the latter stages of our project. This interface has
a performance as good as that of INGRES/SQL and hence, to the user, database interaction is fully
transparent. Such Prolog database interface tools are currently commercially available only for
relational database products. This means that we were not in a position to use this approach to
perform database interactions for our POSTGRES test databases. Tools such as ODBC allow
access to heterogeneous databases. This option would have been ideal for our application, but was
not considered due to its unavailability in our development time scale.

        As far as our work is concerned, we needed a facility to issue specific types of query and
obtain the response in such a way that Prolog could process responses without having to download
the entire database. The PRODBI interfaces for relational databases perform this task efficiently,
and also have many other useful data manipulation features. Due to the absence of any PRODBI
equivalent tools to access non-relational or extended-relational DBMSs, we decided to develop
our own version for POSTGRES. Here the functionality of our POSTGRES tool is to accept a
POSTGRES DML statement (i.e a QUEL query statement) and produce the results for that query
in a form that is usable by our (Prolog based) system.

        For Oracle, a PRODBI interface is available commercially, and to use it with our system
the only change we would have to make is to load the Oracle library. As far as our code is
concerned there is no change in any other commands, since they support the same rules as in
INGRES. However at Cardiff only the PRODBI interface for INGRES was available, and even
this was in the latter stages of our project. Therefore we developed our own tool to perform this
functionality for INGRES and Oracle databases. However the implementation of this tool was not
fully generalised, given that such tools were commercially available. When developing this tool
we were not too concerned by performance degradation as our aim was to test functionality, not
performance. Also, in the case of INGRES we have confirmed performance by subsequently using
a commercially developed PRODBI tool with an SQL equivalent query facility.

       6.5.1 Connecting to a database

       To establish a connection with a database the user needs to specify the site name (i.e. the


Page 103
location of the database), the DBMS name (e.g. Oracle v7) and the database name (e.g.
department) to ensure a unique identification of a database located over a network. The site name
is the address of the host machine (e.g. thor.cf.ac.uk) and is used to gain access to that machine
via the network. The type of the named DBMS identifies the kind of data to be accessed, and the
name18 of the database tells us which database is to be used in the extraction process.

        In our system (CCVES), we provide a pop-up window (cf. left part of figure 6.6) to select
and specify these requirements. Here, a set of commonly used site names and the DBMSs
currently supported at a site are embedded in the menu to make this task easy. The specification of
new site and database names can also be done via this pop-up menu (cf. right part of figure 6.6).

         6.5.2 Meta-data extraction

        Once a physical connection to a database is achieved it is possible to commence the meta-
data extraction process. This process is DBMS dependent as the kind of meta-data represented in a
database and the methods of retrieving it vary between DBMSs. The information to be extracted
is recorded in the system catalogues (i.e. data dictionaries) of respective databases. The most basic
type of information is entity and attribute names, which are common to all DBMSs. However,
information about different types of constraints is specific to DBMSs and may not be present in
legacy database system catalogues.




                         Figure 6.6: Database connection process of CCVES

        The organisation of meta-data in databases differs with DBMSs, although all relational
database systems use some table structure to represent this information. For example, a table
structure for an Oracle user table is straight forward as they are separated from its system tables,
while it is more complex in INGRES as all tables are held in a single form using attribute values
to differentiate user defined tables from system and view tables. Hence the extraction query
statements to retrieve entity names of a database schema differ for each system, as shown in table
6.9. These query statements indicate that the meta-data extraction process is done using the query
language of that DBMS (e.g. SQL for Oracle and POSTQUEL for POSTGRES) and that the
query table names and conditions vary with the type of the DBMS. This clearly demonstrates the
DBMS dependency of the extraction process. Once the meta-data is obtained from system

   18
        For simplicity, identification details like the owner of the database are not included here.


Page 104
catalogues we can process it to produce the database schema in the DDL formalism of the source
database and to represent this in our internal representation (see section 7.2). The extraction
process for entity names (cf. table 6.9) covers only one type of information. A similar process is
used to extract all the other types of information, including our enhanced knowledge based tables.
Here, the main difference is in the queries used to extract meta-data and any processing required
to map the extracted information into our internal structures, which are introduced in section 7.2
(see also appendix D).

       6.5.3 Meta-data storage

        The generated internal structures are stored in text files for further use as input data for our
system. These text files are stored locally using distinct directories for each database. The system
uses the database connection specifications to construct a unique directory name for each
database (e.g. department-Oracle7-thor.cf.ac.uk). We have given public access to these files so
that the stored data and knowledge is not only reusable locally, but also usable from other sites.
This directory structure concept provides a logically coherent database environment for users. It
means that any future re-engineering processes may be done without physically connecting to the
database (i.e. by selecting a database logically from one of the public directories instead).

                  DBMS           Query

                  Oracle V7      SELECT table_name FROM user_table;
                  INGRES V6      SELECT table_name FROM iitables WHERE table_type='T' and system_use='U';
                  POSTGRES V4    RETRIEVE pg_class.relname WHERE pg_class.relowner!='6';
                  SQL-3          SELECT table_name FROM tables WHERE table_type='BASE TABLE';

                         Table 6.9: Query statement to extract entity names of a database schema


       The process of connecting to a database and accessing its meta-data usually does not take
much time (e.g. at most 2 minutes). However, trying to access an active database whenever a user
wants to view its structure slows down the regular activities of this database. Also, local working
is more cost effective than regularly performing remote accesses. This alternative also guarantees
access to the database service as it is not affected by network traffic and breakdowns. We
experienced such breakdowns during our system development, especially when accessing
INGRES databases. A database schema can be considered to be static, whereas its instances are
not. Hence, the decision to simulate a logical database environment after the first physical remote
database access is justifiable because it allows us to work on meta-data held locally.



       6.5.4 Schema viewing

        As meta-data is stored in text files for easy database access sessions, it is possible to skip
the stages described in section 6.5.1 to 6.5.3 when viewing a database schema which has been
accessed recently. During a database connection session, our system will only extract and store the
meta-data of a database. Once the database connection process is completed the user needs to
invoke a schema viewing session. Here, the user is prompted with a list of the current logically
connected databases, as shown on the left of figure 6.7. When a database is selected from this list,


Page 105
its name descriptions (i.e. database name and associated schema names) are placed in the main
window of CCVES (cf. right of figure 6.7). The user selects schemas to view them. Our reverse-
engineering process is applied at this point. Here, meta-data extracted from the database schema is
processed further to derive the necessary constructs to produce the conceptual model as an E-R or
OMT diagram.

       CCVES allows multiple selections of the same database schema (i.e. by selecting the same
schema from the main window; cf. right of figure 6.7). As a result, multiple schema visualisation
windows can be produced for the same database. The advantage of this is that it allows a user to
simultaneously view and operate on different sections of the same schema, which otherwise would
not be visible simultaneously due to the size of the overall schema (i.e. we would have to scroll
the window to make other parts of the schema visible). Also, the facility to visualise schemas
using a user preferred display model means that the user could now view the same schema
simultaneously using different display models.




                 Figure 6.7: Database selection and selected databases of CCVES
        To produce a graphical view of a schema, we apply our reverse-engineering process. This
process uses the meta-data which we extracted and represented internally. In chapter 7 we
introduce the representation of our internal structures and describe the internal and external
architecture and operation of our system, CCVES.




Page 106
CHAPTER 7

                         Architecture and Operation of CCVES

The Conceptualised Constraint Visualisation and Enhancement System (CCVES) is defined by
describing its internal architecture and operation - i.e. the way in which different legacy database
schemas are processed within CCVES in the course of enhancing and migrating them into a target
DBMS’s schema - and its external architecture and operation - i.e. CCVES as seen and operated
by its users. Finally, we look into the possible migrations that can be performed using CCVES.

7.1 Internal Architecture of CCVES

       In previous chapters, we discussed overall information flow (section 2.2), our re-
engineering process (section 5.2) and the database access process (section 6.5). Here we describe
how the meta-data accessed from a database is stored and manipulated by CCVES in order to
successfully perform its various tasks.

        There are two sources of input information available to CCVES (cf. figure 7.1): initially,
by accessing a legacy database service via the database connectivity (DBC) process, and later by
using the database enhancement (DBE) process. This information is converted into our internal
representation (see section 7.2) and held in this form for use by other modules of CCVES. For
example, the Schema Meta-Visualisation System (SMVS) uses it to display a conceptual model of
a legacy database, the Query Meta-Translation System (QMTS) uses it to construct queries that
verify the extent to which the data conforms to existing and enhanced constraints, and the Schema
Meta-Translation system (SMTS) uses it to generate and create target databases for migration.

7.2 Internal Representation

        To address heterogeneity issues, meta-representation and translation techniques have been
successfully used in several recent research projects at Cardiff [HOW87, RAM91, QUT93,
IDR94]. A key to this approach is the transformation of the source meta-data or query into a
common internal representation which is then separately transformed into a chosen target
representation. Thus components of a schema, referred to as meta-data, are classified as entity
(class) and attribute (property) on input, and are stored in a database language independent fashion
in the internal representation. This meta-data is then processed to derive the appropriate schema
information of a particular DBMS. In this way it is possible to use a single representation and yet
deal with issues related to most types of DBMSs. A similar approach is used for query
transformation between source and target representations.
DBE




                                                    Internal
                            DBC                  Representation                SMVS




                                          QMTS                SMTS



                                  Figure 7.1: Internal Architecture of CCVES


         The meta-data we deal with has been classified into two types. The first category
represents essential meta-data and the other represents derived meta-data. Information that
describes an entity and its attributes, and constraints that identify relationships/hierarchies among
entities are the essential meta-data (see section 7.2.1), as they can be used to build a conceptual
model. Information that is derived for use in the conceptual model from the essential meta-data
constitutes the other type of meta-data. When performing our reverse-engineering process we look
only at the essential meta-data. This information is extracted from the respective databases during
the initial database access process (i.e. DBC in figure 7.1).

       7.2.1 Essential Meta-data

        Our essential (basic) meta-data internal representation captures sufficient information to
allow us to reproduce a database schema using the DDL syntax of any DBMS. This representation
covers entity and view definitions and their associated constraints. The following 5 Prolog style
constructs were chosen to represent this basic meta-data, see figure 7.2. The first two constructs,
namely: class and class property, are fundamental to any database schema as they describe the
schema entities and their attributes, respectively. The third construct represents constraints
associated with entities. This information is only partially represented by most DBMSs. The next
two constructs are relevant only to some recent object-oriented DBMSs and are not supported by
most DBMSs. We have included them mainly to demonstrate how modern abstraction
mechanisms such as inheritance hierarchies could be incorporated into legacy DBMSs. By a
similar approach, it is possible to add any other appropriate essential meta-data constructs. For
conceptual modelling, and for the type of testing we perform for the chosen DBMSs, namely:
Oracle, INGRES and POSTGRES, we found that the 5 constructs described here are sufficient.
However, some additional environmental data (see section 7.2.2) which allows identification of
the name and the type of the current database is also essential.




Page 108
1. class(SchemaId, CLASS_NAME).
                    2. class_property(SchemaId, CLASS_NAME, PROPERTY_NAME, PROPERTY_TYPE).
                    3. constraint(SchemaId, CLASS_NAME, PROPERTY_list, CONST_TYPE, CONST_NAME,
                                      CONST_EXPR).
                    4. class_inherit(SchemaId, CLASS_NAME, SUPER_list).
                    5. renamed_attr(SchemaId, SUPER_NAME, SUPER_PROP_NAME, CLASS_NAME,
                                      PROPERTY_NAME).


                          Figure 7.2: Our Essential Meta-data Representation Constructs


        We now provide a detailed description of our meta-representation constructs. This
representation is based on the concepts of the Object Abstract Conceptual Schema (OACS)
[RAM91] used in his SMTS and other meta-processing systems. Hence we have used the same
name to refer to our own internal representation. Ramfos’s OACS internal representation provides
a natural abstraction of a particular structure based on the notion of objects. For example, when an
object is described, its attributes, constraints and other related properties are treated as a single
construct although only part of it may be used at a time. Our OACS representation directly
resembles the internal representation structure of most relational DBMSs (e.g. class represents an
entity, class_property represent the list of attributes of an entity). This is the main difference
between the two representations. However it is possible to map the OACS constructs of Ramfos to
our internal representation and vice-versa, hence our decision to use a variation of the original
OACS does not affect the meta-representation and processing principles in general.

• Meta-data Representation of class

    The names of all the entities for a particular schema are recorded using class. This
    information is processed to identify all the entities of a database schema.



• Meta-data Representation of class_property

    The names of all attributes and their data types for a particular schema are recorded using
    class_property. This information is processed to identify all attributes of an entity.

• Meta-data Representation of constraint

    All types of constraints associated with entities are recorded using constraint. This
    information has been organised to represent constraints as logical expressions, along with an
    identification name and participating attributes. Different types of constraint, i.e. primary key,
    foreign key, unique, not null, check constraints, etc., are each processed and stored in this
    form. Usually a certain amount of preprocessing is required for the construction of our
    generalised representation of a constraint. For example, some check constraints extracted
    from the INGRES DBMS need to be preprocessed to allow them to be classified as check
    constraints by our system.

• Meta-data Representation of class_inherit



Page 109
Entities that participate in inheritance hierarchies are recorded using class_inherit. The names
    of all super-entities for a particular entity are recorded here. This information is processed to
    identify all sub-entities of an entity and the inheritance hierarchies of a database schema.

• Meta-data Representation of renamed_attr

    During an inheritance process, some attribute names may be changed to give more meaningful
    names for the inherited attributes. Once the inherited names are changed it makes it
    impossible to automatically reverse engineer these entities as their attribute names no longer
    match. To overcome this problem we have introduced an additional meta-data representation
    construct, namely: renamed_attr, which keeps track of all attributes whose names have
    changed due to inheritance. This is a representation of synonyms for attribute names of an
    inheritance hierarchy.

       7.2.2 Environmental Data

       This is recorded using ccves_data, which is used to represent three types of information,
namely: the database name, the DBMS name and the name of the host machine (see figure 7.3).
These are captured at the database connection stage.

       7.2.3 Processed Meta-data

        The essential meta-data described in section 7.2.1 is processed to derive additional
information required for conceptual modelling. This additional information is schema_data,
class_data and relationship. Here, schema_data (cf. figure 7.4 section 1), identifies all entities
(all_classes, using class of figure 7.2 section 1) and entity types (link_classes and weak_classes,
by the process described in section 5.4, using constraint types such as primary and foreign key
which are recorded in constraint of figure 7.2 section 3). Class_data (cf. figure 7.4 section 2)
identifies all class properties (property_list, using class_property of figure 7.2 section 2), inherited
properties (using class_property, class_inherit and renamed_attr of figure 7.2 sections 2, 4 and 5,
respectively), sub- and super- classes (subclass_list and superclass_list, using class_inherit of
figure 7.2) and referencing and referenced classes (ref and refed, using foreign key constraints
recorded in constraint of figure 7.2). Relationship (cf. figure 7.4 section 3) records the relationship
types (derived using the process described in section 5.4) and cardinality information (using
derived relationship types and available cardinality values).

                                    ccves_data(dbname, DATABASE_NAME).
                                    ccves_data(dbms, DBMS_NAME).
                                    ccves_data(host, HOST_MACHINE_NAME).


                             Figure 7.3: OACS Constructs used as environmental data




Page 110
1. schema_data(SchemaId, [
                                 all_classes(ALL_CLASS_list),
                                 link_classes(LINK_CLASS_list),
                                 weak_classes(WEAK_CLASS_list) ]).
                       2. class_data(SchemaId, CLASS_NAME, [
                                 property_list(OWN_PROPERTY_list, INHERIT_PROPERTY_list),
                                 subclass_list(SUBCLASS_list),
                                 superclass_list(SUPERCLASS_list),
                                 ref(REFERENCING_CLASS_list),
                                 refed(REFERENCED_CLASS_list) ]).
                       3. relationship(SchemaId, REFERENCING_CLASS_NAME, RELATIONSHIP_TYPE,
                                 CARDINALITY, REFERENCED_CLASS_NAME).

                                         Figure 7.4: Derived OACS Constructs


       7.2.4 Graphical Constructs

       Besides the above OACS representations it is necessary to support additional constructs to
produce a graphical display of a conceptual model. For this we produce graphical constructs using
our derived OACS constructs (cf. figure 7.4) and apply a display layout algorithm (see section
7.3). We call these graphical object abstract conceptual schema (GOACS) constructs, as they are
graphical extensions of our OACS constructs.

        The graphical display represents entities, their attributes (optional), relationships, etc.,
using graphical symbols which consist of strings, lines and widgets (basic objects in a toolkit
which contains data that will not be forgotten after writing to the screen as in the case of strings
and lines [NYE93]). To produce this display, coordinates of the positions of all entities,
relationships etc., are derived and recorded in our graphical constructs. The coordinates of each
entity are recorded using class_info as shown in section 1 of figure 7.5. This information identifies
the top left coordinates of an entity.

                  1. class_info(SchemaId, CLASS_NAME, [ x(X0), y(Y0) ] ).

                  2. box(SchemaId, X0, Y0, W, H, REGULAR_CLASS_NAME).
                     box_box(SchemaId, X0, Y0, W, H, Gap, WEAK_CLASS_NAME).
                     diamond_box(SchemaId, X0, Y0, W, H, LINK_CLASS_NAME).

                  3. ref_info(Schema_Id, REFERENCING_CLASS_NAME,
                           REFERENCING_CLASS_CONNECTING_SIDE,
                           REFERENCING_CLASS_CONNECTING_SIDE_COORDINATE,
                           REFERENCED_CLASS_NAME, REFERENCED_CLASS_CONNECTING_SIDE,
                           REFERENCED_CLASS_CONNECTING_SIDE_COORDINATE).

                  4. line(SchemaId, X1, Y1, X2, Y2).
                     string(SchemaId, X0, Y0, STRING_NAME).
                     diamond(SchemaId, X0, Y0, W, H, ASSOCIATION_NAME).

                  5. property_line(SchemaId, CLASS_NAME, X1, Y1, X2, Y2).
                     property_string(SchemaId, CLASS_NAME, PROPERTY_NAME, DISPLAY_COLOUR, X0, Y0).


                                        Figure 7.5: Graphical Constructs (GOACS)


        The graphical symbol for an entity depends on the entity type. Thus, further processing is
required to graphically categorise entity types. For the EER model, we categorise entities: regular
as box, weak as box_box and link as diamond_box (cf. section 2 of figure 7.5, and figure 7.6). We
use an intermediate representation construct, namely: ref_info, to assist in the derivation of
appropriate coordinates for all associations (cf. section 3 of figure 7.5). With the assistance of



Page 111
ref_info, coordinates to represent relationships are derived and recorded using line, string and
diamond (cf. section 4 of figure 7.5, and figure 7.7).

        Users of our schema displays are allowed to interact with schema entities. During this
process, optional information such as properties (i.e. attributes) of selected entities can be added to
the display. This feature is the result of providing the entities and their attributes at different levels
of abstraction. The added information is recorded separately using property_line and
property_string (cf. section 5 of figure 7.5, and figure 7.8).


                       (X0,Y0)       W               (X0,Y0)      W                    (X0,Y0)      W
                                                                       Gap
                        H       REGULAR              H          WEAK                   H           LINK
                                 CLASS                          CLASS                             CLASS


                                   box                         box_box                        diamond_box

                         Figure 7.6: Graphical representation of entity types in EER notations



                                                                                       (X0,Y0)      W
                                                               (X0,Y0)
                               (X1,Y1)          (X2,Y2)
                                                                 .  STRING_NAME
                                                                                              Association
                                                                                                Name
                                                                                                            H


                                         line                        string                       diamond


                      Figure 7.7: Graphical representation of connections, labels and associations
                                                   in EER notations




                                                                        (X0,Y0)
                               (X1,Y1)          (X2,Y2)
                                                                              . PROPERTY_NAME
                                     property_line                              property_string


                            Figure 7.8: Graphical representation of selected attributes of a class
                                                     in EER notations


7.3 Display Layout Algorithm

        To produce a suitable display of a database schema it was necessary to adopt an intelligent
algorithm which determines the positioning of objects in the display. Such algorithms have been
used by many researchers and also commercially for similar purposes [CHU89]. We studied these
ideas and implemented our own layout algorithm which proved to be effective for small,
manageable database schemas. However, to allow displays to be altered to a user preferred style
and for our method to be effective for large schemas we decided to incorporate an editing facility.
This feature allows users to move entities and change their original positions in a conceptual
schema. Internally, this is done by changing the coordinates recorded in class_info for a
repositioned entity and recomputing all its associated graphical constructs.



Page 112
When the location of an entity is changed the connection side of that entity may also need
to be changed. To deal with this case, appropriate sides for all entities can be derived at any stage
of our editing process. When appropriate sides are derived, the ref_info construct (cf. section 3 of
figure 7.5) is appropriately updated to enable us to reproduce the revised coordinates of line,
string and diamond constructs (cf. section 4 of figure 7.5).

       Our layout algorithm does the following things:

    1. Entities connected to each other are identified (i.e. grouped) using their referenced entity
       information. This process highlights unconnected entities as well as entities forming
       hierarchies or graph structures.
    2. Within a group, entities are rearranged according to the number of connections associated
       with them. This arrangement puts entities with most connections in the centre of the
       display structure and entities with the least connections at the periphery.
    3. A tree representation is then constructed starting from the entity having the most
       connections. During the construction of subsequent trees, entities which have already been
       used are not considered, to prevent their original position being changed. This makes it
       easy to visualise relationships/aggregations present in a conceptual model. The
       identification of such properties allow us to gain a better understanding of the application
       being modelled. Similarly, attempts are made to highlight inheritance hierarchies whenever
       they are present. However, when too many inter-related entities are involved, it is
       sometimes necessary to use the move editing facility to relocate some entities so that their
       properties (e.g. relationships) are highlighted in the diagram. The existence of such hidden
       structures is due to the cross connection of some entities. To prevent overlapping of
       entities, relationships, etc., initial placement is done using an internal layout grid. However,
       the user is permitted to overlap or place entities close to each other during schema editing.

       The coordinate information of a final diagram is saved in disk files, so that these
coordinates are automatically available to all subsequent re-engineering processes. Hence our
system first checks for the existence of a file containing these coordinates and only in its absence
would it use the above layout algorithm.

7.4 External Architecture and Operation of CCVES

       We characterise CCVES by first considering the type of people who may use this system.
This is followed by an overview of the external system components. Finally the external
operations performed by the system are described.

       7.4.1 CCVES Operation

        The three main operations of CCVES, i.e. analysing, enhancing and incremental migration,
need to be performed by a suitably qualified person. This person must have a good knowledge of
the current database application to ensure that only appropriate enhancements are made to it. Also,
this person must be able to interpret and understand conceptual modelling and the data
manipulation language SQL, as we have used the SQL syntax to specify the contents of databases.


Page 113
This person must have the skills to design and operate a database application. Thus they are a
more specialised user than the traditional adhoc user. We shall therefore refer to this person as a
DataBase Administrator (DBA), although they need not be a professional DBA. It is this person
who will be in charge of migrating the current database application.

        To this DBA the process of accessing meta-data from a legacy database service in a
heterogeneous distributed database environment is fully automated once the connection to the
database of interest is made. The production of a graphical display representation for the relevant
database schema is also fully automated. This representation shows all available meta-data, links
and constraints in the existing database schema. All links and constraints defined by hand coding
in the legacy application (i.e. not in the database schema but appearing in the application in the
form of 3GL or equivalent code) will not be shown until such links and constraints are supplied to
CCVES during the process of enhancing the legacy database. Such enhancements are represented
in the database itself to allow automatic reuse of these additions, not only by our system users but
also by others (i.e. users of other database support tools).

        The enhancement process will assist the DBA in incrementally building the database
structure for the target database service. Possible decomposable modules for the legacy system are
identified during this stage. Finally, when the incremental migration process has been performed,
the DBA may need to review its success by viewing both the source and the target database
schemas. This is achieved using the facility to visualise multiple heterogeneous databases.

       We have sought to meet our objectives by developing an interactive schema visualisation
and knowledge acquisition tool which is directed by an inference engine using a real world data
modelling framework based on the EER and OMT conceptual models and extended relational
database modelling concepts. This tool has been implemented in prototype form mostly in Prolog,
supported by some C language routines embedded with SQL to access and use databases built
with the INGRES DBMS (version 6), Oracle DBMS (version 6 and 7) or POSTGRES O-O data
model (version 4). The Prolog code which does the main processing and uses X window and
Motif widgets exceeds 13,000 lines, while the C language embedded with SQL code is from 100
to 1,000 lines depending on the DBMS.

       7.4.2 System Overview

        This section defines the external architecture and operation of CCVES. It covers the design
and structure of its main interfaces, namely: database connection (access), database selection
(processing) and user interaction (see figure 7.9). The heart of the system consists of a meta-
management module (MMM) (see figure 7.10), which processes and manages meta-data using a
common internal intermediate schema representation (cf. section 7.2). A presentation layer which
offers display and dialog windows has been provided for user interaction. The schema
visualisation, schema enhancement, constraint visualisation and database migration modules (cf.
figure 7.9) communicate with the user.




Page 114
Start
                           GUI                                                                   Database
                                                                                                  Tools
                                                                    Query
                                                                     Tool

                     Database Access                                         User Interaction

                                                                                                  Schema
                      Connect                                                                   Enhancement
                      Database
                                                  Database Processing

                                                                    Schema                       Constraint
                                                                  Visualisation                 Visualisation

                                              Select
                                             Database
                                                                                                 Database
                                                                                                 Migration



                                 Figure 7.9: Principal processes and control flow of CCVES


        The meta-data and knowledge for this system is extracted from respective database system
tables and stored using a common internal representation (OACS). This knowledge is further
processed to derive the graphical constructs (GOACS) of a visualised conceptual model.
Information is represented in Prolog, as dynamic predicates, to describe facts, and semantic
relationships that hold between facts, about graphical and textual schema components. The meta-
management module has access to the selected database to store any changes (e.g. schema
enhancements) made by the user. The input / output interfaces of MMM manage the presentation
layer of CCVES. This consists of X window and Motif widgets used to create an interactive
graphical environment for users.

        In section 2.2 we introduced the functionality of CCVES in terms of information flow with
special emphasis on its external components (cf. figure 2.1). Later, in sections 2.3 and 7.1, we
described the main internal processes of CCVES (cf. figures 2.2 and 7.1). Here, in figure 7.10, we
show both internal and external components of CCVES together with special emphasis on the
external aspect.

        7.4.3 System Operation

        The system has three distinct operational phases: meta-data access, meta-data processing
and user interaction. In the first phase, the system communicates with the source legacy database
service to extract meta-data19 and knowledge20 specifications from the database concerned. This
is achieved when connection to the database (connect database of figure 7.10) is made by the
system, and is the meta-data access phase. In the second phase, the source specifications extracted
from the database system tables are analysed, along with any graphical constructs we may have
subsequently derived, to form the meta-data and meta-knowledge base of MMM. This information

   19
      meta-data represents the original database schema specifications of the database.
   20
      knowledge represents subsequent knowledge we may have already added to augment this
database schema.


Page 115
is used to produce a visual representation in the form of a conceptual model. This phase is known
as meta-data processing and is activated when select database (cf. figure 7.10) is chosen by the
user. The final phase is interaction with the user. Here, the user may supply the system with
semantic information to enrich the schema; visualise the schema using a preferred modelling
technique (EER and OMT are currently available); select graphical objects (i.e. classes) and
visualise their properties and intra- and inter- object constraints using the constraint window; and
modify the graphical view of the displayed conceptual model. He may also incrementally migrate
selected schema constructs; transfer selected meta-data to other tools (e.g. MITRA, a Query Tool
[MAD95]); accept meta-data from other tools (e.g. REVEERD, a reverse-engineering tool
[ASH95]); and examine the same database using another window of the CCVES or other database
design tools (e.g. Oracle*Design). The objective of providing the user with a wide range of design
tools is to optimise process of analysing the source legacy database. The enhancement of the
legacy database with constraints is an attempt to collect information that is managed by modern
DBMSs in the legacy database without affecting its operation and in preparation for its migration.


                                                                                    The Designer



                   Display and Dialog Windows                            Designer Interaction

                                                       Constraint
                                                                                                                Text Files
                                                                              External                                                                    Database
                                                       Window                 Constraints
                      Host :           Dept-Oracle                                                                                                         Tools
                      DBMS:        College-POSTGRES                                                  DB Sch               SQL-3 Const
                     DB Name:       Faculty-INGRES
                                                        Select                    Define                                                                  GUI
                                                                 Schema Display                                                                           GQL
                                                                                                                                                          Oracle *
                                                                                                   Constraint                            External
                                                                                                   Visualiser                            Constraints



                                                                                      View
                                                                                                                   Schema Display
                                                                                                                                                            Design
                                                      Select,
                                                                                                  Select                                         Define

                                                                                                                                                          ..........
                                                      Move,                                                          OMT/EER


                                                      Transfer      OMT/EER
                                                                                                                Select, Move, Transfer
                                                                                                                                                          ..........


                                    Connect               Select           Meta-Translation
                                    Database             Database                 (OUTPUT)


                    Meta-Management Module                                              Input / Output Interface

                                Meta-Knowledge base                   GOACS

                                Meta-Processor
                                                                           Meta-Transformation


                                Meta-Storage System                    OACS


                                                                          Meta-Translation
                                                                              (INPUT)

                                                           Connect
                                                           Database
                                                                           External Constraints
                                                                            Store & Enforce


                                                 Heterogeneous Distributed Databases

                                           Figure 7.10: External Architecture of CCVES


       For successful system operation, users need not be aware of the internal schema
representation or any other non-SQL database specific syntax of the source or target database.
This is because all source schemas are mapped into our internal representation and are always


Page 116
presented to the user using the the standard SQL language syntax (unless specifically requested
otherwise). This enables the user to deal with the problem of heterogeneity, since at the global
level local databases are viewed as if they come from the same DBMS. The SQL syntax is used
by default to express the associated constraints of a database. If specifically requested, the SQL
syntax can be translated and viewed using the DDL of the legacy DBMS; as far as CCVES is
concerned this is just performing another meta-translation process. A textual version of the
original legacy database definition is also created by CCVES when connection to the legacy
database is established. This definition may be viewed by the user for better understanding of the
database being modelled.

       The ultimate migration process allows the user to employ a single target database
environment for all legacy databases. This will assist in removing the physical heterogeneity
between those databases. The complete migration process may take days for large information
systems as they already hold a large volume of data. Hence the ability to enhance and migrate
while legacy databases continue to function is an important feature. Our enhancement process
does not affect existing operations as it involves adding new knowledge and validating existing
data. Whenever structural changes are introduced (e.g. an inheritance hierarchy), we have
proposed the use of view tables (cf. section 5.6) to ensure that normal database operations will not
be affected until the actual migration is commenced. This is because some data items may
continue to change while the migration is in preparation, and indeed during migration itself. We
have proposed an incremental migration process to minimise this effect and use of a forward
gateway to deal with such situations.

7.5 External Interfaces of CCVES

       CCVES is seen by users as consisting of four processes, namely: a database access
process, a schema and constraint visualisation process, a schema enhancement process, and a
schema migration process. The database access process was described in section 6.5. In the next
subsections we describe the other three processes of CCVES to complete the picture.

       7.5.1 Schema and Constraint Visualisation

        The input / output interfaces of MMM manage the presentation layers of CCVES. These
layers consist of display and dialog windows used to provide an interactive graphical environment
for users. The user is presented with a visual display of the conceptual model for a selected
database, and may perform many operations on this schema display window (SDW) to analyse,
enhance, evolve, visualise and migrate any portion of that database. Most of these operations are
done via SDW as they make use of the conceptual model.

        The traditional conceptual model is an E-R diagram which displays only entities, their
attributes and relationships. This level of abstraction gives the user a basic idea of the structure of
a database. However this information is not sufficient to gain a more detailed understanding of the
components of a conceptual model, including identification of intra- and/or inter- object
constraints. Intra-object constraints for an entity provide extra information that allows the user to
identify behavioural properties of the entity. For instance, the attributes of an entity do not provide
sufficient information to determine the type of data that may be held by an attribute and any


Page 117
restrictions that may apply to it. Hence providing a higher level of abstraction by displaying
constraints along with their associated entities and attributes gives the user a better understanding
of the conceptual model. The result is much more than a static entity and attribute description of a
data model as it describes how the model behaves for dynamic data (i.e. a constraint implies that
any data item which violates it cannot be held by the database).

        The schema visualisation module allows users to view the conceptual schema and
constraints defined for a database. This visualisation process is done using three levels of
abstraction. The top level describes all the entity types along with any hierarchies and
relationships. The properties of each entity are viewed at the next level of abstraction to increase
the readability of the schema. Finally, all constraints associated with the selected entities and their
properties are viewed to gain a deeper and better understanding of the actual behaviour of selected
database components. The conceptual diagrams of our test databases are given in appendix C.
These diagrams are at the top level of abstraction. Figures 7.11 and 7.12 show the other two levels
of abstraction.

        The graphical schema displayed in the SDW for a selected database uses by default the
OMT notation, which can be changed to EER from a menu. Users can produce any number of
schema displays for the same schema, and thus can visualise a database schema using both OMT
and EER diagrams at the same time (a picture of our system while viewing the same schema using
both forms is provided in figure 7.11). The display size of the schema may also be changed from a
menu. A description that identifies the source of each display is provided as we are dealing with
many databases in a heterogeneous environment. The diagrams produced by CCVES can be
edited to alter the location of their displayed entities and hence to permit visualisation of a schema
in the personally preferred style and format of an individual user. This is done by selecting and
moving a set of entities within the scrolling window, thus altering the relative positions of entities
within the diagram produced. These changes can be saved and automatically restored for the next
session by users.

         The system allows interactive selection of entities and attributes from the SDW. We
initially do not include any attributes as part of the displayed diagram, because we provide them
as a separate level of abstraction. A list of attributes associated with an entity can be viewed by
first selecting the entity from the display window (abstraction at level 2), and then browsing
through its attributes in the attribute window, which is placed just below the display window (see
figure 7.12). Any attribute selected from this window will automatically be transferred to the
main display window, so that only attributes of interest are displayed there. This technique
increases the readability of our display window. At each stage, appropriate messages produced by
the system display unit are shown at the bottom of this window. For successful interactions,
graphical responses are used whenever applicable. For example, when an entity is selected by
clicking on it, the selected entity will be highlighted. In this thesis we do not provide interaction
details as these are provided at the user manual level.

        The 'browser' menu option for SDW will invoke the browser window. This is done only
when a user wishes to visit the constraint visualisation abstraction level, our third level of
abstraction. Here we see all entities and their properties of interest as the default option, but we
can expand this to display other entities and properties by choosing appropriate menu options


Page 118
from the browsing window. We can also filter the displayed constraints to include those of interest
(e.g. show only domain constraints). In cases where inherited attributes are involved the system
will initially show only those attributes owned by an entity (the default option), others can be
viewed by selecting the required level of abstraction (available in the menu) for the entity
concerned.

        The ability to select an entity from the display window and display its properties in the
browsing window satisfies our requirement of visualising intra-object constraints. The reverse of
this process, i.e. selecting a constraint from the browsing window and displaying all its associated
entities in the display window, satisfies our inter-object constraint visualisation requirement. Both
of these facilities are provided by our system (see figures 7.11 and 7.12, respectively).

        All operations with the mouse device are done using the left button except when altering
the location of an entity of a displayed conceptual model. We have allowed the use of the middle
button of the mouse to select, drag and drop21 such an entity. This process alters the position of the
entity and redraws the conceptual model. By this means, object hierarchies, relationships, etc., can
be made prominent by placing the objects concerned in hierarchies close to each other. This
feature was introduced firstly to allow users to visualise a conceptual model in a preferred manner
and secondly as our display algorithm was not capable of automatically providing such features
when constructing complex schemas having many entities, hierarchies and relationships (cf.
section 7.3).




   21
        CCVES changes the cursor symbol to confirm the activation of this mode.


Page 119
Page 120
7.5.2 Schema Enhancement

        Schema enhancements are also done via the schema display window. This module is
mainly used to specify dynamic constraints. These constraints are usually extracted from the
legacy code of a database application, as in older systems they were specified in the application
itself. Constraint extraction from legacy applications is not supported by CCVES. Hence, this
information must be extracted by other means. We assume that such constraints can be extracted
by either examining the legacy code, using any existing documentation or using user knowledge
of the application, and have introduced this module to capture them. We have also provided an
option to detect possible missing, hidden and redundant information (cf. section 5.5.2) to assist
users in formulating new constraints. The user specifies constraints via the constraint
enhancement interface by choosing the constraint type and associated attributes. In the case of a
check constraint the user needs to specify it using SQL syntax.

        The constraint enhancement process allows further constraints to appear in the graphical
model. This development is presented via the graphical display, so that the user is aware of the
existing links and constraints present for the schema. For instance, when a foreign key is
specified, CCVES will try to derive a relationship using this information. If this process is
successful a new relationship will appear in the conceptual model.

        A graphical user interface in the form of a pop-up sub-menu is used to specify constraints,
which take the form of integrity constraints (e.g. primary key, foreign key, check constraints) and
structural components (e.g. inheritance hierarchies, entity modifications). In figure 7.13 we
present some pop-up menus of CCVES which assist users in specifying various types of
constraints. Here, names of entities and their attributes are automatically supplied via pull-down
menus to ensure the validity of certain input components of user specified constraints. For all
constraints, information about the type of constraint and the class involved is initially specified.
When the type of constraint is known, prior existence of such a constraint is checked in the case of
primary key and foreign key constraints. For primary keys, the process will not proceed if a key
specification already exists; for foreign keys, if the attributes already participate in such a
relationship they will not appear in the referencing attribute specification list. In the case of
referenced attributes, only attributes with at least the uniqueness property will appear in order to
prevent specification of any invalid foreign keys. All enhanced constraints are stored internally
until they are added to the database using another menu option. Prior to this augmentation process
the user should verify the validity of the constraints. In the case of recent DBMSs like Oracle,
invalid constraints will be rejected automatically and the user will be requested to amend them or
discard them. In such situations the incorrect constraints are reported to the user and are stored on
disk as a log file. Also, all changes made during a session will not be saved until specifically
instructed by the user. This gives the opportunity to rollback (in the event of an incorrect addition)
and resume from the previous state.




Page 121
Figure 7.13: Two stages of a Foreign Key constraint specification

       Input data in the form of constraints to enhance the schema provides new knowledge about
a database. It is essential to retain this knowledge within the database itself, if it is to be used for
any future processing. CCVES achieves this task using its knowledge augmentation process as
described in section 5.7. From a user’s point of view this process is fully automated and hence no
intermediate interactions are involved. The enhanced knowledge is augmented only if the database
is unable to naturally represent the new knowledge. Such knowledge cannot be automatically
enforced except via a newer version of the DBMS or other DBMS (if supported), by migrating the
database. However, the data being held by the database may already not conform to the new
constraints, and hence existing data may be rejected by the target DBMS. This will result in loss
of data and/or migration delays. To address this problem, we provide an optional constraint
enforcement process which checks the conformation of the data to the new constraints prior to
migration.

       7.5.3 Constraint Enforcement and Verification

        Constraint enforcement is automatically managed only by relatively recent DBMSs. If
CCVES is used to enhance a recent DBMS such as Oracle then verification and enforcement will
be handled by the DBMS, as CCVES will just create constraints using the DDL commands of that
DBMS. However, when such features are not supported by the underlying DBMS, CCVES has to
provide such a service itself. Our objective in this process is to give users the facility to ensure
that the database conforms to all the enhanced constraints. Constraints are chosen from the
browser window to verify their validity. Once selected, the constraint verification process will
issue each constraint to the database using the technique described in section 5.8 and report any
violations to the user. When a violated constraint is detected, the user can decide whether to keep
or discard it. The user could decide to retain the constraint in the knowledge-base for various
reasons. These include: ensuring that future data conforms to the constraint; providing users with
a guideline to the system data contents irrespective of violations that may occur occasionally;
assisting the user in improving the data or the constraint. To enable the enforcement of such
constraints for future data instances, it is necessary to either use a trigger (e.g. on append check
constraint) or add a temporal component to the constraint (e.g. system date > constraint input
date). This will ensure that the constraint will not be enforced on existing data.



Page 122
When using queries to verify enhanced constraints the retrieved data are instances that
violate a constraint. In such a situation, retrieving a large number of instances for a given query
does not make much sense as it may be due to an incorrect constraint specification rather than the
data itself. Therefore in the event of an output exceeding 20 instances we terminate the query and
instruct the user to inspect this constraint as a separate action.

       7.5.4 Schema Migration

       The migration process is provided to allow an enhanced legacy system to be ported to a
new environment. This is incrementally performed by initially creating the schema in the target
DBMS and then copying the legacy data to the target system. To create the schema in the target
system, DDL statements are generated by CCVES. An appropriate schema meta-translation
process is performed if required (e.g. if the target DBMS has a non-SQL query language). The
legacy data is migrated using the import/export tools of source and target DBMSs.

       The migration process is not fully automated as certain conflicts cannot be resolved
without user intervention. For example, if the target database only accepts names of length 16 (as
in Oracle) instead of 32 (as in INGRES) in the source database, then a name resolution process
must be performed by the user. Also, names used in one DBMS may be keywords in another. Our
system resolves these problems by adding a tag to those names or by truncating the length of a
name. This approach is not generic as the uniqueness property of an attribute cannot be
maintained by truncating its original name. In these situations user involvement is unavoidable.

        CCVES, although it has been tested for only three types of DBMS, namely: INGRES,
POSTGRES and Oracle, could be easily adapted for other relational DBMSs as they represent
their meta-data similarly - i.e. in the form of system tables, with minor differences such as table
and attribute names and some table structures. Non relational database models accessible via
ODBC or other tools (e.g. Data Extract for DB2, which permits movement of data from IMS/VS,
DL/1, VSAM, SAM to SQL/DS or DB2), could also be easily adapted as the meta-data required
by CCVES could be extracted from them. Previous work related to meta-translation [HOW87] has
investigated the translation of dBase code to INGRES/QUEL, demonstrating the applicability of
this technique in general, not only to the relational data model but also to others such as
CODASYL and hierarchical data models. This means CCVES is capable in principle of being
extended to cope with other data models.




Page 123
CHAPTER 8

                       Evaluation, Future Work and Conclusion

In this chapter the Conceptualised Constraint Visualisation and Enhancement System (CCVES)
described in Chapters 5, 6 and 7 is evaluated with respect to our hypotheses and objectives listed
in Chapter 1. We describe the functionality of different components of CCVES to identify their
strengths and summarise their limitations. Potential extensions and improvements are considered
as part of future work. Finally, conclusions about the work are drawn by reviewing the objectives
and the evaluation.

8.1 Evaluation

       8.1.1 System Objectives and Achievements

        The major technical challenge in designing CCVES was to provide an interactive graphical
environment to access and manipulate legacy databases within an evolving heterogeneous
distributed database environment for the purpose of analysing, enhancing and incrementally
migrating legacy database schemas to modern representations. The objective of this exercise was
to enrich a legacy database with valuable additional knowledge that has many uses, without being
restricted by the existing legacy database service and without affecting the operation of the legacy
IS. This knowledge is in the form of constraints that can be used to understand and improve the
data of the legacy IS.

         Here, we assess the main external and internal aspects of our system, CCVES, based on
the objectives laid out in sections 1.2 and 2.4. Externally, CCVES performs 3 important tasks -
initially, a reverse-engineering process; then, a knowledge augmentation process, which is a re-
engineering process on the original system; and finally, an incremental migration process. The
reverse-engineering process is fully automated and is generalised to deal with the problems caused
by heterogeneity.

       a) A framework to address the problem of heterogeneity

        The problems of heterogeneity have been addressed by many researchers, and at Cardiff
the meta-translation technique has been successfully used to demonstrate its wide-ranging
applicability to heterogeneous systems. This previous work, which includes query meta-
translation [HOW87], schema meta-translation [RAM91] and schema meta-integration [QUT93],
was developed using Prolog - emphasising its suitability for meta-data representation and
processing. Hence Prolog was chosen as the main programming language for the development of
our system. Among the many Prolog versions around, we found that Quintus Prolog was well
suited to supporting an interactive graphical environment as it provided access to X window and
Motif widget routines. Also, the PRODBI tools [LUC93] were available with Quintus Prolog, and
these enabled us to directly access a number of relational DBMSs, like INGRES, Oracle and
SYBASE.
Chapter 8                                                    Evaluation, Future Work and
Conclusion

        Our framework for meta-data representation and manipulation has been described in
section 7.2. The meta-programming approach enabled us to implement many other features, such
as the ability to easily customise our system for different data models, e.g. relational and object-
oriented (cf. section 7.2.1), the ability to easily enhance or customise for different display models,
e.g. E-R, EER and OMT (cf. section 7.2.4), and the ability to deal with heterogeneity due to
differences in local databases (e.g. at the global level the user views all local databases as if they
come from the same DBMS, and is also able to view databases using a preferred DDL syntax).

       b) An interactive graphical environment

        An interactive graphical environment which makes extensive use of modern graphical user
interface facilities was required to provide graphical displays of conceptual models and allow
subsequent interaction with them. To fulfil these requirements the CCVES software development
environment had to be based on a GUI sub-system consisting of pop-up windows, pull-down
menus, push buttons, icons etc. We selected X window and Motif widgets to build such an
environment on a UNIX platform. SunSparc workstations were used for this purpose. Provision of
interactive graphical responses when working via this interface was also included to ensure user
friendliness (cf. section 7.5).

       c) Ability to access and work on legacy database systems

        An initial, basic facility of our system was the ability to access legacy database systems.
This process, which is described in section 6.5, enables users to specify and access any database
system over a network. Here, as the schema information is usually static, CCVES has been
designed to provide the user with the option of by-passing the physical database access process
and using instead an existing (already accessed) logical schema. This saves time once the initial
access to a schema has been made. Also, it guarantees access to meta-data of previously accessed
databases during server and network breakdowns, which were not uncommon during the
development of our system.


       d) A framework to perform the reverse-engineering process

       A framework to perform the reverse-engineering process for legacy database systems has
been provided. This process is based on applying a set procedure which produces an appropriate
conceptual model (cf. section 5.2). It is performed automatically even if there is very limited
meta-knowledge. In such a situation, links that should be present in the conceptual model will not
appear in the corresponding graphical display. Hence, the full success of this process depends on
the availability of adequate meta-knowledge. This means that a real world data modelling
framework that facilitates the enhancement of legacy systems must be provided, as described next.

       e) A framework to enhance existing systems

        A comprehensive data modelling framework that facilitates the enhancement of
established database systems has been provided (cf. section 5.6). A method of retaining the


Page 125
Chapter 8                                                    Evaluation, Future Work and
Conclusion
enhanced knowledge for future use which is in line with current international standards is
employed. Techniques that are used in recent versions of commercial DBMSs are supported to
enable legacy databases to logically incorporate modern data modelling techniques irrespective of
whether these are supported by their legacy DBMSs or not (cf. section 5.7). This enhancement
facility gives users the ability to exploit existing databases in new ways (i.e. restructuring and
viewing them using modern features even when these are not supported by the existing system).
The enhanced knowledge is retained in the database itself so that it is readily available for future
exploitation by CCVES or other tools, or by the target system in a migration.

       f) Ability to view a schema using preferred display models

       The original objective of producing a conceptual model as a result of our reverse-
engineering process was to display the structure of databases in a graphical form (i.e. conceptual
model) and so make it easier for users to comprehend their contents. As all users are not
necessarily familiar with the same display model, the facility to visualise using a user preferred
display model (e.g. EER or OMT) has been provided. This is more flexible than our original aim.

       g) High level of data abstraction for better understanding

        A high level of data abstraction for most components of a conceptual model (i.e.
visualising the contents, relationships and behavioural properties of entities and constraints;
including identification of intra- and inter-object constraints) has been provided (cf. section 7.5.1).
Such features are not usually incorporated in visualisation tools. These features and various other
forms of interaction with conceptual models are provided via the user interface of CCVES.

       h) Ability to enhance schema and to verify the database

       The schema enhancement process was provided originally to enrich a legacy database
schema and its resultant conceptual model. A facility to determine the constraints on the
information held and the extent to which the legacy data conforms to these constraints is also
provided to enable the user to verify their applicability (section 5.7). The graphical user interface
components used for this purpose are described in section 7.5.2.

       i) Ability to migrate while the system continues to function

        The ability to enhance and migrate while a legacy database continues to function normally
was considered necessary as it ensures that this process will not affect the ongoing operation of
the legacy system (section 5.8). The ability to migrate to a single target database environment for
all legacy databases assists in removing the physical heterogeneity between these databases.
Finally, the ability to integrate CCVES with other tools to maximise the benefits to the user
community was also provided (section 7.4.3).

       8.1.2 System Development and Performance

       A working prototype CCVES system that enabled us to test all the contributions of this
research was implemented using Quintus Prolog with X window and Motif libraries; INGRES,


Page 126
Chapter 8                                                  Evaluation, Future Work and
Conclusion
Oracle and POSTGRES DBMSs; the C programming language embedded with SQL and
POSTQUEL; and the PRODBI interface to INGRES. This system can be split into 4 parts,
namely: the database access process to capture meta-data from legacy databases; the mapping of
the meta-data of a legacy database to a conceptual model to present the semantics of the database
using a graphical environment; the enhancement of a legacy database schema with constraint
based knowledge to improve its semantics and functionality; and the incremental migration of the
legacy database to a target database environment.

         Our initial development commenced using POPLOG, which was at that time the only
Prolog version with any graphical capabilities available on UNIX workstations at Cardiff. Our
initial exposure to X window library routines occurred at this stage. Later, with the availability of
Quintus Prolog, which had a more powerful graphical capability due to its support of X windows
and Motif widgets, it was decided to transfer our work to this superior environment. To achieve
this we had to make two significant changes, namely: converting all POPLOG graphic routines to
Quintus equivalents and modifying a particular implementation approach adopted by us when
working with POPLOG. The latter took advantage of POPLOG’s support for passing unevaluated
expressions as arguments of Prolog clauses. In Quintus Prolog we had to evaluate all expressions
before passing them as arguments.

        Due to the use of slow workstations (i.e. SPARC1s) and running Prolog interactively, there
was a delay in most interactions with our original system. This delay was significant (e.g. nearly a
minute) when having to redraw a conceptual model. It was necessary to redraw this model when
we moved an object of the display in order to change its location and whenever the drawing
window was exposed. This exposure occurred when the window’s position changed, or was
overlapped with another window or a menu, or when someone clicked on this window. In such
situations it was necessary to refresh the drawing window by redrawing the model. Redrawing
was required as our initial attempt at producing a conceptual model was based solely on drawing
routines. This method was inefficient as such drawings had to be redone every time the drawing
window became exposed.

        Our second attempt was to draw conceptual models in the background using a pixmap.
This process allocates part of the memory of the computer to enable us to directly draw and retain
an image. A pixmap can be copied to any drawing window without having to reconstruct its
graphical components. This means that when the drawing window becomes exposed it is possible
to copy this pixmap to that window without redrawing the conceptual model. The process of
copying a pixmap to the drawing window took only a few seconds and so there was a significant
improvement over our original method. However with this new approach whenever a move
operation is performed, it is still necessary to recompute all graphical settings and redraw. This
took a similarly long time to the original method.

       The use of a pixmap took up a significant part of the computer’s memory and as a result
Quintus was unable to cope if there was a need to simultaneously view multiple conceptual
models. We also experienced several instances of unusual system behaviour such as failure to
execute routines that had been tested previously. This was due to the full utilisation by Prolog of
run time memory because of the existence of this pixmap. We noticed that Quintus Prolog had a
bug of not being able to release the memory used by a pixmap. In order to regain this memory we


Page 127
Chapter 8                                                    Evaluation, Future Work and
Conclusion
had to logout (exit) from the workstation, as the xnew process which was collecting garbage was
unable to deal with this case. Hence, we decided to use widgets instead of drawing rectangles for
entities, as widgets are managed automatically by X windows and Motif routines. This allowed us
to reduce the drawing components in our conceptual model and hence to minimise redrawing time
when the drawing window became exposed. We discarded the pixmap approach as it gave us
many problems. However as widgets themselves take up memory their behaviour under some
complex conceptual models is questionable. We decided not to test this in depth as we had already
spent too much time on this module, and its feasibility had been demonstrated satisfactorily.

        During the course of CCVES development, Quintus Prolog was upgraded from release 3.1
to 3.1.4. Due to incompatibilities between the two versions, certain routines of our system had to
be modified to suit the new version. This meant that a full test of the entire system was required.
Also, since three versions of INGRES, two versions of Oracle and one POSTGRES were used
during our project, this meant that more and more system testing was required. Thus, we have
experienced several changes to our system due to technological changes in its development
environment. Comparing the lifespan and scale of our project with those of a legacy IS, we could
more clearly appreciate the amount of change that is required for such systems to keep up with
technological progress and business needs. Hence, the migration of any IS is usually a complex
process. However, the ability to enhance and evolve such a system without affecting its normal
operation is a significant step towards assisting this process.

       Our final task was to produce a compiled version of our system. This is still being
undertaken, as although we have been able to produce executable code, some user interface
options are not being activated for unknown reasons (we think this may be due to insufficient
memory), although the individual modules work correctly.

       8.1.3 System Appraisal

        The approach presented in this thesis for mapping a relational database schema to a
conceptual schema is in many ways simpler and easier to apply than any previous attempts as it
has eliminated the need for any initial user interaction to provide constraint based knowledge for
this process. Constraint information such as primary and foreign keys are used to automatically
derive the entity and relationship types. Use of foreign key information was not considered in
previous approaches as most database systems did not support such facilities at that time.

       One major contribution of our work is providing the facility for specifying and using
constraint based information in any type of DBMS. This means that once a database is enhanced
with constraints, it is semantically richer. If the source DBMS does not support constraints then
the conceptual model will still be enhanced, and our tool will augment the database with these
constraints in an appropriate form.

       Another innovative feature of our system is the automated use of the DML of a database to
determine the extent to which the enhanced constraints conform to its data. This enables users to
take appropriate compensatory actions prior to migrating legacy databases.

       We provided an additional level of schema abstraction for our conceptual models. This is


Page 128
Chapter 8                                                   Evaluation, Future Work and
Conclusion
in the form of viewing the constraints associated with a schema. This feature allows users to gain
a better understanding of databases.

        The facility to view multiple schemas allows users to compare different components of a
global system if it comprises several databases. This feature is very useful when dealing with
heterogeneous databases. We also deal with heterogeneity at the conceptual viewing stage by
providing users with the facility to view a schema using their preferred modelling notation. For
example, in our system the user can choose either an EER or an OMT display to view a schema.
This ensures greater accuracy in understanding, as the user can select a familiar modelling
notation to view database schemas. The display of the same schema in multiple windows using
different scales allows the user to focus on a small section of the schema in one window while
retaining a larger view in another. The ability to view multiple schemas also means that it is
possible to jointly monitor the progress or status of the source and target databases during an
incremental migration process. The introduction of both EER and OMT as modelling options
means that the recent advances which were not present in the original E-R model and some
subsequent variants can be represented using our system.

        Our approach of augmenting a database itself with new semantic knowledge rather than
using separate specialised knowledge-bases means that our enhanced knowledge is accessible by
any user or tools using the DML of the database. This knowledge is represented in the database
using an extended version of the SQL-3 standards for constraint representation. Thus this
knowledge will be compatible with future database products, which should conform to the new
SQL-3 standards. Also, no semantics are lost due to the mapping from a conceptual model to a
database schema. Oracle version 6 provided similar functionality by allowing constraints to be
specified even though they could not be applied until the introduction of version 7.

       8.1.4 Useful real-life Applications

        We were able to successfully reverse-engineer a leading telecommunication database
extract consisting of over 50 entities. This enabled us to test our tool on a scale greater than that
of our test databases. The successful use of all or parts of our system for other research work,
namely: accessing POSTGRES databases for semantic object-oriented multi-database access
[ALZ96], viewing heterogeneous conceptual schemas when dealing with graphical query
interfaces [MAD95], and viewing heterogeneous conceptual schemas via the world wide web
(WWW) [KAR96] indicates its general usefulness and applicability.

        The display of conceptual models can be of use in many areas such as database design,
database integration and database migration. We could identify similar areas of use for CCVES.
These include training new users by allowing them to understand an existing system, and enabling
users to experiment with possible enhancements to existing systems.

8.2 Limitations and possible future Extensions

        There are a number of possible extensions that could be incorporated to improve the
current functionalities of our system. Some of these are associated with improving run time
efficiency, accommodating a wider range of users and extending graphical user interaction


Page 129
Chapter 8                                                   Evaluation, Future Work and
Conclusion
capabilities. Such extensions would not have great significance with respect to demonstrating the
applicability of our fundamental ideas. Examples of improvements are: engineer the system to the
level of a commercial product so that it could be used by a wide range of users with minimal user
training; improve the run time efficiency of the system by producing a compiled version; test it in
a proper distributed database environment, as our test databases did not emphasise distribution;
extend the graphical display options to offer other conceptual models, such as ECR; extend the
system to enable us to test migrations to a proper object-oriented DBMS (i.e. not only to an
extended relational DBMS with O-O features, like POSTGRES); and improve the display layout
algorithm (cf. section 7.3) to efficiently manage large database schemas. The time scale for such
improvements would vary from a few weeks to many months each, depending on the work
involved.

        Our system is designed to cope with two important extensions. They are: extend the
graphical display option to offer other forms of conceptual models, and extend the number of
DBMSs and their versions it can support. Of these two extensions, the least work involved is in
supporting a new graphical display. Here, the user needs to identify the notations used by the new
display and write the necessary Prolog rules to generate strings and lines used for the drawings.
This process will take at most one week, as we do not change graphical constructs such as
class_info and ref_info (cf. section 7.2.4) to support different display models. On the other hand,
inclusion of a new relational DBMS or version can take a few months as it affects 3 stages of our
system. These stages are: meta-data access, constraint enforcement and database migration. All 3
stages uses the query language (SQL) of the DBMS and hence, if this is variant we will need to
expand our QMTS. The time required for such an extension will depend on its similarities when
compared with standard SQL and may take 2-4 person weeks. Next, we need to assess the
constraint handling features supported by the new DBMS so that we can use our knowledge-based
tables to overcome any constraint representation limitations. This process may take 1-2 person
weeks. To access the meta-data from a database it is necessary to know the structures of its system
tables. Also, we need a mechanism to externally access this information (i.e. use an ODBC driver
or write our own). This stage can take 1-6 person weeks as in many cases system documentation
will be inadequate. Inclusion of a different data model would be a major extension as it affects all
stages of our system. It would require provision of new abstraction mechanisms such as parent-
child relationships for a hierarchical model and owner-member relationships for a network model.

       Other possible extensions are concerned with incorporating software modules that would
expand our approach. These include a forward gateway for use at the incremental migration stage;
an integration module for merging related database applications; and analysers for extracting
constraint based information from legacy IS code. These are important and major areas of
research, hence the development of such modules could take from many months to years in each
case.

8.3 Conclusion

8.3.1 Overall Summary

      This thesis has reported the results of a research investigation aimed at the design and
implementation of a tool for enhancing and migrating heterogeneous legacy databases.


Page 130
Chapter 8                                                    Evaluation, Future Work and
Conclusion

         In the first two chapters we introduced our research and its aims and objectives. Then in
chapter 3, we presented some preliminary database concepts and standards relevant to our work.
In chapters 4 and 5, we introduced wider aspects of our problem and studied alternative ways
proposed to solve major parts of this problem. Many important points emerged from this study.
These include: application of meta-translation techniques to deal with legacy database system
heterogeneity; application of migration techniques to specific components of a database
application (i.e. the database service) as opposed to an IS as a whole; extending the application of
database migration beyond the traditional COBOL oriented and IBM database products;
application of a migration approach to distributed database systems; enhancing previous re-
engineering approaches to incorporate modern object-oriented concepts and multi-database
capabilities; introducing semantic integrity constraints into legacy database systems and hence
exploring them beyond their structural semantics. In chapter 5, we described our re-engineering
approach and explained how we accomplished our goals in enhancing and preparing legacy
databases for migration, while chapter 6 was concerned with testing our ideas using carefully
designed test databases. Also in chapter 6, we provided illustrative examples of our system
working. In chapter 7, we described the overall architecture and operation of the system together
with related implementation considerations. Here we also gave some examples of our system
interfaces. In chapter 8, we carried out a detailed evaluation which included research
achievements, limitations and suggestions for possible future extensions. We also looked at some
real-life areas of application in which our prototype system has been tested and/or could be used.
Finally, some major conclusions that can be drawn from this research are presented, below.

8.3.2 Conclusions

       The important conclusions that can be drawn from the work described in this thesis are as
follows:

    • Although many approaches have been proposed for mapping relational schemas to a form
      where their semantics can be more easily understood by users, they either lack the
      application of modern modelling concepts or have been applied to logically centralised or
      decentralised database schemas, not physically heterogeneous databases.
    • Previous proposed approaches for mapping relational schemas to conceptual models involve
      user interactions and pre-requisites. This is confusing for first time users of a system as they
      don’t have any prior direct experience or knowledge about the underlying schema. We
      produce an initial conceptual model automatically, prior to any user interaction, to
      overcome this problem. Our user interaction commences only after the production of the
      initial conceptual model. This gives users the opportunity to gain some vital basic
      understanding of a system prior to any serious interaction with it.
    • Most previous reverse-engineering tools have ignored an important source of database
      semantics, namely semantic integrity constraints such as foreign key definitions. One
      obvious reason for this is that many existing database systems do not support the
      representation of such semantics. We have identified and showed the important contribution
      that semantic integrity constraints can make by presenting them and applying them to the
      conceptual and physical models. We have also successfully incorporated them into legacy
      database systems which do not directly support such semantics.


Page 131
Chapter 8                                                    Evaluation, Future Work and
Conclusion
    • The problem of legacy IS migration has not been studied for multi-database systems in
      general. This appears to present many difficulties to users. We have tested and demonstrated
      the use of our tools with a wide range of relational and extended relational database
      systems.
    • The problem of legacy IS migration has not been studied for more recent and modern
      systems; as a result, ways of eliminating the need for migration have not yet been addressed.
      Our approach of enhancing legacy ISs irrespective of their DBMS type will assist in
      redesigning modern database applications and hence overcome the need to migrate such
      applications in many cases.
• Our evaluation has concluded that most of the goals and objectives of our system, presented in
sections 1.2 and 2.4 have been successfully met or exceeded.




Page 132

More Related Content

PDF
Re-Engineering Databases using Meta-Programming Technology
PDF
A Centralized Network Management Application for Academia and Small Business ...
PDF
Getting relational database from legacy data mdre approach
PPT
Data & database administration hoffer
PDF
Towards a low cost etl system
PDF
Pitfalls & Challenges Faced During a Microservices Architecture Implementation
PDF
Model-Driven Architecture for Cloud Applications Development, A survey
PPTX
Re-Engineering Databases using Meta-Programming Technology
A Centralized Network Management Application for Academia and Small Business ...
Getting relational database from legacy data mdre approach
Data & database administration hoffer
Towards a low cost etl system
Pitfalls & Challenges Faced During a Microservices Architecture Implementation
Model-Driven Architecture for Cloud Applications Development, A survey

What's hot (15)

PPT
Topic1 Understanding Distributed Information Systems
DOCX
16 & 2 marks in i unit for PG PAWSN
PPT
Case Study: Synchroniztion Issues in Mobile Databases
PDF
DWDM-RAM: a data intensive Grid service architecture enabled by dynamic optic...
DOCX
Case4 lego embracing change by combining bi with flexible information system 2
DOCX
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
PPTX
Pmit 6102-14-lec1-intro
PPT
0321210255 ch01
PDF
Toward Cloud Computing: Security and Performance
DOCX
A database management system
PDF
Fs2510501055
PDF
Lesson - 02 Network Design and Management
PPTX
Distributed Systems - Information Technology
PPTX
Case study 9
PDF
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
Topic1 Understanding Distributed Information Systems
16 & 2 marks in i unit for PG PAWSN
Case Study: Synchroniztion Issues in Mobile Databases
DWDM-RAM: a data intensive Grid service architecture enabled by dynamic optic...
Case4 lego embracing change by combining bi with flexible information system 2
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
Pmit 6102-14-lec1-intro
0321210255 ch01
Toward Cloud Computing: Security and Performance
A database management system
Fs2510501055
Lesson - 02 Network Design and Management
Distributed Systems - Information Technology
Case study 9
A HEALTH RESEARCH COLLABORATION CLOUD ARCHITECTURE
Ad

Similar to Assisting Migration and Evolution of Relational Legacy Databases (20)

PDF
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdf
PDF
A Reconfigurable Component-Based Problem Solving Environment
PDF
AtomicDBCoreTech_White Papaer
PDF
01_Program
PDF
Conspectus data warehousing appliances – fad or future
PDF
Research Inventy : International Journal of Engineering and Science
PDF
Adm Workshop Program
PPTX
Opportunities and Challenges for Running Scientific Workflows on the Cloud
PPT
CSC UNIT1 CONTENT IN THE SUBJECT CLIENT SERVER COMPUTING
PDF
Microservices for Application Modernisation
PPTX
Transform Legacy Systems with Modern Development Expertise
PPTX
Mykhailo Hryhorash: Архітектура IT-рішень (Частина 1) (UA)
PDF
Transform Legacy Systems with Modern Development Expertise
PDF
data-mesh_whitepaper_dec2021.pdf
PDF
Query Evaluation Techniques for Large Databases.pdf
PDF
Cloud Computing: A Perspective on Next Basic Utility in IT World
PDF
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
PDF
Embracing Containers and Microservices for Future Proof Application Moderniza...
PDF
CC LECTURE NOTES (1).pdf
PDF
publishable paper
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdf
A Reconfigurable Component-Based Problem Solving Environment
AtomicDBCoreTech_White Papaer
01_Program
Conspectus data warehousing appliances – fad or future
Research Inventy : International Journal of Engineering and Science
Adm Workshop Program
Opportunities and Challenges for Running Scientific Workflows on the Cloud
CSC UNIT1 CONTENT IN THE SUBJECT CLIENT SERVER COMPUTING
Microservices for Application Modernisation
Transform Legacy Systems with Modern Development Expertise
Mykhailo Hryhorash: Архітектура IT-рішень (Частина 1) (UA)
Transform Legacy Systems with Modern Development Expertise
data-mesh_whitepaper_dec2021.pdf
Query Evaluation Techniques for Large Databases.pdf
Cloud Computing: A Perspective on Next Basic Utility in IT World
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
Embracing Containers and Microservices for Future Proof Application Moderniza...
CC LECTURE NOTES (1).pdf
publishable paper
Ad

More from Gihan Wikramanayake (20)

PPTX
Using ICT to Promote Learning in a Medical Faculty
PPTX
Evaluation of English and IT skills of new entrants to Sri Lankan universities
PPT
Learning beyond the classroom
PPT
Broadcasting Technology: Overview
PPT
Importance of Information Technology for Sports
PDF
Improving student learning through assessment for learning using social media...
PDF
Exploiting Tourism through Data Warehousing
PDF
Speaker Search and Indexing for Multimedia Databases
PDF
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...
PDF
Analysis of Multiple Choice Question Papers with Special Reference to those s...
PDF
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව දිනමිණ, පරිගණක දැනුම
PDF
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය දිනමිණ, පරිගණක දැනුම
PDF
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා දිනමිණ, පරිගණක දැනුම
PDF
Producing Employable Graduates
PDF
Balanced Scorecard and its relationship to UMM
PDF
An SMS-Email Reader
PDF
Web Usage Mining based on Heuristics: Drawbacks
PDF
Evolving and Migrating Relational Legacy Databases
PDF
Development of a Web site with Dynamic Data
PDF
Web Based Agriculture Information System
Using ICT to Promote Learning in a Medical Faculty
Evaluation of English and IT skills of new entrants to Sri Lankan universities
Learning beyond the classroom
Broadcasting Technology: Overview
Importance of Information Technology for Sports
Improving student learning through assessment for learning using social media...
Exploiting Tourism through Data Warehousing
Speaker Search and Indexing for Multimedia Databases
Authropometry of Sri Lankan Sportsmen and Sportswomen, with Special Reference...
Analysis of Multiple Choice Question Papers with Special Reference to those s...
ICT ප්‍රාරම්භක ඩිප්ලෝමා පාඨමාලාව දිනමිණ, පරිගණක දැනුම
වෘත්තීය අවස්ථා වැඩි පරිගණක ක්ෂේත‍්‍රය දිනමිණ, පරිගණක දැනුම
පරිගණක ක්ෂේත‍්‍රයේ වෘත්තීය අවස්ථා දිනමිණ, පරිගණක දැනුම
Producing Employable Graduates
Balanced Scorecard and its relationship to UMM
An SMS-Email Reader
Web Usage Mining based on Heuristics: Drawbacks
Evolving and Migrating Relational Legacy Databases
Development of a Web site with Dynamic Data
Web Based Agriculture Information System

Recently uploaded (20)

PDF
Basic Mud Logging Guide for educational purpose
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Institutional Correction lecture only . . .
PPTX
Cell Structure & Organelles in detailed.
PPTX
Pharma ospi slides which help in ospi learning
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Complications of Minimal Access Surgery at WLH
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
Basic Mud Logging Guide for educational purpose
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Institutional Correction lecture only . . .
Cell Structure & Organelles in detailed.
Pharma ospi slides which help in ospi learning
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Supply Chain Operations Speaking Notes -ICLT Program
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
VCE English Exam - Section C Student Revision Booklet
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
01-Introduction-to-Information-Management.pdf
Microbial diseases, their pathogenesis and prophylaxis
Complications of Minimal Access Surgery at WLH
Anesthesia in Laparoscopic Surgery in India
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Final Presentation General Medicine 03-08-2024.pptx

Assisting Migration and Evolution of Relational Legacy Databases

  • 1. Assisting Migration and Evolution of Relational Legacy Databases by G.N. Wikramanayake Department of Computer Science, University of Wales Cardiff, Cardiff September 1996
  • 3. Abstract The research work reported here is concerned with enhancing and preparing databases with limited DBMS capability for migration to keep up with current database technology. In particular, we have addressed the problem of re-engineering heterogeneous relational legacy databases to assist them in a migration process. Special attention has been paid to the case where the legacy database service lacks the specification, representation and enforcement of integrity constraints. We have shown how knowledge constraints of modern DBMS capabilities can be incorporated into these systems to ensure that when migrated they can benefit from the current database technology. To this end, we have developed a prototype conceptual constraint visualisation and enhancement system (CCVES) to automate as efficiently as possible the process of re-engineering for a heterogeneous distributed database environment, thereby assisting the global system user in preparing their heterogeneous database systems for a graceful migration. Our prototype system has been developed using a knowledge based approach to support the representation and manipulation of structural and semantic information about schemas that the re-engineering and migration process requires. It has a graphical user interface, including graphical visualisation of schemas with constraints using user preferred modelling techniques for the convenience of the user. The system has been implemented using meta-programming technology because of the proven power and flexibility that this technology offers to this type of research applications. The important contributions resulting from our research includes extending the benefits of meta- programming technology to the very important application area of evolution and migration of heterogeneous legacy databases. In addition, we have provided an extension to various relational database systems to enable them to overcome their limitations in the representation of meta-data. These extensions contribute towards the automation of the reverse-engineering process of legacy databases, while allowing the user to analyse them using extended database modelling concepts. Page v
  • 4. CHAPTER 1 Introduction This chapter introduces the thesis. Section 1.1 is devoted to the background and motivations of the research undertaken. Section 1.2 presents the broad goals of the research. The original achievements which have resulted from the research are summarised in Section 1.3. Finally, the overall organisation of the thesis is described in Section 1.4. 1.1 Background and Motivations of the Research Over the years rapid technological changes have taken place in all fields of computing. Most of these changes have been due to the advances in data communications, computer hardware and software [CAM89] which together have provided a reliable and powerful networking environment (i.e. standard local and wide area networks) that allow the management of data stored in computing facilities at many nodes of the network [BLI92]. These changes have turned round the hardware technology from centralised mainframes to networked file-server and client-server architectures [KHO92] which support various ways to use and share data. Modern computers are much more powerful than the previous generations and perform business tasks at a much faster rate by using their increased processing power [CAM88, CAM89]. Simultaneous developments in the software industry have produced techniques (e.g. for system design and development) and products capable of utilising the new hardware resources (e.g. multi-user environments with GUIs). These new developments are being used for a wide variety of applications, including modern distributed information processing applications, such as office automation where users can create and use databases with forms and reports with minimal effort, compared to the development efforts using 3GLs [HIR85, WOJ94]. Such applications are being developed with the aid of database technology [ELM94, DAT95] as this field too has advanced by allowing users to represent and manipulate advanced forms of data and their functionalities. Due to the program data independence feature of DBMSs the maintenance of database application programs has become easier as functionalities that were traditionally performed by procedural application routines are now supported declaratively using database concepts such as constraints and rules. In the field of databases, the recent advances resulting from technological transformation include many areas such as the use of distributed database technology [OZS91, BEL92], object- oriented technology [ATK89, ZDO90], constraints [DAT83, GRE93], knowledge-based systems [MYL89, GUI94], 4GLs and CASE tools [COMP90, SCH95, SHA95]. Meanwhile, the older technology was dealing with files and primitive database systems which now appear inflexible, as the technology itself limits them from being adapted to meet the current changing business needs catalysed by newer technologies. The older systems which have been developed using 3GLs and in operation for many years, often suffer from failures, inappropriate functionality, lack of documentation, poor performance and are referred to as legacy information systems [BRO93, COMS94, IEE94, BRO95, IEEE95]. The current technology is much more flexible as it supports methods to evolve (e.g. 4GLs, CASE tools, GUI toolkits and reusable software libraries [HAR90, MEY94]), and can share resources through software that allows interoperability (e.g. ODBC [RIC94, GEI95]). This evolution
  • 5. reflects the changing business needs. However, modern systems need to be properly designed and implemented to benefit from this technology, which may still be unable to prevent such systems themselves being considered to be legacy information systems in the near future due to the advent of the next generation of technology with its own special features. The only salvation would appear to be building in evolution paths in the current systems. The increasing power of computers and their software has meant they have already taken over many day to day functions and are taking over more of these tasks as time passes. Thus computers are managing a larger volume of information in a more efficient manner. Over the years most enterprises have adopted the computerisation option to enable them to efficiently perform their business tasks and to be able to compete with their counterparts. As the performance ability of computers has increased, the enterprises still using early computer technology face serious problems due to the difficulties that are inherent in their legacy systems. This means that new enterprises using systems purely based on the latest technology have an advantage over those which need to continue to use legacy information systems (ISs), as modern ISs have been developed using current technology which provides not only better performance, but also utilises the benefits of improved functionality. Hence, managers of legacy IS enterprises want to retire their legacy code and use modern database management systems (DBMSs) in the latest environment to gain the full benefits from this newer technology. However they want to use this technology on the information and data they already hold as well as on data yet to be captured. They also want to ensure that any attempts to incorporate the modern technology will not adversely affect the ongoing functionality of their existing systems. This means legacy ISs need to be evolved and migrated to a modern environment in such a way that the migration is transparent to the current users. The theme of this thesis is how we can support this form of system evolution. 1.1.1 The Barriers to Legacy Information System Migration Legacy ISs are usually those systems that have stood the test of time and have become a core service component for a business’s information needs. These systems are a mix of hardware and software, sometimes proprietary, often out of date, and built to earlier styles of design, implementation and operation. Although they were productive and fulfilled their original performance criteria and their requirements, these systems lack the ability to change and evolve. The following can be seen as barriers to evolution in legacy IS [IEE94]. • The technology used to build and maintain the legacy IS is obsolete, • The system is unable to reflect changes in the business world and to support new needs, • The system cannot integrate with other sub-systems, • The cost, time and risk involved in producing new alternative systems to the legacy IS. The risk factor is that a new system may not provide the full functionality of the current system for a period because of teething problems. Due to these barriers, large organisations [PHI94] prefer to write independent sub-systems to perform new tasks using modern technology which will run alongside the existing systems, rather than attempt to achieve this by adapting existing code or by writing a new system that replaces the old and has new facilities as well. We see the following immediate advantages of this low risk approach. Page 4
  • 6. • The performance, reliability and functionality of the existing system is not affected, • New applications can take advantage of the latest technology, • There is no need to retrain those staff who only need the facilities of the old system. However with this approach, as business requirements evolve with time, more and more new needs arise, resulting in the development and regular use of many diverse systems within the same organisation. Hence, in the long term the above advantages are overshadowed by the more serious disadvantages of this approach, such as: • The existing systems continue to exist and are legacy IS running on older and older technology, • The need to maintain many different systems to perform similar tasks increases the maintenance and support costs of the organisation, • Data becomes duplicated in different systems which implies the maintenance of redundant data with its associated increased risk of inconsistency between the data copies if updating occurs, • The overall maintenance cost for hardware, software and support personnel increases as many platforms are being supported, • The performance of the integrated information functions of the organisation decreases due to the need to interface many disparate systems. To address the above issues, legacy ISs need to be evolved and migrated to new computing environments, when their owning organisation upgrades. This migration should occur within a reasonable time after the upgrade occurs. This means that it is necessary to migrate legacy ISs to new target environments in order to allow the organisation to dispose of the technology which is becoming obsolete. Managers of some enterprises have chosen an easy way to overcome this problem, by emulating [CAM89, PHI94] the current environment on the new platforms (e.g. AS/400 emulators for IBM S/360 and ICL’s DME emulators for 1900 and System 4 users). An alternative strategy is achieved by translating [SHA93, PHI94, SHE94, BRO95] the software to run in new environments (i.e. code-to-code level translation). The emulator approach perpetuates all the software deficiencies of the legacy ISs although successfully removing the old-fashioned hardware technology and so it does enjoy the increased processing power of the new hardware. The translation approach takes advantage of some of the modern technological benefits in the target environment as the conversions - such as IBM’s JCL and ICL’s SCL code to Unix shell scripts, Assembler to COBOL, COBOL to COBOL embedded with SQL, and COBOL data structures to relational DBMS tables - are also done as part of the translation process. This approach, although a step forward, still carries over most of the legacy code as legacy systems are not evolved by this process. For example, the basic design is not changed. Hence the barrier to change and/or integration to a common sub- system still remains, and the translated systems were not designed for the environment they are now running in, so they may not be compatible with it. There are other approaches to overcoming this problem which have been used by enterprises [SHA93, BRO95]. These include re-implementing systems under the new environment and/or upgrading existing systems to achieve performance improvements. As computer technology continues to evolve at an ever quicker pace the need to migrate arises more rapidly. This means, most small organisations and individuals are left behind and are forced to work in a technologically Page 5
  • 7. obsolete environment, mainly due to the high cost of frequently migrating to newer systems and/or upgrading existing software, as this process involves time and manpower which cost money. The gap between the older and newer system users will very soon create a barrier to information sharing unless some tools are developed to assist the older technology users’ migration to new technology environments. This assistance for the older technology users may take many forms, including tools for: analysing and understanding existing systems; enhancing and modifying existing systems; migrating legacy ISs to newer platforms. The complete migration process for a legacy IS needs to consider these requirements and many other aspects, as recently identified by Brodie and Stonebraker in [BRO95]. Our work was primarily motivated by these business oriented legacy database issues and by work in the area of extending relational database technology to enable it to represent more knowledge about its stored data [COD79, STO86a, STO86b, WIK90]. This second consideration is an important aspect of legacy system migration, since if a graceful migration is to be achieved we must be able to enhance a legacy relational database with such knowledge to take full advantage of the new system environment. 1.1.2 Heterogeneous Distributed Environments As well as the problem of having to use legacy ISs, most large enterprises are faced with the problem of heterogeneity and the need for interoperability between existing ISs [IMS91]. This arises due to the increased use of different computer systems and software tools for information processing within an organisation as time passes. The development of networking capabilities to manage and share information stored over a network has made interoperability a requirement and local area networks finding broad acceptance in business enterprises has enhanced the need to perform this task within organisations. Network file servers, client-server technology and the use of distributed databases [OZS91, BEL92, KHO92] are results of these challenging innovations. This technology is currently being used to create and process information held in heterogeneous databases, which involves linking different databases in an interoperable environment. An aspect of this work is legacy database interoperation, since as time passes these databases will have been built using different generations of software. In recent years, the demand for distributed database capabilities has been fuelled mostly by the decentralisation of business functions in large organisations to address customer needs, and by mergers and acquisitions that have taken place in the corporate world. As a consequence, there is a strong requirement among enterprises for the ability to cross-correlate data stored in different existing heterogeneous databases. This has led to the development of products referred to as gateways, to enable users to link different databases together, e.g. Microsoft’s Open Database Connectivity (ODBC) drivers can link Access, FoxPro, Btrieve, dBASE and Paradox databases together [COL94, RIC94]. There are similar products for other database vendors, such as Oracle1 [HOL93] and others2 [PUR93, SME93, RIC94, BRO95]. Database vendors have targetted cross- platform compatibility via SQL access protocols to support interoperability in a heterogeneous environment. As heterogeneity in distributed systems may occur in various forms ranging from 1 For IBM’s DB2, UNISYS’s DMS, DEC RMS. 2 For INGRES, SYBASE, Informix and other popular SQL DBMSs. 3 During the life-time of this project the SQL-3 standards moved from a preliminary draft, through several modifications before being finalised in 1995. Page 6
  • 8. different hardware platforms, operating systems, networking protocols and local database systems, cross-platform compatibility via SQL provides only a simple form of heterogeneous distributed database access. The biggest challenge comes in addressing heterogeneity due to differences in local databases [OZS91, BEL92]. This challenge is also addressed in the design and development of our system. Distributed DBMSs have become increasingly popular in organisations as they offer the ability to interconnect existing databases, as well as having many other advantages [OZS91, BEL92]. The interconnection of existing databases leads to two types of distributed DBMS, namely: homogeneous and heterogeneous distributed DBMSs. In homogeneous systems all of the constituent nodes run the same DBMS and the databases can be designed in harmony with each other. This simplifies both the processing of queries at different nodes and the passing of data between nodes. In heterogeneous systems the situation is more complex, as each node can be running a different DBMS and the constituent databases can be designed independently. This is the normal situation when we are linking legacy databases, as the DBMS and the databases used are more likely to be heterogeneous since they are usually implemented for different platforms during different technological eras. In such a distributed database environment, heterogeneity may occur in various forms, at different levels [OZS91, BEL92], namely : • The logical level (i.e. involving different database designs), • The data management level (i.e. involving different data models), • The physical level, (i.e. involving different hardware, operating systems and network protocols), and • At all three or any pair of these levels. 1.1.3 The Problems and Search for a Solution The concept of heterogeneity itself is valuable as it allows designers a freedom of choice between different systems and design approaches, thus enabling them to identify those most suitable for different applications. The exploitation of this freedom over the years in many organisations has resulted in the creation of multiple local and remote information systems which now need to be made interoperable to provide an efficient and effective information service to the enterprise managers. Open database connectivity (ODBC) [RIC94, GEI95] and its standards has been proposed to support interoperability among databases managed by different DBMSs. Database vendors such as Oracle, INGRES, INFORMIX and Microsoft have already produced tools, engines and connectivity products to fulfil this task [HOL93, PUR93, SME93, COL94, RIC94, BRO95]. These products allow limited data transfer and query facilities among databases to support interoperability among heterogeneous DBMSs. These features, although they permit easy, transparent heterogeneous database access, still do not provide a solution to legacy IS where a primary concern is to evolve and migrate the system to a target environment so that obsolete support systems can be retired. Furthermore, the ODBC facilities are developed for current DBMSs and hence may not be capable of accessing older generation DBMSs, and, if they are, are unlikely to be able to enhance them to take advantage of the newer technologies. Hence there is a need to create tools that will allow ODBC equivalent functionality for older generation DBMSs. Our work provides such functionality for all the DBMSs we have chosen for this research. It also provides the ability to enhance and evolve legacy databases. Page 7
  • 9. In order to evolve an information system, one needs to understand the existing system’s structure and code. Most legacy information systems are not properly documented and hence understanding such systems is a complex process. This means that changing any legacy code involves a high risk as it could result in unexpected system behaviour. Therefore one needs to analyse and understand existing system code before performing any changes to the system. Database system design and implementation tools have appeared recently which have the aim of helping new information system development. Reverse and re-engineering tools are also appearing in an attempt to address issues concerned with existing databases [SHA93, SCH95]. Some of these tools allow the examination of databases built using certain types of DBMSs, however, the enhancements they allow are done within the limitation of that system. Due to continuous ongoing technology changes, most current commercial DBMSs do not support the most recent software modelling techniques and features (e.g. Oracle version 7 does not support Object-Oriented features). Hence a system built using current software tools is guaranteed to become a legacy system in the near future (i.e. when new products with newer techniques and features begin to appear in the commercial market place). Reverse engineering tools [SHA93] are capable of recreating the conceptual model of an existing database and hence they are an ideal starting point when trying to gain a comprehensive understanding of the information held in the database and its current state, as they create a visual picture of that state. However, in legacy systems the schemas are basic, since most of the information used to compose a conceptual model is not available in these databases. Information such as constraints that show links between entities is usually embedded in the legacy application code and users find it difficult to reverse engineer these legacy ISs. Our work addresses these issues while assisting in overcoming this barrier within the knowledge representation limitations of existing DBMSs. 1.1.4 Primary and Secondary Motivations The research reported in this thesis therefore was primarily promoted by the need to provide, for a logically heterogeneous distributed database environment, a design tool that allows users not only to understand their existing systems but also to enhance and visualise an existing database’s structure using new techniques that are either not yet present in existing systems or not supported by the existing software environment. It was also motivated by: a) Its direct applicability in the business world, as the new technique can be applied to incrementally enhance existing systems and prepare them to be easily migrated to new target environments, hence avoiding continued use of legacy information systems in the organisation. Although previous work and some design tools address the issue of legacy information system analysis, evolution and migration, these are mainly concerned with 3GL languages such as COBOL and C [COMS94, BRO95, IEEE95]. Little work has been reported which addresses the new issues that arise due to the Object-Oriented (O-O) data model or the extended relational data model [CAT94]. There are no reports yet of enhancing legacy systems so that they can migrate to O-O or extended relational environments in a graceful migration from a relational system. There has been Page 8
  • 10. some work in the related areas of identifying extended entity relationship structures in relational schemas, and some attempts at reverse-engineering relational databases [MAR90, CHI94, PRE94]. b) The lack of previous research in visualising pre-existing heterogeneous database schemas and evolving them by enhancing them with modern concepts supported in more recent releases of software. Most design tools [COMP90, SHA93] which have been developed to assist in Entity- Relationship (E-R) modelling [ELM94] and Object Modelling Technique (OMT) modelling [RUM91] are used in a top-down database design approach (i.e. forward engineering) to assist in developing new systems. However, relatively few tools attempt to support a bottom-up approach (i.e. reverse engineering) to allow visualisation of pre-existing database schemas as E-R or OMT diagrams. Among these tools only a very few allow enhancement of the pre-existing database schemas, i.e. they apply forward engineering to enhance a reverse-engineered schema. Even those which do permit this action to some extent, always operate on a single database management system and work mostly with schemas originally designed using such systems (e.g. CASE tools). The tools that permit only the bottom-up approach are referred to as reverse-engineering tools and those which support both (i.e. bottom-up and top-down) are called re-engineering tools [SHA93]. This thesis is primarily concerned with creating re-engineering tools that assist legacy database migration. The commercially available re-engineering tools are customised for particular DBMSs and are not easily usable in a heterogeneous environment. This barrier against widespread usability of re- engineering tools means that a substantial adaptation and reprogramming effort (costing time and money) is involved every time a new DBMS appears in a heterogeneous environment. An obvious example that reflects this limitation arises in a heterogeneous distributed database environment where there may be a need to visualise each participant database’s schema. In such an environment if the heterogeneity occurs at the database management level (where each node uses a different DBMS, for example, one node uses INGRES [DAT87] and another uses Oracle [ROL92]), then we have to use two different re-engineering tools to display these schemas. This situation is exacerbated for each additional DBMS that is incorporated into the given heterogeneous context. Also, legacy databases are migrated to different DBMS environments as newer versions and better database products have appeared since the original release of their DBMS. This means that a re-engineering tool that assists legacy database migration must work in an heterogeneous environment so that its use will not be restricted to particular types of ISs. Existing re-engineering tools provide a single target graphical data model (usually the E-R model or a variant of it), which may differ in presentation style between tools and therefore inhibits the uniformity of visualisation that is highly desirable in an interoperable heterogeneous distributed database environment. This limitation means that users may need to use different tools to provide the required uniformity of display in such an environment. The ability to visualise the conceptual model of an information system using a user-preferred graphical data model is important as it ensures that no inaccurate enhancements are made to the system due to any misinterpretation of graphical notations used. c) The need to apply rules and constraints to pre-existing databases to identify and clean inconsistent legacy data, as preparation for migration or as an enhancement of the database’s quality. Page 9
  • 11. The inability to define and apply rules and constraints on early database systems due to system limitations resulted in them not using constraints to increase the accuracy and consistency of the data held by these systems. This limitation is now a barrier to information system migration as a new target DBMS is unable to enforce constraints on a migrated database until all violations are investigated and resolved either by omitting the violating data or by cleaning it. This investigation may also show that a constraint has to be adjusted as the violating data is needed by the organisation. The enhancement of such a system by rules and constraints provides knowledge that is usable to determine possible data violations. The process of detecting constraint violations may be done by applying queries that are generated from these enhanced constraints. Similar methods have been used to implement integrity constraints [STO75], optimise queries [OZS91] and obtain intensional answers [FON92, MOT89]. This is essential as constraints may have been implemented at the application coding level and that can lead to their inconsistent application. d) An awareness of the potential contribution that knowledge-based systems and meta-programming technologies, in association with extended relational database technology, have to offer in coping with semantic heterogeneity. The successful production of a conceptual model is highly dependent on the semantic information available, and on the ability to reason about these semantics. A knowledge-based system can be used to assist in this task, as the process to generalise effective exploitation of semantic information for pre-existing heterogeneous databases needs to undergo three sub-processes, namely: knowledge acquisition, representation and manipulation. The knowledge acquisition process extracts the existing knowledge from a database’s data dictionaries. This knowledge may include subsequent enhancements made by the user, as the use of a database to store such knowledge will provide easy access to this information along with its original knowledge. The knowledge representation process represents existing and enhanced knowledge. The knowledge manipulation process is concerned with deriving new knowledge and ensuring consistency of existing knowledge. These stages are addressable using specific processes. For instance, the reverse-engineering process used to produce a conceptual model can be used to perform the knowledge acquisition task. Then the derived and enhanced knowledge can be stored in the same database by adopting a process that will allow us to distinguish this knowledge from its original meta-data. Finally, knowledge manipulation can be done with the assistance of a Prolog based system [GRA88], while data and knowledge consistency can be verified using the query language of the database. 1.2 Goals of the Research The broad goals of the research reported in this thesis are highlighted here, with detailed aims and objectives presented in section 2.4. These goals are to investigate interoperability problems, schema enhancement and migration in a heterogeneous distributed database environment, with particular emphasis on extended relational systems. This should provide a basis for the design and implementation of a prototype software system that brings together new techniques from the areas of knowledge-based systems, meta-programming and O-O conceptual data modelling with the aim of facilitating schema enhancement, by means of generalising the efficient representation of constraints using the current standards. Such a system is a tool that would be a valuable asset in a logically heterogeneous distributed extended relational database environment as it would make it possible for Page 10
  • 12. global users to incrementally enhance legacy information systems. This offers the potential for users in this type of environment to work in terms of such a global schema, through which they can prepare their legacy systems to easily migrate to target environments and so gain the benefits of modern computer technology. 1.3 Original Achievements of the Research The importance of this research lies in establishing the feasibility of enhancing, cleaning and migrating heterogeneous legacy databases using meta-programming technology, knowledge-based system technology, database system technology and O-O conceptual data modelling concepts, to create a comprehensive set of techniques and methods that form an efficient and useful generalised database re-engineering tool for heterogeneous sets of databases. The benefits such a tool can bring are also demonstrated and assessed. A prototype Conceptual Constraint Visualisation and Enhancement System (CCVES) [WIK95a] has been developed as a result of the research. To be more specific, our work has made four important contributions to progress in the database topic area of Computer Science: 1) CCVES is the first system to bring the benefits of meta-programming technology to the very important application area of enhancing and evolving heterogeneous distributed legacy databases to assist the legacy database migration process [GRA94, WIK95c]. 2) CCVES is also the first system to enhance existing databases with constraints to improve their visual presentation and hence provide a better understanding of existing applications [WIK95b]. This process is applicable to any relational database application, including those which are unable to naturally support the specification and enforcement of constraints. More importantly, this process does not affect the performance of an existing application. 3) As will be seen later, we have chosen the current SQL-3 standards [ISO94] as the basis for knowledge representation in our research. This project provides an extension to the representation of the relational data model to cope with automated reuse of knowledge in the re- engineering process. In order to cope with technological changes that result from the emergence of new systems or new versions of existing DBMSs, we also propose a series of extended relational system tables conforming to SQL-3 standards to enhance existing relational DBMSs [WIK95b]. 4) The generation of queries using the constraint specifications of the enhanced legacy systems is an easy and convenient method of detecting any constraint violating data in existing systems. The application of this technique in the context of a heterogeneous environment for legacy information systems is a significant step towards detecting and cleaning inconsistent data in legacy systems prior to their migration. This is essential if a graceful migration is to be effected [WIK95c]. 1.4 Organisation of the Thesis Page 11
  • 13. The thesis is organised into 8 chapters. This first chapter has given an introduction to the research done, covering background and motivations, and outlining original achievements. The rest of the thesis is organised as follows: Chapter 2 is devoted to presenting an overview of the research together with detailed aims and objectives for the work undertaken. It begins by identifying the scope of the work in terms of research constraints and development technologies. This is followed by an overview of the research undertaken, where a step by step discussion of the approach adopted and its role in a heterogeneous distributed database environment is given. Finally, detailed aims and objectives are drawn together to conclude the chapter. Chapter 3 identifies the relational data model as the current dominant database model and presents its development along with its terminology, features and query languages. This is followed by a discussion of conceptual data models with special emphasis on the data models and symbols used in our project. Finally, we pay attention to key concepts related to our project, mainly the notion of semantic integrity constraints and extensions to the relational model. Here, we present important integrity constraint extensions to the relational model and its support using different SQL standards. Chapter 4 addresses the issue of legacy information system migration. The discussion commences with an introduction to legacy and our target information systems. This is followed by migration strategies and methods for such ISs. Finally, we conclude by referring to current techniques and identify the trends and existing tools applicable to database migration. Chapter 5 addresses the re-engineering process for relational databases. Techniques currently used for this purpose are identified first. Our approach, which uses constraints to re-engineer a relational legacy database is described next. This is followed by a process for detecting possible keys and structures of legacy databases. Our schema enhancement and knowledge representation techniques are then introduced. Finally, we present a process to detect and resolve conflicts that may occur due to schema enhancement. Chapter 6 introduces some example test databases which were chosen to represent a legacy heterogeneous distributed database environment and its access processes. Initially, we present the design of our test databases, the selection of our test DBMSs and the prototype system environment. This is followed by the application of our re-engineering approach to our test databases. Finally, the organisation of relational meta-data and its access is described using our test DBMSs. Chapter 7 presents the internal and external architecture and operation of our conceptual constraint visualisation and enforcement system (CCVES) in terms of the design, structure and operation of its interfaces, and its intermediate modelling system. The internal schema mappings, e.g. mapping from INGRES QUEL to SQL and vice-versa, and internal database migration processes are presented in detail here. Chapter 8 provides an evaluation of CCVES, identifying its limitations and improvements that could be made to the system. A discussion of potential applications is presented. Finally we conclude the Page 12
  • 14. chapter by drawing conclusions about the research project as a whole. Page 13
  • 15. CHAPTER 2 Research Scope, Approach, Aims and Objectives This chapter describes, in some detail, the aims and objectives of the research that has been undertaken. Firstly, the boundaries of the research are defined in section 2.1, which considers the scope of the project. Secondly, an overview of the research approach we have adopted in dealing with heterogeneous distributed legacy database evolution and migration is given in section 2.2. Next, in section 2.3, the discussion is extended to the wider aspects of applying our approach in a heterogeneous distributed database environment using the existing meta-programming technology developed at Cardiff in other projects. Finally, the research aims and objectives are detailed in section 2.4, illustrating what we intend to achieve, and the benefits expected from achieving the stated aims. 2.1 Scope of the Project We identify the scope of the work in terms of research constraints and the limitations of current development technologies. An overview of the problem is presented along with the drawbacks and limitations of database software development technology in addressing the problem. This will assist in identifying our interests and focussing the issues to be addressed. 2.1.1 Overview of the Problem In most database designs, a conceptual design and modelling technique is used in developing the specifications at the user requirements and analysis stage of the design. This stage usually describes the real world in terms of object/entity types that are related to one another in various ways [BAT92, ELM94]. Such a technique is also used in reverse-engineering to portray the current information content of existing databases, as the original designs are usually either lost, or inappropriate because the database has evolved from its original design. The resulting pictorial representation of a database can be used for database maintenance, for database re- design, for database enhancement, for database integration or for database migration, as it gives its users a sound understanding of an existing database’s architecture and contents. Only a few current database tools [COMP90, BAT92, SHA93, SCH95] allow the capture and presentation of database definitions from an existing database, and the analysis and display of this information at a higher level of abstraction. Furthermore, these tools are either restricted to accessing a specific database management system’s databases or permit modelling with only a single given display formalism, usually a variant of the EER [COMP90]. Consequently there is a need to cater for multiple database platforms with different user needs to allow access to a set of databases comprising a heterogeneous database, by providing a facility to visualise databases using a preferred conceptual modelling technique which is familiar to the different user communities of the heterogeneous system. The fundamental modelling constructs of current reverse and re-engineering tools are entities, relationships and associated attributes. These constructs are useful for database design at
  • 16. a high level of abstraction. However, the semantic information now available in the form of rules and constraints in modern DBMSs provides their users with a better understanding of the underlying database as its data conforms to these constraints. This may not necessarily be true for legacy systems, which may have constraints defined that were not enforced. The ability to visualise rules and constraints as part of the conceptual model increases user understanding of a database. Users could also exploit this information to formulate queries that more effectively utilise the information held in a database. Having these features in mind, we concentrated on providing a tool that permits specification and visualisation of constraints as part of the graphical display of the conceptual model of a database. With modern technology increasing the number of legacy systems and with increasing awareness of the need to use legacy data [BRO95, IEEE95], the availability of such a visualisation tool will be more important in future as it will let users see the full definition of the contents of their databases in a familiar format. Three types of abstraction mechanism, namely: classification, aggregation and generalisation, are used in conceptual design [ELM94]. However, most existing DBMSs do not maintain sufficient meta-data information to assist in identifying all these abstraction mechanisms within their data models. This means that reverse and re-engineering tools are semi-automated, in that they extract information, but users have to guide them and decide what information to look for [WAT94]. This requires interactions with the database designer in order to obtain missing information and to resolve possible conflicts. Such additional information is supplied by the tool users when performing the reverse-engineering process. As this additional information is not retained in the database, it must be re-entered every time a reverse engineering process is undertaken if the full representation is to be achieved. To overcome this problem, knowledge bases are being used to retain this information when it is supplied. However, this approach restricts the use of this knowledge by other tools which may exist in the database’s environment. The ability to hold this knowledge in the database itself would enhance an existing database with information that can be widely used. This would be particularly useful in the context of legacy databases as it would enrich their semantics. One of the issues considered in this thesis is how this can be achieved. Most existing relational database applications record only entities and their properties (i.e. attribute names and data types) as system meta-data. This is because these systems conformed to early database standards (e.g. the SQL/86 standard [ANSI86], supported by INGRES version 5 and Oracle version 5). However, more recent relational systems record additional information such as constraint and rule definitions, as they conform to the SQL/92 standards [ANSI92] (e.g. Oracle version 7). This additional information includes, for example, primary and foreign key specifications, and can be used to identify classification and aggregation abstractions used in a conceptual model [CHI94, PRE94, WIK95b]. However, the SQL/92 standard does not capture the full range of modelling abstractions, e.g. inheritance representing generalisation hierarchies. This means that early relational database applications are now legacy systems as they fail to naturally represent additional information such as constraint and rule definitions. Such legacy database systems are being migrated to modern database systems not only to gain the benefits of the current technology but also to be compatible with new applications built with the modern technology. The SQL standards are currently subject to review to permit the representation of extra knowledge (e.g. object-oriented features), and we have anticipated some of these proposals in our work - i.e. SQL-33 [ISO94] will be adopted by commercial systems and thus the current modern DBMSs Page 15
  • 17. will become legacy databases in the near future or already may be considered to be legacy databases in that their data model type will have to be mapped onto the newer version. Having experienced the development process of recent DBMSs it is inevitable that most current databases will have to be migrated, either to a newer version of the existing DBMS or to a completely different newer technology DBMS for a variety of reasons. Thus the migration of legacy databases is perceived to be a continuing requirement, in any organisation, as technology advances continue to be made. Most migrations currently being undertaken are based on code-to-code level translations of the applications and associated databases to enable the older system to be functional in the target environment. Minimal structural changes are made to the original system and database, thus the design structures of these systems are still old-fashioned, although they are running in a modern computing environment. This means that such systems are inflexible and cannot be easily enhanced with new functions or integrated with other applications in their new environment. We have also observed that more recent database systems have often failed to benefit from modern database technology due to inherent design faults that have resulted in the use of unnormalised structures, which cause omission of the features enforcing integrity constraints even when this is possible. The ability to create and use databases without the benefit of a database design course is one reason for such design faults. Hence there is a need to assist existing systems to be evolved, not only to perform new tasks but also to improve their structure so that these systems can maximise the gains they receive from their current technology environment and any environment they migrate to in the future. 2.1.2 Narrowing Down the Problem Technological advances in both hardware and software have improved the performance and maintenance functionality of all information systems (ISs), and as a result, older ISs suffer from comparatively poor performance and inappropriate functionality when compared with more modern systems. Most of these legacy systems are written in a 3GL such as COBOL, have been around for many years, and run on old-fashioned mainframes. Problems associated with legacy systems are being identified and various solutions are being developed [BRO93, SHE94, BRO95]. These systems basically have three functional components, namely: interface, application and a database service, which are sometimes inter-related to each other, depending on how they were used during the design and implementation stages of the IS development. This means that the complexity of a legacy IS depends on what occurred during the design and implementation of the system. These systems may range from a simple single user database application using separate interfaces and applications, to a complex multi-purpose unstructured application. Due to the complex nature of the problem area we do not address this issue as a whole, but focus only on problems associated with one sub-component of such legacy information systems, namely the database service. This in itself is a wide field, and we have further restricted ourselves to legacy ISs using a specific DBMS for their database service. We considered data models ranging from original flat file and relational systems, to modern relational DBMSs and object-oriented DBMSs. From these data models we have chosen the traditional relational model for the following reasons. • The relational model is currently the most widely used database model. Page 16
  • 18. • During the last two decades the relational model has been the most popular model; therefore it has been used to develop many database applications and most of these are now legacy systems. • There have been many extensions and variations of the relational model, which has resulted in many heterogeneous relational database systems being used in organisations. • The relational model can be enhanced to represent additional semantics currently supported only by modern DBMSs (e.g. extended relational systems [ZDO90, CAT94]). As most business requirements change with time, the need to enhance and migrate legacy information systems exists for almost every organisation. We address problems faced by these users while seeking for a solution that prevents new systems becoming legacy systems in the near future. The selection of the relational model as our database service to demonstrate how one could achieve these needs means that we shall be addressing only relational legacy database systems and not looking at any other type of legacy information systems. This decision means we are not considering the many common legacy IS migration problems identified by Brodie [BRO95] (e.g. migration of legacy database services such as flat- file structures or hierarchical databases into modern extended relational databases; migration of legacy applications with millions of lines of code written in some COBOL-like language into a modern 4GL/GUI environment). However, as shown later, addressing the problems associated with relational legacy databases has enabled us to identify and solve problems associated with more recent DBMSs, and it also assists in identifying precautions which if implemented by designers of new systems will minimise the chance of similar problems being faced by these systems as IS developments occur in the future. 2.2 Overview of the Research Approach Having presented an overview and narrowing down of our problem, we identify the following as the main functionalities that should be provided to fulfil our research goal: • Reverse-engineering of a relational legacy database to fully portray its current information content. • Enhancing a legacy database with new knowledge to identify modelling concepts that should be available to the database concerned or to applications using that database. • Determining the extent to which the legacy database conforms to its existing and enhanced descriptions. • Ensuring that the migrated IS will not become a legacy IS in the future. We need to consider the heterogeneity issue in order to be able to reverse-engineer any given relational legacy database. Three levels of heterogeneity are present for a particular data model, namely: at a physical, logical and data management level. The physical level of heterogeneity usually arises due to different data model implementation techniques, use of different computer platforms and use of different DBMSs. The physical / logical data independence of DBMSs hides implementation differences from users, hence we need only address how to access databases that are built using different DBMSs, running on different computer platforms. Page 17
  • 19. Differences in DBMS characteristics lead to heterogeneity at the logical level. Here, the different DBMSs conform to a particular standard (e.g. SQL/86 or SQL/92), which supports a particular database query language (e.g. SQL or QUEL) and different relational data model features (e.g. handling of integrity constraints and availability of object-oriented features). To tackle heterogeneity at the logical level, we need to be aware of different standards, and to model ISs supporting different features and query languages. Heterogeneity at the data management level arises due to the physical limitations of a DBMS, differences in the logical design and inconsistencies that occurred when populating the database. Logical differences in different database schemas have to be resolved only if we are going to integrate them. The schema integration process is concerned with merging different related database applications. Such a facility can assist the migration of heterogeneous database systems. However any attempt to integrate legacy database schemas prior to the migration process complicates the entire process as it is similar to attempting to provide new functionalities within the system which is being migrated. Such attempts increase the chance of failure of the overall migration process. Hence we consider any integration or enhancements in the form of new functionalities only after successfully migrating the original legacy IS. However, the physical limitations of a DBMS and data inconsistencies in the database need to be addressed beforehand to ensure a successful migration. Our work addresses the heterogeneity issues associated with database migration by adopting an approach that allows its users to incrementally increase the number of DBMSs it could handle without having to reprogram its main application modules. Here, the user needs to supply specific knowledge about DBMS schema and query language constructs. This is held together with the knowledge of the DBMSs already supported and has no effect on the application’s main processing modules. 2.2.1 Meta-Programming Meta-programming technology allows the meta-data (schema information) of a database to be held and processed independently of its source specification language. This allows us to work on a database language independent environment and hence overcome many logical heterogeneity issues. Prolog based meta-programming technology has been used in previous research at Cardiff in the area of logical heterogeneity [FID92, QUT94]. Using this technology the meta-translation of database query languages [HOW87] and database schemas [RAM91] has been performed. This work has shown how the heterogeneity issues of different DBMSs can be addressed without having to reprogram the same functionality for each and every DBMS. We use meta-programming technology for our legacy database migration approach as we need to be able to start with a legacy source database and end with a modern target database where the respective database schema and query languages may be different from each other. In this approach the source database schema or query language is mapped on input into an internal canonical form. All the required processing is then done using the information held in this internal form. This information is finally mapped to the target schema or query language to produce the desired output. The advantage of this approach is that processing is not affected by heterogeneity as it is always performed on data held in the canonical form. This canonical form is an enriched collection of semantic data modelling features. Page 18
  • 20. 2.2.2 Application We view our migration approach as consisting of a series of stages, with the final stage being the actual migration and earlier stages being preparatory. At stage 1, the data definition of the selected database is reverse-engineered to produce a graphical display (cf. paths A-1 and A-2 of figure 2.1). However, in legacy systems much of the information needed to present the database schema in this way is not available as part of the database meta-data and hence these links which are present in the database cannot be shown in this conceptual model. In modern systems such links can be identified using constraint specifications. Thus, if the database does not have any explicit constraints, or it does but these are incomplete, new knowledge about the database needs to be entered at stage 2 (cf. path B-1 of figure 2.1), which will then be reflected in the enhanced schema appearing in the graphical display (cf. path B-2 of figure 2.1). This enhancement will identify new links that should be present for the database concerned. These new database constraints can next be applied experimentally to the legacy database to determine the extent to which it conforms to them. This process is done at stage 3 (cf. paths C-1 and C-2 of figure 2.1). The user can then decide whether these constraints should be enforced to improve the quality of the legacy database prior to its migration. At this point the three preparatory stages in the application of our approach are complete. The actual migration process is then performed. All stages are further described below to enable us to identify the main processing components of our proposed system as well as to explain how we deal with different levels of heterogeneity. Stage 1: Reverse Engineering In stage 1, the data definition of the selected database is reverse-engineered to produce a graphical display of the database. To perform this task, the database’s meta-data must be extracted (cf. path A-1 of figure 2.1). This is achieved by connecting directly to the heterogeneous database. The accessed meta-data needs to be represented using our internal form. This is achieved through a schema mapping process as used in the SMTS (Schema Meta-Translation System) of Ramfos [RAM91]. The meta-data in our internal formalism then needs to be processed to derive the graphical constructs present for the database concerned (cf. path A-2 of figure 2.1). These constructs are in the form of entity types and the relationships and their derivation process is the main processing component in stage 1. The identified graphical constructs are mapped to a display description language to produce a graphical display of the database. Page 19
  • 21. Schema Enhanced Visualisation Enforced Constraints (EER or OMT) Constraints with Constraints B-1 C-1 B-2 A-2 Internal Processing B-3 C-2 A-1 Heterogeneous Databases Stage 1 (Reverse Engineering) Stage 2 (Knowledge Augmentation) Stage 3 (Constraint Enforcement) Figure 2.1: Information flow in the 3 stages of our approach prior to migration a) Database connectivity for heterogeneous database access Unlike the previous Cardiff meta-translation systems [HOW87, RAM91, QUT92], which addressed heterogeneity at the logical and data management levels, our system looks at the physical level as well. While these previous systems processed schemas in textual form and did not access actual databases to extract their DDL specification, our system addresses physical heterogeneity by accessing databases running on different hardware / software platforms (e.g. computer systems, operating systems, DBMSs and network protocols). Our aim is to directly access the meta-data of a given database application by specifying its name, the name and version of the host DBMS, and the address of the host machine4. If this database access process can produce a description of the database in DDL formalism, then this textual file is used as the starting point for the meta-translation process as in previous Cardiff systems [RAM91, QUT92]. We found that it is not essential to produce such a textual file, as the required intermediate representation can be directly produced by the database access process. This means that we could also by-pass the meta-translation process that performs the analysis of the DDL text to translate it into the intermediate representation5. However the DDL formalism of the schema can be used for optional textual viewing and could also serve as the starting point for other tools6 developed at Cardiff for meta-programming database applications. The initial functionality of the Stage 1 database connectivity process is to access a heterogeneous database and supply the accessed meta-data as input to our schema meta-translator 4 We assume that access privileges for this host machine and DBMS have been granted. 5 A list of tokens ready for syntactic analysis in the parsing phase is produced and processed based on the BNF syntax specification of the DDL [QUT92]. 6 e.g. The Schema Meta-Integration System (SMIS) of Qutaishat [QUT92]. Page 20
  • 22. (SMTS). This module needs to deal with heterogeneity at the physical and data management levels. We achieve this by using DML commands of the specific DBMS to extract the required meta-data held in database data dictionaries treated like user defined tables. Relatively recently, the functionalities of a heterogeneous database access process have been provided by means of drivers such as ODBC [RIC94]. Use of such drivers will allow access to any database supported by them and hence obviate the need to develop specialised tools for each database type as happened in our case. These driver products were not available when we undertook this stage of our work. b) Schema meta-translation The schema meta-translation process [RAM91] accepts input of any database schema irrespective of its DDL and features. The information captured during this process is represented internally to enable it to be mapped from one database schema to another or to further process and supply information to other modules such as the schema meta-visualisation system (SMVS) [QUT93] and the query meta-translation system (QMTS) [HOW87]. Thus, the use of an internal canonical form for meta representation has successfully accommodated heterogeneity at the data management and logical levels. c) Schema meta-visualisation Schema visualisation using graphical notation and diagrams has proved to be an important step in a number of applications, e.g. during the initial stages of the database design process; for database maintenance; for database re-design; for database enhancement; for database integration; or for database migration; as it gives users a sound understanding of an existing database’s structure in an easily assimilated format [BAT92, ELM94]. Database users need to see a visual picture of their database structure instead of textual descriptions of the defining schema as it is easier for them to comprehend a picture. This has led to the production of graphical representations of schema information, effected by a reverse engineering process. Graphical data models of schemas employ a set of data modelling concepts and a language-independent graphical notation (e.g. the Entity Relationship (E-R) model [CHE76], Extended/Enhanced Entity Relationship (EER) model [ELM94] or the Object Modelling Technique (OMT) [RUM91]). In a heterogeneous environment different users may prefer different graphical models, and an understanding of the database structure and architecture beyond that given by the traditional entities and their properties. Therefore, there is a need to produce graphical models of a database’s schema using different graphical notations such as either E-R/EER or OMT, and to accompany them with additional information such as a display of the integrity constraints in force in the database [WIK95b]. The display of integrity constraints allows users to look at intra- and inter- object constraints and gain a better understanding of domain restrictions applicable to particular entities. Current reverse engineering tools do not support this type of display. The generated graphical constructs are held internally in a similar form to the meta-data of the database schema. Hence using a schema meta visualisation process (SMVS) it is possible to map the internally held graphical constructs into appropriate graphical symbols and coordinates for the graphical display of the schema. This approach has a similarity to the SMTS, the main Page 21
  • 23. difference being that the output is graphical rather than textual. Stage 2: Knowledge Augmentation In a heterogeneous distributed database environment, evolution is expected, especially in legacy databases. This evolution can affect the schema description and in particular schema constraints that are not reflected in the stage 1 (path A-2) graphical display as they may be implicit in applications. Thus our system is designed to accept new constraint specifications (cf. path B-1 of figure 2.1) and add them to the graphical display (cf. path B-2 of figure 2.1) so that these hidden constraints become explicit. The new knowledge accepted at this point is used to enhance the schema and is retained in the database using a database augmentation process (cf. path B-3 of figure 2.1). The new information is stored in a form that conforms with the enhanced target DBMS’s methods of storing such information. This assists the subsequent migration stage. a) Schema enhancement Our system needs to permit a database schema to be enhanced by specifying new constraints applicable to the database. This process is performed via the graphical display. These constraints, which are in the form of integrity constraints (e.g. primary key, foreign key, check constraints) and structural components (e.g. inheritance hierarchies, entity modifications) are specified using a GUI. When they are entered they will appear in the graphical display. b) Database augmentation The input data to enhance a schema provides new knowledge about a database. It is essential to retain this knowledge within the database itself, if it is to be readily available for any further processing. Typically, this information is retained in the knowledge base of the tool used to capture the input data, so that it can be reused by the same tool. This approach restricts the use of this knowledge by other tools and hence it must be re-entered every time the re-engineering process is applied to that database. This makes it harder for the user to gain a consistent understanding of an application, as different constraints may be specified during two separate re- engineering processes. To overcome this problem, we augment the database itself using the techniques proposed in SQL-3 [ISO94], wherever possible. When it is not possible to use SQL-3 structures we store the information in our own augmented table format which is a natural extension of the SQL-3 approach. When a database is augmented using this method, the new knowledge is available in the database itself. Hence, any further re-engineering processes need not make requests for the same additional knowledge. The augmented tables are created and maintained in a similar way to user- defined tables, but have a special identification to distinguish them. Their structure is in line with the international standards and the newer versions of commercial DBMSs, so that the enhanced database can be easily migrated to either a newer version of the host DBMS or to a different DBMS supporting the latest SQL standards. Migration should then mean that the newer system can enforce the constraints. Our approach should also mean that it is easy to map our tables for Page 22
  • 24. holding this information into the representation used by the target DBMS even if it is different, as we are mapping from a well defined structure. Legacy databases that do not support explicit constraints can be enhanced by using the above knowledge augmentation method. This requirement is less likely to occur for databases managed by more recent DBMSs as they already hold some constraint specification information in their system tables. The direction taken by Oracle version 6 was a step towards our augmentation approach, as it allowed the database administrator to specify integrity constraints such as primary and foreign keys, but did not yet enforce them [ROL92]. The next release of Oracle, i.e. version 7, implemented this constraint enforcement process. Stage 3: Constraint Enforcement The enhanced schema can be held in the database, but the DBMS can only enforce these constraints if it has the capability to do so. This will not normally be the case in legacy systems. In this situation, the new constraints may be enforced via a newer version of the DBMS or by migrating the database to another DBMS supporting constraint enforcement. However, the data being held in the database may not conform to the new constraints, and hence existing data may be rejected by the target DBMS in the migration, thus losing data and / or delaying the migration process. To address this problem and to assist the migration process, we provide an optional constraint enforcement process module which can be applied to a database before it is migrated. The objective of this process is to give users the facility to ensure that the database conforms to all the enhanced constraints before migration occurs. This process is optional so that the user can decide whether these constraints should be enforced to improve the quality of the legacy data prior to its migration, whether it is best left as it stands, or whether the new constraints are too severe. The constraint definitions in the augmented schema are employed to perform this task. As all constraints held have already been internally represented in the form of logical expressions, these can be used to produce data manipulation statements suitable for the host DBMS. Once these statements are produced, they are executed against the current database to identify the existence of data violating a constraint. Stage 4: Migration Process The migration process itself is incrementally performed by initially creating the target database and then copying the legacy data over to it. The schema meta-translation (SMTS) technique of Ramfos [RAM91] is used to produce the target database schema. The legacy data can be copied using the import / export tools of source and target DBMS or DML statements of the respective DBMSs. During this process, the legacy applications must continue to function until they too are migrated. To achieve this an interface can be used to capture and process all database queries of the legacy applications during migration. This interface can decide how to process database queries against the current state of the migration and re-direct those newly related to the target database. The query meta-translation (QMTS) technique of Howells [HOW87] can be used to convert these queries to the target DML. This approach will facilitate transparent migration for legacy databases. Our work does not involve the development of an interface to capture and Page 23
  • 25. process all database queries, as interaction with the query interface of the legacy IS is embedded in the legacy application code. However, we demonstrate how to create and populate a legacy database schema in the desired target environment while showing the role of SMTS and QMTS in such a process. 2.3 The Role of CCVES in Context of Heterogeneous Distributed Databases Our approach described in section 2.2 is based on preparing a legacy database schema for graceful migration. This involves visualisation of database schemas with constraints and enhancing them with constraints to capture more knowledge. Hence we call our system the Conceptualised Constraint Visualisation and Enhancement System (CCVES). CCVES has been developed to fit in with the previously developed schema (SMTS) [RAM91] and query (QMTS) [HOW87] meta-translation systems, and the schema meta- visualisation system (SMVS) [QUT93]. This allows us to consider the complementary roles of CCVES, SMTS, QMTS and SMVS during Heterogeneous Distributed Database access in a uniform way [FID92, QUT94]. The combined set of tools achieves semantic coordination and promotes interoperability in a heterogeneous environment at logical, physical and data management levels. Figure 2.2 illustrates the architecture of CCVES in the context of heterogeneous distributed databases. It outlines in general terms the process of accessing a remote (legacy) database to perform various database tasks, such as querying, visualisation, enhancement, migration and integration. There are seven sub-processes: the schema mapping process [RAM91], query mapping process [HOW87], schema integration process [QUT92], schema visualisation process [QUT93], database connectivity process, database enhancement process and database migration process. The first two processes together have been called the Integrated Translation Support Environment [FID92], and the first four processes together have been called the Meta-Integration/Translation Support Environment [QUT92]. The last three processes were introduced as CCVES to perform database enhancement and migration in such an environment. The schema mapping process, referred to as SMTS, translates the definition of a source schema to a target schema definition (e.g. an INGRES schema to a POSTGRES schema). The query mapping process, referred to as QMTS, translates a source query to a target query (e.g. an SQL query to a QUEL query). The meta-integration process, referred to as SMIS, tackles heterogeneity at the logical level in a distributed environment containing multiple database schemas (e.g. Ontos and Exodus local schemas with a POSTGRES global schema) - it integrates the local schemas to create the global schema. The meta-visualisation process, referred to as SMVS, generates a graphical representation of a schema. The remaining three processes, namely: database connectivity, enhancement and migration with their associated processes, namely: SMVS, SMTS and QMTS, are the subject of the present thesis, as they together form CCVES (centre section of figure 2.2). The database connectivity process (DBC), queries meta-data from a remote database (route Page 24
  • 26. A-1 in figure 2.2) to supply meta-knowledge (route A-2 in figure 2.2) to the schema mapping process referred to as SMTS. SMTS translates this meta-knowledge to an internal representation which is based on SQL schema constructs. These SQL constructs are supplied to SMVS for further processing (route A-3 in figure 2.2) which results in the production of a graphical view of the schema (route A-4 in figure 2.2). Our reverse-engineering techniques [WIK95b] are applied to identify entity and relationship types to be used in the graphical model. Meta-knowledge enhancements are solicited at this point by the database enhancement process (DBE) (route B-1 in figure 2.2), which allows the definition of new constraints and changes to the existing schema. These enhancements are reflected in the graphical view (route B-2 and B-3 in figure 2.2) and may be used to augment the database (route B-4 to B-8 in figure 2.2). This approach to augmentation makes use of the query mapping process, referred to as QMTS, to generate the required queries to update the database via the DBC process. At this stage any existing or enhanced constraints may be applied to the database to determine the extent to which it conforms to the new enhancements. Carrying out this process will also ensure that legacy data will not be rejected by the target DBMS due to possible violations. Finally, the database migration process, referred to as DBMI, assists migration by incrementally migrating the database to the target environment (route C-1 to C-6 in figure 2.2). Target schema constructs for each migratable component are produced via SMTS, and DDL statements are issued to the target DBMS to create the new database schema. The data for these migrated tables are extracted by instructing the source DBMS to export the source data to the target database via QMTS. Here too, the queries which implement this export are issued to the DBMS via the DBC process. 2.4 Research Aims and Objectives Our relational database enhancement and augmentation approach is important in three respects, namely: 1) by holding the additional defining information in the database itself, this information is usable by any design tool in addition to assisting the full automation of any future re- engineering of the same database; 2) it allows better user understanding of database applications, as the associated constraints are shown in addition to the traditional entities and attributes at the conceptual level; Page 25
  • 27. 3) the process which assists a database administrator to clean inconsistent legacy data ensures a safe migration. To perform this latter task in a real world situation without an automated support tool is a very difficult, tedious, time consuming and error prone task. Therefore the main aim of this project has been the design and development of a tool to assist database enhancement and migration in a heterogeneous distributed relational database environment. Such a system is concerned with enhancing the constituent databases in this type of environment to exploit potential knowledge both to automate the re-engineering process and to assist in evolving and cleaning the legacy data to prevent data rejection, possible losses of data and/or delays in the migration process. To this end, the following detailed aims and objectives have been pursued in our research: 1. Investigation of the problems inherent in schema enhancement and migration for a heterogeneous distributed relational legacy database environment, in order to fully understand these processes. 2. Identification of the conceptual foundation on which to successfully base the design and development of a tool for this purpose. This foundation includes: • A framework to establish meta-data representation and manipulation. • A real world data modelling framework that facilitates the enhancement of existing working systems and which supports applications during migration. • A framework to retain the enhanced knowledge for future use which is in line with current international standards and techniques used in newer versions of relational DBMSs. • Exploiting existing databases in new ways, particularly linking them with data held in other legacy systems or more modern systems. • Displaying the structure of databases in a graphical form to make it easy for users to comprehend their contents. • The provision of an interactive graphical response when enhancements are made to a database. • A higher level of data abstraction for tasks associated with visualising the contents, relationships and behavioural properties of entities and constraints. • Determining the constraints on the information held and the extent to which the data conforms to these constraints. • Integrating with other tools to maximise the benefits of the new tool to the user community. 3. Development of a prototype tool to automate the re-engineering process and the migration assisting tasks as far as possible. The following development aims have been chosen for this system: • It should provide a realistic solution to the schema enhancement and migration assistance process. • It should be able to access and perform this task for legacy database systems. • It should be suitable for the data model at which it is targeted. • It should be as generic as possible so that it can be easily customised for other data models. • It should be able to retain the enhanced knowledge for future analysis by itself and other Page 26
  • 28. tools. • It should logically support a model using modern data modelling techniques irrespective of whether it is supported by the DBMS in use. • It should make extensive use of modern graphical user interface facilities for all graphical displays of the database schema. • Graphical displays should also be as generic as possible so that they can be easily enhanced or customised for other display methods. Page 27
  • 29. CHAPTER 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints The origins and historical development of database technology are initially presented here to focus the evolution of ISs and the emergence of database models. The relational data model is identified as currently the most commonly used database model and some terminology for this data model, along with its features including query languages is then presented. A discussion of conceptual data models with special emphasis on EER and OMT is provided to introduce these data models and the symbols used in our project. Finally, we pay attention to crucial concepts relating to our work, namely the notion of semantic integrity constraints, with special emphasis on those used in semantic extensions to the relational model. The relational database language SQL is also discussed, identifying how and when it supports the implementation of these semantic integrity constraints. 3.1 Origins and Historical Developments The origin of data management goes back to the 1950’s and hence, this section is sub divided into two parts: the first part describes database technology prior to the relational data model, and the second part describes developments since. This division was chosen as the relational model is currently the most dominant database model for information management [DAT90]. 3.1.1 Database Technology Prior to the Relational Data Model Database technology emerged from the need to manipulate large collections of data for frequently used data queries and reports. The first major step in mechanisation of information systems came with the advent of punched card machines which worked sequentially on fixed-length fields [SEN73, SEN77]. With the appearance of stored program computers, tape-oriented systems were used to perform these tasks with an increase in user efficiency. These systems used sequential processing of files in batch mode, which was adequate until peripheral storage with random access capabilities (e.g. DASD) and time sharing operating systems with interactive processing appeared to support real-time processing in computer systems. Access methods such as direct and indexed sequential access methods (e.g. ISAM, VSAM) [BRA82, MCF91] were used to assist with the storage and location of physical records in stored files. Enhancements were made to procedural languages (e.g. COBOL) to define and manage application files, making the application program dependent on the organisation of the file. This technique caused data redundancy as several files were used in systems to hold the same data (e.g. emp_name and address in a payroll file; insured_name and address in an insurance file; and depositors_name and address in a bank file). These stored data files used in the applications of the 1960's are now referred to as conventional file systems, and they were maintained using third generation programming languages such as COBOL and PL/1. This evolution of mechanised information systems was influenced by the hardware and software developments which occurred in the 1950’s and early 1960’s. Most long existing legacy ISs are based on this technology. Our work does not address this type of IS as they do not use a DBMS for their data management. The evolution of databases and database management systems [CHA76, FRY76, SIB76,
  • 30. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints SEN77, KIM79, MCG81, SEL87, DAT90, ELM94] was to a large extent the result of addressing the main deficiencies in the use of files, i.e. by reducing data redundancy and making application programs less dependent on file organisation. An important factor in this evolution was the development of data definition languages which allowed the description of a database to be separated from its application programs. This facility allowed the data definition (often called a schema) to be shared and integrated to provide a wide variety of information to the users. The repository of all data definitions (meta data) is called data dictionaries and their use allows data definitions to be shared and widely available to the user community. In the late 1960's applications began to share their data files using an integrated layer of stored data descriptions, making the first true database, e.g. the IMS hierarchical database [MCG77, DAT90]. This type of database was navigational in nature and applications explicitly followed the physical organisation of records in files to locate data using commands such as GNP - get next under parent. These databases provided centralised storage management, transaction management, recovery facilities in the event of failure and system maintained access paths. These were the typical characteristics of early DBMSs. Work on extending COBOL to handle databases was carried out in the late 60s and 70s. This resulted in the establishment of the DBTG (i.e. DataBase Task Group) of CODASYL and the formal introduction of the network model along with its data manipulation commands [DBTG71]. The relational model was proposed during the same period [COD70], followed by the 3 level ANSI/SPARC architecture [ANSI75] which made databases more independent of applications, and became a standard for the organisation of DBMSs. Three popular types of commercial database systems7 classified by their underlying data model emerged during the 70s [DAT90, ELM94], namely: • hierarchical • network • relational and these have been the dominant types of DBMS from the late 60s on into the 80s and 90s. 3.1.2 Database Technology Since the Relational Data Model At the same time as the relational data model appeared, database systems introduced another layer of data description on top of the navigational functionality of the early hierarchical and network models to bring extra logical data independence8. The relational model also introduced the use of non-procedural (i.e. declarative) languages such as SQL [CHA74]. By the early 1980's many relational database products, e.g. System R [AST76], DB2 [HAD84], INGRES [STO76] and Oracle were in use and due to their growing maturity in the mid 80s and the complexity of programming, navigating, and changing data structures in the older DBMS data models, the relational data model was able to take over the commercial database market with the result that it is now dominant. 7 Other types such as flat file, inverted file systems were also used. 8 This allows changes to the logical structure of data without changing the application programs. Page 29
  • 31. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints The advent of inexpensive and reliable communication between computer systems, through the development of national and international networks, has brought further changes in the design of these systems. These developments led to the introduction of distributed databases, where a processor uses data at several locations and links it as though it were at a single site. This technology has led to distributed DBMSs and the need for interoperability among different database systems [OZS91, BEL92]. Several shortcomings of the relational model have been identified, including its inability to perform efficiently compute-intensive applications such as simulation, to cope with computer-aided design (CAD) and programming language environments, and to represent and manipulate effectively concepts such as [KIM90]: • Complex nested entities (e.g. design and engineering objects), • Unstructured data (e.g. images, textual documents), • Generalisation and aggregation within a data structure, • The notion of time and versioning of objects and schemas, • Long duration transactions. The notion of a conceptual schema for application-independent modelling introduced by the ANSI/SPARC architecture led to another data model, namely: the semantic model. One of the most successful semantic models is the entity-relationship (E-R) model [CHE76]. Its concepts include entities, relationships, value sets and attributes. These concepts are used in traditional database design as they are application-independent. Many modelling concepts based on variants/extensions to the E-R model have appeared since Chen’s paper. The enhanced/extended entity-relationship model (EER) [TEO86, ELM94], the entity-category-relationship model (ECR) [ELM85], and the Object Modelling Technique (OMT) [RUM91] are the most popular of these. The DAPLEX functional model [SHI81] and the Semantic Data Model [HAM81] are also semantic models. They capture a richer set of semantic relationships among real-world entities in a database than the E-R based models. Semantic relationships such as generalisation / specialisation between a superclass and its subclass, the aggregation relationship between a class and its attributes, the instance-of relationship between an instance and its class, the part-of relationship between objects forming a composite object, and the version-of relationship between abstracted versioned objects are semantic extensions supported in these models. The object-oriented data model with its notions of class hierarchy, class-composition hierarchy (for nested objects) and methods could be regarded as a subset of this type of semantic data model in terms of its modelling power, except for the fact that the semantic data model lacks the notion of methods [KIM90] which is an important aspect of the object-oriented model. The relational model of data and the relational query language have been extended [ROW87] to allow modelling and manipulation of additional semantic relationships and database facilities. These extensions include data abstraction, encapsulation, object identity, composite objects, class hierarchies, rules and procedures. However, these extended relational systems are still being evolved to fully incorporate features such as implementation of domain and extended data types, enforcement of primary and foreign key and referential integrity checking, prohibition of duplicate rows in tables and views, handling missing information by supporting four-valued predicate logic Page 30
  • 32. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints (i.e. true, false, unknown, not applicable) and view updatability [KIV92], and they are not yet available as commercial products. The early 1990's saw the emergence of new database systems by a natural evolution of database technology, with many relational database systems being extended and other data models (e.g. the object-oriented model) appearing to satisfy more diverse application needs. This opened opportunities to use databases for a greater diversity of applications which had not been previously exploited as they were not perceived as tractable by a database approach (e.g. Image, medical, document management, engineering design and multi-media information, used in complex information processing applications such as office automation (OA), computer-aided design (CAD), computer-aided manufacturing (CAM) and hyper media [KIM90, ZDO90, CAT94]). The object- oriented (O-O) paradigm represents a sound basis for making progress in these areas and as a result two types of DBMS are beginning to dominate in the mid 90s [ZDO90], namely: the object-oriented DBMS, and the extended relational DBMS. There are two styles of O-O DBMS, depending on whether they have evolved from extensions to an O-O programming language or by evolving a database model. Extensions have been created for two database models, namely: the relational and the functional models. The extensions to existing relational DBMSs have resulted in the so-called Extended Relational DBMSs which have O-O features (e.g. POSTGRES and Starburst), while extensions to the functional model have produced PROBE and OODAPLEX. The approach of extending O-O programming language systems with database management features has resulted in many systems (e.g. Smalltalk into GemStone and ALLTALK, and C++ into many DBMSs including VBase / ONTOS, IRIS and O2). References to these systems with additional information and references can be found in [CAT94]. Research is currently taking place into other kinds of database such as active, deductive and expert database systems [DAT90]. This thesis focuses on the relational model and possible extensions to it which can represent semantics in existing relational database information systems in such a way that these systems can be viewed in new ways and easily prepared for migration to more modern database environments. 3.2 Relational Data Model In this section we introduce some of the commonly used terminology of the relational model. This is followed by a selective description of the features and query languages of this model. Further details of this data model can be found in most introductory database text books, e.g. [MCF91, ROB93, ELM94, DAT95]. A relation is represented as a table (entity) in which each row represents a tuple (record), the number of columns being the degree of the relation and the number of rows being its cardinality. An example of this representation is shown in figure 3.1, which shows a relation holding Student details, with degree 3 and cardinality 5. This table and each of its columns are named, so that a unique identity for a table column of a given schema is achieved via its table name and column name. The columns of a table are called attributes (fields) each having its own domain (data type) representing its pool of legal data. Basic types of domains are used (e.g. integer, real, character, text, date) to define the domains of attributes. Constraints may be enforced to further restrict the pool of legal Page 31
  • 33. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints values for an attribute. Tables which actually hold data are called base tables to distinguish them from view tables which can be used for viewing data associated with one or more base tables. A view table can also be an abstraction from a single base table which is used to control access to parts of the data. A column or set of columns whose values uniquely identify a row of a relation is called a candidate key (key) of the relation. It is customary to designate one candidate key of a relation as a primary key (e.g. SNO in figure 3.1). The specification of keys restricts the possible values the key attribute(s) may hold (e.g. no duplicate values), and is a type of constraint enforceable on a relation. Additional constraints may be imposed on an attribute to further restrict its legal values. In such cases, there should be a common set of legal values satisfying all the constraints of that attribute, ensuring its ability to accept some data. For example, a pattern constraint which ensures that the first character of SNO is ‘S’ further restricts the possible values of SNO - see figure 3.1. Many other concepts and constraints are associated with the relational model although most of them are not supported by early relational systems as, indeed, some of the more recent relational systems (e.g. a value set constraint for the Address field as shown in figure 3.1). Domain (type character) Value Set Constraint Pattern Constraint (all values begin with 'S') Primary Key (unique values) Student SNO Name Address Cardinality S1 Jones Cardiff S2 Smith Bristol : Relation Tuples S3 Gray Swansea S4 Brown Cardiff : S5 Jones Newport Attributes Degree Figure 3.1: The Student relation 3.2.1 Requisite Features of the Relational Model During the early stages of the development of relational database systems there were many requisite features identified which a comprehensive relational system should have [KIM79, DAT90]. We shall now examine these features to illustrate the kind of features expected from early relational database management systems. They included support for: • Recovery from both soft and hard crashes, • A report generator for formatted display of the results of queries, • An efficient optimiser to meet the response-time requirements of users, • User views of the stored database, Page 32
  • 34. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints • A non-procedural language for query, data manipulation / definition / control, • Concurrency control to allow sharing of a database by multiple users and applications, • Transaction management, • Integrity control during data manipulation, • Selective access control to prevent one user’s database being accessed by unauthorised users, • Efficient file structures to store the database, and • Efficient access paths to the stored data. Many early relational DBMSs originated at universities and research institutes, and none of them were able to provide all the above features. These systems mainly focussed on optimising techniques for query processing and recovery from soft and hard crashes, and did not pay much attention to the other features. Few of these database systems were commercially available, and for those that were the marketing was based on specific features such as report generation (e.g. MAGNUM), and views with selective access control (e.g. QBE). The early commercial systems did not support the full range of features either. Since the mid 1980’s many database products have appeared which aim to provide most of the above features. The enforcement of features such as concurrency control was embodied in these systems, while features such as views, access and integrity control were provided via non-procedural language commands. Systems which were unable to provide these features via a non-procedural language offered procedural extensions (e.g. C with embedded SQL) to perform such tasks. This resulted in the use of two types of data manipulation languages, i.e. procedural and non-procedural, to perform database system functions. In procedural languages a sequence of statements is issued to specify the navigation path needed in the database to retrieve the required data, thus they are navigational languages. This approach was used by all hierarchical and network database systems and by some relational systems. However, most relational systems offer a non-procedural (i.e. non- navigational) language. This allows retrieval of the required data by using a single retrieval expression, which in general has a degree of complexity corresponding to the complexity of the retrieval (e.g. SQL). 3.2.2 Query Language for the Relational Model Querying or the retrieval of information from a database is perhaps the aspect of relational languages which has received the most attention. A variety of approaches to querying has emerged, based on relational calculus, relational algebra, mapping-oriented languages and graphic-oriented languages [CHA76, DAT90]. During the first decade of relational DBMSs, there were many experimental implementations of relational systems in universities and industry, particularly at IBM. The initial projects were aimed at proving the feasibility of relational database systems supporting high-level non-procedural retrieval languages. The Structured Query Language (SQL9) [AST75] emerged from an IBM research project. Later projects created more comprehensive relational DBMSs and among the more important of these systems were probably the System R project at IBM [AST76] and the INGRES project (with its QUEL query language) at the University of California at Berkeley [STO76]. 9 Initially called SEQUEL, and later pronounced as SEQUEL. Page 33
  • 35. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints Standards for relational query languages were introduced [ANSI86] so that a common language could be used to retrieve information from a database. SQL became the standard query language for relational databases. These standards were reviewed regularly [ANSI89a, ANSI92] and are being reviewed [ISO94] to incorporate technological changes that meet modern database requirements. Hence, the standard query language SQL is evolving, and although some of the recent database systems conform to [ANSI92] standards they will have to be upgraded to incorporate even more recent advances such as the object-oriented paradigm additions to the language [ISO94]. This means that different database system query languages conform to different standards, and provide different features and facilities to their users even though they are of the same type. Hence, information systems developed during different eras will have used different techniques to perform the same task, with early systems being more procedural in their approach than more recent ones. Query languages, including SQL, have three categories of statements, i.e. the data manipulation language (DML) statements to perform all retrievals, updates, insertions and deletions, the data definition language (DDL) statements to define the schema and its behavioural functions such as rules and constraints, and the data control language (DCL) statements to specify access control which is concerned with the privileges to be given to database users. 3.3 Conceptual Modelling The conceptual model is a high level representation of a data model, providing an identification and description of the main data objects (avoiding details of their implementation). This model is hardware and software independent, and is represented using a set of graphical symbols in a diagrammatic form. As noted in part ‘c’ of stage 1 of section 2.2.2, different users may prefer different graphical models and hence it is useful to provide them with a choice of models. We consider two types of conceptual model in this thesis, namely: the enhanced entity-relationship model (EER), which is based on the popular entity-relationship model, and the object-modelling technique (OMT), which uses the more recent concepts of object-oriented modelling as opposed to the entities of the E-R model. These were chosen as they are among the currently most widely used conceptual modelling approaches and they allow representation of modelling concepts such as generalisation hierarchies. 3.3.1 Enhanced Entity-Relationship Model (EER) The entity-relationship approach is considered to be the first useful proposal [CHE76] on the subject of conceptual modelling. It is concerned with creating an entity model which represents a high-level conceptual data model of the proposed database, i.e. it is an abstract description of the structure of the entities in the application domain, including their identity, relationship to other entities and attributes, without regard for eventual implementation details. Thus an E-R diagram describes entities and their relationships using distinctive symbols, e.g. an entity is a rectangle and a relationship is a diamond. Distinctive symbols for recent modelling concepts such as generalisation, aggregation and complex structures have been introduced into these models by practitioners. Despite its popularity, no standard has emerged or been defined for this model. Hence different authors use different notations to represent the same concept. Therefore we have to define our symbols for these concepts: we have based our definitions on [ROB93] and Page 34
  • 36. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints [ELM94]. a) Entity An entity in the E-R model corresponds to a table in the relational environment and is represented by a rectangle containing the entity name, e.g. the entity Employee of figure 3.2. b) Attributes Attributes are represented by a circle that is connected to the corresponding entity by a line. Each attribute has a name located near the circle10, e.g. attributes EmpNo, Name and Address of the Employee entity in figure 3.2. Key attributes of a relation are indicated using a colour to fill in the circle (red on the computer screen or shaded dark in this thesis) (e.g. the attribute EmpNo of Employee in figure 3.2). Attributes usually have a single value in an entity occurrence although multivalued attributes can occur and other types such as derived attributes can be represented in the conceptual model (see appendix B for a comprehensive list of the symbols used in EER models in this thesis). c) Relationships A relationship is an association between entities. Each relationship is named and represented by a diamond-shaped symbol. Three types of relationships (one-to-many or 1:M, many-to-many or M:N, and one-to-one or 1:1) are used to describe the association between entities. Here 1 means that an instance of this entity relates to only one instance of the other entity (e.g. an employee works for only one department), and M or N means that an instance of an entity may relate to more than one instance of the other entity (e.g. a department can have many employees working for it - see figure 3.2), through this relationship (the same entities can be linked in more than one relationship). The relationship type is determined by the participating entities and their associated properties. In the relational model a separate entity is used for M:N relationship types (e.g. a composite entity as in the case of the entity ComMem of figure 3.2), and the other relationship types (i.e. 1:1 and 1:M) are represented by repeated attributes (e.g. relationship WorksFor of figure 3.2 is established from the attribute WorksFor of the entity Employee). 10 We do not place the attribute name inside the circle to avoid the use of large circles or ovals in our diagrams. Page 35
  • 37. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints (Weak Entity) (Weak Relationship) (1,1) (1,N) Title Committee Fcom Faculty (1,N) (Composite Entity) ComMem YearJoined Office d (Generalised Entity) (1,N) (1,1) WorksFor (4,N) (Entity) Employee N (Relationships) 1 Department (0,1) (1,1) (Key) Head (Specialised Entity) EmpNo Address of Office Name (Attributes) Figure 3.2: EER diagram for part of the University Database A relationship’s degree indicates the number of associated entities (or participants) there are in the relationship. Relationships with 1, 2 and 3 participants are called unary, binary and ternary relationships, respectively. In practice most relationships are binary (e.g. relationship WorksFor in figure 3.2) and relationships of higher order (e.g. four) occur very rarely as they are usually simplified to a series of binary relationships. The term connectivity is used to describe the relationship classification and it is represented in the diagram by using 1 or N near the related entity (see for example, the WorksFor relationship in figure 3.2). Alternatively, a more detailed description of the relationship is specified using cardinality, which expresses the specific number of entity occurrences associated with one occurrence of the related entity. The actual number of occurrences depends on the organisation’s policy and hence, can differ from that of another organisation, although both may model the same information. The cardinality has upper and lower limits indicating a range and is represented in the diagram within brackets near the related entity (see the WorksFor relationship in figure 3.211). Cardinality is a type of constraint and in appendix B.2 we provide more details about the symbols and notations used to represent these types of constraints. Thus in the WorksFor relationship: (1,1) indicates an employee must belong to a department (4,N) indicates a department must have at least 4 employees N indicates a department has many employees 1 indicates an employee may work for only one department d) Other Relationship and Entity Types The original E-R model of Chen did not contain relationship attributes and did not use the concept of a composite entity. We use this concept as in [ROB93], because the relational model requires the use of an entity composed of the primary keys of other entities to connect and represent M:N relationships. Hence, a composite entity (also called a link [RUM91] or regular [CHI94] entity) 11 In practise in a diagram only one of these types is shown depending on availability of information on cardinality limits. Page 36
  • 38. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints representing an M:N relationship is represented using a diamond inside the rectangle, indicating that it as an entity as well as a relationship (e.g. ComMem of figure 3.2). In this type of relationship, the primary key of the composite entity is created by using the keys of the entities which it connects. This is usually a binary or 2-ary relationship involving two referenced entities and is a special case of n-ary relationship which connects with n entities. Some entity occurrences cannot exist without an entity occurrence with which they have a relationship. Such entities are called weak entities and are represented by a double rectangle (e.g. Committee in figure 3.2). The relationship formed with this entity type is called a weak relationship and is represented by a double diamond (e.g. Fcom relationship of figure 3.2). In this type of relationship, the primary key of the weak entity is a proper subset of the key of the entity which it depends on and the remaining attributes (called dangling attributes) of the primary key do not contain a key of any other entity. When a relationship exists between occurrences of the same entity set (e.g. a unary relationship) it forms a recursive relationship (e.g. a course may have pre-requisites courses). e) Generalisation / Specialisation / Inheritance Most organisations employ people with a wide range of skills and special qualifications (e.g. a university employs academics, secretaries, porters, research associates, etc.) and it may be necessary to record additional information for certain types of employee (e.g. qualifications of academics). Representing such additional information in the employee table results in the use of null values in this attribute for other employees as this additional information is not applicable for these employees. To overcome this, common characteristics for all employees are chosen to define the employee entity as a generalised entity, and the additional information is put in a separate entity, called a specialised entity, which inherits all the properties of its parent entity (i.e. the generalised entity), creating a parent-child or is-a relationship (also called a generalised hierarchy). The higher level of this relationship is a supertype entity (i.e. generalised entity) and the lower-level is a subtype entity (i.e. specialised entity). A supertype entity set is usually composed of several unique and disjoint (non-overlapping) subtype entity sets. However some supertypes contain overlapping subtypes (e.g. an employee may also be a student and hence we get two subtypes of person in an overlapping relationship). There are constraints applicable for generalised hierarchies and special symbols / notations are used in these cases (see appendixes B.1 figure ‘e’ and B.2 figure ‘b’). In figure 3.2, the entities Office, Department and Faculty form a generalised hierarchy with Office being the Supertype entity and Department and Faculty being the subtype entities. Subtype and supertype entities have a 1:1 relationship although we view it differently, i.e. as a hierarchy. The subtypes described above inherit from a single supertype entity. However, there may be cases where a subtype inherits from multiple supertypes (e.g. an empstudent entity representing employees who are also students may inherit from employee and student entities). This is known as multiple inheritance. In such cases the subtype may represent either an intersection or a union. The concept of inheritance was taken from the O-O paradigm and hence it does not occur in the original E-R model, but is included in the EER model. 3.3.2 Object Modelling Technique (OMT) Page 37
  • 39. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints The Object Modelling Technique (OMT) is an O-O development methodology. It creates a high-level conceptual data model of a proposed system without regard for eventual implementation details. This model is based on objects. The notations of OMT used here are taken from [RUM91] and those used in our work are described in appendix B, where they are compared with their EER equivalents. Hence we do not describe this model in depth here. The diagrams produced by this method are known as object diagrams. They combine O-O concepts (i.e. classes and inheritance) with information modelling concepts (i.e. entities and relationships). Although the terminology used differs from that used in the EER model, both create similar conceptual models, although using different graphical notations. The main notations used in OMT are rectangles with text inside (e.g. for classes and their properties, as opposed to the EER where attributes appear outside the entity). This makes OMT easier to implement than EER in a graphical computing environment. OMT is used for most O-O modelling (e.g. in [COO93, IDS94]), and so it is a widely known technique. 3.4 Semantic Integrity Constraints A real world application is always governed by many rules which define the application domain and are referred to as integrity constraints [DAT83, ELM94]. An important activity when designing a database application is to identify and specify these integrity constraints for that database and if possible to enforce them using the DBMS constraint facilities. The term integrity refers to the accuracy, correctness or validity of a database. The role of the integrity constraint enforcer is to ensure that the data in the database is accurate by guarding it against invalid updates, which may be caused by errors in data entry, mistakes on the part of the operator or the application programmer, system failures, and even due to deliberate falsification by users. This latter case is the concern of the security system which protects the database from unauthorised access (i.e. it implements authorisation constraints). The integrity system uses integrity rules (integrity constraints) to protect the database from invalid updates supplied by authorised users and to maintain the logical consistency of the database. Integrity is sometimes used to cover both semantic and transaction integrity. The latter case deals with concurrency control (i.e. the prevention of inconsistencies caused by concurrent access by multiple users or applications to a database), and recovery techniques which prevent errors due to malfunctioning of system hardware and software. Protection against this type of integrity-violation is dealt with by most commercially available systems and is not an issue of this thesis. Here we shall use the terms integrity and constraints to refer only to semantic integrity constraints. Integrity rules cannot detect all types of errors, for instance when dealing with percentage marks, there is no way that the computer can detect the fact that an input value of 45 for a student mark should really be 54. However, on the other hand, a value of 455 could be detected and corrected. Consistency is another term used for integrity. However, this is normally used in cases where two or more values in the database are required to be in agreement with each other in some way. For example, the DeptNo in an Employee record should tally with the DeptNo appearing in some Department record (referential integrity in relational systems), or the Age of a Person must be Page 38
  • 40. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints equal to the difference in years between today’s date and their date of birth (a property of a derived attribute). In order to check for invalid data, DBMSs use an integrity subsystem to monitor transactions and detect integrity violations. In the event of a violation the DBMS takes appropriate actions such as rejecting the operation, reporting the violation, or assisting in correcting the error. To perform such a task, the integrity subsystem must be provided with a set of rules that define what errors to check for, when to do the checking, and what to do if an error is detected. Most early DBMSs did not have an integrity subsystem (mainly due to unacceptable database system performance when integrity checking was performed in older technological environments) and hence such checking was not implemented in their information systems. These information systems performed integrity checking using procedural language extensions of the database to check for invalid entries during the capture of data via their user interface (e.g. data entry forms). Here too, due to technological limitations and poor database performance, only specific types of constraints (e.g. range check, pattern matching), and a limited number of checks were allowed for an attribute. As these rules were coded in application programs they violated program / data (rule) independence for constraint specification. However, most recent DBMSs attempt to support such specifications using their DDL and hence they achieve program / rule independence. The original SQL standard specifications [ANSI86] were subsequently enhanced so that constraints could be specified using SQL [ANSI89a]. Current commercial DBMSs are seeking to meet these standards by targeting the implementation of the SQL-2 standards [ANSI92] in their latest releases. Systems such as Oracle now conform to these standards, while others such as INGRES and POSTGRES have taken a different path by extending their systems with a rule subsystem, which performs similar tasks but using a procedural style approach where the rules and procedures are retained in data dictionaries. Integrity constraints can be identified for the properties of a data model and for the values of a database application. We examine both to present a detailed description of the types of constraint associated with databases and in particular those used for our work. 3.4.1 Integrity Constraints of a Data Model Some constraints are automatically supported by the data model itself. These constraints are assumed to hold by the definition of that data model (i.e. they are built into the system and not specified by a user). They are called the inherent constraints of the data model. There are also constraints that can be specified and represented in a data model. These are called the implicit constraints of the model and they are specified using DDL statements in a relational schema, or graphical constructs in an E-R model. Table 3.1 gives some examples of implicit and inherent constraints for relational and EER data models. The constraint types used in this table are described in detail in section 3.5. The structure of a data model represents inherent constraints implicitly and is also capable of representing implicit constraints. Hence, constraints represented in these two ways are referred to as structural constraints. Data models differ in the way constraints are handled. Hierarchical and network database constraints are handled by being tightly linked to structural concepts (records, sets, Page 39
  • 41. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints segment definitions), of which the parent-child and owner-member relationships are logical examples. The classical relational model, on the other hand, has two constraints represented structurally by its relations or tables (namely, relations consist of a certain number of simple attributes and have no duplicate rows). Hence only specific types of structural constraint are defined for a particular data model (e.g. parent-child relationships are not defined for the relational model). Data Model Implicit constraint Inherent constraint • Primary key attributes, • Every relationship instance of an • Attribute structural constraints, n-ary relationship type R relates • Relationship structural constraints, exactly one entity from each entity EER • Superclass /subclass relationship, type participating in R in a specific • Disjointness /totality constraints role, on specialisation /generalisation. • Every entity instance in a subclass must also exist in its superclass. • Domain constraints, • A relation consists of a certain Relational • Key constraints, number of simple attributes, • Relationship structural constraints. • An attribute value is atomic, • No duplicate tuples are allowed. Table 3.1: Structural constraints of selected data models Every data model has a set of concepts and rules (or assertions) that specify the structure and the implicit constraints of a database describing a miniworld. A given implementation of a data model by a particular DBMS will usually support only some of the structural (inherent and implicit) constraints of the data model automatically and the rest must be defined explicitly. These additional constraints of a data model are called explicit or behavioural constraints. They are defined using either a procedural or a declarative (non-procedural) approach, which is basically not part of the data model per se. 3.4.2 Integrity Constraints of a Database Application In database applications, integrity constraints are used to ensure the correctness of a database. A change to a database application takes place during an update transaction and constraints are used at this stage to ensure that the database is in a consistent state before and after that transaction. This type of constraint is called a state (static) constraint as it applies to a particular state of the database and should hold for every state where the database is not in transition, i.e. not in the process of being updated. Constraints that are applicable to a database and which change from one state to another are called transition (dynamic) constraints (e.g. the age of a person can only be increased, meaning that the new value of age is greater than the old value). In general, transition constraints occur less frequently than state constraints and are usually specified explicitly. The discussion above classifies the types of semantic integrity constraints used in data models and database applications. We summarise them in figure 3.3 to highlight the basic classification of integrity constraints. We separate the two approaches using a dotted line as they are independent of each other. However, most constraints are common to both categories as they are implemented using a particular data model for a database application. Data models used for conceptual modelling are more concerned with structural constraints as opposed to the value constraints of database applications. Page 40
  • 42. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints Integrity Constraints Data Model Database Application structural explicit state transistion constraints (behavioural) (static) (dynamic) constraints constraints constraints inherent implicit constraints constraints Figure 3.3: Classification of integrity constraints 3.5 Constraint Types We consider constraint types in more detail here so that we can later relate them to data models and database applications. Initially, we describe value constraints (i.e. domain and key constraints) which are applicable to database values (i.e. attributes). Then, we describe structural constraints, namely: attribute structural, relationship structural and superclass/subclass structural. These constraints are often associated with data models and some of them have been mentioned in section 3.4.1. In this section, we look at them with respect to their structural properties and are concerned with identifying differences within a structure, in addition to the relationships (e.g. between entities) formed by them. Finally, constraints that do not fall into either of these categories are described. As most of these constraints are state constraints we shall refer to the constraint type only when type distinction is necessary. All structural constraints are shown in a conceptual model as this model is used to describe the structure of a database. Not all value constraints (e.g. check constraints) are shown as they are not associated with the structure of a database and are described using a DML. However, our work includes presenting them at optional lower levels of abstraction which involves software dependent code. This code is based on current SQL standards and may be replaced using equivalent graphical constructs if necessary12. Here for each type of an SQL statement, we could introduce a suitable graphical representation and hence increase its readability. All value constraints are implicitly or explicitly defined when implementing an application. Most constraints considered here are implicit constraints as they may be specified using the DDL of a modern DBMS. In such cases the DBMS will monitor all changes to the data in the database to ensure that no constraint has been violated by these changes. 3.5.1 Domain Constraints Domain constraints are specified by associating every simple attribute type with a value set. 12 This idea is beyond the scope of this thesis. Page 41
  • 43. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints The value set of a simple attribute is initially defined using its data type (integer, real, char, boolean) and length, and later is further restricted using appropriate constraints such as range (minimum, maximum) and pattern (letters and digits). For example, the value set for the Deptno attribute of the entity Department could be specified as data type character of length 5, and the Salary attribute of the entity Staff as data type decimal of length 8 with 2 decimal places, in the range 3000 to 20000. Nonnull constraints can be seen as a special case of domain constraints, since they too restrict the domain of attributes. These constraints are used to eliminate the possibility of missing, or unknown values of an attribute occurring in the database. A domain constraint is usually used to restrict the value of an attribute, e.g. an employee’s age is ≥ 18 (i.e. a state constraint), however they may also be used to compare values of two states, e.g. an employee’s new salary is ≥ to their current salary (i.e. a transition constraint). 3.5.2 Key Constraints Key constraints specify key attribute(s) that can uniquely identify an instance of an entity. These constraints are also called candidate key or uniqueness constraints. For example, stating Deptno is a key of Department will ensure that no two departments will have the same Deptno. When a set of attributes form a key, then that key is called a composite key, as we are dealing with a composite attribute. When a nonnull constraint is added to a key uniqueness constraint then such keys are referred to as primary keys. An entity may have several candidate keys and in such cases one is called the primary key and the others alternate keys. Primary key attributes are shown in the EER model (see appendix B.2, figure ‘b’). The OMT model uses object identities (oids) to uniquely identify objects and as they are usually system generated they are not shown in this model. However, when modelling relational databases we do not use the concept of oid, instead we have primary keys which are shown in our diagrams (see appendix B.2, figure ‘b’) as they carry important information about a relational database. 3.5.3 Structural Constraints on Attributes Attribute structural constraints specify whether an attribute is single valued or multivalued. Multivalued attributes with a fixed number of possible values are sometimes defined as composite attributes. For example, name can be a composite attribute composed of first name, middle initial and last name. However composite attributes cannot be constructed for multivalued attributes like a student’s course set, where the student can attend several courses. In such a case one would have to use an alternative solution, such as recording all possibilities in one long string or using a separate data type like sets. This type of constraint is not generally supported by most traditional DBMSs. In the relational model we use a separate entity to hold multiple values and these are related to the correct entity through an identical primary key [ELM94]. 3.5.4 Structural Constraints on Relationships Structural constraints on relationships specify limitations on the participation of entities in relationship instances. Two types of relationship constraints occur frequently. They are called Page 42
  • 44. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints cardinality ratio constraints and participation constraints. The cardinality ratio constraint specifies the number of relationship instances that an entity can participate in using 1 and N (many). For example, the constraints every employee works for exactly one department and a department can have many employees working for it has the cardinality ratio of 1:N, meaning that each department entity can be related to numerous employee entities. A participation constraint specifies whether the existence of an entity depends on its being related to another entity via a certain relationship type. If all the instances of an entity participate in a relationship of this type then the entity has total participation (existence dependency) in the relationship. Otherwise the participation is partial, meaning only some of the instances of that entity participate in a relationship of this type. For example, the constraint every employee works for exactly one department means that an Employee entity has a total participation in the relationship WorksFor (see figure 3.2), and the constraint an employee can be the head of a department, means that the Employee entity has a partial participation in the relationship Head (see figure 3.2) (i.e. not all employees are head of a department). Referential integrity constraints are used to specify a type of structural relationship constraint. In relational databases, foreign keys are used to define referential integrity constraints. A foreign key is defined on attributes of a relation. This relation is known as the referencing table. The foreign key attribute of the referencing table (e.g. WorksFor of Employee in figure 3.4) will always refer to an attribute(s) of another relation, where the attribute(s) are the primary or alternate key (e.g. DeptCode of Department in figure 3.4). We refer to this relation as the referenced table. The foreign key attribute(s) of the referenced table have a uniqueness property, and may be the primary or alternate key of that relation. This means that references from one relation to another are achieved by using foreign keys, which indicate a relationship between two entities. Also this establishes an inclusion dependency between the two entities. Here the values of the attribute of the referencing entity (e.g. Employee.WorksFor) are a subset of the values of the attribute of the referenced entity (e.g. Department.DeptCode). Only recent DBMSs such as Oracle version 7 support the specification of foreign keys using DDL statements. Employee Department ...Attributes...WorksFor DeptCode ...Attributes... COMMA COMMA MATHS ELSYM COMMA MATHS COMMA MATHS (5 records) (3 records) WorksFor is a foreign key attribute of referencing entity Employee. This attribute refers to the referenced entity Department whose attribute DeptCode is its primary key . Figure 3.4: A foreign key example 3.5.5 Structural Constraints on Specialisation/Generalisation Disjointness and completeness constraints are defined on specialisation/generalisation structures. The disjointness constraint specifies that the subclasses (superclass) of the specialisation (generalisation) must be disjoint. This means that an entity can be a member of at most one of the subclasses of the specialisation. When an entity is a member of more than one of the subclasses, then we get an overlapping situation. The completeness constraint on specialisation (generalisation) defines total/partial specialisation (generalisation). A total specialisation specifies Page 43
  • 45. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints the constraint that every entity in the superclass must be a member of some subclass in the specialisation. For example: Lecturer, Secretary and Technician are some of the job types of an Employee. They describe disjoint subclasses of the entity Employee, having a partial specialisation as there could be an employee with another job type. Generalisation is a feature supported by many object-oriented (O-O) systems, but it has yet to be adopted by commercial relational DBMSs. However, with O-O features being incorporated into the relational model (e.g. SQL-3) we can expect to see this feature in many RDBMSs in the near future. 3.5.6 General Semantic Constraints There are general semantic integrity constraints that do not fall into one of the above categories (e.g. the constraint the salary of an employee must not be greater than the salary of the head of the department that the employee works for, or the salary attribute of an employee can only be increased). These constraints can be either state or transition constraints, and are generally specified as explicit constraints. The transition constraint mentioned above is a single-step transition constraint. Here, a constraint is evaluated on a pair of pre-transaction and post-transaction states of a database, e.g. in the employee’s salary example, the current salary will be the pre-transaction state and the new salary will become the post-transaction state. However, there are transition constraints that are not limited to a single-step, e.g. temporal constraints specified using the temporal qualifiers always and sometimes [CHO92]. Other forms of constraint exist, including those defined for incomplete data (e.g. employees having similar jobs and experience must have almost equal salary) [RAJ88]. These can also be considered as a type of semantic constraint, mainly as they are not implicitly supported by the most frequently used (i.e. relational) data model. They may need a special constraint specification language to support them. 3.6 Specifying Explicit Constraints Explicit constraints are generally defined using either a procedural or a declarative approach. 3.6.1 Procedural Approach In the procedural approach (or the coded constraints technique), constraint checking statements are coded into appropriate update transactions of the application by the programmer. For example to enforce the constraint, the salary of an employee must not be greater than the salary of the head of the department that the employee works for, one has to check every update transaction that may violate this constraint. This includes any transaction that modifies the salary of an employee, any transaction that links an employee to a department, and any transaction that assigns a new employee or manager to a department. Thus in all such transactions appropriate code has to be included that will check for possible violations of this constraint. When a violation occurs the transaction has to be aborted, and this is also done by including appropriate code in the application. The above technique for handling explicit constraints is used by many existing DBMSs. This Page 44
  • 46. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints technique is general because the checks are typically programmed in a general-purpose programming language. It also allows the programmer to code in an effective way. However, it is not very flexible and places an extra burden on the programmer, who must know where the constraints may be violated and include checks at each and every place that a violation may occur. Hence, a misunderstanding, omission or error by the programmer may leave the database able to get into an inconsistent state. Another drawback of specifying constraints procedurally is that they can change with time as the rules of a real world situation change. If a constraint changes, it is the responsibility of the DBA to instruct appropriate programmers to recode all the transactions that are affected by the change. This again opens the possibility of overlooking some transactions, and hence the chance that errors in constraint representation will render the database inconsistent. Another source of error is that it is possible to include conflicting constraints in procedural specifications that will cause incorrect abortion of correct transactions. This error may occur in other types of constraint specification, e.g. whenever we attempt to define multiple constraints on the same entity. However, such errors can be detected more easily in a declarative approach, as it is possible to evaluate constraints defined in that form to identify conflicts between them. The procedural approach is usually adopted only when the DBMS cannot declaratively support the same constraint. In all early DBMSs the procedural code was part of the application code and was not retained in the database’s system catalog. However, some recent DBMSs (e.g. INGRES) provide a rule subsystem where all defined procedures are stored in system catalogs and are fired using rules which detect update transactions associated with a particular constraint. This approach is a step towards the declarative approach as it overcomes some of the deficiencies of the procedural approach described above (e.g. the maintenance of constraints), although the code is still of procedural type which for example, prevents the detection of conflicting constraints. Some DBMSs (e.g. INGRES) do not support the specification of foreign key constraints through their DDL. Hence, for these systems such constraints have to be explicitly defined using a procedural approach. In section ‘a’ of table 3.2, we show how a procedure is used in INGRES to implement a foreign key constraint. Here the existence of a value in the referenced table is checked and the statement is rejected if it does not exist. For comparison purposes, we include the definition of the same constraint using an SQL-3 DDL specification (implicit) in section ‘b’ of table 3.2, and the declarative approach (explicit) in section ‘c’ of table 3.2. When comparing these three approaches, it is clear that the procedural one is most unfriendly and more error-prone. The DDL approach is the best and most efficient approach as it is specified and managed implicitly by the DBMS. Page 45
  • 47. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints Approach SQL Statements (a) Procedural CREATE PROCEDURE Employee_FK_Dept (WorksFor CHAR(5)) AS Approach DECLARE (Explicit) msg = VARCHAR(70) NOT NULL; check_value = INTEGER; BEGIN IF WorksFor IS NOT NULL THEN SELECT COUNT(*) INTO :check_value FROM Department WHERE DeptCode = :WorksFor; IF check_value = 0 THEN msg = ‘Error 1: Invalid Department Code: ‘ + :WorksFor; RAISE ERROR 1 :msg; RETURN ENDIF ENDIF END; (b) DDL CONSTRAINT Employee_FK_Dept Approach FOREIGN KEY (WorksFor) REFERENCES Department (DeptCode); (Implicit) Note: This constraint is defined in the Employee table. (c) Declarative CREATE ASSERTION Employee_FK_Dept Approach CHECK (NOT EXISTS (SELECT * FROM Employee (Explicit) WHERE WorksFor IN (SELECT DeptCode FROM Department))); Note: The schema on which these constraints are defined is in figure 6.2. Table 3.2: Different Approaches to defining a Constraint 3.6.2 Declarative Approach This more formal technique for representing explicit constraints is to use a constraint specification language, usually based on some variation of relational calculus. This is used to specify or declare all the explicit constraints. In this declarative approach there is a clean separation between the constraint base in which the constraints are stored, in a suitable encoded form, and the integrity control subsystem of the DBMS, which accesses the constraints to apply them to transactions. When using this technique, constraints are often called integrity assertions, or simply assertions, and the specification language is called an assertion specification language. The term assertion is used in place of explicit constraints, and the assertions are specified as declarative statements. These constraints are not attached to a specific table as in the case of the implicit constraint types introduced in section 3.5. This approach is supported by SQL-3. For example, the constraint head of departments’ salary must be greater than that of his employees, can be specified as, CREATE ASSERTION Salary_Constraint AFTER INSERT, UPDATE ON Employee E H, Department CHECK (E.Salary < H.Salary and E.Dept = DeptCode and Head = H.EmpNo) Assertions can also be used to define implicit constraints, like examination mark is between 0 and 100; or referential integrity constraints, as shown in table 3.2 part ‘b’. However, it is easier and more efficient (i.e. consumes less computer resources) to monitor and enforce implicit constraints using the DDL approach as such constraints are attached to an entity and used only when an update transaction is applied to that entity, as opposed to checking whenever an update transaction is performed on the database in general. Page 46
  • 48. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints In many cases it is convenient to specify the type of action to be taken when a constraint is violated or satisfied, rather than just having the options of aborting or silently performing the transaction. In such cases, it is useful to include the option of informing a responsible person regarding the need to take action or notifying them of the occurrence of that transaction (e.g. in referential constraints, it is sometimes necessary to perform an action like update or delete on a table to amend its contents, instead of aborting the transaction). This facility is supported either by an optional trigger option on an existing DDL statement or by defining triggers [DAT93]. Triggers can combine the declarative and procedural approaches, as the action part can be procedural, while the condition part is always declarative (INGRES rules are a form of trigger). A trigger can be used to activate a chain of associated updates, that will ensure database integrity (e.g. update total quantity when new stock arrives or when stock is dispatched). An alerter, which is a variant of the trigger mechanism, is used to notify users of important events in the database. For example, we could send a message to the head calling to his attention a purchase transaction for £1,000 or more made from department funds. In this section we have introduced concepts from INGRES which also appear in other DBMSs, namely triggers and alerters. These constructs provide further information about database contents, but are beyond the scope of this project. They are related to constraints, so could be utilised in a similar fashion. 3.7 SQL Approach to Implementing Constraints In SQL-3, a constraint is either a domain constraint, a table constraint or a general constraint [ISO94]. It is described by a constraint descriptor, which is either a domain constraint descriptor (cf. sections 3.7.1 and A.11), a table descriptor (cf. sections 3.7.2 and A.4) or a general descriptor (cf. sections 3.7.3 and A.12). Every constraint descriptor includes: the name of the constraint, an indication of whether or not the constraint is deferrable, and an indication of whether or not the initial constraint mode is deferred or immediate (see section A.3). Constraint descriptors are optional in that they can be assigned with system generated names (except for the general constraint case, where a name must be given). A constraint has an action which is either deferrable or non- deferrable. This can be set using the constraint mode option (see section A.13). Usually, most constraints are immediate as the default constraint mode is immediate, and in these cases they are checked at the end of an SQL transaction. To deal with deferred constraints, all constraints are effectively checked at the end of an SQL session or when an SQL commit statement is executed. Whenever a constraint is detected as being violated, an exception condition is raised and the transaction concerned is terminated by an implicit SQL rollback statement to ensure the consistency of the database system. 3.7.1 SQL Domain Constraints In SQL, domain constraints are specified by means of the CREATE DOMAIN statement (see section A.11) and can be added to or dropped from an existing domain by means of the ALTER DOMAIN statement [DAT93]. These constraints are associated with a specific domain and apply to every column that is defined using that domain. They allow users to define new data types, which in turn are used to define the structure of a table. For example, a domain Marks may be defined as shown in figure 3.5. This means SQL will recognise the data type Marks which permits integers Page 47
  • 49. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints between 0 and 100, thus giving a natural look to that data type when it is used. CREATE DOMAIN Marks INTEGER CONSTRAINT icmarks CHECK(VALUE BETWEEN (0,100)); Figure 3.5: An SQL domain constrant 3.7.2 SQL Table Constraints In SQL, table constraints (i.e. constraints on base tables) are initially specified by means of the CREATE TABLE statement (see section A.4) and can be added to or dropped from an existing base table by means of the ALTER TABLE statement [DAT93]. These constraints are associated with a specific table, as they cannot exist without a base table. However, this does not mean that such constraints cannot span multiple tables as in the case of foreign keys. Constraints defined on specific columns of a base table are a type of table constraint, but are usually referred to as column constraints. Three types of table constraints are defined here, namely: candidate key constraints, foreign key constraints and check constraints. Their definitions may appear next to their respective column definitions or at the end (i.e. after all column definitions have been defined). We now describe an example that uses all three types of constraints, using figure 3.6. The PRIMARY KEY (only one per table) (see section A.6) and UNIQUE (the value in a row position is unique) (see section A.5) are used to define candidate keys. A FOREIGN KEY definition (see section A.8) defines a referential integrity constraint and may also include an action part (which is not used in figure 3.6 for simplicity). Check constraints are defined using a CHECK clause (see section A.9) and may contain any logical expression. The check constraint CHECK(Name IS NOT NULL) is usually defined using a shorthand form NOT NULL next to the column Name, as shown in figure 3.6. We have also included a check constraint spanning multiple tables in figure 3.6. Such table constraints can be included only after the tables have been created, and hence in practice they are added using ALTER TABLE statements. CREATE TABLE Employee( EmpNo CHAR(5) PRIMARY KEY, Name CHAR(20) NOT NULL, Address CHAR(80) Age INTEGER CHECK(Age BETWEEN (21,65)), WorksFor CHAR(5) FOREIGN KEY REFERENCES (Department), Salary DECIMAL, CHECK (E.Salary <= (SELECT H.Salary FROM Department D, Employee H E WHERE D.Head=H.EmpNo AND E.WorksFor=H.WorksFor), UNIQUE (Name, Address) ); Figure 3.6: SQL table constrants 3.7.3 SQL General Constraints In SQL, general constraints are specified by means of the CREATE ASSERTION statement (see section A.12) and are dropped by means of the DROP ASSERTION statement. These constraints must be associated with a user defined constraint name as they are not attached to a specific table Page 48
  • 50. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints and are used to constrain arbitrary combinations of columns in arbitrary combinations of base tables in a database. The constraint defined in section ‘c’ of table 3.2 belongs to this type. Domain and table constraints are implicit constraints of a database, while assertions used to define general constraints are explicit constraints (using a declarative approach). SQL data types have their own constraint checking, which rejects for example string values being entered into a numeric column definition. This type of constraint checking can be considered as inherent as it is supported by the SQL language itself. All integrity constraints discussed above are deterministic and independent of the application and system environments. Hence, no parameters, host variables and built in system functions (e.g. CURRENT_DATE) are allowed in these definitions as they make the database inconsistent. For example CURRENT_DATE will give different values on different days. This means the validity of a data entry may not hold during its lifetime despite no changes being made to its original entry. This makes the task of maintaining the consistency of the database more difficult. Also it makes it difficult to distinguish these errors from the traditional errors discussed in section 3.5. Hence attributes such as age, which involves use of CURRENT_DATE should be derived attributes whose value is computed during retrieval. 3.8 SQL Standards To conclude this chapter, we compare different SQL standards to chronicle when respective constraint specification statements were introduced to the language. A standard for the database language SQL was first introduced in 1986 [ANSI86], and this is now called SQL/86. The SQL/86 standard specified two levels, namely: level 1 and level 2 (referred to as level 1 and 2 respectively in table 3.3, column 2); where level 2 defined the complete SQL database language, while level 1 was a subset of level 2. In 1989, the original standard was extended to include the integrity enhancement feature [ANSI89a]. This standard is called SQL/89 and is referred to as level Int. in table 3.3, column 2. The current standard, SQL/92 [ANSI92], is also referred to as SQL-2. This standard defines three levels, namely: Entry, Intermediate and Full SQL (referred to as level E, I and F, respectively, in table 3.3, column 4); where Full SQL is the complete SQL database language, Intermediate SQL is a proper subset of Full SQL, and Entry SQL is a proper subset of Intermediate SQL. The purpose of introducing 3 levels was to enable database vendors who had incorporated the SQL/89 extensions into their original SQL/86 implementations to claim SQL/92 Entry level status. As Intermediate extensions were more straightforward additions than the rest, they were separated from the Full SQL/92 extensions. However, SQL/92 is also constantly being reviewed [ISO94], mainly to incorporate O-O features into the language, and this latest release is called SQL-3 (referred to as level O-O in table 3.3, column 5). Until recently, relational DBMSs supported only the SQL/86 standard and even now most support only up to the Entry level of SQL/92. Hence ISs developed using these relational DBMSs are not capable of supporting modern features introduced in the newest standards. Our work focuses on providing these newer features for existing relational legacy database systems so that features such as primary / foreign key specification can be supported for relational databases conforming to SQL/86 standards; and sub / super type features can be specified for all relational products. Page 49
  • 51. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints SQL Version SQL/86 SQL/89 SQL/92 SQL-3 Level 1 2 Int. E I F O-O Data Type x + = = + + + Identifier length x + = = + = = Not Null x + = = = = = Unique Key - x = = + = = Primary Key - - x = + = = Foreign Key - - x = = + + Check Constraint - - x = = + = Default Value - - x = = = = Domain - - - - x + = Assertion - - - - - x + Trigger - - - - - - x Sub/SuperType - - - - - - x Table 3.3: SQL Standards and introduction of constraints The integrity features discussed in previous sections were thus gradually introduced into the SQL language standards as we can see from table 3.3. In this table ‘x’ indicates when the feature was first introduced. The ‘+’ sign indicates that some enhancements were made to a previous version, the ‘=’ sign indicates that the same feature was used in a later version, and the ‘-’ sign indicates that the feature was not provided in that version. For example, the Primary Key constraint for a table was first introduced in SQL/89 (cf. appendix A.6) and later enhanced (i.e. in SQL/92 Intermediate) by merging it with the Unique constraint (cf. appendix A.5) to introduce a candidate key constraint (cf. appendix A.7). Thus, SQL standards for Primary Key are shown in table 3.3 as: ‘-’ for SQL/86; ‘x’ for SQL/89; ‘=’ for SQL/92 Entry level; ‘+’ for SQL/92 Intermediate level; and ‘=’ for all subsequent versions. Simple attributes are defined using their data type and length (cf. section 3.5.1). These specifications are used as inherent domain constraints. The first two rows of table 3.3 show that they were among the first constraint features introduced via SQL standards (i.e. SQL/86). The Not Null constraint, which is a special domain constraint, was also introduced in the initial SQL standard. The key constraints (cf. section 3.5.2), which specify unique and primary keys, were introduced in a subsequent standard (i.e. SQL/89) as shown in table 3.3. The referential constraint which specifies a type of a structural relationship constraint uses foreign keys, and this constraint was also introduced in SQL/89, along with default values for attributes and check constraints. Later, more complex forms of constraints were introduced in SQL/92. These include defining new domains for an attribute (e.g. child as a domain having an inherent constraint of age being less than 18 years), and specifying domain constraints spanning multiple tables (i.e. assertions). Constraints which are activated when an event occurs (i.e. triggers), and structural constraints on specialisation / generalisation (i.e. sub/super type, cf. section 3.5.5) are among other enhancements proposed in the draft SQL-3 standards. A detailed description of the syntax of statements used to provide the features identified in table 3.3 is given in appendix A. For further details we refer the reader to the standards themselves [ANSI86, ANSI89a, ANSI92, ISO94]. Page 50
  • 52. CHAPTER 4 Migration of Legacy Information Systems This chapter addresses in detail the migration process and issues concerned with legacy information systems (ISs). We identify the characteristics and components of a typical legacy IS and present the expected features and functions of a migration target IS. An overview of some of the strategies and methods used for migration of a legacy IS to a target IS is presented along with a detailed study of migration support tools. Finally, we introduce our tool to assist the enhancement and migration of relational legacy databases. 4.1 Introduction Rapid technological advances in recent years have changed the standard computer hardware technology from centralised mainframes to network file-server and client/server architectures, and software data management technology from files and primitive database systems to powerful extended relational distributed DBMSs, 4GLs, CASE tools, non-procedural application generators and end-user computing facilities. Information systems (ISs) built using old-fashioned technology are inflexible, as that technology limits them from being changed and evolving to meet changing business needs, which adjust rapidly to the potential of technological advances. It also means they are expensive to maintain, as older systems suffer from failures, inappropriate functionality, lack of documentation, poor performance and problems in training support staff. Such older systems are called legacy ISs [BRO93, PHI94, BEN95, BRO95], and they need to be evolved and migrated to a modern technological environment so that their existence remains beneficial to their user community and organisation, and their full potential to the organisation can be realised. 4.2 Legacy Information Systems Technological advances in hardware and software have improved the performance and maintainability of new information systems. Equipment and techniques used by older ISs are obsolete and prone to suffer from frequent breakdowns along with ever increasing maintenance costs. In addition, older ISs may have other deficiencies depending on the type of system. Common characteristics of these systems are [BRO93, PHI94, BEN95, BRO95] that they are: • scientifically old, as they were built using older technology, • written in a 3GL, • use an old fashioned database service (e.g. a hierarchical DBMS), • have a dated style of user interface (e.g. command driven). Furthermore, in very large organisations additional negative characteristics may be present making the intended migration process even more complex and difficult. These include [BRO93, AIK94, BEN95, BRO95]: being very large (e.g. having millions of lines of code); being mission critical (e.g. an on-line monitoring system like customer billing); and being operational all the time (i.e. 24 hours a day and 7 days a week). These characteristics are not present in most smaller scale legacy information systems, and hence the latter are less complex to maintain. Our work may not
  • 53. Chapter 4 Migration of legacy ISs assist all types of legacy IS as it deals with one particular component of a legacy IS only (i.e. the database service). Information systems consist of three major functional components, namely: interfaces, applications and a database service. In the context of a legacy IS these components are, accordingly, referred to as [BRO93, BRO95] the: • legacy interface, • legacy application, • legacy database service. These functional components are sometimes inter-related depending on how they were designed and implemented in the IS’s development. They may exist independently of each other, having no functional dependencies (i.e. all three components are decomposable - see section ‘a’ of figure 4.1); they may be semi-decomposable (e.g. the interface may be separate from the rest of the system - see section ‘b’ of figure 4.1); or they may be totally non-decomposable (i.e. the functional components are not discrete but are intertwined and used as a single unit - section ‘c’ of figure 4.1). This variety makes the legacy IS environment complex to deal with. Due to the potential complexity of entire legacy ISs, we have focussed on one particular functional component only, namely: the legacy database service. In order to restrict our attention to the database service component, we have to treat the other components, namely the interface and application, as black boxes. This can be done successfully when a legacy IS has decomposable modules as in section ‘a’ of figure 4.1. However, when the legacy database service is not fully decomposable from both the legacy interface and the legacy application, treating them as black boxes may result in loss of information since relevant database code may also appear in other components. In such cases, attempts must be made by the designer to decompose or restructure the legacy code. The designer needs to investigate the legacy code of the interface and application modules to detect any database service code, then move it to the database service module (e.g. legacy code used to specify and enforce integrity constraints in the interface or application components is identified and transferred to this module). This will assist in the conversion of legacy ISs of the types shown in sections ‘b’ and ‘c’ of figure 4.1 to conform to the structure of the IS type in section ‘a’ of figure 4.1. The identification and transfer of any legacy database code left in the legacy application or interface modules can be done at any stage (e.g. even after the initial migration) as the migration process can be repeated iteratively. Also, the existence of legacy database code in the application does not affect the operation of the IS as we are not going to change any existing functionalities during the migration process. Hence, treating a legacy interface or a legacy application having legacy database code as a black box does not harm migration. Page 52
  • 54. Chapter 4 Migration of legacy ISs I - Interface A -Application I D - Database Service I A A/D I/A/D D (a) (b) (c) Figure 4.1 : Functional Components of a Legacy IS 4.2.1 Legacy Interfaces Early information systems were developed for system users who were computer literate. These systems did not have system or user level interfaces because they were not regarded as essential since the system users did these tasks themselves. However, when the business community and others wanted to use these systems, the need for user interfaces was identified and they have been incorporated into more recent ISs. The introduction of DBMSs in the late 1960’s facilitated easy access to computer held data. However, in the early DBMSs, the end user had no direct access to their database and their interactions with the database were done through verbal communication with a skilled database programmer [ELM94]. All user requests were made via the programmer, who then coded the task as a batch program using a procedural language. Since the introduction of query languages such as SQL [CHA74, CHA76], QBE [ZLO77] and QUEL [HEL75], provision of interfaces for database access has rapidly improved. These interfaces are provided not only to encourage laymen to use the system but also to hide technical details from users. A presentation layer consisting of forms [ZLO77] was the initial approach to supporting interaction with the end user. Modern user interfaces rely on multiple screen windows and iconic (pictorial) representations of entities manipulated by pull-down or pop-up menus and pointing devices such as cursor mice [SHN86, NYE93, HEL94, QUE93]. The current trend is towards using separate interfaces for all decomposable application modules of an IS. Some Graphical User Interface (GUI) development tools (e.g. Vision for graphics and user interfaces [MEY94]) allow the construction of advanced GUIs supporting portability to various toolkits. This is an important step towards building the next generation of ISs. Changes in the user interface and operating environment result in the need for user training, an additional factor in system evolution costs. As defined by Brodie [BRO93, BRO95], we shall use the term legacy interfaces in the context of all ISs whose applications have no interfaces or use old fashioned user / system interfaces. In figures 4.1a and 4.1b, these interfaces are distinct and separable from the rest of the system as they are decomposable modules. However, interfaces can be non-decomposable as shown in figure 4.1c. Migration issues concerning user interfaces have been addressed in [BRO93, BRO95, MER95], and as mentioned in section 4.2, our work does not address problems associated with such interface migration. 4.2.2 Legacy Applications Page 53
  • 55. Chapter 4 Migration of legacy ISs Originally, ISs were written using 3GLs, usually the COBOL or PL/1 programming languages. These languages had many software engineering deficiencies due to the state of the technology at the time of their development. Techniques such as structured programming, modularity, flexibility, reusability, portability, extensibility and tailorability [YOU79, SOM89, BOO94, ELM94] were not readily available until subsequent advances in software engineering occurred. The lack of these has made 3GL based ISs appear to be inflexible and, hence, difficult and expensive to maintain and evolve to meet changing business needs. The unstructured and non-modular nature of 3GLs resulted in the use of non-decomposable application modules13 in the development of early ISs. However, with the introduction of software engineering techniques such as structured modular programming these 3GLs were enhanced, and new languages, such as Pascal [WIR71], Simula [BIR73], and more recently C++ [STR91] and Eiffel [MEY92], gradually emerged to support these changing software engineering requirements. The emergence of query languages in the 1970’s as standard interfaces to databases saw the development and use of embedded query languages in host programming languages for large software application program development. Embedded QUEL for INGRES [RTI90a] and Embedded SQL for many relational DBMSs [ANSI89b] are examples of this gendre. The emergence of 4GLs is an evolution which allows users to give a high-level specification of an application expressed entirely in 4GL code. A tool then automatically generates the corresponding application code from the 4GL code. For example, in the INGRES Application-by-Forms interface [RTI90b], the application designer develops a system by using forms displayed on the screen, instead of writing a program. Similar software products are offered by other vendors, such as Oracle [KRO93]. Information systems developed recently have partially or totally adopted modern software engineering practices. As a result, decomposable modules exist in some recent ISs, i.e. their architecture is as in section ‘a’ of figure 4.1. Applications which do not use the concept of modularity are non-decomposable (e.g. section ‘c’ of figure 4.1), while those partially using it are semi-decomposable (section ‘b’ of figure 4.1). Semi- and non- decomposable applications are referred to as legacy applications and need to be converted into fully-decomposable systems to simplify maintenance and make it easier for them to evolve and support future business needs. Some aspects of legacy application migration need tools to analyse code. These are discussed in [BIG94, NIN94, BRA95, WON95]. They are beyond the scope of this thesis, except insofar as we are interested in any legacy code that is relevant to the provisions of a modern database service (e.g. integrity constraints). 4.2.3 Legacy Database Service Originally, many ISs were developed on centralised mainframes using COBOL and PL/1 based file systems [FRY76, HAN85]. These ISs had no DBMS and their data was managed by the system using separate files and programs for each file handling task [HAN85]. Later ISs were based 13 often containing calculated or parameter-driven GOTO statements preventing a reasonable analysis of their structure. Page 54
  • 56. Chapter 4 Migration of legacy ISs on using database technology with limited DBMS capabilities. These systems typically used hierarchical or network DBMSs for their data management [ELM94, DAT95], such as IMS [MCG77] and IDMS [DAT86, ELM94]. The introduction and rapid acceptance of the relational model for DBMSs in recent years has resulted in most applications now being developed with original relational DBMSs (e.g. System R [AST76], DB2 [HAD84], SQL/DS [DAT88], INGRES [STO76, DAT87]). The steady evolution of the relational data model has resulted in the emergence of extended relational DBMSs (e.g. POSTGRES [STO91]) and newer versions of existing products (e.g. Oracle [ROL92], INGRES [DAT87] and SYBASE [DAT90]) which have been used for most recent database applications. This relational data model has been widely accepted as the dominant current generation standard for supporting ISs. The rapidly expanding horizon of applications means that DBMSs are now expected to cater for diverse data handling needs such as management of image, spatial, statistical or temporal databases [ELM94, DAT95] and it is in its support of these that they are often weak. This highlights the different range of functionality that is supported by various DBMSs. Thus applications using older database services support modern database functionalities by means of application modules. This is a typical characteristic of a legacy IS. Hence, the structure of such ISs is more complex and is poorly understood as it is not adequately engineered in accordance with current technology. The database services offered by most hierarchical, network and original relational DBMSs are now considered to be primitive, as they fail to support many functions (e.g. backup, recovery, transaction support, increased data independence, security, performance improvements and views [DAT77, DAT81, DAT86, DAT90, ELM94]) found in modern DBMSs. These functions facilitate the system maintenance of databases developed using modern DBMSs. Hence, the database services provided by early DBMSs and file based systems are now referred to as legacy database services, since they do not fulfil many current requirements and expectations of such services. The specifications of a database service are described by a database schema which is held in the database using data dictionaries. Analysis of the contents of these data dictionaries will provide information that is useful in constructing a conceptual model for a legacy system. Our approach focuses on using the data dictionaries of a relational legacy database to extract as much information as possible about the specifications of that database. 4.2.4 Other Characteristics The programming constructs of 3GL programs are less powerful than the data manipulation features offered by 4GLs. As 4GL code uses the higher level DML code of modern DBMSs, it uses less code (e.g. about 20% less) than its predecessors to accomplish even more powerful applications. A typical program of a 3GL based information system is large and may consist of over a hundred thousand lines of 3GL code. This means that a 20% reduction is a considerable saving in quantity of code to be maintained [BRO93, BRO95]. Code translation tools [SHA93, SHE94] are being built to automate as far as possible the conversion between 3GL and 4GL. These translations sometimes optimise the translated code. Some of these techniques were mentioned in section 1.1.1. The long lifetime of some ISs also leads to deficiencies in documentation. This may be due to non-existent, out of date or lost documentary materials. The extent of this deficiency was only Page 55
  • 57. Chapter 4 Migration of legacy ISs realised recently when people tried to transform ISs. To address this problem in the future, CASE tools are being developed to automatically produce suitable documentation for current ISs developed using them [COMP90]. However, this solution does not apply to legacy ISs as they were not built using such tools and it is impossible to use these tools on the legacy systems. Thus we must solve this problem in another way. Sometimes, certain critical functions of an IS are written for high performance, often using a specific, machine dependent set of instructions on some obsolete computer. This results in the use of mysterious and complex machine code constructs which need to be deciphered to enable the code to be ported to other computer systems. Such code is usually not convertible using generalised translation tools. In general, the performance of legacy ISs is poor as most of their functions are not optimised. This is inevitable, due to the state of the technology at the time of their original development. Thus problems arise when we try to translate 3GL code into 4GL equivalent code in a straightforward manner. Solving the problems identified above is the overall concern when assisting the migration and evolution of legacy ISs. However, our aim is to address only some of the problems concerning legacy ISs, as the complete task is beyond the scope of our project. 4.2.5 Legacy Information System Architecture Having considered the characteristics of the components of legacy ISs, we can conclude that a typical IS consists of many application modules, which may or may not use an interface for user / system interactions, and may use a legacy database service to manage legacy data. This database service may use a DBMS to manage its database. Hence, in general, the architecture of most legacy ISs is not strictly decomposable, semi- decomposable or non-decomposable, as they may have evolved several times during their lifetime. As a result, parts of the system may belong to any of the three categories shown in figure 4.2. This means that the general architecture of a legacy IS is a hybrid one, as defined in [BRO93, BRO95, KAR95]. Figure 4.2 suggests that some interfaces and application modules are inseparable from the legacy database service while others are modular and independent of each other. This legacy IS architecture highlights the database service component, as our interactions are with this layer to extract the legacy database schema and any other database services required. 4.3 Target Information System A legacy IS can be migrated to a target IS with an associated computing environment. This target environment is intended to take maximum advantage of the benefits of rightsized computers, client/server network architecture, and modern software including relational DBMSs, 4GLs and CASE tools. In this section we present the hardware and software environments needed for the target ISs. 4.3.1 Hardware Environment The target environment must be equipped with modern technology supporting current Page 56
  • 58. Chapter 4 Migration of legacy ISs business needs which should be flexible enough to evolve and fulfil future requirements. The fundamental goal of a legacy IS migration is that the target IS must not itself become a legacy IS in the near future. Thus, the target hardware environment needs to be flexibly networked (e.g. client- server architecture) to support a distributed multi-user community. This type of environment includes a desk top computer for each target IS user with an appropriate server machine(s) controlling and resourcing the network provision. A PC (e.g. IBM PC or compatible) or a workstation (e.g. Sun SPARC) may be used as the desk top computer (i.e. client / local machine), each being connected using a local area network (LAN) (e.g. Ethernet) to the server(s). I 1..Il I l+1 I m+1 In • • Im • • A1 ..A l Al+1..Am Mm+1 Mn non-decomposable semi-decomposable decomposable Legacy Database Service Legacy Legacy • • • • Database Database Data I - Interface module A - Application module with legacy database services M - Decomposed application module Figure 4.2 : Legacy IS Architecture 4.3.2 Software Environment The target database software is typically based on a relational DBMS with a 4GL, SQL, report writers and CASE tools (e.g. Oracle v7 with Oracle CASE). Use of such software provides many benefits to its users, such as an increase in program / data independence, introduction of modularised software components, graphical user interfaces, reduction in code, ease of maintenance, support for future evolution and integration of heterogeneous systems. The target database can be centralised on a single server machine or distributed over multiple servers in a networked environment. The target system may consist of application modules representing the decomposed system components, each having its corresponding graphical user interface (see figure 4.3). A typical architecture for a modern IS consists of layers for each of the system functions (e.g. interface, application, database, network) as identified in [BRO93, BRO95]. In figure 4.3 we introduce such an architecture with special emphasis on the database service, which will be a modern DBMS. Page 57
  • 59. Chapter 4 Migration of legacy ISs GUI1 GUIi GUIn • • • • M1 Mi Mn Target DBMS Target Target Target Database • • Database • • Database GUI - graphical user interface module M - Decomposed application module Figure 4.3 : Target IS Architecture The complete migration process involves significant changes, not only in hardware and software of the applications but also in the skills required by users and management. Thus they will have to be trained or replaced to operate the target IS. These changes must be done in some organised manner as the complete migration process itself is complex, and may take months or even years depending on the size and complexity of the legacy IS. The number of persons involved in the migration process and the resources available also contribute towards determining the ultimate duration and cost of the migration. 4.4 Migration Strategies The migration process for legacy ISs may take one of two main forms [BRO93, BRO95]. The first approach involves rewriting a legacy IS from scratch to produce the target IS using modern software techniques (i.e. a complete migration). The other approach involves gradually migrating the legacy IS in small steps until the desired long term objective is reached (i.e. incremental migration). The approach of complete rewriting carries substantial risks and has failed many times in large organisations as it is not an easy task to perform, especially when dealing with systems that must remain operational throughout the process, or large ISs [BRO93, BEN95, BRO95]. However, if the incremental approach fails, then only the failed step must be repeated rather than redoing the entire migration. Hence, it is argued [BRO95] that the latter approach involves a lower risk and is more appropriate in most situations. These approaches are further described in the next two subsections. Our work is directed towards assisting this incremental migration approach. 4.4.1 Complete Migration The process of complete migration involves rewriting a legacy IS from scratch to produce the intended target IS. This approach carries a substantial risk. We discuss some of the reasons for this risk to explain why this approach is not considered to be suitable by us. These are: a) A better system is expected. Page 58
  • 60. Chapter 4 Migration of legacy ISs A 1-1 rewrite of a complex IS is nearly impossible to undertake, as additional functions not present in the original system are expected to be provided by the target IS. Besides, it is a significant problem to evolve a developing replacement IS in step with an evolving legacy IS and to incorporate in both ongoing changes in business requirements. Changes to and requests to evolve ISs may occur at any time, without warning, and hence, it is difficult to incorporate any minor / major ad hoc changes to the new system as they may not fit into its design. Also, an attempt to change this design may violate its original goals and contribute towards a never ending cycle of development changes. b) Specifications rarely exist for the current system. Documentation for the current system is often non-existent and typically available only in the form of the code14 itself. Due to the evolution of the IS many undocumented dependencies will have been added to the system without the knowledge of the legacy IS owners (i.e. uncontrolled enhancements have occurred). These additions and dependencies must be identified and accommodated when rewriting from scratch. This adds to the complexity of a complete rewriting process and raises the risk of unpredicted failure of dependent ISs when we rewrite a legacy system as they are dependent on undocumented features of that system. c) Information system is too critical to the organisation. Many legacy ISs must be operational almost all the time and cannot be dormant during evolution. This means that migrating live data from a legacy IS to the target IS may take more time than the business can afford to be without its mission critical information. Such situations often prohibit complete rewriting altogether and make this approach a non-starter. It also means that a carefully thought out staged migration plan must be followed in this situation. d) Management of large projects is hard. The management of large projects involves managing more and more people. This normally results in less and less productive work because of the effort required to manage organisational complexity. As a result the timing of most large projects is seriously under-estimated. Frequently, this results in partial or complete abortion of the project, as the inability to keep up with original targets due to time lost is not always tolerated by an impatient company management. 4.4.2 Incremental Migration An incremental migration process involves a series of steps, each requiring a relatively small resource allocation (e.g. a few person weeks or months in the case of small or medium scale systems), and a short time to produce a specific small result towards the desired goal. This is in sharp contrast to the complete rewrite approach which may involve a large resource allocation (e.g. several person months or years), and a development project spanning several years to produce a single massive result. To perform a migration involving a series of steps, it is important to identify 14 This code is sometimes provided only in the form of executable code, as ISs are often in-house developments. Page 59
  • 61. Chapter 4 Migration of legacy ISs independent increments (i.e. portions of the legacy interfaces, applications and databases that can be migrated independently of each other), and sequence them to achieve the desired goal. However, as legacy ISs have a wide spectrum of forms from well-structured to unstructured, this process is complex and usually has to deal with unavoidable problems due to dependencies between migration steps. The following are the most important steps to apply in an incremental migration approach: a) Iteratively migrate the computing environment. The target environment must be selected, tested and established based on the total target IS requirements. To determine the target IS requirements, it may be necessary to partially or totally analyse and decompose the legacy IS. The installation of the target environment typically involves installing a desk top computer for each target IS user and appropriate server machines, as identified in section 4.3.1. The process of replacing dumb terminals with a PC or a workstation and connecting them with a local area network can be done incrementally. This process allows the development of the application modules and GUIs on an inexpensive local machine by downloading the relevant code from a server machine, rather than by working on the server itself to develop this software. Software and hardware changes are gradually introduced in this approach along with the necessary user and management training. Hence, although we explicitly refer to a particular process there are many processes that take place simultaneously. This is due to the involvement of many people in the overall migration activity, with each person contributing towards the desired migration goal in a controlled and measurable way. Our work is concerned with iteratively migrating part of the legacy software (i.e. the database service) and not the computing environment. Therefore we worked on a preinstalled target software and hardware environment. b) Iteratively analyse the legacy information system. The purpose of this process is to understand the legacy IS in detail so that ultimately the system can be modified to consist of decomposable modules. Any existing documentation, along with the system code are used for this analysis. Knowledge and experience from people who support and manage the legacy IS is also used to document the existing and the target IS requirements. This knowledge has played a key role in other migration projects [DED95]. Some existing code analysis tools such as Reasoning Systems' Software Refinery and Bachman Information Systems' Product Set [COMP90, CLA92, BRO93, MAR94] can be used to assist in this process. It may be useful to conduct experiments to examine the current system using its known interfaces and available tools (e.g. CASE tools), so that the information gathered with one tool can be reused by other tools. Here, functions and the data content of the current system are analysed to extract as much information as possible about the legacy IS. Other available information for this type of analysis includes: documentation, discussions with users, dumps (system backups), the history of system operation and the services it provides. We do not perform any code analysis as part of our work. However, the analysis we do by automated conceptual modelling identifies the usage of the data structures of the legacy IS. Our Page 60
  • 62. Chapter 4 Migration of legacy ISs analysis assists in identifying the structural components of the legacy IS and their functional dependencies. This information may then be used to restructure the legacy code. c) Iteratively decompose the legacy information system structure. The objective of this process is to construct well-defined interfaces between the modules and the database services of the legacy IS. The process may involve restructuring the legacy IS and removing dependencies between modules. It will thereby simplify the migration, that otherwise must support all these dependencies. This step may be too complex in the worst case, when the legacy IS will have to remain in its original form. Such a situation will complicate the migration process and may result in increased cost, reduced performance and additional risk. However, in such cases an attempt to perform even limited restructuring could facilitate the migration, and is preferable to totally avoiding the decomposition step altogether. We investigate supporting some structural changes in order to improve the existing structures of the legacy database (e.g. introduction of inheritance to represent hierarchical structures and specification of relationship structures). These changes eventually affect the application modules and the interfaces of the legacy IS. Hence there is a direct dependency with respect to decomposing the legacy database service and an indirect dependency with respect to decomposing the other components of the legacy IS. The actual testing of this indirect dependency was not considered due to its involvement with the application module. However, the ability to define referential integrity constraints and assertions spanning multiple tables allows us to redefine functional dependencies in the form of constraints or rules. When these constraints are stored in the database, it is possible to remove such dependencies from the legacy applications. This assists decomposition of some functional components of a legacy IS. d) Iteratively migrate the identified portions. An identified portion of the legacy IS may be an interface, application or a database service. These components are individually migrated to the target environment. When this is done the migrated portion will then run in the target environment with the remaining parts of the legacy system continuing to operate. A gateway is used to encapsulate system components undergoing changes. The objective of this gateway is to hide the ongoing changes in the application and the database service from the system users. Obviously any change made to the appearance of any user interface components will be noticeable, along with any significant performance improvements in application modules processing large volumes of data. Our work is applicable only to a legacy database service and hence any incremental migration of interfaces or application modules is not considered at this stage. The complete migration of legacy data takes a significant amount of time from hours to days depending on the volume of data held. During this process no updates or additions can be made to the data as they cause inconsistency problems. This means all functions of the database application have to be stopped to perform a complete data migration in one go. For large organisations this type of action is not appropriate. Hence iterative migration of selected data portions is desirable. To ensure a successful migration, each chosen portion needs to be validated for consistency and guarded against being rejected by the target database. When migrating data in stages it is necessary to be aware of Page 61
  • 63. Chapter 4 Migration of legacy ISs the two sources of data as queries involving a migrated portion need to be re-directed to the target system while other queries must continue to access the legacy database. This process may cause a delay when a response for the query involves both the legacy and target databases. Hence it is important to minimise this delay by choosing independent portions wherever possible for the migration process. 4.5 Migration Method A formal approach to migrating legacy ISs has been proposed by Brodie [BRO93, BRO95] based on his experience in the field of telecommunication and other related projects. These methods, referred to as forward, backward/reverse and general (a combination of forward and backward) migration, are based on his “chicken little” incremental migration process. A forward migration incrementally analyses the legacy IS and then decomposes its structure, while incrementally designing the target IS or installing the target environment. In this approach the database is migrated prior to the other IS components and hence unchanged legacy applications are migrated forward onto a modern DBMS before they are migrated to new target applications. Conversely, backward migration creates the target applications and allows them to access the legacy data as the database migration is postponed to the end. General migration is more complex as it uses a combination of both these methods based on the characteristics of the legacy application and databases. However, this is more suitable for most ISs as the approach can be tailored at will. The incremental migration process consists of a number of migration steps that together achieve the desired migration. Each step is responsible for a specific aspect of the migration (i.e. computer environment, legacy application, legacy data, system and user interfaces). The selection and ordering of each aspect of the migration may differ as it depends on the application, as well as the money and effort allocated for each process. Independent components can be migrated sequentially or in parallel. As we see here, the migration methods of Brodie deal with all components of a legacy IS. Our interest in this project is to focus on a particular component, namely the database service, and as a result a detailed review of Brodie’s migration methods is not relevant here. However, our approach has taken account of his forward migration approach as it first deals with the migration of the legacy database service and then allows the legacy applications to access the post-migrated data management environment through a forward database gateway. 4.6 Migration Support Tools There is no tool that is capable of performing a complete automatic migration for a given IS. However, there are tools that can assist at various stages of this process. Hence, categorising tools by their functions according to the stages of a migration process can help in identifying and selecting those most appropriate. There are three main types of tools, namely: gateways, analysers and migration tools, which can be of use at different stages of migration [BRO95]. For large migration projects, testing and configuration management tools are also of use. a) Gateways Page 62
  • 64. Chapter 4 Migration of legacy ISs The function of a gateway is to coordinate between different components of an IS and hide ongoing changes (i.e. to interfaces, data, applications and other system components being migrated) from users. One of the main functions of these tools is to intercept calls on an application or database service and direct them to the appropriate part of the legacy or target IS. To incrementally migrate a legacy IS to a target IS, we need to select independent manageable portions, replicate them in the target environment and give control to the new target modules while the legacy system is still operational. To perform such a transition in a fashion transparent to users, we need a software module (i.e. a gateway) which encapsulates system components that are undergoing change behind an unchanged interface. Such a software module manages information flow between two different environments, the legacy and target environments. Functions such as retrieval, processing, management and representation of data from various systems are expected from gateways. These expectations from a gateway managing a migration process are similar to those we have of DBMS’s to manage data. DBMSs were designed to provide general purpose data management and similarly the gateway needs to manage the migration process in a generalised form. Development of such a gateway is beyond the scope of this project as it may take several man years of effort. Hence our work will focus on some selected functionalities of a gateway, which are sufficient to produce a realistic prototype. We aim to provide a simplified form of the functionality of a gateway, which permits the evolution of an existing IS at the logical level, by creating a target database and managing an incremental migration of the existing database service in a way transparent to its users. This facility should be provided not only for centralised database systems, but also for heterogeneous distributed databases. This means our gateway functionality should support databases built using different types of DBMS. We expect some of this functionality to be incorporated in future DBMSs as part of their system functionality. Functions such as schema evolution, management of heterogeneous distributed databases and schema integration are expected capabilities of modern database products. b) Analysers These tools employ a wide variety of techniques in exploring the technical, functional and architectural aspects of an application program or database service to provide graphical and textual information about an IS. The functions of reverse and forward engineering are provided by these tools. Many tools are used in this way to analyse the different components of an IS. Most of these tools are semi-automatic as some form of user interaction is required to successfully complete their task. For example, an application or database translation process is automatic if the source program and data conform to all the standards supported by the tool. Otherwise, the translation process will be terminated with the unconvertible portions indicated, leaving the database administrator to complete the job manually by either correcting or re-programming those unconvertible portions of the source program into target language code. We experienced this situation when attempting to migrate an Oracle version 6 database to Oracle version 7, using the Oracle tools. In this case, Oracle failed to successfully convert date functions used to check the constraints of its version 6 databases to the equivalent coding in version 7 (Note: Oracle version 6’s use of non-standard date functions was the cause of this problem). Page 63
  • 65. Chapter 4 Migration of legacy ISs c) Migration tools These tools are responsible for creating the components of the target IS, including interfaces, data, data definitions and applications. d) Testing An important task is to ensure that the migrated target IS performs in the same way as its legacy original, with possibly better performance. For this task we need test beds to check the most amount of logic using the least amount of data. There are tools that allow for easy manipulation of testing functions like break points and data values. However, they do not help with the generation of test beds or validation of the adequacy of the testing process. Comparing the results that are generated using both systems will assist the achievement of a reasonable form of testing. This may not be sufficient to test new features such as the introduction of distributed computing functionality to our systems. It is up to the person involved to ensure that a reasonable amount of testing has been done to ensure the functionality and the accuracy of the new IS. e) Configuration management This type of tool is needed for large migration projects involving many people, to coordinate functions such as documentation, synchronisation, keeping track of changes made (auditing), management of revisions to system elements (version control), and automatic building of a particular version of a system component. Our work focuses on bringing these tools together into a single environment. We wish to analyse a legacy database service, hence the functions of reverse and forward engineering are of particular interest. We integrate these functions with some forward gateway and migration functions as they are the relevant components for us to address the enhancement and migration of a database service. Thus, we are not interested in all the features associated with migration support tools. The classification of reverse and re-engineering tools given in [SHA93, BRO95] provides a broad description of the functions of existing CASE tools. These include maintaining, understanding and enhancing existing systems; converting / migrating existing systems to new technology; and building new replacement systems. There are many tools which perform some of these functions. However, none of them is capable of performing the integrated task of all the above functions. This is one of the important requirements for future CASE tools. As it is practically impossible to produce a single tool to perform all these tasks, the way to overcome this deficiency is to provide a gateway that permits multiple tools to exchange information and hence provide the required integrated facility. The need to integrate different software components (i.e. database, spreadsheet, word processing, e-mail and graphic applications) has resulted in the production of some integration tools, such as DEC’s Link Works and Dictionary Software’s InCASE [MAY94]. However, what we need is to integrate data extraction and downloading tools with schema enhancement and evolution functions as they are together vital in the context of enhancing and migrating legacy databases. Page 64
  • 66. Chapter 4 Migration of legacy ISs Support for interoperability among various DBMSs and the ability to re-engineer a DBMS are important functions for a successful migration process. Of these two, the former has not been given any attention until very recently, and there has been some progress relating to the latter in the form of CASE tools. However, among the many CASE tools available only a handful support the re- engineering process. The reason for this is that most CASE tools focus on forward-engineering. In this situation, new or replacement software systems are being designed and appropriate code generated. The re-engineering process is a combination of both forward-engineering and reverse- engineering. The reverse-engineering process analyses the data structures of the databases of existing systems to produce a higher level representation of these systems. This higher level representation is usually in diagrammatic form and may be an entity-relationship diagram, data-flow diagram, cross-reference diagram or structure chart. We came across some tools that are commercially available for performing various tasks of the migration process. These include data access and / or extraction tools for Oracle [BAR90, HOL93, KRO93] and INGRES [RTI92] - two of our test DBMSs. Some other tools, mainly those capable of performing the initial reverse engineering task, are also identified here. These tools are not suitable for legacy ISs in general, as they fail to support a variety of DBMSs or the re- engineering of most pre-existing databases. Among the different tools available, tools such as gateways play a more significant role than others. When different database products are used in an organisation, there may be a need to use multiple tools for a single step of a migration process, conversely some tools may be of use for multiple steps. The process of using multiple tools for a migration is complex and difficult as most vendors have not yet addressed the need for tool interoperability. The survey carried out in [COMP90] identifies many reverse-engineering products. Among the 40 vendor products listed there, only three claimed to be able to reverse engineer Oracle, INGRES or POSTGRES databases (our test databases - see section 6.1) or any SQL based database products. These three products were: Deft CASE System, Ultrix Programming Environment and Foundation Vista. Of these products only Deft and Vista produced E-R diagrams. None of the products in the complete list supported POSTGRES, which was then a research prototype. Of the two products identified above, only Deft was able to read both Oracle and INGRES databases, while Vista could read only INGRES databases. This analysis indicated that interoperability among our preferred databases was rare and that it is not easy to find a tool that will perform the re-engineering process and support interoperability among existing DBMSs. Although the information published in [COMP90] may be now outdated, the literature published since then [SHA93, MAY94, SCH95] does not show that modern CASE tools have addressed the task of re-engineering existing ISs along with interoperability, both of which are essential for a successful migration process. However, the functionality of accessing and sharing information from various DBMSs via gateways like ODBC is a step towards achieving this task. One of the reasons for progress limitation is the inability to customise existing tools, which in turn prevents them being used in an integrated environment. This is confirmed to some extent by the failure of the leading Unix database vendor - Oracle - to provide such tools. Brodie and Stonebraker, in their book [BRO95], present a study of the migration of large legacy systems. It identifies an approach (chicken-little) and the commercial tools required to Page 65
  • 67. Chapter 4 Migration of legacy ISs support this approach for legacy ISs in general. In this project we have developed a set of tools to support an alternative approach for migrating legacy database services in particular. Thus Brodie and Stonebraker take account of the need to migrate the application processes with a database, using commercial tools, while in this thesis we concentrate on the development of integrated tools for enhancing and migrating a database service. 4.7 The Migration Process Having identified the migration strategies and methods applicable to our work, we can review our migration process. This process must start with a legacy IS as in figure 4.2 and end with a target IS as shown in figure 4.3. However, as we are not addressing the application and interface components of a legacy IS, their conversion is not part of this project. Our conceptualised constraint visualisation and enhancement system (CCVES) described in section 2.2 was designed to assist in preparing legacy databases for such a migration. Hence our migration process can be performed by connecting the legacy and target ISs using CCVES. This is shown in figure 4.4. The three essential steps performed by CCVES before the actual migration process occurs are shown using the paths highlighted in this figure as A, B and C, respectively. These are the same paths that were described in section 2.2. The identification of all legacy databases used by an application is made prior to the commencement of path A of figure 4.4. The reverse engineering process is then performed on any selected database. This process commences when the database schema and its initial constraints are extracted from the selected database and is completed when the database schema is graphically displayed in a chosen format. Any missing or new information is supplied via path B in the form of enhanced constraints, to allow further relationships and constraints to appear in the conceptual model. The constraint enforcement process of path C is responsible for issuing queries and applying these constraints to the legacy data and taking necessary actions whenever a violation is detected, before any migration occurs. This ensures that the legacy data is consistent with its enhanced constraints before migration. Once these steps are completed, a graceful transparent migration process can be undertaken via path D. Our work focuses only on evolving and migrating database services, hence path X representing the application migration is not done via CCVES. The evolution of database services includes increasing IS program / data independence by identifying and transferring legacy application services which are concerned with data management functions, like enforcement of referential constraints, integrity constraints, rules, triggers, etc., to the database service from the application. Our migration process performs the transformation of the legacy database to the target environment and passes responsibility for enforcing the newly identified constraints to this system. Figure 4.4 indicates that our approach commences with a reverse engineering process. This is followed by a knowledge augmentation process which itself is a function of a forward engineering process. These two stages together are referred to as re-engineering (see section 5.1). The constraint enforcement process is the next stage of our approach. This is associated with the enhanced constraints of the previous stage as it is necessary to validate the existing and enhanced constraint specifications against the data held. These three preparatory stages are described in chapter 5. The final stage of our approach is the database migration process. This is described later after we have Page 66
  • 68. Chapter 4 Migration of legacy ISs fully discussed the application of the earlier stages in relation to our test databases. Page 67
  • 69. Chapter 4 Migration of legacy ISs Page 68
  • 70. CHAPTER 5 Re-engineering Relational Legacy Databases This chapter addresses the re-engineering process and issues concerned with relational legacy DBMSs. Initially, the reverse-engineering process for relational databases is overviewed. Next, we introduce our re-engineering approach, highlighting its important stages and the role of constraints in performing these stages. We then present how existing legacy databases can be enhanced with modern concepts and introduce our knowledge representation techniques which allow the holding of the enhanced knowledge in the legacy database. Finally, we describe the optional constraint enforcement process which allows validation of existing and enhanced constraint specifications against the data held. 5.1 Re-engineering Relational Databases Software such as programming code and databases is re-engineered for a number of reasons: for example, to allow reuse of past development efforts, reduce maintenance expense and improve software flexibility [PRE94]. This re-engineering process consists of two stages, namely: a reverse- engineering and a forward-engineering process. In database migration the reverse-engineering process may be applied to help migrate databases between different vendor implementations of a particular database paradigm (e.g. from INGRES to Oracle), between different versions of a particular DBMS (e.g. Oracle version 3 to Oracle version 7) and between database types (e.g. hierarchical to modern relational database systems). The forward-engineering process, which is the second stage of re-engineering, is performed on the conceptual model derived from the original reverse-engineering process. At this stage, the objective is to redesign and / or enhance an existing database system with missing and / or new information. The application of reverse-engineering to relational databases has been widely described and applied [DUM81, NAV87, DAV87, JOH89, MAR90, CHI94, PRE94, WIK95b]. The latest approaches have been extended to construct a higher level of abstraction than the original E-R model. This includes the representation of object-oriented concepts such as generalisation / specialisation hierarchies in a reversed-engineered conceptual model. Due to parallel work that had occurred in this area in the recent years, there are some similarities and differences in our reverse-engineering approach [WIK95b] when compared with other recent approaches [CHI94, PRE94]. In the next sub-sections we shall refer to them. The techniques used in the reverse-engineering process consist of identifying common characteristics as identified below: • Identify the database’s contents such as relations and attributes of relations. • Determine keys, e.g. primary keys, candidate keys and foreign keys. • Determine entity and relationship types. • Construct suitable data abstractions, such as generalisation and aggregation structures.
  • 71. Chapter 5 Re-engineering relational legacy DBs 5.1.1 Contents of a relational database Diverse sources provide information that leads to the identification of a database’s contents. These include the database’s schema, observed patterns of data, semantic understanding of application and user manuals. Among these the most informative source is the database’s schema, which can be extracted from the data dictionary of a DBMS. The observed patterns of data usually provide information such as possible key fields, domain ranges and the related data elements. This source of information is usually not reliable as invalid, inconsistent, and incomplete data exists in most legacy applications. The reliability can be increased by using the semantics of an application. The availability of user manuals for a legacy IS is rare and they are usually out of date, which means they provide little or no useful information to this search. Data dictionaries of relational databases store information about relations, attributes of relations, and rapid data access paths of an application. Modern relational databases record additional information, such as primary and foreign keys (e.g. Oracle), rules / constraints on relations (e.g. INGRES, POSTGRES, Oracle) and generalisation hierarchies (e.g. POSTGRES). Hence, analysis of the data dictionaries of relational databases provides the basic elements of a database schema, i.e. entities, their attributes, and sometimes the keys and constraints, which are then used to discover the entity and relationship types that represent the basic components of a conceptual model for the application. The trend is for each new product release to support more sophisticated facilities for representing knowledge about the data. 5.1.2 Keys of a relational data model Theoretically, three types of key are specified in a relational data model. They are primary, candidate and foreign keys. Early relational DBMSs were not capable of implicitly representing these. However, sometimes indexes which are used for rapid data access can be used as a clue to determine some keys of an application database. For instance, the analysis of the unique index keys of a relational database provides sufficient information to determine possible primary or candidate keys of an application. The observed attribute names and data patterns may also be used to assist this process. This includes attribute names ending with ‘#’ or ‘no’ as possible candidate keys, and attributes in different relations having the same name for possible foreign key attributes. In the latter case, we need to consider homonyms to eliminate incorrect detections and synonyms to prevent any omissions due to the use of different names for the same purpose. Such attributes may need to be further verified using the data elements of the database. This includes explicit checks on data for validity of uniqueness and referential integrity properties. However the reverse of this process, i.e. determining a uniqueness property from the data values in the extensional database is not a reliable source of information, as the data itself is usually not complete (i.e. it may not contain all possible values) and may not be fully accurate. Hence we do not use this process although it has been used in [CHI94, PRE94]. The lack of information on keys in some existing database specifications has led to the use of data instances to derive possible keys. However it is not practicable to automate this process as some entities have keys consisting of multiple attributes. This means many permutations would have to be considered to test for all possibilities. This is an expensive operation when the volume of data and / or the number of attributes is large. Page 70
  • 72. Chapter 5 Re-engineering relational legacy DBs In [CHI94], a consistent naming convention is applied to key attributes. Here attributes used to represent the same information must have the same name, and as a result referencing and referenced attributes of a binary relationship between two entities will have the same attribute names in the entities involved. This naming convention was used in [CHI94] to determine relationship types, as foreign key specifications are not supported by all databases. An important contribution of our work is to support the identification of foreign key specifications for any database and hence the detection of relationships, without performing any name conversions. We note that some reverse- engineering methods rely on candidate keys (e.g. [NAV87, JOH89]), while others rely on primary keys (e.g. [CHI94]). These approaches insist on their users meeting their pre-requisites (e.g. specification of missing keys) to enable the user to successfully apply their reverse-engineering process. This means it is not possible to produce a suitable conceptual model until the pre-requisites are supplied. For a large legacy database application the number of these could exceed a hundred and hence, it is not appropriate to rely on such pre-requisites being met to derive an initial conceptual model. Therefore, we concentrate on providing an initial conceptual model using only the available information. This will ensure that the reverse-engineering process will not fail due to the absence of any vital information (e.g. the key specification for an entity). 5.1.3 Entity and Relationship Types of a data model In the context of an E-R model an entity is classified as strong15 or weak depending on an existence-dependent property of the entity. A weak entity cannot exist without the entity it is dependent on. The enhanced E-R model (EER) [ELM94] identifies more entity types, namely: composite, generalised and specialised entities. In section 3.3.1 we described these entity types and the relationships formed among them. Different classifications of entities are due to their associative properties with other entities. The identification of an appropriate entity type for each entity will assist in constructing a graphically informative conceptual model for its users. The extraction of information from legacy systems to classify the appropriate entity type is a difficult task as such information is usually lost during an implementation. This is because implementations take different forms even within a particular data model [ELM94]. Hence, an information extraction process may need to interact with a user to determine some of the entity and relationship types. The type of interaction required depends on the information available for processing and will take different forms. For this reason we focus only on our approach, i.e. determining entity and relationship types using enhanced knowledge such as primary and foreign key information. This is described in section 5.4. 5.1.4 Suitable Data Abstractions for a data model Entities and relationships form the basic components of a conceptual data model. These components describe specific structures of a data model. A collection of entities may be used to represent more than one data structure. For example, entities Person and Student may be represented as a 1:1 relationship or as a is-a relationship. Each representation has its own view and hence the user understanding of the data model will differ with the choice of data structure. Hence it is important to be able to introduce any data structure for a conceptual model and view using the most 15 In some literature this type of entity is referred to as regular entity, e.g. [DAT95]. Page 71
  • 73. Chapter 5 Re-engineering relational legacy DBs suitable data abstraction. Data structures such as generalisation and aggregation have inherent behavioural properties which give additional information about their participating entities (e.g. an instance of a specialised entity of a generalisation hierarchy is made up from an instance of its generalised entity). These structures are specialised relationships and representation of them in a conceptual model provides a higher level of data abstraction and a better user understanding than the basic E-R data model gives. These data abstractions originated in the object-oriented data model and they are not implicitly represented in existing relational DBMSs. Extended-relational DBMSs support the O-O paradigm (e.g. POSTGRES) with generalisation structures being created using inheritance definitions on entities. However in the context of legacy DBMSs such information is not normally available, and as a result such data abstractions can only be introduced either by introducing them without affecting the existing data structures or by transforming existing entities and relationships to support their representation. For example, entities Staff and Student may be transformed to represent a generalisation structure by introducing a Person entity. Other forms of transformation can also be performed. These include decomposing all n-ary relationships for n > 3 into their constituent relationships of order 2 to remove such relationships and hence simplify the association among their entities. At this stage double buried relationships are identified and merged and relationships formed with subclasses are eliminated. Transitive closure relationships are also identified and changed to form simplified hierarchies. We use constraints to determine relationships and hierarchies. By controlling these constraints (i.e. modifying or deleting them) it is possible to transform or eliminate necessary relationships and hierarchies. 5.2 Our Re-engineering Process Our re-engineering process has two stages. Firstly, the relational legacy database is accessed to extract the meta-data of the application. This extracted meta-data is translated into an internal representation which is independent of the vendor database language. This information is next analysed to determine the entity and relationship types, their attributes, generalisation / specialisation hierarchies and application constraints. The conceptual model of the database is then derived using this information and is presented graphically for the user. This completes the first stage which is a reverse-engineering process for a relational database. To complete the re-engineering process, any changes to the existing design and any new enhancements are done at the second stage. This is a forward-engineering process that is applied to the reverse-engineered model of the previous stage. We call this process constraint enhancement as we use constraints to enhance the stored knowledge of a database and hence perform our forward- engineering process. These constraint enhancements are done with the assistance of the DBA. 5.2.1 Our Reverse-Engineering Process Our reverse-engineering process concentrates on producing an initial conceptual model without any user intervention. This is a step towards automating the reverse-engineering process. However the resultant conceptual model is usually incomplete due to the limited meta-knowledge available in most legacy databases. Also, as a result of incomplete information and unseen inclusion Page 72
  • 74. Chapter 5 Re-engineering relational legacy DBs dependencies we may represent redundant relationships as well as fail to identify some of the entity and / or relationship types. We depend on constraint enhancement (i.e. the forward-engineering process) to supply this missing knowledge so that subsequent conceptual models will be more complete. The DBA can investigate the reversed-engineered model to detect and resolve such cases with the assistance of the initial display of that model. The system will need to guide the DBA by identifying missing keys and assisting in specifying keys and other relevant information. It also assists in examining the extent to which the new specifications conform to the existing data. Our reverse-engineering process does not depend on information about specialised constraints. When no information about these is available, we treat all entities of a database to be of the same type (i.e. strong entities) and any links present in the database will not be identified. In such a situation the conceptual model will display only the entities and attributes of the database schema without any links. For example, a relational database schema for a university college database system with no constraint-specific information will initially be viewed as shown in figure 5.1. This is the usual case for most legacy databases as they lack constraint-specific information. However, the DBA will be able to provide any missing information at the next stage so that any intended data structures can be reconstructed. Obviously if some constraints are available our reverse-engineering process will try to derive possible entity types and links during its initial application. University College Faculty Employee office code code name building building address name name birthDate address address secretary gender Student principal phone phone empNo name dean designation address worksFor birthDate Department yearJoined gender deptCode EmpStudent room collegeNo building collegeNo phone course name address empNo salary department head remarks tutor phone regYear faculty Figure 5.1 : A relational legacy database schema for a university college database Our reverse-engineering process first identifies all the available information by accessing the legacy database service (cf. section 5.3). The accessed information is processed to derive the relationship and entity types for our database schema (cf. section 5.4). These are then presented to the user using our graphical display function. 5.2.2 Our Forward-Engineering Process The forward-engineering process is provided to allow the designer (i.e. DBA) to interact with a conceptual model. The designer is responsible for verifying the displayed model and can supply any additional information to the reverse-engineered model at this stage. The aim of this process is to allow the designer to define and add any of the constraint types we identified in section 3.5 (i.e. primary key constraints, foreign key constraints, uniqueness constraints, check constraints, generalisation / specialisation structures, cardinality constraints and other constraints) which are not Page 73
  • 75. Chapter 5 Re-engineering relational legacy DBs present. Such additions will enhance the knowledge held about a legacy database. As a result, new links and data abstractions that should have been in the conceptual model can be derived using our reverse-engineering techniques and presented in the graphical display. This means that the legacy database schema originally viewed as in figure 5.1 can be enhanced with constraints and presented as in figure 5.2, which is a vast improvement on the original display. Such an enhanced display demonstrates the extent to which a user’s understanding of a legacy database schema can be improved by providing some additional knowledge about the database. In sections 6.3.4 and 6.4 we introduce the enhanced constraints of our test databases including those used to improve the legacy database schema of figure 5.1 to figure 5.2. University Person name Constraints: address .............. birthDate gender office Constraints: .............. Office inCharge code siteName worksFor unitName address phone 4+ Employee Student Constraints: .............. empNo collegeNo designation course dean tutor yearJoined department room regYear phone College-Office Faculty-Office Constraints: salary .............. siteName AS building siteName AS building Constraints: unitName AS name unitName AS name .............. inCharge AS principal inCharge AS secretary Constraints: Constraints: .............. .............. Dept-Office faculty code AS deptCode siteName AS building EmpStudent unitName AS name 2-12 inCharge AS head remarks Constraints: Constraints: .............. .............. Figure 5.2 : The enhanced university college database schema We support the examination of existing specifications and identification of possible new specifications (cf. section 5.5) for legacy databases. Once these are identified, the designer defines new constraints using a graphical interface (cf. section 5.6). The new constraint specifications are stored in the legacy database using a knowledge augmentation process (cf. section 5.7). We also supply a constraint verification module to give users the facility to verify and ensure that the data conforms to all the enhanced constraints (cf. section 5.8) being introduced. 5.3 Identifying Information of a Legacy Database Service Schema information about a database (i.e. meta-data) is stored in the data dictionaries of that database. The representation of information in these data dictionaries is dependent on the type of the DBMS. Hence initially the relational DBMS and the databases used by the application are identified. The name and the version of the DBMS (e.g. INGRES version 6), the names of all the databases in Page 74
  • 76. Chapter 5 Re-engineering relational legacy DBs use (e.g. faculty / department), and the name of the host machine (e.g. llyr.athro.cf.ac.uk) are identified at this stage. These are the input data that allows us to access the required meta-data. As the access process is dependent on the type of the DBMS, we describe this process in section 6.5 after specifying our test DBMSs. This process is responsible for identifying all existing entities, keys and other available information in a legacy database schema. 5.4 Identification of Relationship and Entity Types Once the entities and their attributes along with primary and candidate keys have been provided, we are ready to classify relationships and entity types. Three types of binary relationships (i.e. 1:1, 1:N and N:M) and five types of entities (i.e. strong, weak, composite, generalised and specialised) are identified at this stage. Initially we assume that all entities are strong and look for certain properties associated with them (mainly primary and foreign key), so that they can be reclassified into any of the other four types. Weak and composite entities are identified using relationship properties and generalised / specialised entities are determined using generalisation hierarchies. 5.4.1 Relationship Types (a) A M:N relationship If the primary key of an entity is formed from two foreign keys then their referenced entities participate in an M:N relationship. This is a special case of n-ary relationship involving two referenced entities (see section ‘a’ of figure 5.3). This entity becomes a composite entity having a composite key. For example, entity Option with primary key (course,subject) participates in an M:N relationship as the primary key attributes are foreign keys - see tables 6.2, 6.4 and 6.6 (later). In a n-ary relationship (e.g. 3-ary or ternary if the number of foreign keys is 3, see section ‘b’ of figure 5.3) the primary key of an entity is formed from a set of n foreign keys. As stated in section 5.1.4, n-ary relationships for n > 3 are usually decomposed into their constituent relationships of order 2 to simplify their association. Hence we do not specifically describe this case. For example, entity Teach with primary key (lecturer, course, subject) participates in a 3-ary relationship when lecturer, course and subject are foreign keys referencing entities Employee, Course and Subject, respectively. However, as Option is made up using Course and Subject entities we could decompose this 3-ary relationship into a binary relationship by defining course and subject of Teach to be a foreign key referencing entity Option - see tables 6.2, 6.4 and 6.6. Page 75
  • 77. Chapter 5 Re-engineering relational legacy DBs Relational Model ER Model Concept Graphical Notation (a) PK = FK + FK (i.e. n = 2) M:N relationship M N re relation re 1 2 binary n L N (b) PK = FK n>2 n-ary relationship re relation re ternary i=1 i M e.g. 3-ary re 1 N (c) FK attr. is part of PK and 1:N relationship re attribute e other part does not contain a key of any other relation 1 N 1:N relationship re attribute e (d) A non-key FK and non-unique attr. 1:1 relationship 1 1 (e) A non-key FK and re attribute e unique attr. PK - Primary Key e - referencing entity FK - Foreign Key re - referenced entity Figure 5.3: Mapping foreign key references to an ER relationship type Sometimes a foreign key refers to the same entity, forming a unary relationship, like in the case where some courses may have pre-requisites. In this case the attribute pre-requisites of entity Course is a foreign key referencing the same entity. (b) A 1:N relationship There are two types of 1:N relationships. One is formed with a weak entity and the other with a strong entity. If part of the primary key of an entity is a foreign key and the other part does not contain a key of any other relation, then the entity concerned is a weak entity and will participate in a weak 1:N relationship (see section ‘c’ of figure 5.3) with its referenced entity. For example, entity Committee with primary key (name, faculty) is a weak entity as only a part of its primary key attributes (i.e. faculty) is a foreign key. A non-key foreign key attribute (i.e. an attribute that is not part of a primary key) that may have multiple values will participate in a strong 1:N relationship (see section ‘d’ of figure 5.3) if it does not satisfy the uniqueness property. For example, attribute tutor of entity Student is a non-key, non-unique foreign key referencing the entity Employee (cf. tables 6.2 to 6.4). Here tutor participates in a 1:N relationship with Employee - see table 6.6. (c) A 1:1 relationship A non-key foreign key attribute will participate in a 1:1 relationship if a uniqueness constraint is defined for that attribute (see section ‘e’ of figure 5.3). For example, attribute head of entity Department participates in a 1:1 relationship with entity Employee as it is a non-key foreign key with the uniqueness property, referencing Employee - see tables 6.2 to 6.4 and 6.6. Page 76
  • 78. Chapter 5 Re-engineering relational legacy DBs The specialised and generalised entity pair of a generalisation hierarchy has a 1:1 is-a relationship. Hence it is possible to define a binary relationship in place of a generalisation hierarchy. For example, it is possible to define a foreign key (empNo) on entity EmpStudent, referencing entity Employee to form a 1:1 relationship instead of representing it as a generalisation hierarchy. Such cases must be detected and corrected by the database designer. We introduce inheritance constraints involving these entities to resolve such cases. 5.4.2 Entity Types (a) A strong entity This is the default entity type, as any entity that cannot be classified as one of the other types will be a strong (regular) entity. (b) A composite entity An entity that is used to represent an M:N relationship is referred to as a composite (link) entity (cf. section 5.4.1 (a)). The identification of M:N relationships will result in the identification of composite entities. (c) A weak entity An entity that participates in a weak 1:N relationship is referred to as a weak entity (cf. section 5.4.1 (b)). The identification of weak 1:N relationships will result in the identification of weak entities. (d) A generalised / specialised entity An entity defined to contain an inheritance structure (i.e. inheriting properties from others) is a specialised entity. Entities whose properties are used for inheritance are generalised entities. The identification of inheritance structures will result in the identification of specialised and generalised entities. An inheritance structure defines a single inheritance structure (e.g. entities X1 to Xj inherit from entity A in figure 5.4). However, a set of inheritance structures may form a multiple inheritance structure (e.g. entity Xj inherits from entities A and B in figure 5.4). To determine the existence of multiple inheritance structures we analyse all subtype entities of the database (e.g. entities X1 to Xn in figure 5.4) and derive their supertypes (e.g. entity A or B or both in figure 5.4). For example, entity EmpStudent inherits from Employee and Student entities forming a multiple inheritance, while entity Employee inherits from Person to form a single inheritance. Page 77
  • 79. Chapter 5 Re-engineering relational legacy DBs A B X1 • • Xi • • Xj • • Xn Entities X 1, .. ,Xi, .. ,Xj inherit from entity A and entities X j, .. ,Xn inherit from entity B. Figure 5.4: Single and multiple inheritance structures using EER notations 5.5 Examining and Identifying Information Our forward-engineering process allows the designer to specify new information. To successfully perform this task the designer needs to be able to examine the current contents of the database and identify possible missing information from it. 5.5.1 Examining the contents of a database At this stage the user needs to be able to browse through all features of the database. Overall, this includes viewing existing primary keys, foreign keys, uniqueness constraints and other constraint types defined for the database. When inheritance is involved the user may need to investigate the participating entities at each level of inheritance. As specific viewing the user may want to investigate the behaviour of individual entities. This includes identifying constraints associated with a particular entity (i.e. intra-object properties) and those involving other entities (i.e. inter-object properties). Our system provides for this via its graphical interface. We describe viewing of these properties in section 7.5.1, as it is directly associated with this interface. Here, global information is tabulated and presented for each constraint type, while specific information (i.e. inter- and intra- object) presents constraints associated with an individual entity. 5.5.2 Identifying possible missing, hidden and redundant information This process allows the designer to search for specific types of information, including information about the type of entities that do not contain primary keys, possible attributes for such keys, buried foreign key definitions and buried inheritance structures. In this section we describe how we identify this type of information. i) Possible primary key constraints Entities that do not contain primary keys are identified by comparing the list of entities having primary keys with the list of all entities of the database. When such entities are identified the user can view the attributes of these and decide on a possible primary key constraint. Sometimes, an entity may have several attributes and hence the user may find it difficult to decide on suitable primary key attributes. In such a situation the user may need to examine existing properties of that entity (cf. section 5.5.1) to identify attributes with uniqueness properties and not null values. Page 78
  • 80. Chapter 5 Re-engineering relational legacy DBs Sometimes, attribute names such as those ending with ‘no’ or ‘#’ may give a clue in selecting the appropriate attributes. Once the primary key attributes have been decided the user may want to verify this choice against the data of the database (cf. section 5.8). ii) Possible foreign key constraints Existence of either an inclusion dependency between a non-key attribute of one table and a key attribute of another (e.g. deptno of Employee and deptno of Department), or a weak or n-ary relationship between a key attribute and part of a key attribute (e.g. cno of strong entity Course and cno of link entity Teach) implies the possible existence of a foreign key definition. Such possibilities are detected by matching attribute names satisfying the required condition. Otherwise, the user needs to inspect the attributes and detect their possible occurrence (e.g. if attribute name worksfor instead of deptno was used in Employee). iii) Possible uniqueness constraints Detection of a uniqueness index gives a clue to a possible uniqueness constraint. All other indications of this type of constraint have to be identified by the user. iv) Possible inheritance structures Existence of an inclusion dependency between two strong entities having the same key implies a subtype / supertype relationship between the two entities. Such possible relationships are detected by matching identical key attribute names of strong entities (e.g. empno of Person and empno of Employee). Otherwise, the user needs to inspect the table and 1:1 relationships to detect these structures (e.g. if personid instead of empno was used in Person then the link between empno and personid would have to be identified by the user). In distributed database design some entities are partitioned using either horizontal or vertical fragmentation. In this situation strong entities having the same key will exist with a possible inclusion dependency between vertically fragmented tables. Such cases need to be identified by the designer to avoid incorrect classifications occurring. For example, employee records can be horizontally fragmented and distributed over each department as opposed to storing at one site (e.g. College). Also, employee records in a department may be vertically fragmented at the College site as the college is interested in a subset of information recorded for a department. v) Possible unnormalised structures All entities of a relational model are at least in 1NF, as this model does not allow multivalued attributes. When entities are not in 3NF (i.e. a non-key attribute is dependent on part of a key or another non-key attribute: violating 2NF or 3NF, respectively), there are hidden functional dependencies. These entities need to be identified and transformed into 3NF to show their dependencies. New entities in the form of views are used to construct this transformation. For example, entity Teach can be defined to contain attributes lecturer, course, subNo, subName and room. Here we see that subName is fully dependent on subNo and hence Teach is in 2NF. Using a view we separate subName from Teach and use it as a separate entity with primary key subNo. This Page 79
  • 81. Chapter 5 Re-engineering relational legacy DBs allows us to transform the original Teach to 3NF and view Subject and Teach as a binary, instead of an unary relationship. This will assist in improving conceptual model readability. vi) Possible redundant constraints Redundant inclusion dependencies representing projection or transitivity must be removed, otherwise incorrect entity or relationship types may be represented. For instance, if there is an inclusion dependency between entities A, B and B, C then the transitivity inclusion dependency between A, C is redundant. Such relationships should be detected and removed. For example, EmpStudent is an Employee and Employee is a Person, thus EmpStudent is a Person is a redundant relationship. Redundant constraints are often most obvious when viewing the graphical display of a conceptual model with its inter- and intra- object constraints. 5.6 Specifying New Information We can specify new information using constraints. In a modern DBMS which supports constraints we can use its query language to specify these. However this approach will fail for legacy databases as they do not normally support the specification of constraints. To deal with both cases we have designed our system to externally accept constraints of any type, but represent them internally by adopting the appropriate approach depending on the capabilities of the DBMS in use. Thus if constraint specification is supported by the DBMS in use we will issue a DDL statement (cf. figure 5.5 which is based on SQL-3 syntax) to create the constraint. If constraint specification is not supported by the DBMS in use we will store the constraint in the database using techniques described in section 5.7. These constraints are not enforced by the system but they may be used to verify the extent to which the data conforms with the constraints (cf. section 5.8). In both cases this enhanced knowledge is used by our conceptual model wherever it is applicable. The following sub- sections describe the specification process for each constraint type. We cover all types of constraints that may not be supported by a legacy system, including primary key. We use the SQL syntax to introduce them. In SQL, constraints are specified as column/table constraint definitions and can optionally contain a constraint name definition and constraint attributes (see sections A.3 and A.4) which are not included here. i) Specifying Primary Key Constraints Only one primary key is allowed for an entity. Hence our system will not allow any input that may violate this status. Once an entity is specified the primary key attributes are chosen. Modern SQL DBMSs will use the DDL statement ‘a’ of figure 5.5 to create a new primary key constraint, older systems do not have this capability in their syntax. ii) Foreign Key Constraints A foreign key establishes a relationship between two entities. Hence, when the enhancing constraint type is chosen as a foreign key, our system requests two entity names. The first is the referencing entity and the second the referenced entity. Once the entity names are identified the system automatically shows the referenced attributes. These attributes are those having the uniqueness property. When these attributes are chosen a new foreign key is established. This Page 80
  • 82. Chapter 5 Re-engineering relational legacy DBs constraint will only be valid if there is an inclusion dependency between the referencing and referenced attributes. Modern SQL DBMSs will use the DDL statement ‘b’ of figure 5.5 to create a new foreign key constraint in this situation. This statement can optionally contain a match type and referential triggered action (see section A.8) which are not shown here. iii) Uniqueness Constraints A uniqueness constraint may be defined on any combination of attributes. However such constraints should be meaningful (e.g. these is no point in defining a uniqueness constraint for a set of attributes when a subset of it already holds the uniqueness status), and should not violate any existing data. Modern SQL DBMSs will use the DDL statement ‘c’ of figure 5.5 to create a new uniqueness constraint. (a) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name PRIMARY KEY (Primary_Key_Attributes) (b) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name FOREIGN KEY (Foreign_Key_Attributes) REFERENCES Referenced_Entity_Name (Referenced_Attributes) (c) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name UNIQUE (Uniqueness_Attributes) (d) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name CHECK (Check_Constraint_Expression) (e) ALTER TABLE Entity_Name ADD UNDER Inherited_Entities [ WITH (Renamed_Attributes) ] (f) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name FOREIGN KEY (Foreign_Key_Attributes) [ CARDINALITY (Referencing_Cardinality_Value) ] REFERENCES Referenced_Entity_Name (Reference_Attributes) Our optional extensions to the SQL-3 syntax are highlighted using bold letters here. Figure 5.5 : Constraints expressed in extended SQL-3 syntax iv) Check Constraints A check constraint may be defined to represent a complex expression involving any combination of attributes and system functions. However such constraints should not be redundant (i.e. not a subset of an existing check constraint) and should not violate any existing data. Modern SQL DBMSs will use the DDL statement ‘d’ of figure 5.5 to create a new check constraint. v) Generalisation / Specialisation Structures An inheritance hierarchy may be defined without performing any structural changes if its existence can be detected by our process described in part ‘d’ of section 5.4.2. In this case we need to specify the entities being inherited (cf. statement ‘e’ of figure 5.5). If an inherited attribute’s name differs from the target attribute name it is necessary to rename them. For example, attributes siteName, unitName and inCharge of Office are renamed to building, name and principal when it is inherited by College - see figures 6.2 and 6.3 (later). It is also possible to make some structural changes in order to introduce new generalisation / specialisation structures. In such situations new entities are created to represent the specialisation/generalisation. Appropriate data for these entities are copied to them during this process. For instance, in our university college example of figure 5.1, the entities College, Faculty and Page 81
  • 83. Chapter 5 Re-engineering relational legacy DBs Department can be restructured to represent a generalisation hierarchy, by introducing a generalised entity called Office and transforming the entities College, Faculty and Department to College-Office, Faculty-Office and Dept-Office, respectively (cf. figure 5.2). Once this transformation is done the entities Office, College-Office, Faculty-Office and Dept-Office will represent a generalisation hierarchy as shown in figure 5.2. Any change to existing structures and table names will affect the application programs which use them. To overcome this we introduce view tables in the legacy database to represent new structures. These tables are defined using the syntax of figure 5.6. For example, the generalised entity will be Office and the specialised entities will be College-Office, Faculty-Office and Dept-Office. The introduction of view tables means that legacy application code using the original tables will not be affected by the change. However, appropriate changes must be introduced in the target application code and database if we are going to introduce these features permanently. We introduced the concept of defining view tables in the legacy database to assist the gateway service in managing these structural changes. CREATE VIEW GeneralisedEntity (GeneralisedAttributes) AS SELECT Attributes FROM SpecialisedEntity [ [UNION SELECT Attributes FROM SpecialisedEntity] ..] CREATE VIEW SpecialisedEntity (SpecialisedAttributes) AS SELECT g1.Attributes [ [, g2.Attributes] ..] FROM GeneralisedEntity g1 [ [, GeneralisedEntity g2] ..] [ WHERE specialised-conditions ] Figure 5.6 : Creation of view table to represent a hierarchy vi) Cardinality Constraints Cardinality constraints specify the minimum / maximum number of instances associated with a relationship. In a 1:1 relationship type the number of instances associated with a relationship will be 0 or 1, and in a M:N relationship type it can take any value from 0 upwards. The ability to define more specific limits allows users not only to gain a better understanding about a relationship, but also to be able to verify its conformance by using its instances. We suggest creating such specifications using an extended syntax of the current SQL foreign key definition (cf. statement ‘f’ of figure 5.5) as this is the key which initially establishes this relationship. The minimum / maximum instance occurrences for a particular relationship of a referential value (i.e. cardinality values) can be specified using a keyword CARDINALITY as shown in figure 5.5. Here the Referencing_Cardinality_Value corresponds to the many side of a relationship. Hence the value of this indicates the minimum instances. When the referencing attribute is not null then the minimum cardinal value is 1, else it is 0. In our examples introduced in part ‘b’ of section 6.2.3, we have used ‘+’ to represent the many symbol (e.g. 0+ represents zero or many) and ‘-’ to represent up to (e.g. -1 represents 0 to 1). vii) Other Constraints In the example in figure 5.2, we have also shown an aggregation relationship between the entities University and Office. Here we have assumed that a reference to a set of instances can be defined. In such a situation, as with the other constraint types, an appropriate SQL statement should be used to describe the constraint and an appropriate augmented table such as those used in figure 5.7 must be used to record this information in the database itself. We discuss this case here for the purpose of highlighting that other constraint types can be introduced and incorporated into our Page 82
  • 84. Chapter 5 Re-engineering relational legacy DBs system using the same general approach. However our implementations have concentrated only on the constraints discussed above. The enhanced constraints, once they are absorbed into the system, will be stored internally in the same way as any existing constraints. Hence the reconstruction process to produce an enhanced conceptual model can utilise this information directly as it is fully automated. To hold the enhancements in the database itself we need to issue appropriate query statements. The enhancements can be effected using the SQL statements shown in figure 5.5 if the database is SQL based and such changes are implicitly supported by it. In section 5.7 we describe how this is done when the database supports such specifications (e.g. Oracle version 7) and when it does not (e.g. INGRES version 6). When the DBMS does not support SQL, the query statement to be issued is translated using QMTS [HOW87] to a form appropriate to the target DBMS. As there are variants of SQL16 we send all queries via QMTS so that the necessary query statements will automatically get translated to the target language before entering the target DBMS environment. 5.7 The Knowledge Augmentation Approach In this section we describe how the enhanced constraints are retained in a database. Our aim has been to make these enhancements compatible with the newer versions of commercial DBMSs, so that the migration process is facilitated as fully as possible. Many types of constraint are defined in a conceptual model during database design. These include relationship, generalisation, existence condition, identity and dependency constraints. In most current DBMSs these constraints are not represented as part of the database meta-data. Therefore, to represent and enforce such constraints in these systems, one needs to adopt a procedural approach which makes use of some embedded programming language code to perform the task. Our system uses a declarative approach (cf. section 3.6) for constraint manipulation, as it is easier to process constraints in this form than when they are represented in the traditional form of procedural code. 16 The date functions of most SQL databases (e.g. INGRES and Oracle) are different. Page 83
  • 85. Chapter 5 Re-engineering relational legacy DBs CREATE TABLE Table_Constraints ( CREATE TABLE Check_Constraints ( Constraint_Id char(32) NOT NULL, Constraint_Id char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Table_Id char(32) NOT NULL, Check_Clause char(240) NOT NULL ); Table_Name char(32) NOT NULL, Constraint_Type char(32) NOT NULL, Is_Deferrable char(3) NOT NULL, CREATE TABLE Sub_tables ( Initially_Deferred char(3) NOT NULL ); Table_Id char(32) NOT NULL, Sub_Table_Name char(32) NOT NULL, CREATE TABLE Key_Column_Usage ( Super_Table_Name char(32) NOT NULL, Constraint_Id char(32) NOT NULL, Super_Table_Column integer(4) NOT NULL ); Constraint_Name char(32) NOT NULL, Table_Id char(32) NOT NULL, Table_Name char(32) NOT NULL, CREATE TABLE Altered_Sub_Table_Columns ( Column_Name char(32) NOT NULL, Table_Id char(32) NOT NULL, Ordinal_Position integer(2) ); Sub_Table_Name char(32) NOT NULL, Sub_Table_Column char(32) NOT NULL, CREATE TABLE Referential_Constraints ( Super_Table_Name char(32) NOT NULL, Constraint_Id char(32) NOT NULL, Super_Table_Column char(32) NOT NULL ); Constraint_Name char(32) NOT NULL, Unique_Constraint_Id char(32) NOT NULL, Unique_Constraint_Name char(32) NOT NULL, CREATE TABLE Cardinality_Constraints ( Match_Option char(32) NOT NULL, Constraint_Id char(32) NOT NULL, Update_Rule char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Delete_Rule char(32) NOT NULL ); Referencing_Cardinal char(32) ); Figure 5.7: Knowledge-based table descriptions The constraint enhancement module of our system (CCVES) accepts new constraints (cf. figure 5.5) irrespective of whether they are supported by the selected DBMS. These new constraints are the enhanced knowledge which is stored in the current database, using a set of user defined knowledge-based tables, each of which represents a particular type of constraint. These tables provide general structures for all constraint types of interest. In figure 5.7 we introduce our table structures which are used to hold constraint-based information in a database. We have followed the current SQL-3 approach to representing constraint types supported by the standards. In areas which the current standards have yet to address (e.g. representation of cardinality constraints) we have proposed our own table structures. Thus, all general constraints associated with a table (i.e. an entity) are recorded in Table_Constraints. The constraint description for each type is recorded elsewhere in other tables, namely, Key_Column_Usage for attribute identifications, Referential_Constraints for foreign key definitions, Check_Constraints to hold constraint expressions, Sub_Tables to hold generalisation / specialisation structures (i.e. inherited tables), Altered_Sub_Table_Columns to hold any attributes renamed during inheritance, and Cardinality_Constraints to hold cardinal values associated with relationship structures. The use of these table structures to represent constraint-based information in a database depends on the type of DBMS in use and the features it supports. The features supported by a DBMS may differ from the standards to which it claims to conform, as database vendors do not always follow the standards fully when they develop their systems. However, DBMSs supporting the representation of constraints need not have identical table structures to our approach as they may have used an alternative way of dealing with constraints. In such situations it is not necessary to insist on the use of our table structures for constraint representation, as the database is capable of managing them itself if we follow its approach. Therefore we need to identify which SQL standards have been used and in which DBMSs we should introduce our own tables to hold enhanced constraints. In figure 5.8 we identify the tables required for the three SQL standards and for three selected DBMSs. The selected DBMSs were used as our test DBMSs, as we shall see in section 6.1. The CCVES determines for the DBMS being used the required knowledge-based tables and Page 84
  • 86. Chapter 5 Re-engineering relational legacy DBs creates and maintains them automatically. The creation of these tables and the input of data to them are ideally done at the database application implementation stage, by extracting data from the conceptual model used originally to design a database. However, as current tools do not offer this type of facility, one may have to externally define and manage these tables in order to hold this knowledge in a database. Our system has been primed with the knowledge of the tables required for each DBMS it supports, and so it automatically creates these tables and stores information in them if the database is enhanced with new constraints. Here, Table_Constraints, Referential_Constraints, Key_Column_Usage, Check_Constraints and Sub_Tables are those used by SQL-3 to represent constraint specifications. SQL-2 has the same tables, except for Sub_Tables, Hence, as shown in figure 5.8, these tables are not required as augmented tables when a DBMS conforms to SQL-3 or SQL-2 standards, respectively. Adopting the same names and structures as used in the SQL standards makes our approach compatible with most database products. We have introduced two more tables (namely: Cardinality_Constraints and Altered_Sub_Table_Columns) to enable us to represent cardinality constraints and to record any synonyms involved in generalisation / specialisation structures. The representation of this type of information is not yet addressed by the SQL standards. CCVES utilises the above mentioned user defined knowledge-based tables not only to automatically reproduce a conceptual model, but also to enhance existing databases by detecting and cleaning inconsistent data. To determine the presence of these tables, CCVES looks for user defined tables such as Table_Constraints, Referential_Constraints, etc., which can appear in known existing legacy databases only if the DBMS maintains our proposed knowledge-base. For example, in INGRES version 6 we know that such tables are not maintained as part of its system provision, hence the presence of tables with these names in this context confirms the existence of our knowledge-base. Use of our knowledge-based tables is database system specific as they are used only to represent knowledge that is not supported by that DBMS's meta-data facility. Hence, the components of two distinct knowledge-bases, e.g. for INGRES version 6 and Oracle version 7, are different from each other (see figure 5.8). Table Name S1 S2 S3 I O P - - D V6 V7 V4 Table_Constraints Y N N Y N Y Referential_Constraints Y N N Y N Y Key_Column_Usage Y N N Y N Y Check_Constraints Y N N N N N Sub_Tables Y Y N Y Y N Altered_Sub_Table_Columns Y Y Y Y Y Y Cardinality_Constraints Y Y Y Y Y Y S1 - SQL/86, S2 - SQL-2, S3 - SQL-3 Y - Yes, required I - INGRES, O - Oracle, P - POSTGRES N - No, not required D - Draft, V - Version Figure 5.8 : Requirement of augmented tables for SQL standards and some current DBMSs The different types of constraints are identified using the attribute Constraint_Type of Table_Constraints, which must have one of the values PRIMARY KEY, UNIQUE, FOREIGN KEY or CHECK. A set of example instances are give in figure 5.9 to show the types of information held in our knowledge-based tables. The constraint type NOT NULL may also appear in Table_Constraints when dealing with a DBMS that does not support NULL value specifications. We have not included it in our sample data as it is supported by our test DBMSs and all the SQL standards. The constraint Page 85
  • 87. Chapter 5 Re-engineering relational legacy DBs and table identifications in our knowledge-based tables (i.e. Constraint_Id and Table_Id of figure 5.9), may be of composite type as they need to identify not only the name, but also the schema, catalog and location of the database. Foreign key constraints are associated with their referenced table through a unique constraint. Hence, the ‘Const_Name_Key’ instance of attribute Unique_Constraint_Name of table Referential_Constraints should also appear in Key_Column_Usage as a unique constraint. This means that each of the knowledge-based tables has its own set of properties to ensure the accuracy and consistency of the information retained in these tables. For instance Constraint_Type of Table_Constraints must be one of {‘PRIMARY KEY’, ‘UNIQUE’, ‘FOREIGN KEY’, ‘CHECK’} if these are the only type of constraints that are to be represented. Also, within a particular schema a constraint name is unique. Hence Constraint_Name of Table_Constraints must be unique for a particular type of Constraint_Id. In figure 5.10 we present the set of constraints associated with our knowledge-based tables. Besides these there are a few others which are associated with other system tables, such as Tables and Columns which are used to represent all entity and attribute names respectively. Such constraints are used in systems supporting the above constraint types. This allows us to maintain consistency and accuracy within the constraint definitions. Table_Constraints { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type, Is_Deferrable, Initially_Deferred } ('dbId', 'Const_Name_PK', 'TableId', 'Entity_Name_PK', 'PRIMARY KEY', 'NO', 'NO') ('dbId', 'Const_Name_UNI', 'TableId', 'Entity_Name_UNI', 'UNIQUE', 'NO', 'NO') ('dbId', 'Const_Name_FK', 'TableId', 'Entity_Name_FK', 'FOREIGN KEY', 'NO', 'NO') ('dbId', 'Const_Name_CHK', 'TableId', 'Entity_Name_CHK', 'CHECK', 'NO', 'NO') Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name, Ordinal_Position } ('dbId', 'Const_Name_PK', 'TableId','Entity_Name_PK', 'Attribute_Name_PK', i) ('dbId', 'Const_Name_UNI', 'TableId','Entity_Name_UNI', 'Attribute_Name_UNI', i) ('dbId', 'Const_Name_FK', 'TableId','Entity_Name_FK', 'Attribute_Name_FK', i) Referential_Constraints { Constraint_Id, Constraint_Name,Unique_Constraint_Id, Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule } ('dbId', 'Entity_Name_FK', 'TableId', 'Const_Name_Key', 'NONE', 'NO ACTION', 'NO ACTION') Check_Constraints { Constraint_Id, Constraint_Name, Check_Clause } ('dbId', 'Const_Name_CHK', 'Const_Expression') Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name } ('dbId', 'Entity_Name_Sub', 'Entity_Name_Super') Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name, Super_Table_Column } ('dbId', 'Entity_Name_Sub', 'newAttribute_Name', 'Entity_Name_Super', 'oldAttribute_Name') Cardinality_Constraints { Constraint_Id, Constraint_Name, Referencing_Cardinal } ('dbId', 'Entity_Name_FK', 'Const_Value_Ref') Figure 5.9 : Augmented tables with different instance occurrences Some attributes of these knowledge-based tables are used to indicate when to execute a constraint and what action is to be taken. The actions are application dependent or have no effect on the approach proposed here, and hence we have used a default value as proposed in the standards. However, it is possible to specify trigger actions like ON DELETE CASCADE so that when a value of the referenced table is deleted the corresponding values in the referencing table will automatically get deleted. These features were initially introduced in the form of rule based constraints to allow triggers and alerters to be specified in databases and make them active [ESW76, STO88]. Such actions may also have been implemented in legacy ISs as in the case of general constraints. The Page 86
  • 88. Chapter 5 Re-engineering relational legacy DBs constraints used in our constraint enforcement process (cf. section 5.8) are alerters as they draw attention to constraints that do not conform to the existing legacy data. Table_Constraints PRIMARY KEY (Constraint_Id, Constraint_Name) CHECK (Constraint_Type IN ('UNIQUE','PRIMARY KEY','FOREIGN KEY','CHECK') ) CHECK ( (Is_Deferrable, Initially_Deferred) IN ( values ('NO','NO'), ('YES','NO'), ('YES','YES') ) ) CHECK ( UNIQUE ( SELECT Table_Id, Table_Name FROM Table_Constraints WHERE Constraint_Type = 'PRIMARY KEY' ) ) Key_Column_Usage PRIMARY KEY (Constraint_Id, Constraint_Name, Column_Name) UNIQUE (Constraint_Id, Constraint_Name, Ordinal_Position) CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name FROM Table_Constraints WHERE Constraint_Type IN ('UNIQUE', 'PRIMARY KEY','FOREIGN KEY' ) ) ) Referential_Constraints PRIMARY KEY (Constraint_Id, Constraint_Name) CHECK ( Match_Option IN ('NONE','PARTIAL','FULL') ) CHECK ( Update_Rule IN ('CASCADE','SET NULL','SET DEFAULT','RESTRICT','NO ACTION') ) CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name FROM Table_Constraints WHERE Constraint_Type = 'FOREIGN KEY' ) ) CHECK ( (Unique_Constraint_Id, Unique_Constraint_Name) IN ( SELECT Constraint_Id, Constraint_Name FROM Table_Constraints WHERE Constraint_Type IN ('UNIQUE','PRIMARY KEY') ) ) Check_Constraints PRIMARY KEY (Constraint_Id, Constraint_Name) Sub_Tables PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name) Altered_Sub_Table_Columns PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name, Column_Name) FOREIGN KEY (Table_Id, Sub_Table_Name, Super_Table_Name) REFERENCES Sub_Tables Cardinality_Constraints PRIMARY KEY (Constraint_Id, Constraint_Name) FOREIGN KEY (Constraint_Id, Constraint_Name) REFERENCES Referential_Constraints Figure 5.10: Consistency constraints of our knowledge-based tables Many other types of constraint are possible in theory [GRE93]. We shall not deal with all of them as our work is concerned only with constraints applicable at the conceptual modelling stage. These applicable constraints take the form of logical expressions and are stored in the database using the knowledge-based table Check_Constraints. They are identified by the keyword 'CHECK' in Table_Constraints in figure 5.9. Similarly, other constraint types (e.g. rules and procedures) are represented by means of distinct keywords and tables. Figure 5.9 also includes generalisation and cardinality constraints. A generalisation hierarchy is defined using the SQL-3 syntax (i.e. UNDER, see figure 5.5), while a cardinality constraint is defined using an extended foreign key definition (see figure 5.5). These specifications are also held in the database, using the tables Sub_Tables, Altered_Sub_Table_Columns and Cardinality_Constraints, respectively (see figure 5.9). 5.8 The Constraint Enforcement Process This is an optional process provided by our system, as the third stage of its application to a database. The objective is to give users the facility to verify / ensure that the data conforms to all the enhanced constraints. This process is optional so that the user can decide whether these constraints should be enforced to improve the quality of the legacy database prior to its migration or whether it is best left as it stands. Page 87
  • 89. Chapter 5 Re-engineering relational legacy DBs During the constraint enforcement process any violations of the enhanced constraints are identified. In some cases this may result in removing the violated constraint as it may be an incorrect constraint specification. However, the DBA may decide to keep such constraints as the constraint violation may be as a result of incorrect data instances or due to a change in a business rule that has occurred during the lifetime of the database. Such a rule may be redefined with a temporal component to reflect this change. Such data are manageable using versions of data entities as in object-oriented DBMSs [KIM90]. We use the enhanced constraint definitions to identify constraints that do not conform to the existing legacy data. Here each constraint is used to produce a query statement. This query statement depends on the type of constraint, as shown in figure 5.11. CCVES uses constraint definitions to produce data manipulation language statements suitable for the host DBMS. Once such statements are produced, CCVES will execute them against the current database to identify any violated data for each of these constraints. When such violated data are found for an enhanced constraint it is up to the user to take appropriate action. Enforcement of such constraints can prevent data rejection by the target DBMS, possible losses of data and/or delays in the migration process, as the migrating data’s quality will have been ensured by prior enforcement of the constraints. However as the enforcement process is optional, the user need not take immediate action. He can take his own time to determine the exact reasons for each violation and take action at his convenience prior to migration. 5.9 The Migration Process The migration process is the fourth and final stage in the application of our approach. This is incrementally performed by initially creating the meta-data in the target DBMS, using the schema meta-translation technique of Ramfos [RAM91], and then copying the legacy data to the target system, using the import/export tools of source and target DBMSs. During this activity, legacy applications must continue to function until they too are migrated. To support this process we need to use an interface (i.e. a forward gateway) that can capture and process all database queries of the legacy application and then re-direct those related to the target system via CCVES. The functionality that is required here is a distributed query processing facility which is supported by current distributed DBMSs. However, in our case the source and target databases are not necessarily of the same type as in the case of distributed DBMSs, so we need to perform a query translation in preparation for the target environment. Such a facility can be provided using the query meta- translation technique of Howells [HOW87]. This approach will facilitate transparent migration for legacy databases as it will allow the legacy IS users to continue working while the legacy data is being migrated incrementally. Page 88
  • 90. Chapter 5 Re-engineering relational legacy DBs Constraint Queries to detect Constraint Violation Instances Primary Key SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) > 1 UNION SELECT Attribute_Names, 1 FROM Entity_Name WHERE Attribute_Names IS NULL Unique SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) > 1 Referential SELECT * FROM Referencing_Entity_Name WHERE NOT (Referencing_Attributes IS NULL OR Referencing_Attributes IN (SELECT Referenced_Attributes FROM Referenced_Entity_Name)) Check SELECT * FROM Entity_Name WHERE NOT (Check_Constraint_Expression) Cardinality SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) < Min_Cardinality_Value UNION SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) > Max_Cardinality_Value Figure 5.11: Detection of violated constraints in SQL Page 89
  • 91. CHAPTER 6 Test Databases and their Access Process In this chapter we introduce our example databases, by describing their physical and logical components. The selection criteria for these test databases, and the associated constraints in accessing and using them are discussed here. We investigate the tools available for our test DBMSs. We then apply our re-engineering process to our test databases to show its applicability. Lastly, we refer to the organisation of system information in a relational DBMS and describe how we identify and access information about entities, attributes, relationships and constraints in our test DBMSs. 6.1 Introduction to our Test Databases In the following sub-sections we introduce our test databases. We first identify the main requirements for these databases. This is followed by a description of associated constraints and their role in database access and use. Finally, we identify how we established our test databases and the DBMSs we have used for this purpose. 6.1.1 Main Requirements The design of our test databases was based on two important requirements. Firstly, to establish a suitable legacy test database environment to enable us to demonstrate the practicability of our re-engineering and migration techniques. Secondly, to establish a heterogeneous database environment for the test databases to enable us to test the generality of our approach. As described in section 2.1.2.1, the problems of legacy databases apply mostly to long existing database systems. Most of these systems use traditional file-based methods or an old version of the hierarchical, network or relational database models for their database management. Due to complexity and availability of resources, which are discussed in section 6.1.2, we decided to focus on a particular type of database model to apply our legacy database enhancement and migration techniques. Test databases were developed for the chosen database model, while establishing the required levels of heterogeneity in the context of that model. 6.1.2 Availability and Choice of DBMSs At University of Wales College of Cardiff, where our research was conducted, there were only a few application databases. These included systems used to process student and staff information for personnel and payroll applications. This information was centrally processed and managed using third party software. Due to the licence conditions on this software, the university did not have the authority to modify and improve it on their own. Also, most of this software was developed with 3GL technology using files to manipulate information. There were recent enhancements which had been developed using 4GL tools. However, no proper DBMS had been used to build any of these applications, although future plans included using Oracle for new
  • 92. database applications. These databases were therefore not well suited to our work. Other than the personnel and payroll applications there were a few departmental and specific project based applications. Some of these were based on DBMSs (such as Oracle), although their application details were not readily available. Information gathered from these sources revealed that not many database applications existed in our university environment and gaining permission to access them for research purposes was practically impossible. Also, until we obtained access and investigated each application we would not be able to fully justify its usefulness as a test database, as it might not fulfil all our requirements. Therefore, it was decided to initially design and develop our own database applications to suit our requirements and then if possible to test our system on any other available real world databases. Access to DBMSs was restricted to products running on Personal Computers (PCs) and some Unix systems. Most of these products were based on the relational data model and some on the object-oriented data model. The older database models - hierarchical and network - were no longer being used or available as DBMSs. Also, the available DBMSs were in their latest versions, making the task of building a proper legacy database environment more difficult. The relational model has been in use for database applications over the last 20 years and currently is the most widely used data model. During this time many database products and versions have been used to manage these database applications. As a result, many of them are now legacy systems and their users need assistance to enhance and migrate them to modern environments. Thus the choice of the relational data model for our tests is reasonable, although one may argue that similar requirements exist for database applications which have been in use prior to this data model gaining its pre-eminent position. Due to the superior power of workstations as compared to PC’s it was decided to work on these Unix platforms and to build test databases using the available relational DBMSs, as our main aim was simply to demonstrate the applicability of our approach. Two popular commercial relational DBMSs, namely: INGRES and Oracle, were accessible via the local campus network. We selected these two products to implement our test databases as they are leading, commercially-established products which have been in existence since the early days of relational databases. The differences between these two database products made them ideal for representing heterogeneity within our test environment. Both products supported the standard database query language, SQL. However, only one of them (Oracle) conforms to the current SQL-2 standard. Oracle is also a leading relational database product, along with SYBASE and INFORMIX, on Unix platforms [ROS94]. As described in section 3.8, SQL standards have been regularly reviewed and hence it is also important to choose a database environment that will support at least some of the modern concepts, such as object-oriented features. In recent database products these features have been introduced either via extended relational or via object-oriented database technology. Obviously the choice of an extended relational data model is the most suitable for our purposes as it incorporates natural extensions to the relational data model. Hence we selected POSTGRES, which is a research DBMS providing modern object-oriented features in an extended relational model, as our third test DBMS. 6.1.3 Choice of database applications and levels of heterogeneity Page 91
  • 93. Designing a single large database application as our test database would result in one very complex database application. To overcome the need to devise and manage a single complex application to demonstrate all of our tasks, we decided to build a series of simple applications and later to provide a single integrated application derived from these simple database applications. Our own university environment was chosen to construct these test database systems as we were able to perform a detailed system study in this context and collect sufficient information to create appropriate test applications. Typical text book examples [MCF91, ROB93, ELM94, DAT95] were also used to verify the contents chosen for our test databases. Three databases representing college, faculty and department information were chosen for our simple test databases. To ensure simplicity, no more than ten entities were included for each of these databases. However, each was carefully designed to enable us to thoroughly test our ideas, as well as to represent three levels of heterogeneity within our test systems. These systems were implemented on different computer systems using different DBMSs so that they represented heterogeneity at the physical level. INGRES, POSTGRES and Oracle running on DEC station, SUN Sparc and DEC Alpha, respectively, were chosen. The differences in characteristics among these three DBMSs introduced heterogeneity at the logical level. Here, Oracle conforms to the current SQL/92 standard and supports most modern relational data model requirements. INGRES and POSTGRES, although they are based on the same data model, have some basic differences in handling certain database functions such as integrity constraints. These two DBMSs use a rule subsystem to handle constraints, which is a different approach from that proposed by the SQL standards. POSTGRES, which is regarded as an extended relational DBMS having many object-oriented features, is also regarded as an object-oriented DBMS. These inherent differences ensure the initial heterogeneity of our environment at the logical level. Our test databases were designed to highlight these logical differences, as we shall see. 6.2 Migration Support Tools for our Test DBMSs Prior to creating and applying our approach it was useful to investigate the availability of tools for our test DBMSs to assist the migration of databases. As indicated in the following sub- sections, only a few tools are provided to assist this process and they have limited functionality that is inadequate to assist all the stages of enhancing and migrating a legacy database service. 6.2.1 INGRES INGRES permits manipulation of data in non-INGRES databases [RTI92] and the development of applications that are portable across all INGRES servers. This type of data manipulation is done through an INGRES gateway. INGRES/Open SQL, a subset of INGRES SQL, is used for this purpose. The type of services provided by this gateway include [RTI92]: • Translation between Open SQL and non-INGRES SQL DBMS query interfaces such as Rdb/VMS (for DEC) or DB2 (for IBM). • Conversion between INGRES data types and non-INGRES data types. • Translation of non-INGRES DBMS error messages to INGRES generic error types. Page 92
  • 94. This functionality is useful in creating a target database service. However, as the target databases supported by INGRES/Open SQL do not include Oracle and POSTGRES, this tool was not helpful to us. The PRODBI interface for INGRES [LUC93] allows access to INGRES databases from Prolog code. This tool is useful in our work as our main processing is done using Prolog. Hence we have used this tool to implement our constraint enforcement process. Meta-data access from INGRES databases could have been done using PRODBI. However, due to its unavailability at the start of our project we implemented this using C programs embedded with SQL code. INGRES does not support any CASE tools that assist in reverse-engineering or analysing INGRES applications. Its only support was in the form of a 4GL environment [RTI90b] which is useful for INGRES application development, but not for any INGRES based legacy ISs and their reverse engineering. 6.2.2 Oracle The latest version of Oracle (i.e. version 7) is a RDBMS that conforms to the SQL-2 standards. Hence, this DBMS supports most modern database functions, including the specification, representation and enforcement of integrity constraints. Oracle has provided migration tools to convert databases from either of its two most recent versions (i.e. versions 5 or 6) to version 7. Oracle, a leading database product on the Unix platform [ROS94], has its own tool set to assist in developing Oracle based application systems [KRO93]. This includes screen-based application development tools SQL*Forms and SQL*Menu, and the report-writing product SQL*Report. These tools assist in implementing Oracle applications but do not provide any form of support to analyse the system being developed. To overcome this, a series of CASE products are provided by Oracle (i.e. CASE*Bridge, CASE*Designer, CASE*Dictionary, CASE*Generator, CASE*Method and CASE*Project) [BAR90]. The objective of these tools is to assist users by supporting a structured approach to the design, analysis and implementation of an Oracle application. CASE*Designer provides different views of the application using Entity Relationship Diagrams, Function Hierarchies, Dataflow Diagrams and matrix handlers to show the inter- relationship between different objects held in an Oracle dictionary. Oracle*Dictionary maintains complete definitions of the requirements and the detailed design of the application. Oracle*Generator uses these definitions to generate the code for the target environment and CASE*Bridge is used to extract information from other Oracle CASE tools or vice versa. However, such functions can be performed only on applications developed using these tools and not on an Oracle legacy database developed in any other way, which means they are no help with the current legacy problem. Hence, Oracle CASE tools are useful when developing new applications but cannot be used to re-engineer a pre-existing Oracle application, unless that original application was developed in an Oracle CASE environment. This limitation is shared by most CASE tools [COMP90, SHA93]. Currently, Oracle and other vendors are working on overcoming this limitation, and Page 93
  • 95. Oracle’s open systems architecture for heterogeneous data access [HOL93] is a step towards this. ANSI standard embedded SQL [ANSI89b] is used for application portability along with a set of function calls. In Oracle’s open systems architecture, standard call level interfaces are used to dynamically link and run applications on different vendor engines without having to recompile the application programs. This functionality is a subset of Microsoft’s ODBC [RIC94, GEI95] and the aim is to provide a transparent gateway to access non-Oracle SQL database products (e.g. IMS, DB2, SQL/DS and VSAM for IBM machines, or RMS and Rdb for DEC) via Oracle’s SQL*Connect. Transparent gateway products are machine and DBMS dependent in that they need to be recompiled or modified to run on different computers and support access to a variety of DBMSs. In the past, developers had to create special code for each type of database their users wanted to access. This limitation can now be overcome using a tool like ODBC to permit access to multiple heterogeneous databases. Most database vendors have development strategies which include plans to interoperate with open systems vendors as well as proprietary database vendors. This facility is being implemented using the 17SQL Access Group’s RDA (Remote Database Access) standard. As a result, products such as Microsoft’s Open Database Connectivity (ODBC), INFORMIX-Gateway [PUR93] and Oracle Transparent Gateway [HOL93] support some form of connectivity between their own and other products. For our work with Oracle, we developed our own C programs embedded with the query language SQL to access and update our prototype Oracle database. There is a version of PRODBI for Oracle that allows access to Oracle databases from Prolog code which was used in this project. 6.2.3 POSTGRES POSTGRES was developed at the University of California at Berkeley as a research oriented relational database extended with object-oriented features. Since 1994 a commercial version called ILLUSTRA [JAE95] has been available. However, POSTGRES has yet to address the inter-operability and other issues associated with our migration approach. 6.3 The Design of our Test University Databases 6.3.1 Background In our university system, we assume that departments and faculties have common user requirements and ideally could share a common database. Based on this assumption we have developed our test database schema to contain shared information. Hence, our three simple test databases, known as: College, Faculty and Department, can be easily integrated. A complete integration of these three databases will result in the generation of a global University database schema. However, in practice, schemas used by different departments and faculties may differ, 17 SQL Access Group (SAG) is a non-profit corporation open to vendors and users that develops technical specifications to enable multiple SQL-based RDBMS’s and application tools to interoperate. The specifications defined by the SAG consist of a combination of current and evolving standards that include ANSI SQL, ISO RDA and X/Open SQL. Page 94
  • 96. making the task of integration more difficult and bringing up more issues of heterogeneity. As our work is concerned with legacy database issues in a heterogeneous environment and not with integrating or resolving conflicts that arise in these environments, the differences that exist within this type of environment were not considered. Hence, we shall be looking at each of these databases independently. The main advantage of being able to easily integrate our test databases was the ability, thereby, to readily generate a complex database schema which could also be used to test our ideas. Each test database was designed to represent a specific kind of information, for example the Faculty and Department databases represent all kinds of structural relationships (e.g. 1:1, 1:M, and M:N; strong and weak relationships and entity types). The College database represents specialisation / generalisation structures, while the University database acts as a global system consisting of all the sub-database systems. This allows all sub-database systems, i.e. College, Faculty and Department, to act as a distributed system - the University database system. This is illustrated in figure 6.1 and is further described in section 6.3.2. We also need to be able to specify and represent all the constraint types discussed in section 3.5, as our re-engineering techniques are based on constraints. These were chosen to reflect actual database systems as closely as possible. We introduce these constraints in section 6.4 after identifying the entities of each of our test databases. College Database FPS Database A Faculty Database COMMA Database MATHS Database Departmental Databases Figure 6.1: The UWCC Database 6.3.2 The UWCC Database We shall use the term UWCC database to refer to our example university database, as the data of our system is based on that used at University of Wales College of Cardiff (UWCC). The UWCC database consists of many distributed database sites each used to perform the functions either of a particular department or school, or of a faculty, or of the college. The functions of the college are performed using the database located at the main college, which we shall refer to as the College database. The College consists of five faculties, each having its own local database located at the administrative section of the faculty. Our test database has been populated for one faculty, namely: The Faculty of Physical Science (FPS), and we shall refer to Page 95
  • 97. this database as the Faculty database. The College has 28 departments or schools, with five of them belonging to FPS [UWC94a, UWC94b]. Our test databases were populated using data from two departments of FPS, namely: The Department of Computing Mathematics (COMMA), which is now called the Department of Computer Science, and The Department of Mathematics (MATHS). These are referred to as Department databases. The component databases of our UWCC database form a hierarchy as shown in figure 6.1. This will let us demonstrate how the global University database formed by integrating these components incorporates all the functions present in the individual databases. In the next section we identify our test databases by summarising their entities and specific features. Entity Database Name (Meaning) College Faculty Department University University (university data) x - - x Employee (university employees) x x x x Student (university students) x - x x EmpStudent (employees as students) x - - x College (college data) x - - x Faculty (faculty data) x x - x Department (department data) x x x x Committee (faculty committees) - x - x ComMember (committee members) - x - x Teach (subjects taught by staff) - - x x Course (offered by the department) - - x x Subject (subject details) - - x x Option (subjects for each course) - - x x Take (subjects taken by each student) - - x x Table 6.1: Entities used in our test databases 6.3.3 The Test Database schemas Fourteen entities shown in table 6.1 were represented in our test database schemas. As we are not concerned with heterogeneity issues associated with schema integration, we have simplified our local schemas by using the same attribute definitions in schemas having the same entity name. The attribute definitions of all our entities are given in figure 6.2. Each test database schema is defined using the data definition language (DDL) of the chosen DBMS, and is governed by a set of rules to establish integrity within the database. In the context of a legacy system these rules may not appear as part of the database schema. In this situation our approach is to supply them externally via our constraint enhancement process. Therefore we present the set of constraints defined on our test databases separately, so that the initial state of these databases conforms to the database structure of a typical legacy system. 6.3.4 Features of our Test Database schemas Among the specific features represented in our test databases are relationship types which form weak and link entities, cardinality constraints which highlight the behaviour of entities, and inheritance and aggregation which form specialised relationships among entities. These features (if not present) are introduced to our test database schemas by enhancing them with new constraints. Page 96
  • 98. a) Relationship types Our reverse-engineering process uses the knowledge of constraint definitions to construct a conceptual model for a legacy database system. The foreign key definitions of table 6.4 along with their associated primary (cf. table 6.2) and uniqueness constraints (cf. table 6.3) are used to determine the relationship structures of a conceptual model. In this section we look at our foreign key constraint definitions to identify the types of relationship formed in our test database schemas. The check constraints of table 6.5 are used purely to restrict the domain values of our test databases. The foreign keys of table 6.4 are processed to find relationships according to our approach described in section 5.4.1. Here we identify keys defined on primary key attributes to determine M:N and 1:N weak relationships. The remaining keys will form 1:N or 1:1 relationships depending on the uniqueness property of the attributes of these keys. Table 6.6 shows all the relationships found in our test databases. We have also identified the criteria used to determine each relationship type according to section 5.4.1. Page 97
  • 99. CREATE TABLE University ( CREATE TABLE Department ( Office char(50) NOT NULL ); DeptCode char(5) NOT NULL, Building char(20) NOT NULL, CREATE TABLE Employee ( Name char(50) NOT NULL, Name char(25) NOT NULL, Address char(30), Address char(30) NOT NULL, Head char(9), BirthDate date(7) NOT NULL, Phone char(13), Gender char(1) NOT NULL, Faculty char(5) NOT NULL ); EmpNo char(9) NOT NULL, Designation char(30) NOT NULL, CREATE TABLE Committee ( WorksFor char(5) NOT NULL, Name char(15) NOT NULL, YearJoined integer(2) NOT NULL, Faculty char(5) NOT NULL, Room char(9), Chairperson char(9) ); Phone char(13), Salary decimal(8,2) ); CREATE TABLE ComMember ( ComName char(15) NOT NULL, CREATE TABLE Student ( MemName char(9) NOT NULL, Name char(20) NOT NULL, Faculty char(5) NOT NULL, Address char(30) NOT NULL, YearJoined integer(2) NOT NULL ); BirthDate date(7) NOT NULL, Gender char(1) NOT NULL, CREATE TABLE Teach ( CollegeNo char(9) NOT NULL, Lecturer char(9) NOT NULL, Course char(5) NOT NULL, Course char(5) NOT NULL, Department char(5) NOT NULL, Subject char(5) NOT NULL, Tutor char(9), Room char(9) ); RegYear integer(2) NOT NULL ); CREATE TABLE Course ( CREATE TABLE EmpStudent ( CourseNo char(5) NOT NULL, CollegeNo char(9) NOT NULL, Name char(35) NOT NULL, EmpNo char(9) NOT NULL, Coordinator char(9), Remark char(10) ); Offeredby char(5) NOT NULL, Type char(1) NOT NULL, CREATE TABLE College ( Length char(10), Code char(5) NOT NULL, Options integer(2) ); Building char(20) NOT NULL, Name char(40) NOT NULL, CREATE TABLE Subject ( Address char(30), SubNo char(5) NOT NULL, Principal char(9), Name char(40) NOT NULL ); Phone char(13) ); CREATE TABLE Option ( CREATE TABLE Faculty ( Course char(5) NOT NULL, Code char(5) NOT NULL, Subject char(5) NOT NULL, Building char(20) NOT NULL, Year integer(2) NOT NULL ); Name char(40) NOT NULL, Address char(30), CREATE TABLE Take ( Secretary char(9), CollegeNo char(9) NOT NULL, Phone char(13), Subject char(5) NOT NULL, Dean char(9) ); Year integer(2) NOT NULL, Grade char(1) ); Figure 6.2: Test database schema entities and their attribute descriptions We can see that the selected constraints cover four of the five relationship identification categories of figure 5.3. The remaining category (i.e. ‘b’) is a special case of category ‘a’ which could be represented in the entity Take by introducing two separate foreign keys to link entities Course and Subject, instead of linking with the entity Option. However, as stated in section 5.4.1, n-ary relationships are simplified whenever possible. Hence, in the test examples presented here we do not show this type to reduce the complexity of our diagrams. In appendix C we present the graphical view of all our test databases. The figures there show the graphical representation of all the relationships identified in table 6.6. b) Inheritance We have introduced two inheritance structures, one representing a single inheritance and the other a multiple inheritance (see figure 5.2 and table 6.7). To do so, two generalised entities, Page 98
  • 100. namely: Office and Person, have been introduced (see figure 6.3). Entities College, Faculty and Department now inherit from Office, while entities Employee and Student inherit from Person. Entity EmpStudent has been modified to become a specialised combination of Student and Employee. Figure 6.3 also contain all constraints associated with these entities. Constraint Entity(s) PRIMARY KEY (office) University PRIMARY KEY (empNo) Employee PRIMARY KEY (collegeNo) Student, EmpStudent PRIMARY KEY (code) College, Faculty PRIMARY KEY (deptCode) Department PRIMARY KEY (name,faculty) Committee PRIMARY KEY (comName,memName,faculty) ComMember PRIMARY KEY (lecturer,cource,subject) Teach PRIMARY KEY (courseNo) Course PRIMARY KEY (subNo) Subject PRIMARY KEY (course,subject) Option PRIMARY KEY (collegeNo,subject,year) Take Table 6.2: Primary Key constraints of our test databases Constraint Entity(s) UNIQUE (empNo) EmpStudent UNIQUE (name) College, Department, Faculty UNIQUE (principal) College UNIQUE (dean) Faculty UNIQUE (head) Department UNIQUE (name,offeredBy) Course Table 6.3: Uniqueness Key constraints of our test databases c) Cardinality constraints We have introduced some cardinality constraints on our test databases to show how these can be specified for a legacy database. In table 6.8 we show those used in the College database. Here the cardinality constraints for worksFor and faculty have been explicitly specified (see figure 6.3), while the others (inCharge, tutor and dean) have been derived using their relationship types. For example inCharge and tutor are 1:N relationships while dean is a 1:1 relationship. Our conceptual diagrams incorporate these constraint values (cf. appendix C and figure 5.2). Constraint Entity(s) FOREIGN KEY (course) REFERENCES Course Student, Option FOREIGN KEY (department) REFERENCES Department Student FOREIGN KEY (tutor) REFERENCES Employee Student FOREIGN KEY (dean) REFERENCES Employee Faculty FOREIGN KEY (faculty) REFERENCES Faculty Committee FOREIGN KEY (chairPerson) REFERENCES Employee Committee FOREIGN KEY (comName,faculty) REFERENCES Committee ComMember FOREIGN KEY (memName) REFERENCES Employee ComMember FOREIGN KEY (lecturer) REFERENCES Employee Teach FOREIGN KEY (course,subject) REFERENCES Option Teach FOREIGN KEY (coordinator) REFERENCES Employee Course FOREIGN KEY (offeredBy) REFERENCES Department Course FOREIGN KEY (subject) REFERENCES Subject Option, Take FOREIGN KEY (collegeNo) REFERENCES Student Take Table 6.4: Foreign Key constraints of our test databases Page 99
  • 101. Constraint Entity(s) CHECK (yearJoined >= 21 + birthDate INTERVAL YEAR) Employee CHECK (salary BETWEEN 200 AND 3000 OR salary IS NULL) Employee CHECK (regYear >= 18 + birthDate INTERVAL YEAR) Student CHECK (phone IS NOT NULL) College, Department, Faculty CHECK (type IN ('U','P','E','O')) Course CHECK (options >= 0 OR options IS NULL) Course CHECK (year BETWEEN 1 AND 7) Option Table 6.5: Check constraints of our test databases d) Aggregation A university has many offices (e.g. faculties, departments etc.) and an office belongs to a university. Also, attribute office is the key of entity University. Hence, entities University and Office participate in a 1:1 relationship. However, it is natural to represent this as a specialised relationship by considering office of University to be of type set. Then University and Office participate in an aggregation relationship which is a special form of a binary relationship. We introduce this type to show how specialised constraints could be introduced into a legacy database system. As shown in figure 6.3 we have used the key word REF SET to specify this type of relationship. In this case, as office is the key of University, a foreign key definition on office (see figure 6.3) will treat University as a link entity and hence can be classified as a special relationship. Attribute(s) Entity Relationship Entity(s) Criteria course Student 1 :N Course (d) department Student 1 :N Department (d) tutor Student 1 :N Employee (d) dean Faculty 1 :1 Employee (e) faculty Committee 1 :N Faculty (c) chairPerson Committee 1 :N Employee (d) comName, faculty, memName ComMember M :N Committee, Employee (a) lecturer, course, subject Teach M :N Employee, Option (a) coordinator Course 1 :N Employee (d) offeredBy Course 1 :N Department (d) course, subject Option M :N Course, Subject (a) collegeNo, subject Take M :N Student, Subject (a) Table 6.6: Relationship types of our test databases Entity Inherited Entities Employee Person Student Person EmpStudent Student, Employee College Office Faculty Office Department Office Table 6.7: Inherited Entities Participating Referencing Referenced Referencing Referenced Attribute Entity Entity Cardinality Cardinality inCharge Office Employee 0+ -1 worksFor Employee Office 4+ 1 tutor Student Employee 0+ -1 dean Faculty Employee -1 -1 faculty Department Faculty 2-12 1 Table 6.8: Cardinality constraints of College database Page 100
  • 102. 6.4 Constraints Specification, Enhancement and Enforcement In the context of legacy systems, our test database schemas (cf. figure 6.2) will not explicitly contain most of the constraints introduced in tables 6.2 to 6.5, 6.7 and 6.8. Thus we need to specify them using the approach described in section 5.6. In figure 6.3 we present these constraints for the College database. CREATE TABLE Office (code, siteName, unitName, address, inCharge, phone) AS SELECT code, building, name, address, principal, phone FROM College UNION SELECT code, building, name, address, secretary, phone FROM Faculty UNION SELECT deptCode, building, name, address, head, phone FROM Department; ALTER TABLE Office ADD CONSTRAINT Office_PK PRIMARY KEY (code) ADD CONSTRAINT Office_Unique_name UNIQUE (siteName, unitName) ADD CONSTRAINT Office_FK_Staff FOREIGN KEY (inCharge) REFERENCES Employee ADD UNIQUE (phone); ALTER TABLE College ADD UNDER Office WITH (siteName AS building, unitName AS name, inCharge AS principal); ALTER TABLE Faculty ADD UNDER Office WITH (siteName AS building, unitName AS name, inCharge AS secretary) ADD FOREIGN KEY (faculty) CARDINALITY (2-12) REFERENCES Faculty ; ALTER TABLE Department ADD UNDER Office WITH (code AS deptCode, siteName AS building, unitName AS name, inCharge AS head); CREATE VIEW College_Office AS SELECT * FROM Office WHERE code in (SELECT code FROM College); CREATE VIEW Faculty_Office AS SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, f.dean FROM Office o, Faculty f WHERE o.code = f.code; CREATE VIEW Dept_Office AS SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, d.faculty FROM Office o, Department d WHERE o.code = d.deptCode; ALTER TABLE University ALTER COLUMN office REF SET(Office) NOT NULL | ADD FOREIGN KEY (office) REFERENCES Office ; CREATE TABLE Person AS SELECT name, address, birthDate, gender FROM Employee UNION SELECT name, address, birthDate, gender FROM Student; ALTER TABLE Person ADD PRIMARY KEY (name, address, birthDate) ADD CHECK (gender IN ('M', 'F')); ALTER TABLE Employee ADD UNDER Person ADD CONSTRAINT Employee_FK_Office FOREIGN KEY (worksFor) CARDINALITY (4) REFERENCES Office; ALTER TABLE Student ADD UNDER Person; ALTER TABLE EmpStudent ADD UNDER Student, Employee ADD CHECK (tutor <> empNo OR tutor IS NULL); Figure 6.3 : Enhanced constraints of college database in extended SQL-3 syntax When all the above constraints are not supported by a legacy database management system, we need to be able to store the constraints in the database using our knowledge augmentation techniques (cf. section 5.7). In figure 6.4 we present selected instances used in our knowledge-based tables to represent the enhanced constraints for the College database. The selected instances represent all the possible constraint types so we have not represented all the enhanced constraints of figure 6.3. Page 101
  • 103. Our constraint enforcement process (cf. section 5.8) allows users to verify the extent to which the data in a database conforms to its enhanced constraints. The different types of queries used for this process in the College database are given in figure 6.5. Table_Constraint { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type, Is_Deferrable, Initially_Deferred } ('Uni_db', 'Office_PK', 'Col', 'Office', 'PRIMARY KEY', 'NO', 'NO') ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'UNIQUE', 'NO', 'NO') ('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'FOREIGN KEY', 'NO', 'NO') ('Uni_db', 'Employee_PK', 'Col', 'Employee', 'PRIMARY KEY', 'NO', 'NO') ('Uni_db', 'Employee_FK_Office', 'Col, 'Employee', 'FOREIGN KEY', 'NO', 'NO') ('Uni_db', 'College_phone', 'Col', 'College', 'CHECK', 'NO', 'NO') Referential_Constraint { Constraint_Id, Constraint_Name,Unique_Constraint_Id, Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule } ('Uni_db', 'Office_FK_Employee', 'Col', 'Employee_PK', 'NONE', 'NO ACTION', 'NO ACTION') ('Uni_db', 'Employee_FK_Office', 'Uni', 'Office_PK', 'NONE', 'NO ACTION', 'NO ACTION') Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name, Ordinal_Position } ('Uni_db', 'Office_PK', 'Col', 'Office', 'Code', 1) ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'siteName', 1) ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'unitName', 2) ('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'inCharge', 1) ('Uni_db', 'Employee_PK', 'Col', 'Employee', 'empNo', 1) ('Uni_db', 'Employee_FK_Office', 'Col', 'Employee', 'worksFor', 1) Check_Constraint { Constraint_Id, Constraint_Name, Check_Clause } ('Uni_db', 'College_phone', 'phone is NOT NULL') Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name } ('Uni_db', 'College', 'Office') Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name, Super_Table_Column } ('Uni_db', 'College', 'building', 'Office', 'siteName') ('Uni_db', 'College', 'name', 'Office', 'unitName') ('Uni_db', 'College', 'principal', 'Office', 'inCharge') Cardinality_Constraint { Constraint_Id, Constraint_Name, Referencing_Cardinal } ('Uni_db', 'Office_FK_Employee', '0+') ('Uni_db', 'Employee_FK_Office', '4+') Figure 6.4 : Augmented tables with selected sample data for the college database 6.5 Database Access Process Having described the application of our re-engineering processes using our test databases, we identify the tools developed and used to access those databases. The database access process is the initial stage of our application. This process extracts meta-data from legacy databases and represents it internally so that it can be used by other stages of our application. During re-engineering we need to access a database at three different stages: to extract meta-data and any existing constraint knowledge specifications to commence our reverse- engineering process; to add enhanced knowledge to the database; and to verify the extent to which the data conforms to the existing and enhanced constraints. We also need to access the database during the migration process. In all these cases, the information we require is held in either system or user-defined tables. Extraction of information from these tables can be done using the query language of the database, thus what we need for this stage is a mechanism that will allow us to issue queries and capture their responses. Page 102
  • 104. Constraint Type Constraint Violation Instances Primary Key SELECT code, COUNT(*) FROM Office GROUP BY code HAVING COUNT(*) > 1 UNION SELECT code, 1 FROM Office WHERE code IS NULL Unique SELECT dean, COUNT(*) FROM Faculty GROUP BY dean HAVING COUNT(*) > 1 Referential SELECT * FROM Office WHERE NOT (inCharge IS NULL OR inCharge IN (SELECT empNo FROM Employee)) Check SELECT * FROM College WHERE NOT (phone IS NOT NULL) Cardinality SELECT worksFor, COUNT(*) FROM Employee GROUP BY worksFor HAVING COUNT(*) < 4 Figure 6.5: Selected constraints to be enforced for the college database in SQL As our system implementation is in Prolog, the necessary query statements are generated from Prolog rules. The PRODBI interface allows access to several relational DBMSs, namely: Oracle, INGRES, INFORMIX and SYBASE [LUC93], from Prolog as if their relational tables are in the Prolog environment. The availability of INGRES PRODBI enabled us to use this tool to communicate with our INGRES test databases in the latter stages of our project. This interface has a performance as good as that of INGRES/SQL and hence, to the user, database interaction is fully transparent. Such Prolog database interface tools are currently commercially available only for relational database products. This means that we were not in a position to use this approach to perform database interactions for our POSTGRES test databases. Tools such as ODBC allow access to heterogeneous databases. This option would have been ideal for our application, but was not considered due to its unavailability in our development time scale. As far as our work is concerned, we needed a facility to issue specific types of query and obtain the response in such a way that Prolog could process responses without having to download the entire database. The PRODBI interfaces for relational databases perform this task efficiently, and also have many other useful data manipulation features. Due to the absence of any PRODBI equivalent tools to access non-relational or extended-relational DBMSs, we decided to develop our own version for POSTGRES. Here the functionality of our POSTGRES tool is to accept a POSTGRES DML statement (i.e a QUEL query statement) and produce the results for that query in a form that is usable by our (Prolog based) system. For Oracle, a PRODBI interface is available commercially, and to use it with our system the only change we would have to make is to load the Oracle library. As far as our code is concerned there is no change in any other commands, since they support the same rules as in INGRES. However at Cardiff only the PRODBI interface for INGRES was available, and even this was in the latter stages of our project. Therefore we developed our own tool to perform this functionality for INGRES and Oracle databases. However the implementation of this tool was not fully generalised, given that such tools were commercially available. When developing this tool we were not too concerned by performance degradation as our aim was to test functionality, not performance. Also, in the case of INGRES we have confirmed performance by subsequently using a commercially developed PRODBI tool with an SQL equivalent query facility. 6.5.1 Connecting to a database To establish a connection with a database the user needs to specify the site name (i.e. the Page 103
  • 105. location of the database), the DBMS name (e.g. Oracle v7) and the database name (e.g. department) to ensure a unique identification of a database located over a network. The site name is the address of the host machine (e.g. thor.cf.ac.uk) and is used to gain access to that machine via the network. The type of the named DBMS identifies the kind of data to be accessed, and the name18 of the database tells us which database is to be used in the extraction process. In our system (CCVES), we provide a pop-up window (cf. left part of figure 6.6) to select and specify these requirements. Here, a set of commonly used site names and the DBMSs currently supported at a site are embedded in the menu to make this task easy. The specification of new site and database names can also be done via this pop-up menu (cf. right part of figure 6.6). 6.5.2 Meta-data extraction Once a physical connection to a database is achieved it is possible to commence the meta- data extraction process. This process is DBMS dependent as the kind of meta-data represented in a database and the methods of retrieving it vary between DBMSs. The information to be extracted is recorded in the system catalogues (i.e. data dictionaries) of respective databases. The most basic type of information is entity and attribute names, which are common to all DBMSs. However, information about different types of constraints is specific to DBMSs and may not be present in legacy database system catalogues. Figure 6.6: Database connection process of CCVES The organisation of meta-data in databases differs with DBMSs, although all relational database systems use some table structure to represent this information. For example, a table structure for an Oracle user table is straight forward as they are separated from its system tables, while it is more complex in INGRES as all tables are held in a single form using attribute values to differentiate user defined tables from system and view tables. Hence the extraction query statements to retrieve entity names of a database schema differ for each system, as shown in table 6.9. These query statements indicate that the meta-data extraction process is done using the query language of that DBMS (e.g. SQL for Oracle and POSTQUEL for POSTGRES) and that the query table names and conditions vary with the type of the DBMS. This clearly demonstrates the DBMS dependency of the extraction process. Once the meta-data is obtained from system 18 For simplicity, identification details like the owner of the database are not included here. Page 104
  • 106. catalogues we can process it to produce the database schema in the DDL formalism of the source database and to represent this in our internal representation (see section 7.2). The extraction process for entity names (cf. table 6.9) covers only one type of information. A similar process is used to extract all the other types of information, including our enhanced knowledge based tables. Here, the main difference is in the queries used to extract meta-data and any processing required to map the extracted information into our internal structures, which are introduced in section 7.2 (see also appendix D). 6.5.3 Meta-data storage The generated internal structures are stored in text files for further use as input data for our system. These text files are stored locally using distinct directories for each database. The system uses the database connection specifications to construct a unique directory name for each database (e.g. department-Oracle7-thor.cf.ac.uk). We have given public access to these files so that the stored data and knowledge is not only reusable locally, but also usable from other sites. This directory structure concept provides a logically coherent database environment for users. It means that any future re-engineering processes may be done without physically connecting to the database (i.e. by selecting a database logically from one of the public directories instead). DBMS Query Oracle V7 SELECT table_name FROM user_table; INGRES V6 SELECT table_name FROM iitables WHERE table_type='T' and system_use='U'; POSTGRES V4 RETRIEVE pg_class.relname WHERE pg_class.relowner!='6'; SQL-3 SELECT table_name FROM tables WHERE table_type='BASE TABLE'; Table 6.9: Query statement to extract entity names of a database schema The process of connecting to a database and accessing its meta-data usually does not take much time (e.g. at most 2 minutes). However, trying to access an active database whenever a user wants to view its structure slows down the regular activities of this database. Also, local working is more cost effective than regularly performing remote accesses. This alternative also guarantees access to the database service as it is not affected by network traffic and breakdowns. We experienced such breakdowns during our system development, especially when accessing INGRES databases. A database schema can be considered to be static, whereas its instances are not. Hence, the decision to simulate a logical database environment after the first physical remote database access is justifiable because it allows us to work on meta-data held locally. 6.5.4 Schema viewing As meta-data is stored in text files for easy database access sessions, it is possible to skip the stages described in section 6.5.1 to 6.5.3 when viewing a database schema which has been accessed recently. During a database connection session, our system will only extract and store the meta-data of a database. Once the database connection process is completed the user needs to invoke a schema viewing session. Here, the user is prompted with a list of the current logically connected databases, as shown on the left of figure 6.7. When a database is selected from this list, Page 105
  • 107. its name descriptions (i.e. database name and associated schema names) are placed in the main window of CCVES (cf. right of figure 6.7). The user selects schemas to view them. Our reverse- engineering process is applied at this point. Here, meta-data extracted from the database schema is processed further to derive the necessary constructs to produce the conceptual model as an E-R or OMT diagram. CCVES allows multiple selections of the same database schema (i.e. by selecting the same schema from the main window; cf. right of figure 6.7). As a result, multiple schema visualisation windows can be produced for the same database. The advantage of this is that it allows a user to simultaneously view and operate on different sections of the same schema, which otherwise would not be visible simultaneously due to the size of the overall schema (i.e. we would have to scroll the window to make other parts of the schema visible). Also, the facility to visualise schemas using a user preferred display model means that the user could now view the same schema simultaneously using different display models. Figure 6.7: Database selection and selected databases of CCVES To produce a graphical view of a schema, we apply our reverse-engineering process. This process uses the meta-data which we extracted and represented internally. In chapter 7 we introduce the representation of our internal structures and describe the internal and external architecture and operation of our system, CCVES. Page 106
  • 108. CHAPTER 7 Architecture and Operation of CCVES The Conceptualised Constraint Visualisation and Enhancement System (CCVES) is defined by describing its internal architecture and operation - i.e. the way in which different legacy database schemas are processed within CCVES in the course of enhancing and migrating them into a target DBMS’s schema - and its external architecture and operation - i.e. CCVES as seen and operated by its users. Finally, we look into the possible migrations that can be performed using CCVES. 7.1 Internal Architecture of CCVES In previous chapters, we discussed overall information flow (section 2.2), our re- engineering process (section 5.2) and the database access process (section 6.5). Here we describe how the meta-data accessed from a database is stored and manipulated by CCVES in order to successfully perform its various tasks. There are two sources of input information available to CCVES (cf. figure 7.1): initially, by accessing a legacy database service via the database connectivity (DBC) process, and later by using the database enhancement (DBE) process. This information is converted into our internal representation (see section 7.2) and held in this form for use by other modules of CCVES. For example, the Schema Meta-Visualisation System (SMVS) uses it to display a conceptual model of a legacy database, the Query Meta-Translation System (QMTS) uses it to construct queries that verify the extent to which the data conforms to existing and enhanced constraints, and the Schema Meta-Translation system (SMTS) uses it to generate and create target databases for migration. 7.2 Internal Representation To address heterogeneity issues, meta-representation and translation techniques have been successfully used in several recent research projects at Cardiff [HOW87, RAM91, QUT93, IDR94]. A key to this approach is the transformation of the source meta-data or query into a common internal representation which is then separately transformed into a chosen target representation. Thus components of a schema, referred to as meta-data, are classified as entity (class) and attribute (property) on input, and are stored in a database language independent fashion in the internal representation. This meta-data is then processed to derive the appropriate schema information of a particular DBMS. In this way it is possible to use a single representation and yet deal with issues related to most types of DBMSs. A similar approach is used for query transformation between source and target representations.
  • 109. DBE Internal DBC Representation SMVS QMTS SMTS Figure 7.1: Internal Architecture of CCVES The meta-data we deal with has been classified into two types. The first category represents essential meta-data and the other represents derived meta-data. Information that describes an entity and its attributes, and constraints that identify relationships/hierarchies among entities are the essential meta-data (see section 7.2.1), as they can be used to build a conceptual model. Information that is derived for use in the conceptual model from the essential meta-data constitutes the other type of meta-data. When performing our reverse-engineering process we look only at the essential meta-data. This information is extracted from the respective databases during the initial database access process (i.e. DBC in figure 7.1). 7.2.1 Essential Meta-data Our essential (basic) meta-data internal representation captures sufficient information to allow us to reproduce a database schema using the DDL syntax of any DBMS. This representation covers entity and view definitions and their associated constraints. The following 5 Prolog style constructs were chosen to represent this basic meta-data, see figure 7.2. The first two constructs, namely: class and class property, are fundamental to any database schema as they describe the schema entities and their attributes, respectively. The third construct represents constraints associated with entities. This information is only partially represented by most DBMSs. The next two constructs are relevant only to some recent object-oriented DBMSs and are not supported by most DBMSs. We have included them mainly to demonstrate how modern abstraction mechanisms such as inheritance hierarchies could be incorporated into legacy DBMSs. By a similar approach, it is possible to add any other appropriate essential meta-data constructs. For conceptual modelling, and for the type of testing we perform for the chosen DBMSs, namely: Oracle, INGRES and POSTGRES, we found that the 5 constructs described here are sufficient. However, some additional environmental data (see section 7.2.2) which allows identification of the name and the type of the current database is also essential. Page 108
  • 110. 1. class(SchemaId, CLASS_NAME). 2. class_property(SchemaId, CLASS_NAME, PROPERTY_NAME, PROPERTY_TYPE). 3. constraint(SchemaId, CLASS_NAME, PROPERTY_list, CONST_TYPE, CONST_NAME, CONST_EXPR). 4. class_inherit(SchemaId, CLASS_NAME, SUPER_list). 5. renamed_attr(SchemaId, SUPER_NAME, SUPER_PROP_NAME, CLASS_NAME, PROPERTY_NAME). Figure 7.2: Our Essential Meta-data Representation Constructs We now provide a detailed description of our meta-representation constructs. This representation is based on the concepts of the Object Abstract Conceptual Schema (OACS) [RAM91] used in his SMTS and other meta-processing systems. Hence we have used the same name to refer to our own internal representation. Ramfos’s OACS internal representation provides a natural abstraction of a particular structure based on the notion of objects. For example, when an object is described, its attributes, constraints and other related properties are treated as a single construct although only part of it may be used at a time. Our OACS representation directly resembles the internal representation structure of most relational DBMSs (e.g. class represents an entity, class_property represent the list of attributes of an entity). This is the main difference between the two representations. However it is possible to map the OACS constructs of Ramfos to our internal representation and vice-versa, hence our decision to use a variation of the original OACS does not affect the meta-representation and processing principles in general. • Meta-data Representation of class The names of all the entities for a particular schema are recorded using class. This information is processed to identify all the entities of a database schema. • Meta-data Representation of class_property The names of all attributes and their data types for a particular schema are recorded using class_property. This information is processed to identify all attributes of an entity. • Meta-data Representation of constraint All types of constraints associated with entities are recorded using constraint. This information has been organised to represent constraints as logical expressions, along with an identification name and participating attributes. Different types of constraint, i.e. primary key, foreign key, unique, not null, check constraints, etc., are each processed and stored in this form. Usually a certain amount of preprocessing is required for the construction of our generalised representation of a constraint. For example, some check constraints extracted from the INGRES DBMS need to be preprocessed to allow them to be classified as check constraints by our system. • Meta-data Representation of class_inherit Page 109
  • 111. Entities that participate in inheritance hierarchies are recorded using class_inherit. The names of all super-entities for a particular entity are recorded here. This information is processed to identify all sub-entities of an entity and the inheritance hierarchies of a database schema. • Meta-data Representation of renamed_attr During an inheritance process, some attribute names may be changed to give more meaningful names for the inherited attributes. Once the inherited names are changed it makes it impossible to automatically reverse engineer these entities as their attribute names no longer match. To overcome this problem we have introduced an additional meta-data representation construct, namely: renamed_attr, which keeps track of all attributes whose names have changed due to inheritance. This is a representation of synonyms for attribute names of an inheritance hierarchy. 7.2.2 Environmental Data This is recorded using ccves_data, which is used to represent three types of information, namely: the database name, the DBMS name and the name of the host machine (see figure 7.3). These are captured at the database connection stage. 7.2.3 Processed Meta-data The essential meta-data described in section 7.2.1 is processed to derive additional information required for conceptual modelling. This additional information is schema_data, class_data and relationship. Here, schema_data (cf. figure 7.4 section 1), identifies all entities (all_classes, using class of figure 7.2 section 1) and entity types (link_classes and weak_classes, by the process described in section 5.4, using constraint types such as primary and foreign key which are recorded in constraint of figure 7.2 section 3). Class_data (cf. figure 7.4 section 2) identifies all class properties (property_list, using class_property of figure 7.2 section 2), inherited properties (using class_property, class_inherit and renamed_attr of figure 7.2 sections 2, 4 and 5, respectively), sub- and super- classes (subclass_list and superclass_list, using class_inherit of figure 7.2) and referencing and referenced classes (ref and refed, using foreign key constraints recorded in constraint of figure 7.2). Relationship (cf. figure 7.4 section 3) records the relationship types (derived using the process described in section 5.4) and cardinality information (using derived relationship types and available cardinality values). ccves_data(dbname, DATABASE_NAME). ccves_data(dbms, DBMS_NAME). ccves_data(host, HOST_MACHINE_NAME). Figure 7.3: OACS Constructs used as environmental data Page 110
  • 112. 1. schema_data(SchemaId, [ all_classes(ALL_CLASS_list), link_classes(LINK_CLASS_list), weak_classes(WEAK_CLASS_list) ]). 2. class_data(SchemaId, CLASS_NAME, [ property_list(OWN_PROPERTY_list, INHERIT_PROPERTY_list), subclass_list(SUBCLASS_list), superclass_list(SUPERCLASS_list), ref(REFERENCING_CLASS_list), refed(REFERENCED_CLASS_list) ]). 3. relationship(SchemaId, REFERENCING_CLASS_NAME, RELATIONSHIP_TYPE, CARDINALITY, REFERENCED_CLASS_NAME). Figure 7.4: Derived OACS Constructs 7.2.4 Graphical Constructs Besides the above OACS representations it is necessary to support additional constructs to produce a graphical display of a conceptual model. For this we produce graphical constructs using our derived OACS constructs (cf. figure 7.4) and apply a display layout algorithm (see section 7.3). We call these graphical object abstract conceptual schema (GOACS) constructs, as they are graphical extensions of our OACS constructs. The graphical display represents entities, their attributes (optional), relationships, etc., using graphical symbols which consist of strings, lines and widgets (basic objects in a toolkit which contains data that will not be forgotten after writing to the screen as in the case of strings and lines [NYE93]). To produce this display, coordinates of the positions of all entities, relationships etc., are derived and recorded in our graphical constructs. The coordinates of each entity are recorded using class_info as shown in section 1 of figure 7.5. This information identifies the top left coordinates of an entity. 1. class_info(SchemaId, CLASS_NAME, [ x(X0), y(Y0) ] ). 2. box(SchemaId, X0, Y0, W, H, REGULAR_CLASS_NAME). box_box(SchemaId, X0, Y0, W, H, Gap, WEAK_CLASS_NAME). diamond_box(SchemaId, X0, Y0, W, H, LINK_CLASS_NAME). 3. ref_info(Schema_Id, REFERENCING_CLASS_NAME, REFERENCING_CLASS_CONNECTING_SIDE, REFERENCING_CLASS_CONNECTING_SIDE_COORDINATE, REFERENCED_CLASS_NAME, REFERENCED_CLASS_CONNECTING_SIDE, REFERENCED_CLASS_CONNECTING_SIDE_COORDINATE). 4. line(SchemaId, X1, Y1, X2, Y2). string(SchemaId, X0, Y0, STRING_NAME). diamond(SchemaId, X0, Y0, W, H, ASSOCIATION_NAME). 5. property_line(SchemaId, CLASS_NAME, X1, Y1, X2, Y2). property_string(SchemaId, CLASS_NAME, PROPERTY_NAME, DISPLAY_COLOUR, X0, Y0). Figure 7.5: Graphical Constructs (GOACS) The graphical symbol for an entity depends on the entity type. Thus, further processing is required to graphically categorise entity types. For the EER model, we categorise entities: regular as box, weak as box_box and link as diamond_box (cf. section 2 of figure 7.5, and figure 7.6). We use an intermediate representation construct, namely: ref_info, to assist in the derivation of appropriate coordinates for all associations (cf. section 3 of figure 7.5). With the assistance of Page 111
  • 113. ref_info, coordinates to represent relationships are derived and recorded using line, string and diamond (cf. section 4 of figure 7.5, and figure 7.7). Users of our schema displays are allowed to interact with schema entities. During this process, optional information such as properties (i.e. attributes) of selected entities can be added to the display. This feature is the result of providing the entities and their attributes at different levels of abstraction. The added information is recorded separately using property_line and property_string (cf. section 5 of figure 7.5, and figure 7.8). (X0,Y0) W (X0,Y0) W (X0,Y0) W Gap H REGULAR H WEAK H LINK CLASS CLASS CLASS box box_box diamond_box Figure 7.6: Graphical representation of entity types in EER notations (X0,Y0) W (X0,Y0) (X1,Y1) (X2,Y2) . STRING_NAME Association Name H line string diamond Figure 7.7: Graphical representation of connections, labels and associations in EER notations (X0,Y0) (X1,Y1) (X2,Y2) . PROPERTY_NAME property_line property_string Figure 7.8: Graphical representation of selected attributes of a class in EER notations 7.3 Display Layout Algorithm To produce a suitable display of a database schema it was necessary to adopt an intelligent algorithm which determines the positioning of objects in the display. Such algorithms have been used by many researchers and also commercially for similar purposes [CHU89]. We studied these ideas and implemented our own layout algorithm which proved to be effective for small, manageable database schemas. However, to allow displays to be altered to a user preferred style and for our method to be effective for large schemas we decided to incorporate an editing facility. This feature allows users to move entities and change their original positions in a conceptual schema. Internally, this is done by changing the coordinates recorded in class_info for a repositioned entity and recomputing all its associated graphical constructs. Page 112
  • 114. When the location of an entity is changed the connection side of that entity may also need to be changed. To deal with this case, appropriate sides for all entities can be derived at any stage of our editing process. When appropriate sides are derived, the ref_info construct (cf. section 3 of figure 7.5) is appropriately updated to enable us to reproduce the revised coordinates of line, string and diamond constructs (cf. section 4 of figure 7.5). Our layout algorithm does the following things: 1. Entities connected to each other are identified (i.e. grouped) using their referenced entity information. This process highlights unconnected entities as well as entities forming hierarchies or graph structures. 2. Within a group, entities are rearranged according to the number of connections associated with them. This arrangement puts entities with most connections in the centre of the display structure and entities with the least connections at the periphery. 3. A tree representation is then constructed starting from the entity having the most connections. During the construction of subsequent trees, entities which have already been used are not considered, to prevent their original position being changed. This makes it easy to visualise relationships/aggregations present in a conceptual model. The identification of such properties allow us to gain a better understanding of the application being modelled. Similarly, attempts are made to highlight inheritance hierarchies whenever they are present. However, when too many inter-related entities are involved, it is sometimes necessary to use the move editing facility to relocate some entities so that their properties (e.g. relationships) are highlighted in the diagram. The existence of such hidden structures is due to the cross connection of some entities. To prevent overlapping of entities, relationships, etc., initial placement is done using an internal layout grid. However, the user is permitted to overlap or place entities close to each other during schema editing. The coordinate information of a final diagram is saved in disk files, so that these coordinates are automatically available to all subsequent re-engineering processes. Hence our system first checks for the existence of a file containing these coordinates and only in its absence would it use the above layout algorithm. 7.4 External Architecture and Operation of CCVES We characterise CCVES by first considering the type of people who may use this system. This is followed by an overview of the external system components. Finally the external operations performed by the system are described. 7.4.1 CCVES Operation The three main operations of CCVES, i.e. analysing, enhancing and incremental migration, need to be performed by a suitably qualified person. This person must have a good knowledge of the current database application to ensure that only appropriate enhancements are made to it. Also, this person must be able to interpret and understand conceptual modelling and the data manipulation language SQL, as we have used the SQL syntax to specify the contents of databases. Page 113
  • 115. This person must have the skills to design and operate a database application. Thus they are a more specialised user than the traditional adhoc user. We shall therefore refer to this person as a DataBase Administrator (DBA), although they need not be a professional DBA. It is this person who will be in charge of migrating the current database application. To this DBA the process of accessing meta-data from a legacy database service in a heterogeneous distributed database environment is fully automated once the connection to the database of interest is made. The production of a graphical display representation for the relevant database schema is also fully automated. This representation shows all available meta-data, links and constraints in the existing database schema. All links and constraints defined by hand coding in the legacy application (i.e. not in the database schema but appearing in the application in the form of 3GL or equivalent code) will not be shown until such links and constraints are supplied to CCVES during the process of enhancing the legacy database. Such enhancements are represented in the database itself to allow automatic reuse of these additions, not only by our system users but also by others (i.e. users of other database support tools). The enhancement process will assist the DBA in incrementally building the database structure for the target database service. Possible decomposable modules for the legacy system are identified during this stage. Finally, when the incremental migration process has been performed, the DBA may need to review its success by viewing both the source and the target database schemas. This is achieved using the facility to visualise multiple heterogeneous databases. We have sought to meet our objectives by developing an interactive schema visualisation and knowledge acquisition tool which is directed by an inference engine using a real world data modelling framework based on the EER and OMT conceptual models and extended relational database modelling concepts. This tool has been implemented in prototype form mostly in Prolog, supported by some C language routines embedded with SQL to access and use databases built with the INGRES DBMS (version 6), Oracle DBMS (version 6 and 7) or POSTGRES O-O data model (version 4). The Prolog code which does the main processing and uses X window and Motif widgets exceeds 13,000 lines, while the C language embedded with SQL code is from 100 to 1,000 lines depending on the DBMS. 7.4.2 System Overview This section defines the external architecture and operation of CCVES. It covers the design and structure of its main interfaces, namely: database connection (access), database selection (processing) and user interaction (see figure 7.9). The heart of the system consists of a meta- management module (MMM) (see figure 7.10), which processes and manages meta-data using a common internal intermediate schema representation (cf. section 7.2). A presentation layer which offers display and dialog windows has been provided for user interaction. The schema visualisation, schema enhancement, constraint visualisation and database migration modules (cf. figure 7.9) communicate with the user. Page 114
  • 116. Start GUI Database Tools Query Tool Database Access User Interaction Schema Connect Enhancement Database Database Processing Schema Constraint Visualisation Visualisation Select Database Database Migration Figure 7.9: Principal processes and control flow of CCVES The meta-data and knowledge for this system is extracted from respective database system tables and stored using a common internal representation (OACS). This knowledge is further processed to derive the graphical constructs (GOACS) of a visualised conceptual model. Information is represented in Prolog, as dynamic predicates, to describe facts, and semantic relationships that hold between facts, about graphical and textual schema components. The meta- management module has access to the selected database to store any changes (e.g. schema enhancements) made by the user. The input / output interfaces of MMM manage the presentation layer of CCVES. This consists of X window and Motif widgets used to create an interactive graphical environment for users. In section 2.2 we introduced the functionality of CCVES in terms of information flow with special emphasis on its external components (cf. figure 2.1). Later, in sections 2.3 and 7.1, we described the main internal processes of CCVES (cf. figures 2.2 and 7.1). Here, in figure 7.10, we show both internal and external components of CCVES together with special emphasis on the external aspect. 7.4.3 System Operation The system has three distinct operational phases: meta-data access, meta-data processing and user interaction. In the first phase, the system communicates with the source legacy database service to extract meta-data19 and knowledge20 specifications from the database concerned. This is achieved when connection to the database (connect database of figure 7.10) is made by the system, and is the meta-data access phase. In the second phase, the source specifications extracted from the database system tables are analysed, along with any graphical constructs we may have subsequently derived, to form the meta-data and meta-knowledge base of MMM. This information 19 meta-data represents the original database schema specifications of the database. 20 knowledge represents subsequent knowledge we may have already added to augment this database schema. Page 115
  • 117. is used to produce a visual representation in the form of a conceptual model. This phase is known as meta-data processing and is activated when select database (cf. figure 7.10) is chosen by the user. The final phase is interaction with the user. Here, the user may supply the system with semantic information to enrich the schema; visualise the schema using a preferred modelling technique (EER and OMT are currently available); select graphical objects (i.e. classes) and visualise their properties and intra- and inter- object constraints using the constraint window; and modify the graphical view of the displayed conceptual model. He may also incrementally migrate selected schema constructs; transfer selected meta-data to other tools (e.g. MITRA, a Query Tool [MAD95]); accept meta-data from other tools (e.g. REVEERD, a reverse-engineering tool [ASH95]); and examine the same database using another window of the CCVES or other database design tools (e.g. Oracle*Design). The objective of providing the user with a wide range of design tools is to optimise process of analysing the source legacy database. The enhancement of the legacy database with constraints is an attempt to collect information that is managed by modern DBMSs in the legacy database without affecting its operation and in preparation for its migration. The Designer Display and Dialog Windows Designer Interaction Constraint Text Files External Database Window Constraints Host : Dept-Oracle Tools DBMS: College-POSTGRES DB Sch SQL-3 Const DB Name: Faculty-INGRES Select Define GUI Schema Display GQL Oracle * Constraint External Visualiser Constraints View Schema Display Design Select, Select Define .......... Move, OMT/EER Transfer OMT/EER Select, Move, Transfer .......... Connect Select Meta-Translation Database Database (OUTPUT) Meta-Management Module Input / Output Interface Meta-Knowledge base GOACS Meta-Processor Meta-Transformation Meta-Storage System OACS Meta-Translation (INPUT) Connect Database External Constraints Store & Enforce Heterogeneous Distributed Databases Figure 7.10: External Architecture of CCVES For successful system operation, users need not be aware of the internal schema representation or any other non-SQL database specific syntax of the source or target database. This is because all source schemas are mapped into our internal representation and are always Page 116
  • 118. presented to the user using the the standard SQL language syntax (unless specifically requested otherwise). This enables the user to deal with the problem of heterogeneity, since at the global level local databases are viewed as if they come from the same DBMS. The SQL syntax is used by default to express the associated constraints of a database. If specifically requested, the SQL syntax can be translated and viewed using the DDL of the legacy DBMS; as far as CCVES is concerned this is just performing another meta-translation process. A textual version of the original legacy database definition is also created by CCVES when connection to the legacy database is established. This definition may be viewed by the user for better understanding of the database being modelled. The ultimate migration process allows the user to employ a single target database environment for all legacy databases. This will assist in removing the physical heterogeneity between those databases. The complete migration process may take days for large information systems as they already hold a large volume of data. Hence the ability to enhance and migrate while legacy databases continue to function is an important feature. Our enhancement process does not affect existing operations as it involves adding new knowledge and validating existing data. Whenever structural changes are introduced (e.g. an inheritance hierarchy), we have proposed the use of view tables (cf. section 5.6) to ensure that normal database operations will not be affected until the actual migration is commenced. This is because some data items may continue to change while the migration is in preparation, and indeed during migration itself. We have proposed an incremental migration process to minimise this effect and use of a forward gateway to deal with such situations. 7.5 External Interfaces of CCVES CCVES is seen by users as consisting of four processes, namely: a database access process, a schema and constraint visualisation process, a schema enhancement process, and a schema migration process. The database access process was described in section 6.5. In the next subsections we describe the other three processes of CCVES to complete the picture. 7.5.1 Schema and Constraint Visualisation The input / output interfaces of MMM manage the presentation layers of CCVES. These layers consist of display and dialog windows used to provide an interactive graphical environment for users. The user is presented with a visual display of the conceptual model for a selected database, and may perform many operations on this schema display window (SDW) to analyse, enhance, evolve, visualise and migrate any portion of that database. Most of these operations are done via SDW as they make use of the conceptual model. The traditional conceptual model is an E-R diagram which displays only entities, their attributes and relationships. This level of abstraction gives the user a basic idea of the structure of a database. However this information is not sufficient to gain a more detailed understanding of the components of a conceptual model, including identification of intra- and/or inter- object constraints. Intra-object constraints for an entity provide extra information that allows the user to identify behavioural properties of the entity. For instance, the attributes of an entity do not provide sufficient information to determine the type of data that may be held by an attribute and any Page 117
  • 119. restrictions that may apply to it. Hence providing a higher level of abstraction by displaying constraints along with their associated entities and attributes gives the user a better understanding of the conceptual model. The result is much more than a static entity and attribute description of a data model as it describes how the model behaves for dynamic data (i.e. a constraint implies that any data item which violates it cannot be held by the database). The schema visualisation module allows users to view the conceptual schema and constraints defined for a database. This visualisation process is done using three levels of abstraction. The top level describes all the entity types along with any hierarchies and relationships. The properties of each entity are viewed at the next level of abstraction to increase the readability of the schema. Finally, all constraints associated with the selected entities and their properties are viewed to gain a deeper and better understanding of the actual behaviour of selected database components. The conceptual diagrams of our test databases are given in appendix C. These diagrams are at the top level of abstraction. Figures 7.11 and 7.12 show the other two levels of abstraction. The graphical schema displayed in the SDW for a selected database uses by default the OMT notation, which can be changed to EER from a menu. Users can produce any number of schema displays for the same schema, and thus can visualise a database schema using both OMT and EER diagrams at the same time (a picture of our system while viewing the same schema using both forms is provided in figure 7.11). The display size of the schema may also be changed from a menu. A description that identifies the source of each display is provided as we are dealing with many databases in a heterogeneous environment. The diagrams produced by CCVES can be edited to alter the location of their displayed entities and hence to permit visualisation of a schema in the personally preferred style and format of an individual user. This is done by selecting and moving a set of entities within the scrolling window, thus altering the relative positions of entities within the diagram produced. These changes can be saved and automatically restored for the next session by users. The system allows interactive selection of entities and attributes from the SDW. We initially do not include any attributes as part of the displayed diagram, because we provide them as a separate level of abstraction. A list of attributes associated with an entity can be viewed by first selecting the entity from the display window (abstraction at level 2), and then browsing through its attributes in the attribute window, which is placed just below the display window (see figure 7.12). Any attribute selected from this window will automatically be transferred to the main display window, so that only attributes of interest are displayed there. This technique increases the readability of our display window. At each stage, appropriate messages produced by the system display unit are shown at the bottom of this window. For successful interactions, graphical responses are used whenever applicable. For example, when an entity is selected by clicking on it, the selected entity will be highlighted. In this thesis we do not provide interaction details as these are provided at the user manual level. The 'browser' menu option for SDW will invoke the browser window. This is done only when a user wishes to visit the constraint visualisation abstraction level, our third level of abstraction. Here we see all entities and their properties of interest as the default option, but we can expand this to display other entities and properties by choosing appropriate menu options Page 118
  • 120. from the browsing window. We can also filter the displayed constraints to include those of interest (e.g. show only domain constraints). In cases where inherited attributes are involved the system will initially show only those attributes owned by an entity (the default option), others can be viewed by selecting the required level of abstraction (available in the menu) for the entity concerned. The ability to select an entity from the display window and display its properties in the browsing window satisfies our requirement of visualising intra-object constraints. The reverse of this process, i.e. selecting a constraint from the browsing window and displaying all its associated entities in the display window, satisfies our inter-object constraint visualisation requirement. Both of these facilities are provided by our system (see figures 7.11 and 7.12, respectively). All operations with the mouse device are done using the left button except when altering the location of an entity of a displayed conceptual model. We have allowed the use of the middle button of the mouse to select, drag and drop21 such an entity. This process alters the position of the entity and redraws the conceptual model. By this means, object hierarchies, relationships, etc., can be made prominent by placing the objects concerned in hierarchies close to each other. This feature was introduced firstly to allow users to visualise a conceptual model in a preferred manner and secondly as our display algorithm was not capable of automatically providing such features when constructing complex schemas having many entities, hierarchies and relationships (cf. section 7.3). 21 CCVES changes the cursor symbol to confirm the activation of this mode. Page 119
  • 122. 7.5.2 Schema Enhancement Schema enhancements are also done via the schema display window. This module is mainly used to specify dynamic constraints. These constraints are usually extracted from the legacy code of a database application, as in older systems they were specified in the application itself. Constraint extraction from legacy applications is not supported by CCVES. Hence, this information must be extracted by other means. We assume that such constraints can be extracted by either examining the legacy code, using any existing documentation or using user knowledge of the application, and have introduced this module to capture them. We have also provided an option to detect possible missing, hidden and redundant information (cf. section 5.5.2) to assist users in formulating new constraints. The user specifies constraints via the constraint enhancement interface by choosing the constraint type and associated attributes. In the case of a check constraint the user needs to specify it using SQL syntax. The constraint enhancement process allows further constraints to appear in the graphical model. This development is presented via the graphical display, so that the user is aware of the existing links and constraints present for the schema. For instance, when a foreign key is specified, CCVES will try to derive a relationship using this information. If this process is successful a new relationship will appear in the conceptual model. A graphical user interface in the form of a pop-up sub-menu is used to specify constraints, which take the form of integrity constraints (e.g. primary key, foreign key, check constraints) and structural components (e.g. inheritance hierarchies, entity modifications). In figure 7.13 we present some pop-up menus of CCVES which assist users in specifying various types of constraints. Here, names of entities and their attributes are automatically supplied via pull-down menus to ensure the validity of certain input components of user specified constraints. For all constraints, information about the type of constraint and the class involved is initially specified. When the type of constraint is known, prior existence of such a constraint is checked in the case of primary key and foreign key constraints. For primary keys, the process will not proceed if a key specification already exists; for foreign keys, if the attributes already participate in such a relationship they will not appear in the referencing attribute specification list. In the case of referenced attributes, only attributes with at least the uniqueness property will appear in order to prevent specification of any invalid foreign keys. All enhanced constraints are stored internally until they are added to the database using another menu option. Prior to this augmentation process the user should verify the validity of the constraints. In the case of recent DBMSs like Oracle, invalid constraints will be rejected automatically and the user will be requested to amend them or discard them. In such situations the incorrect constraints are reported to the user and are stored on disk as a log file. Also, all changes made during a session will not be saved until specifically instructed by the user. This gives the opportunity to rollback (in the event of an incorrect addition) and resume from the previous state. Page 121
  • 123. Figure 7.13: Two stages of a Foreign Key constraint specification Input data in the form of constraints to enhance the schema provides new knowledge about a database. It is essential to retain this knowledge within the database itself, if it is to be used for any future processing. CCVES achieves this task using its knowledge augmentation process as described in section 5.7. From a user’s point of view this process is fully automated and hence no intermediate interactions are involved. The enhanced knowledge is augmented only if the database is unable to naturally represent the new knowledge. Such knowledge cannot be automatically enforced except via a newer version of the DBMS or other DBMS (if supported), by migrating the database. However, the data being held by the database may already not conform to the new constraints, and hence existing data may be rejected by the target DBMS. This will result in loss of data and/or migration delays. To address this problem, we provide an optional constraint enforcement process which checks the conformation of the data to the new constraints prior to migration. 7.5.3 Constraint Enforcement and Verification Constraint enforcement is automatically managed only by relatively recent DBMSs. If CCVES is used to enhance a recent DBMS such as Oracle then verification and enforcement will be handled by the DBMS, as CCVES will just create constraints using the DDL commands of that DBMS. However, when such features are not supported by the underlying DBMS, CCVES has to provide such a service itself. Our objective in this process is to give users the facility to ensure that the database conforms to all the enhanced constraints. Constraints are chosen from the browser window to verify their validity. Once selected, the constraint verification process will issue each constraint to the database using the technique described in section 5.8 and report any violations to the user. When a violated constraint is detected, the user can decide whether to keep or discard it. The user could decide to retain the constraint in the knowledge-base for various reasons. These include: ensuring that future data conforms to the constraint; providing users with a guideline to the system data contents irrespective of violations that may occur occasionally; assisting the user in improving the data or the constraint. To enable the enforcement of such constraints for future data instances, it is necessary to either use a trigger (e.g. on append check constraint) or add a temporal component to the constraint (e.g. system date > constraint input date). This will ensure that the constraint will not be enforced on existing data. Page 122
  • 124. When using queries to verify enhanced constraints the retrieved data are instances that violate a constraint. In such a situation, retrieving a large number of instances for a given query does not make much sense as it may be due to an incorrect constraint specification rather than the data itself. Therefore in the event of an output exceeding 20 instances we terminate the query and instruct the user to inspect this constraint as a separate action. 7.5.4 Schema Migration The migration process is provided to allow an enhanced legacy system to be ported to a new environment. This is incrementally performed by initially creating the schema in the target DBMS and then copying the legacy data to the target system. To create the schema in the target system, DDL statements are generated by CCVES. An appropriate schema meta-translation process is performed if required (e.g. if the target DBMS has a non-SQL query language). The legacy data is migrated using the import/export tools of source and target DBMSs. The migration process is not fully automated as certain conflicts cannot be resolved without user intervention. For example, if the target database only accepts names of length 16 (as in Oracle) instead of 32 (as in INGRES) in the source database, then a name resolution process must be performed by the user. Also, names used in one DBMS may be keywords in another. Our system resolves these problems by adding a tag to those names or by truncating the length of a name. This approach is not generic as the uniqueness property of an attribute cannot be maintained by truncating its original name. In these situations user involvement is unavoidable. CCVES, although it has been tested for only three types of DBMS, namely: INGRES, POSTGRES and Oracle, could be easily adapted for other relational DBMSs as they represent their meta-data similarly - i.e. in the form of system tables, with minor differences such as table and attribute names and some table structures. Non relational database models accessible via ODBC or other tools (e.g. Data Extract for DB2, which permits movement of data from IMS/VS, DL/1, VSAM, SAM to SQL/DS or DB2), could also be easily adapted as the meta-data required by CCVES could be extracted from them. Previous work related to meta-translation [HOW87] has investigated the translation of dBase code to INGRES/QUEL, demonstrating the applicability of this technique in general, not only to the relational data model but also to others such as CODASYL and hierarchical data models. This means CCVES is capable in principle of being extended to cope with other data models. Page 123
  • 125. CHAPTER 8 Evaluation, Future Work and Conclusion In this chapter the Conceptualised Constraint Visualisation and Enhancement System (CCVES) described in Chapters 5, 6 and 7 is evaluated with respect to our hypotheses and objectives listed in Chapter 1. We describe the functionality of different components of CCVES to identify their strengths and summarise their limitations. Potential extensions and improvements are considered as part of future work. Finally, conclusions about the work are drawn by reviewing the objectives and the evaluation. 8.1 Evaluation 8.1.1 System Objectives and Achievements The major technical challenge in designing CCVES was to provide an interactive graphical environment to access and manipulate legacy databases within an evolving heterogeneous distributed database environment for the purpose of analysing, enhancing and incrementally migrating legacy database schemas to modern representations. The objective of this exercise was to enrich a legacy database with valuable additional knowledge that has many uses, without being restricted by the existing legacy database service and without affecting the operation of the legacy IS. This knowledge is in the form of constraints that can be used to understand and improve the data of the legacy IS. Here, we assess the main external and internal aspects of our system, CCVES, based on the objectives laid out in sections 1.2 and 2.4. Externally, CCVES performs 3 important tasks - initially, a reverse-engineering process; then, a knowledge augmentation process, which is a re- engineering process on the original system; and finally, an incremental migration process. The reverse-engineering process is fully automated and is generalised to deal with the problems caused by heterogeneity. a) A framework to address the problem of heterogeneity The problems of heterogeneity have been addressed by many researchers, and at Cardiff the meta-translation technique has been successfully used to demonstrate its wide-ranging applicability to heterogeneous systems. This previous work, which includes query meta- translation [HOW87], schema meta-translation [RAM91] and schema meta-integration [QUT93], was developed using Prolog - emphasising its suitability for meta-data representation and processing. Hence Prolog was chosen as the main programming language for the development of our system. Among the many Prolog versions around, we found that Quintus Prolog was well suited to supporting an interactive graphical environment as it provided access to X window and Motif widget routines. Also, the PRODBI tools [LUC93] were available with Quintus Prolog, and these enabled us to directly access a number of relational DBMSs, like INGRES, Oracle and SYBASE.
  • 126. Chapter 8 Evaluation, Future Work and Conclusion Our framework for meta-data representation and manipulation has been described in section 7.2. The meta-programming approach enabled us to implement many other features, such as the ability to easily customise our system for different data models, e.g. relational and object- oriented (cf. section 7.2.1), the ability to easily enhance or customise for different display models, e.g. E-R, EER and OMT (cf. section 7.2.4), and the ability to deal with heterogeneity due to differences in local databases (e.g. at the global level the user views all local databases as if they come from the same DBMS, and is also able to view databases using a preferred DDL syntax). b) An interactive graphical environment An interactive graphical environment which makes extensive use of modern graphical user interface facilities was required to provide graphical displays of conceptual models and allow subsequent interaction with them. To fulfil these requirements the CCVES software development environment had to be based on a GUI sub-system consisting of pop-up windows, pull-down menus, push buttons, icons etc. We selected X window and Motif widgets to build such an environment on a UNIX platform. SunSparc workstations were used for this purpose. Provision of interactive graphical responses when working via this interface was also included to ensure user friendliness (cf. section 7.5). c) Ability to access and work on legacy database systems An initial, basic facility of our system was the ability to access legacy database systems. This process, which is described in section 6.5, enables users to specify and access any database system over a network. Here, as the schema information is usually static, CCVES has been designed to provide the user with the option of by-passing the physical database access process and using instead an existing (already accessed) logical schema. This saves time once the initial access to a schema has been made. Also, it guarantees access to meta-data of previously accessed databases during server and network breakdowns, which were not uncommon during the development of our system. d) A framework to perform the reverse-engineering process A framework to perform the reverse-engineering process for legacy database systems has been provided. This process is based on applying a set procedure which produces an appropriate conceptual model (cf. section 5.2). It is performed automatically even if there is very limited meta-knowledge. In such a situation, links that should be present in the conceptual model will not appear in the corresponding graphical display. Hence, the full success of this process depends on the availability of adequate meta-knowledge. This means that a real world data modelling framework that facilitates the enhancement of legacy systems must be provided, as described next. e) A framework to enhance existing systems A comprehensive data modelling framework that facilitates the enhancement of established database systems has been provided (cf. section 5.6). A method of retaining the Page 125
  • 127. Chapter 8 Evaluation, Future Work and Conclusion enhanced knowledge for future use which is in line with current international standards is employed. Techniques that are used in recent versions of commercial DBMSs are supported to enable legacy databases to logically incorporate modern data modelling techniques irrespective of whether these are supported by their legacy DBMSs or not (cf. section 5.7). This enhancement facility gives users the ability to exploit existing databases in new ways (i.e. restructuring and viewing them using modern features even when these are not supported by the existing system). The enhanced knowledge is retained in the database itself so that it is readily available for future exploitation by CCVES or other tools, or by the target system in a migration. f) Ability to view a schema using preferred display models The original objective of producing a conceptual model as a result of our reverse- engineering process was to display the structure of databases in a graphical form (i.e. conceptual model) and so make it easier for users to comprehend their contents. As all users are not necessarily familiar with the same display model, the facility to visualise using a user preferred display model (e.g. EER or OMT) has been provided. This is more flexible than our original aim. g) High level of data abstraction for better understanding A high level of data abstraction for most components of a conceptual model (i.e. visualising the contents, relationships and behavioural properties of entities and constraints; including identification of intra- and inter-object constraints) has been provided (cf. section 7.5.1). Such features are not usually incorporated in visualisation tools. These features and various other forms of interaction with conceptual models are provided via the user interface of CCVES. h) Ability to enhance schema and to verify the database The schema enhancement process was provided originally to enrich a legacy database schema and its resultant conceptual model. A facility to determine the constraints on the information held and the extent to which the legacy data conforms to these constraints is also provided to enable the user to verify their applicability (section 5.7). The graphical user interface components used for this purpose are described in section 7.5.2. i) Ability to migrate while the system continues to function The ability to enhance and migrate while a legacy database continues to function normally was considered necessary as it ensures that this process will not affect the ongoing operation of the legacy system (section 5.8). The ability to migrate to a single target database environment for all legacy databases assists in removing the physical heterogeneity between these databases. Finally, the ability to integrate CCVES with other tools to maximise the benefits to the user community was also provided (section 7.4.3). 8.1.2 System Development and Performance A working prototype CCVES system that enabled us to test all the contributions of this research was implemented using Quintus Prolog with X window and Motif libraries; INGRES, Page 126
  • 128. Chapter 8 Evaluation, Future Work and Conclusion Oracle and POSTGRES DBMSs; the C programming language embedded with SQL and POSTQUEL; and the PRODBI interface to INGRES. This system can be split into 4 parts, namely: the database access process to capture meta-data from legacy databases; the mapping of the meta-data of a legacy database to a conceptual model to present the semantics of the database using a graphical environment; the enhancement of a legacy database schema with constraint based knowledge to improve its semantics and functionality; and the incremental migration of the legacy database to a target database environment. Our initial development commenced using POPLOG, which was at that time the only Prolog version with any graphical capabilities available on UNIX workstations at Cardiff. Our initial exposure to X window library routines occurred at this stage. Later, with the availability of Quintus Prolog, which had a more powerful graphical capability due to its support of X windows and Motif widgets, it was decided to transfer our work to this superior environment. To achieve this we had to make two significant changes, namely: converting all POPLOG graphic routines to Quintus equivalents and modifying a particular implementation approach adopted by us when working with POPLOG. The latter took advantage of POPLOG’s support for passing unevaluated expressions as arguments of Prolog clauses. In Quintus Prolog we had to evaluate all expressions before passing them as arguments. Due to the use of slow workstations (i.e. SPARC1s) and running Prolog interactively, there was a delay in most interactions with our original system. This delay was significant (e.g. nearly a minute) when having to redraw a conceptual model. It was necessary to redraw this model when we moved an object of the display in order to change its location and whenever the drawing window was exposed. This exposure occurred when the window’s position changed, or was overlapped with another window or a menu, or when someone clicked on this window. In such situations it was necessary to refresh the drawing window by redrawing the model. Redrawing was required as our initial attempt at producing a conceptual model was based solely on drawing routines. This method was inefficient as such drawings had to be redone every time the drawing window became exposed. Our second attempt was to draw conceptual models in the background using a pixmap. This process allocates part of the memory of the computer to enable us to directly draw and retain an image. A pixmap can be copied to any drawing window without having to reconstruct its graphical components. This means that when the drawing window becomes exposed it is possible to copy this pixmap to that window without redrawing the conceptual model. The process of copying a pixmap to the drawing window took only a few seconds and so there was a significant improvement over our original method. However with this new approach whenever a move operation is performed, it is still necessary to recompute all graphical settings and redraw. This took a similarly long time to the original method. The use of a pixmap took up a significant part of the computer’s memory and as a result Quintus was unable to cope if there was a need to simultaneously view multiple conceptual models. We also experienced several instances of unusual system behaviour such as failure to execute routines that had been tested previously. This was due to the full utilisation by Prolog of run time memory because of the existence of this pixmap. We noticed that Quintus Prolog had a bug of not being able to release the memory used by a pixmap. In order to regain this memory we Page 127
  • 129. Chapter 8 Evaluation, Future Work and Conclusion had to logout (exit) from the workstation, as the xnew process which was collecting garbage was unable to deal with this case. Hence, we decided to use widgets instead of drawing rectangles for entities, as widgets are managed automatically by X windows and Motif routines. This allowed us to reduce the drawing components in our conceptual model and hence to minimise redrawing time when the drawing window became exposed. We discarded the pixmap approach as it gave us many problems. However as widgets themselves take up memory their behaviour under some complex conceptual models is questionable. We decided not to test this in depth as we had already spent too much time on this module, and its feasibility had been demonstrated satisfactorily. During the course of CCVES development, Quintus Prolog was upgraded from release 3.1 to 3.1.4. Due to incompatibilities between the two versions, certain routines of our system had to be modified to suit the new version. This meant that a full test of the entire system was required. Also, since three versions of INGRES, two versions of Oracle and one POSTGRES were used during our project, this meant that more and more system testing was required. Thus, we have experienced several changes to our system due to technological changes in its development environment. Comparing the lifespan and scale of our project with those of a legacy IS, we could more clearly appreciate the amount of change that is required for such systems to keep up with technological progress and business needs. Hence, the migration of any IS is usually a complex process. However, the ability to enhance and evolve such a system without affecting its normal operation is a significant step towards assisting this process. Our final task was to produce a compiled version of our system. This is still being undertaken, as although we have been able to produce executable code, some user interface options are not being activated for unknown reasons (we think this may be due to insufficient memory), although the individual modules work correctly. 8.1.3 System Appraisal The approach presented in this thesis for mapping a relational database schema to a conceptual schema is in many ways simpler and easier to apply than any previous attempts as it has eliminated the need for any initial user interaction to provide constraint based knowledge for this process. Constraint information such as primary and foreign keys are used to automatically derive the entity and relationship types. Use of foreign key information was not considered in previous approaches as most database systems did not support such facilities at that time. One major contribution of our work is providing the facility for specifying and using constraint based information in any type of DBMS. This means that once a database is enhanced with constraints, it is semantically richer. If the source DBMS does not support constraints then the conceptual model will still be enhanced, and our tool will augment the database with these constraints in an appropriate form. Another innovative feature of our system is the automated use of the DML of a database to determine the extent to which the enhanced constraints conform to its data. This enables users to take appropriate compensatory actions prior to migrating legacy databases. We provided an additional level of schema abstraction for our conceptual models. This is Page 128
  • 130. Chapter 8 Evaluation, Future Work and Conclusion in the form of viewing the constraints associated with a schema. This feature allows users to gain a better understanding of databases. The facility to view multiple schemas allows users to compare different components of a global system if it comprises several databases. This feature is very useful when dealing with heterogeneous databases. We also deal with heterogeneity at the conceptual viewing stage by providing users with the facility to view a schema using their preferred modelling notation. For example, in our system the user can choose either an EER or an OMT display to view a schema. This ensures greater accuracy in understanding, as the user can select a familiar modelling notation to view database schemas. The display of the same schema in multiple windows using different scales allows the user to focus on a small section of the schema in one window while retaining a larger view in another. The ability to view multiple schemas also means that it is possible to jointly monitor the progress or status of the source and target databases during an incremental migration process. The introduction of both EER and OMT as modelling options means that the recent advances which were not present in the original E-R model and some subsequent variants can be represented using our system. Our approach of augmenting a database itself with new semantic knowledge rather than using separate specialised knowledge-bases means that our enhanced knowledge is accessible by any user or tools using the DML of the database. This knowledge is represented in the database using an extended version of the SQL-3 standards for constraint representation. Thus this knowledge will be compatible with future database products, which should conform to the new SQL-3 standards. Also, no semantics are lost due to the mapping from a conceptual model to a database schema. Oracle version 6 provided similar functionality by allowing constraints to be specified even though they could not be applied until the introduction of version 7. 8.1.4 Useful real-life Applications We were able to successfully reverse-engineer a leading telecommunication database extract consisting of over 50 entities. This enabled us to test our tool on a scale greater than that of our test databases. The successful use of all or parts of our system for other research work, namely: accessing POSTGRES databases for semantic object-oriented multi-database access [ALZ96], viewing heterogeneous conceptual schemas when dealing with graphical query interfaces [MAD95], and viewing heterogeneous conceptual schemas via the world wide web (WWW) [KAR96] indicates its general usefulness and applicability. The display of conceptual models can be of use in many areas such as database design, database integration and database migration. We could identify similar areas of use for CCVES. These include training new users by allowing them to understand an existing system, and enabling users to experiment with possible enhancements to existing systems. 8.2 Limitations and possible future Extensions There are a number of possible extensions that could be incorporated to improve the current functionalities of our system. Some of these are associated with improving run time efficiency, accommodating a wider range of users and extending graphical user interaction Page 129
  • 131. Chapter 8 Evaluation, Future Work and Conclusion capabilities. Such extensions would not have great significance with respect to demonstrating the applicability of our fundamental ideas. Examples of improvements are: engineer the system to the level of a commercial product so that it could be used by a wide range of users with minimal user training; improve the run time efficiency of the system by producing a compiled version; test it in a proper distributed database environment, as our test databases did not emphasise distribution; extend the graphical display options to offer other conceptual models, such as ECR; extend the system to enable us to test migrations to a proper object-oriented DBMS (i.e. not only to an extended relational DBMS with O-O features, like POSTGRES); and improve the display layout algorithm (cf. section 7.3) to efficiently manage large database schemas. The time scale for such improvements would vary from a few weeks to many months each, depending on the work involved. Our system is designed to cope with two important extensions. They are: extend the graphical display option to offer other forms of conceptual models, and extend the number of DBMSs and their versions it can support. Of these two extensions, the least work involved is in supporting a new graphical display. Here, the user needs to identify the notations used by the new display and write the necessary Prolog rules to generate strings and lines used for the drawings. This process will take at most one week, as we do not change graphical constructs such as class_info and ref_info (cf. section 7.2.4) to support different display models. On the other hand, inclusion of a new relational DBMS or version can take a few months as it affects 3 stages of our system. These stages are: meta-data access, constraint enforcement and database migration. All 3 stages uses the query language (SQL) of the DBMS and hence, if this is variant we will need to expand our QMTS. The time required for such an extension will depend on its similarities when compared with standard SQL and may take 2-4 person weeks. Next, we need to assess the constraint handling features supported by the new DBMS so that we can use our knowledge-based tables to overcome any constraint representation limitations. This process may take 1-2 person weeks. To access the meta-data from a database it is necessary to know the structures of its system tables. Also, we need a mechanism to externally access this information (i.e. use an ODBC driver or write our own). This stage can take 1-6 person weeks as in many cases system documentation will be inadequate. Inclusion of a different data model would be a major extension as it affects all stages of our system. It would require provision of new abstraction mechanisms such as parent- child relationships for a hierarchical model and owner-member relationships for a network model. Other possible extensions are concerned with incorporating software modules that would expand our approach. These include a forward gateway for use at the incremental migration stage; an integration module for merging related database applications; and analysers for extracting constraint based information from legacy IS code. These are important and major areas of research, hence the development of such modules could take from many months to years in each case. 8.3 Conclusion 8.3.1 Overall Summary This thesis has reported the results of a research investigation aimed at the design and implementation of a tool for enhancing and migrating heterogeneous legacy databases. Page 130
  • 132. Chapter 8 Evaluation, Future Work and Conclusion In the first two chapters we introduced our research and its aims and objectives. Then in chapter 3, we presented some preliminary database concepts and standards relevant to our work. In chapters 4 and 5, we introduced wider aspects of our problem and studied alternative ways proposed to solve major parts of this problem. Many important points emerged from this study. These include: application of meta-translation techniques to deal with legacy database system heterogeneity; application of migration techniques to specific components of a database application (i.e. the database service) as opposed to an IS as a whole; extending the application of database migration beyond the traditional COBOL oriented and IBM database products; application of a migration approach to distributed database systems; enhancing previous re- engineering approaches to incorporate modern object-oriented concepts and multi-database capabilities; introducing semantic integrity constraints into legacy database systems and hence exploring them beyond their structural semantics. In chapter 5, we described our re-engineering approach and explained how we accomplished our goals in enhancing and preparing legacy databases for migration, while chapter 6 was concerned with testing our ideas using carefully designed test databases. Also in chapter 6, we provided illustrative examples of our system working. In chapter 7, we described the overall architecture and operation of the system together with related implementation considerations. Here we also gave some examples of our system interfaces. In chapter 8, we carried out a detailed evaluation which included research achievements, limitations and suggestions for possible future extensions. We also looked at some real-life areas of application in which our prototype system has been tested and/or could be used. Finally, some major conclusions that can be drawn from this research are presented, below. 8.3.2 Conclusions The important conclusions that can be drawn from the work described in this thesis are as follows: • Although many approaches have been proposed for mapping relational schemas to a form where their semantics can be more easily understood by users, they either lack the application of modern modelling concepts or have been applied to logically centralised or decentralised database schemas, not physically heterogeneous databases. • Previous proposed approaches for mapping relational schemas to conceptual models involve user interactions and pre-requisites. This is confusing for first time users of a system as they don’t have any prior direct experience or knowledge about the underlying schema. We produce an initial conceptual model automatically, prior to any user interaction, to overcome this problem. Our user interaction commences only after the production of the initial conceptual model. This gives users the opportunity to gain some vital basic understanding of a system prior to any serious interaction with it. • Most previous reverse-engineering tools have ignored an important source of database semantics, namely semantic integrity constraints such as foreign key definitions. One obvious reason for this is that many existing database systems do not support the representation of such semantics. We have identified and showed the important contribution that semantic integrity constraints can make by presenting them and applying them to the conceptual and physical models. We have also successfully incorporated them into legacy database systems which do not directly support such semantics. Page 131
  • 133. Chapter 8 Evaluation, Future Work and Conclusion • The problem of legacy IS migration has not been studied for multi-database systems in general. This appears to present many difficulties to users. We have tested and demonstrated the use of our tools with a wide range of relational and extended relational database systems. • The problem of legacy IS migration has not been studied for more recent and modern systems; as a result, ways of eliminating the need for migration have not yet been addressed. Our approach of enhancing legacy ISs irrespective of their DBMS type will assist in redesigning modern database applications and hence overcome the need to migrate such applications in many cases. • Our evaluation has concluded that most of the goals and objectives of our system, presented in sections 1.2 and 2.4 have been successfully met or exceeded. Page 132