3620121datastructures.ppt

Data Structures,
File organization,
Physical Data
design

Data structures and file organisation refer to the methods of organising the data in a database
They primarily deal with physical storage of data, which assumes significance in retrieving,
storing and re-organising data in a database
Linked lists, inverted lists, B-trees and hash tables, among others
Data structures can be used to build data files (a data file or a file is a collection of many similar
records) and file organisation determines access methods for the file
File organisation (or file structure) is a combination of representations for data in files and of
operations for accessing the data. A file structure allows applications to read, write, and modify
data

Why Data Structure
The key factor in designing data structures and file organisation is the relatively slow speed of
hard disks and large amount of time that is required to get information from a disk.
All the data structures and file organisation designs focus on minimising disk accesses and
maximising the likelihood that the information the user will want is already in the memory.
The constraint related to disk access is generally referred to as I/O bottleneck.
Accessing information using multiple trips to the disk greatly slows down the access time.
Ideally, we should get the information we need with one access to the disk or with as few
accesses as possible.

FILES
files grew intolerably large for unaided sequential access, indexes were added to the files.
The INDEXES made it possible to keep a list of keys and pointers in a small file that could be
searched more quickly. With the keys and pointers the user had direct access to the large,
primary file.
However, as the indexes grew, they too became difficult to manage, especially for dynamic files
in which the set of keys changes.
Then, in the early 1960s the idea Data Structures Emerged

Memory
Primary storage: Pertains to storage media used by Central Processing Unit (CPU) i.e., the main
memory and also the cache memory.
The primary storage memory also called RAM (Random Access Memory) provides fast access to
data and is volatile i.e., loses its content in case of a power outage.
Secondary storage: Includes magnetic disks, optical disks and tapes. Secondary storage memory
provides slower access to data than RAM.

Static RAM (SRAM) which is cache memory is used by CPU to speed up execution of
programmes while Dynamic RAM (DRAM) provides the main work area for CPU.
Flash memory which is non-volatile and called EEPROM (Electrically Erasable
Programmable Read-Only-Memory) has access speed and performance between
DRAM and magnetic disks.
CD-ROM (Compact Disk Read-Only-Memory) disks store data optically and are read
by a laser.
WORM (Write-Once-Read-Memory) disks are used for archiving data and allow data
to be written once and read any number of times.
DVD (Digital Video Disk) – a type of optical disk allows storage of four to fifteen
gigabytes of data per disk.
Magnetic tapes are used for archiving and back-up storage and are becoming
popular as tertiary storage to hold terabytes of data.
Juke boxes (optical and tape) are employed to use arrays of CD-ROMs and tapes.

RAID Technology(RedundantArrayofInexpensive/IndependentDisks)
The main goal of RAID is to even out the widely different rates of performance
improvement of disks against those in memory and microprocessors
In addition to improving performance, RAID is also used to improve reliability by
storing redundant information on disks. One technique for introducing
redundancy is called mirroring. Data is written redundantly to two identical
physical disks that are treated as one logical disk. If a disk fails, the other is used
until the first is repaired
The problem of speed and access time is overcome by using a large array of
small independent disks acting as a single high-performance logical disk. A
concept called data striping is used, which utilises parallelism to improve disk
performance

DATA STRUCTURES
The term Data Structure refers to the manner in which relationships between data elements are
represented in the computer system. Organisation of indexes, representation of stored fields,
physical sequence of stored records, etc., are included in the purview of data structures. Thus,
an understanding of data structures is important in gaining an understanding of database
management systems
There are three major types of data structures : linked lists (indexes), inverted lists (indexes) and
B-trees

Indexes
An index is a file in which each entry (record) consists of a data value together with one or more
pointers (physical storage addresses). The data value is a value for some field of the indexed file
(the indexed field) and pointers identify records in the indexed file having that value for that
field.
An index can be used in two ways.
First, it can be used for sequential access to the indexed file i.e. access according to the values of
the indexed field by imposing an ordering of the indexed file.
Second, it can also be used for direct access to individual

Primary index , Clustering Indexes, Secondary Indexes
A primary index is a file that contains a sorted sequence of records having
two columns:
the ordering key field; and a block address
An entry in primary index file contains the index value of the first record of the data block and a
pointer to that data block.
.
ordering key field for this index can be
the primary key of the data file

Clustering Indexes
An index that is created on an ordered file whose records of a file are physically ordered on a
non-key field (that is the field does not have a distinct value for each record) is called a
clustering index.

secondary index
A secondary index is a file that contains records containing a secondary index field
value which is not the ordering field of the data file, and a pointer to the block that
contains the data record.

A simple linked list is a chain of pointers embedded in records. It indicates either a record sequence for
an attribute other than the primary key or all the records with a common property. With a linked list,
any data element can be stored separately. A pointer is then used to link to the next data item.

Inverted Lists
Inverted lists may be viewed simply as index tables of pointers stored separately from the data
records rather than embedded in pointer fields in the stored records themselves
DENSE ONE- TO –ONE (1:1) NON-DENSE 1:m

A NONDENSE LIST ONLY A FEW OF THE
RECORDS IN THE FILE ARE PART OF THE
LIST{1:M}
A DENSE LIST IS ONE WITH A POINTER FOR MOST
OR ALL OF THE RECORDS IN THE FILE.{1:1}
The above lists are dense since there is one-to-one
relationship between both company name and
primary key and company symbol and primary key

Binary tree
•Binary tree is a tree which consists of a root node and two disjoint binary trees called the left
subtree and right subtree.
•every node in a binary tree has 0, 1 or no children

 To search a typical key value,
 start from the root and move towards left or right depending on the value of key that is being
searched.
 An index is a pair, thus while using BST, use the value as the key and address field must also be
specified in order to locate the records in the file that is stored on the secondary storage devices
A BST as a data structure is very much suitable for an index, if an index is to be contained
completely in the primary memory.
 indexes are quite large in nature and require a combination of primary and secondary storage.
BST stored level by level on a secondary storage which would require the additional problem of
finding the correct sub-tree and also it may require a number of transfers, with the worst
condition as one block transfer for each level of a tree being searched. This situation can be
drastically remedied if we use B -Tree as data structure.

B-Trees
•B-trees are a form of data structure based on hierarchies
•“B” stands for Bayer, the originator /“balanced”.
•B-tree structure was discovered by R.Bayer and E.McCreight (1970) of Bell Scientific Research Labs
•B-Trees are balanced in the sense that all the terminal (bottom) nodes have the same path length to
the root (top).
Algorithms have been developed for efficiently searching and maintaining B-Tree indexes
Algorithms is a finite sequence of well-defined, computer-implementable instructions, typically to solve a class
of problems or to perform a computation.

BTrees provide both sequential and indexed access and are quite flexible..
The height of a B-Tree is the number of levels in the hierarchy
The height of a B-Tree is the number of levels in the hierarchy.
Each node on the tree contains an index element which has a key value, a pointer to the rest of
the data and
 two link pointers;
One link (to the left) points to the elements (nodes) that have lower values
while the other link (to the right) points to elements that have a value greater than or equal to
the value in the node.
The root is the highest node on the tree.
The bottom nodes are called leaves because they are at the end of the tree branches.

FILES AND THEIR ORGANIZATIONS
A file is a sequence of records. File organization refers to physical layout or a structure of record occurrences in
a file. File organization determines the way records are stored and accessed.
fixed-length records & variable-length records
If every record in the file has exactly the same size (in bytes), the file is said to be made of fixed-length
records.
If different records in the file have different sizes, the file is said to be made up of variable-length
records. A file may have variable-length records for several reasons :

Disk Blocks
the databases are stored persistently on magnetic disks for the reasons given below:
 The databases being very large may not fit completely in the main memory.
 Storing the data permanently using the non-volatile storage and provide access to the users
with the help of front end applications.
Primary storage is considered to be very expensive and in order to cut short the cost of the
storage per unit of data to substantially less.
Each hard drive is usually composed of a set of disk platters.
Each disk platter has a layer of magnetic material deposited on its surface.
The entire disk can contain a large amount of data, which is organised into smaller packages
called BLOCKS (or pages). On most computers, one block is equivalent to 1 KB of data (= 1024
Bytes).

Disk Blocks
A block is the smallest unit of data transfer between the hard disk and the processor of the
computer.
Each block therefore has a fixed, assigned, address.
the computer processor will submit a read/write request, which includes the address of the
block, and the address of RAM in the computer memory area called a buffer (or cache) where
the data must be stored / taken from. T
he processor then reads and modifies the buffer data as required, and, if required, writes the
block back to the disk.

•Disk Blocks
•The division of a track (on storage medium) into equal sized disk blocks is set by the operating
system during disk formatting
•The records of a file must be allocated to disk blocks because a block is a unit of data transfer
between disk and memory
•The hardware address of a block comprises a surface number, track number and block number
•Buffer
•a contiguous reserved area in main storage that holds one block-has also an address
•For a read command, the block from disk is copied into the buffer, whereas for a write command
the contents of the buffer are copied into the disk block.

File Organisation and access method
A file organization refers to the organization of the data of a file into records, blocks and access
structures; this includes the way the records and blocks are placed on the storage medium and
interlinked.
 An access method on the other hand, provides a group of operations – such as find, read,
modify, delete etc., — that can be applied to a file.

File Organisation and access method
Sequential Access Method (SAM)
Indexed Sequential Access Method (ISAM)
Direct Access Method (DAM)

Records of the file are stored in sequence by the primary key field values.
They are accessible only in the order stored, i.e., in the primary key order
This kind of file Organisation works well for tasks which need to access nearly every record in a
file, e.g., payroll.
a sequentially organised file records are written consecutively when the file is created and
must be accessed consecutively when the file is later used for input

If only sequential access is required sequential media (magnetic tapes) are suitable and
probably the most cost-effective way of processing such files
Sequential access is fast and efficient while dealing with large volumes of data that need to be
processed periodically.
However, it is require that all new transactions be sorted into a proper sequence for sequential
access processing.
 Also, most of the database or file may have to be searched to locate, store, or modify even a
small number of data records. Thus, this method is too slow to handle applications requiring
immediate updating or responses
Sequential files are generally used for backup or transporting data to a different system. A
sequential ASCII file is a popular export/import format that most database systems support.

Advantages of Sequential File Organisation
 It is fast and efficient when dealing with large volumes of data that need to be
processed periodically (batch system).
Disadvantages of sequential File Organisation
 Requires that all new transactions be sorted into the proper sequence for
sequential access processing.
 Locating, storing, modifying, deleting, or adding records in the file require
rearranging the file.
This method is too slow to handle applications requiring immediate updating or
responses

Indexed Sequential Access Method
(ISAM)
It organises the file like a large dictionary,
i.e., records are stored in order of the key but an index is kept which also permits a type of
direct access.
The records are stored sequentially by primary key values and there is an index built over
the primary key field.
This approach gives (almost) direct access to record occurrences via the index table and
sequential access via the way in which the records are laid out on the storage medium.
The physical address of a record given by the index file is also called a pointer.

To improve the query response
time of a sequential file, a type
of indexing technique can be
added.
Indexing associates a set of
objects to a set of orderable
quantities, that are usually
smaller in number or their
properties
A sequential (or sorted on
primary keys) file that is indexed
on its primary key is called an
index sequential file.
The index allows for random
access to records, while the
sequential storage of the
records of the file provides easy
access to the sequential records.

Direct Access Method (DAM)
the record occurrences in a file do not have to be arranged in any particular sequence on
storage media
However, the computer must keep track of the storage location of each record using a variety of
direct organization methods so that data is retrieved when needed.
New transactions data do not have to be sorted, and processing that requires immediate
responses or updating is easily handled
In the direct access method, an algorithm is used to compute the address of a record. The
primary key value is the input to the algorithm and the block address of the record is the output.

hashing algorithm
To implement the approach, a portion of the storage
space is reserved for the file.
This space should be large enough to hold the file plus
some allowance for growth. Then the algorithm that
generates the appropriate address for a given primary
key is devised. The algorithm is commonly called
hashing algorithm. The process of converting primary
key values into addresses is called key-to-address
transformation

•Reserved storage space
•Overflow area
•relative pointers or relative addresses
•hashing algorithm
•Hashed key

3620121datastructures.ppt

More Related Content

Similar to 3620121datastructures.ppt (20)

Recently uploaded (20)

3620121datastructures.ppt