Hashing Techniques in database management systems

• Hashing in DBMS is a technique to quickly locate a data record in a
database irrespective of the size of the database. For larger databases
containing thousands and millions of records, the indexing data
structure technique becomes very inefficient because searching a
specific record through indexing will consume more time.

What is Hashing?
• The hashing technique utilizes an auxiliary hash table to store the data records
using a hash function. There are 2 key components in hashing:
• Hash Table: A hash table is an array or data structure and its size is determined by
the total volume of data records present in the database. Each memory location in
a hash table is called a ‘bucket‘ or hash indices and stores a data record’s exact
location and can be accessed through a hash function.
• Bucket: A bucket is a memory location (index) in the hash table that stores the
data record. These buckets generally store a disk block which further stores
multiple records. It is also known as the hash index.
• Hash Function: A hash function is a mathematical equation or algorithm that takes
one data record’s primary key as input and computes the hash index as output.

Hash Function
• A hash function is a mathematical algorithm that computes the index
or the location where the current data record is to be stored in the
hash table so that it can be accessed efficiently later. This hash
function is the most crucial component that determines the speed of
fetching data.

Internal Hashing
• For internal files, hashing is typically implemented as a hash table
through the use of an array of records. Suppose that the array index
range is from 0 to M – 1
• we have M slots whose addresses correspond to the array indexes.
We choose a hash function that transforms the hash field value into
an integer between 0 and M − 1. One common hash function is the
h(K) = K mod M
• M function, which returns the remainder of an integer hash field
value K after division by M; this value is then used for the record
address.

Two simple hashing algorithms: (a) Applying the mod hash
• function to a character string K. (b) Collision resolution by open addressing.
• (a) temp ← 1;
for i ← 1 to 20 do temp ← temp * code(K[i ] ) mod M ;
hash_address ← temp mod M;
• (b) i ← hash_address(K); a ← i;
if location i is occupied
then begin i ← (i + 1) mod M;
while (i ≠ a) and location i is occupied
do i ← (i + 1) mod M;
if (i = a) then all positions are full
else new_hash_address ← i;
end;

• Other hashing functions can be used. One technique, called
folding, involves applying an arithmetic function such as
addition or a logical function such as exclusive or to different
portions of the hash field value to calculate the hash address
• for example, with an address space from 0 to 999 to store
1,000 keys, a 6-digit key 235469 may be folded and stored at
the address: (235+964) mod 1000 = 199
• Another technique involves picking some digits of the hash
field value—for instance, the third, fifth, and eighth digits—to
form the hash address (for example, storing 1,000 employees
with Social Security numbers of 10 digits into a hash file with
1,000 positions would give the Social Security number 301-67-
8923 a hash value of 172 by this hash function).

• A collision occurs when the hash field value of a record that
is being inserted hashes to an address that already contains
a different record. In this situation, we must insert the new
record in some other position, since its hash address is
occupied. The process of finding another position is called
collision resolution. There are numerous methods for
collision resolution, including the following:
• 1.CHAINING
• 2.OPEN ADDRESSING
• 3.MULTIPLE HASHING

Chaining
For this method, various overflow locations are kept, usually by
extending the array with a number of overflow positions.
Additionally, a pointer field is added to each record location. A
collision is resolved by placing the new record in an unused overflow
location and setting the pointer of the occupied hash address
location to the address of that overflow location
Open Addressing/Closed Hashing
This is also called closed hashing this aims to solve the problem of collision by
looking out for the next empty slot available which can store data. It uses
techniques like linear probing, quadratic probing, double hashing, etc.
Multiple hashing.
The program applies a second hash function if the first results in a
collision. If another collision results, the program uses open
addressing or applies a third hash function and then uses open
addressing if necessary.

External hashing
• Hashing for disk files is called external hashing. To suit the characteristics of disk
storage, the hash address space is made of buckets. Each bucket consists of
either one disk block or a cluster of contiguous (neighbouring) blocks, and can
accommodate a certain number of records.
• A hash function maps a key into a relative bucket number, rather than assigning
an absolute block address to the bucket. A table maintained in the file header
converts the relative bucket number into the corresponding disk block address.
• The collision problem is less severe with buckets, because as many records as
will fit in a bucket can hash to the same bucket without causing any problem. If
the collision problem does occur when a bucket is filled to its capacity, we can
use a variation of the chaining method to resolve it.

Dynamic Hashing
Dynamic hashing is also known as extendible hashing, used to handle
database that frequently changes data sets. This method offers us a
way to add and remove data buckets on demand dynamically. This way
as the number of data records varies, the buckets will also grow and
shrink in size periodically whenever a change is made.

• Properties of Dynamic Hashing
• The buckets will vary in size dynamically periodically as changes are made
offering more flexibility in making any change.
• Dynamic Hashing aids in improving overall performance by minimizing or
completely preventing collisions.
• It has the following major components: Data bucket, Flexible hash
function, and directories
• A flexible hash function means that it will generate more dynamic values
and will keep changing periodically asserting to the requirements of the
database.
• Directories are containers that store the pointer to buckets. If bucket
overflow or bucket skew-like problems happen to occur, then bucket
splitting is done to maintain efficient retrieval time of data records. Each
directory will have a directory id.

suppose that a new record to be inserted causes overflow in the bucket whose hash values start with 01 (the third bucket). The records in that
bucket will have to be redistributed among two buckets: the first contains all records whose hash values start with 010, and the second contains
all those whose hash values start with 011. Now the two directory entries for 010 and 011 point to the two new distinct buckets. Before the split,
they point to the same bucket. The local depth of the two new buckets is 3, which is one more than the local depth of the old bucket.

If global depth: k = 2, the keys will be mapped accordingly to the hash index. K bits
starting from LSB will be taken to map a key to the buckets. That leaves us with the
following 4 possibilities: 00, 11, 10, 01.

Retrieval - To find the bucket containing the search key value K:
Compute h(K).
Take the first i bits of h(K).Look at the corresponding table entry for this
i-bit string.Follow the bucket pointer in the table entry to retrieve the
block.
Insertion - To add a new record with the hash key value K:
Follow the same procedure for retrieval, ending up in some bucket.
If there is still space in that bucket, place the record in it.If the bucket is
full, we must split the bucket and redistribute the records.If a bucket is
split, we may need to increase the number of bits we use in the hash.

Performance issues
• Hashing provides the fastest possible access for retrieving a record based on its hash field value.
However, search for a record where the hash field value is not available is as expensive as in the case
of a heap file.
• Record deletion can be implemented by removing the record from its bucket. If the bucket has an
overflow chain, we can move one of the overflow records into the bucket to replace the deleted
record. If the record to be deleted is already in the overflow, we simply remove it from the linked list.
• To insert a new record, first, we use the hash function to find the address of the bucket the record
should be in. Then, we insert the record into an available location in the bucket. If the bucket is full,
we will place the record in one of the locations for overflow records.
• The performance of a modification operation depends on two factors: first, the search condition to
locate the record, and second, the field to be modified.
• If the search condition is an equality comparison on the hash field, we can locate the record
efficiently by using the hash function. Otherwise, we must perform a linear search.
• A non-hash field value can be changed and the modified record can be rewritten back to its original
bucket.
• Modifying the hash field value means that the record may move to another bucket, which requires
the deletion of the old record followed by the insertion of the modified one as a new record.

Hashing Techniques in database management systems

More Related Content

Similar to Hashing Techniques in database management systems (20)

More from SheebaS25 (10)

Recently uploaded (20)

Hashing Techniques in database management systems

Editor's Notes