Hashing.pptx

Hashing
For Efficient Look-up Tables

Lecture Outline
 What is hashing?
 How to hash?
 What is collision?
 How to resolve collision?
 Separate chaining
 Linear probing
 Quadratic probing
 Double hashing
 Load factor
 Primary clustering and secondary clustering

What is Hashing?
 Hashing is an algorithm (via a hash function)
that maps large data sets of variable length,
called keys, to smaller data sets of a fixed length
 A hash table (or hash map) is a data structure
that uses a hash function to efficiently map keys
to values, for efficient search and retrieval
 Widely used in many kinds of computer software,
particularly for associative arrays, database
indexing, caches, and sets

The easiest form of hashing
Direct Addressing Table

Example: SBSBusServices
 Operations
 Retrieval: find(num)
 Find the bus route of bus service number num
 Insertion: insert(num)
 Introduce a new bus service number num
 Deletion: delete(num)
 Remove bus service number num
 If bus numbers are integers 0 – 999,
we can use an array with 1000 entries
:
:
data_998
998
999
data_2
0
1
2
Now there are more bus operators in SG
Of course for now we assume that bus numbers
don’t have variants, like 96A, 96B…, etc

Example: SBSBusServices
// a[] is an array (the table)
insert(key, data)
a[key] = data
delete(key)
a[key] = NULL
find(key)
return a[key]
:
:
data_998
998
999
data_2
0
1
2

Direct Addressing Table: Limitations
 Range of keys must be small
 Keys must be dense
 i.e. not many gaps in the key values
 How to overcome these restrictions?

Hashing: Idea
•Hashing is the process of mapping large amount of data item
to smaller table with the help of hashing function.
•Hashing is also known as Hashing Algorithm or Message
Digest Function.
•Hashing allows to update and retrieve any data entry in a
constant time O(1).
•Constant time O(1) means the operation does not depend on
the size of the data.
•A fixed process converts a key to a hash key is known as
a Hash Function.
•This function takes a key and maps it to a value of a certain
length which is called a Hash value or Hash.
•Hash table is made with hash value and item pairs

Hash Table: Phone Numbers Example
66752378
68744483
h
237
h 336
68744483,
data
66752378,
data
h is a hash function
h(x) = x%997

Hash Table: Collision
 A hash function may map different
keys to the same slot
 A many-to-one mapping and
not one-to-one
 E.g. 66754372 hashes to the same
location of 66752378
 This is called a “collision”, when
two keys have the same hash value
66754372 h
68744483,
data
66752378,
data
237

Hash Table: Operations
// a[] is an array (the table)
// h is a hash function
insert(key, data)
a[h(key)] = data
delete(key)
a[h(key)] = NULL
find(key)
return a[h(key)]
However, this does not work for all cases! Why?

Another Example:
Hash table with n empty slots:
Arithmetic modulo used as the hash function:

New Hash table after applying modulo hash function:
Now when we want to search for an item, we simply use the hash function to
compute the slot name for the item and then check the hash table to see if it is
present. This searching operation is done in O(1), since a constant amount of time is
required to compute the hash value and then index the hash table at that location. If
everything is where it should be, we have found a constant time search algorithm.
You can probably already see that this technique is going to work only if each item
maps to a unique location in the hash table. For example, if the item 44 had been
the next item in our collection, it would have a hash value of 0, 44%11==0. Since 77
also had a hash value of 0, we would have a problem. This is known as Collision.

Two Important Issues
 How to hash?
 How to resolve collisions?

How to create a good one?
Hash Functions

Hash Functions and Hash Values
 Suppose we have a hash table of size N
 Keys are used to identify the data
 A hash function is used to compute a hash value
 A hash value (hash code) is
 Computed from the key with the use of a hash function to
get a number in the range 0 to N−1
 Used as the index (address) of the table entry for the data
 Regarded as the “home address” of a key
 Desire: The addresses are different and spread
evenly over the range
 When two keys have same hash value — collision

Good Hash Functions
 Fast to compute, O(1)
 Scatter keys evenly throughout the hash table
 Less collisions
 Need less slots (space)

Bad Hash Functions: Example
 Select Digits
 e.g. choose the 4th and 8th digits of a phone number
 hash(67754378) = 58
 hash(63497820) = 90

Perfect Hash Functions
 Perfect hash function is a one-to-one mapping between
keys and hash values. So no collision occurs
 Possible if all keys are known
 Applications: compiler and interpreter search for reserved
words; shell interpreter searches for built-in commands
 GNU gperf is a freely available perfect hash function
generator written in C++ that automatically constructs
perfect functions (a C++ program) from a user supplied
list of keywords

How to Define a Hash Function?
 Division method
 Mid Square method
 Multiplication method
 Hashing of strings
 Folding method

Division Method (mod operator)
 Map into a hash table of m slots
 Use the modulo operator (%) to map an integer
to a value between 0 and m−1
 n mod m = remainder of n divided by m, where n and m
are positive integers
hash(k)  k % m
 The most popular method
 Rule of thumb: Pick a prime number, close to
a power of two, to be m

Mid Square method
 The key is squared (k*k). H(k)= l,
Where l is obtained by deleting digits from both
ends of the (k*k)
 Same positions must be slashed from all keys.

Folding Method
 The key is portioned into a number of slots, k1,
k2..kr.
 Each part, except possibly the last has equal
number of digits in it as the required address.
 Then the parts are added together, ignoring the
last carry.
 H(K)= k1+k2+k3+…..+kr.

Multiplication Method
1) Multiply key by a fraction A (between 0 and 1)
2) Extract the fractional part
3) Multiply by m, the hash table size
 hash(k) floor mkA mod 1
 The reciprocal of the golden ratio
= (sqrt(5) −1) / 2 =0.618033
seems to be a good choice for A

Hashing of Strings: Example
// s is a string
hash1(s) {
sum = 0
for each character c in s {
sum += c
// sum up the ASCII values of all characters
}
return sum % m // m is the hash table size
}

hash1("Tan Ah Teck")
= ('T' + 'a' + 'n' + ' ' +
'A' + 'h' + ' ' +
'T' + 'e' + 'c' + 'k') % 11
// hash table size is 11
= (84 + 97 + 110 + 32 +
65 + 104 + 32 +
84 + 101 + 99 + 107) % 11
= 825 % 11
= 0

 All 3 strings below have the same hash value.
Why?
 "Lee Chin Tan"
 "Chen Le Tian"
 "Chan Tin Lee"
 Problem: The hash value is independent of the
positions of the characters

Improved Hashing of Strings
 Better to “shift” the sum before adding the next
character, so that its position affects the hash code
 Polynomial hash code
hash2(s) {
sum = 0
for each character c in s {
sum = sum * 37 + c
}
return sum % m
}

Collision Resolution Techniques
 Separate Chaining
 Linear Probing
 Quadratic Probing
 Double Hashing

Separate Chaining
0
m−1
k4,data
k1,data
k3,data
k2,data

Load Factor
 n: number of keys in the hash table
 m: size of the hash tables — number of slots
 : load factor
  = n / m
 Measures how full the hash table is.
 In separate chaining, table size equals to the
number of linked lists, so  is the average length
of the linked lists

Separate Chaining: Performance
 Hash table operations
 insert (key, data)
 Insert data into the list a[h(key)]
 Takes O(1) time
 find (key)
 Find key from the list a[h(key)]
 Takes O(1+) time on average
 delete (key)
 Delete data from the list a[h(key)]
 Takes O(1+) time on average
If  is bounded by
some constant, then
all three operations
are O(1)

Open Addressing
 Separate chaining is a close addressing system
as the address given to a key is fixed
 When the hash address given to a key is open
(not fixed), the hashing is an open addressing
system
 Open addressing
 Hashed items are in a single array
 Hash code gives the home address
 Collision is resolved by checking multiple positions
 Each check is called a probe into the table

Linear Probing method:
One method for resolving collisions looks into the hash table and tries to find
another open slot to hold the item that caused the collision. A simple way to do
this is to start at the original hash value position and then move in a sequential
manner through the slots until we encounter the first slot that is empty. Note that
we may need to go back to the first slot (circularly) to cover the entire hash table.
This collision resolution process is referred to as open addressing in that it tries to
find the next open slot or address in the hash table. By systematically visiting each
slot one at a time, we are performing an open addressing technique called linear
probing.

Linear Probing
0
1
2
3
4
5
6
hash(k) = k mod 7
Here the table size m = 7
Note: 7 is a prime number.
In linear probing,
when there is a
collision, we scan
forwards for the the
next empty slot
(wrapping around
when we reach the
last slot).

Linear Probing: Insert 18
0
1
2
3
4
5
6
hash(k) = k mod 7
hash(18)
= 18 mod 7
= 4
18

0
1
2
3
4
5
6
hash(k) = k mod 7
18
hash(14)
= 14 mod 7
= 0
14

0
1
2
3
4
5
6
hash(k) = k mod 7
18
14
21
hash(21)
= 21 mod 7
= 0
Collision occurs!
Look for next empty slot.

0
1
2
3
4
5
6
hash(k) = k mod 7
18
14
21
1
Collides with 21
(hash value 0). Look
for next empty slot.
hash(1)
= 1 mod 7
= 1

hash(k) = k mod 7
0
1
2
3
4
5
6
14
21
1
35
18
Collision, need to
check next 3 slots.
hash(35)
= 35 mod 7
= 0

Linear Probing: Find 35
hash(k) = k mod 7
hash(35)
= 35 mod 7
= 0
0
1
2
3
4
5
6
14
21
1
35
18
Found 35, after 4
probes.

hash(k) = k mod 7
hash(8)
= 8 mod 7
= 1
0
1
2
3
4
5
6
14
21
1
35
18
8 NOT found.
Need 5 probes!

Linear Probing: Delete 21
hash(k) = k mod 7
hash(21)
= 21 mod 7
= 0
0
1
2
3
4
5
6
14
21
1
35
18

hash(k) = k mod 7
hash(35)
= 35 mod 7
= 0
0
1
2
3
4
5
6
14
We cannot simply
remove a value,
because it can
affect find()!
1
35
18
35 NOT found!
Incorrect!

Collision resolution with linear probing
Insert (54)
Insert (26)
Insert (17)
Insert (77)
Insert (31)
Insert (44)
Insert (55)
Insert (20)
Insert (93)

Clustering:
A disadvantage to linear probing is the tendency for clustering; items become
clustered in the table. This means that if many collisions occur at the same hash
value, a number of surrounding slots will be filled by the linear probing resolution.
This will have an impact on other items that are being inserted, as we saw when we
tried to add the item 20 above. A cluster of values hashing to 0 had to be skipped to
finally find an open position.
Solution: One way to deal with clustering is to extend the linear probing
technique so that instead of looking sequentially for the next open slot, we skip
slots, thereby more evenly distributing the items that have caused collisions. This
will potentially reduce the clustering that occurs. Figure below shows the items
when collision resolution is done with a “plus 3” probe. This means that once a
collision occurs, we will look at every third slot until we find one that is empty.

How to Delete?
 Lazy Deletion
 Use three different states at each slot
 Occupied
 Deleted
 Empty
 When a value is removed from linear probed
hash table, we just mark the status of the slot as
“deleted”, instead of emptying the slot
 Need to use a state array the same size as the
hash table

Linear Probing: Delete 21
hash(k) = k mod 7
hash(21)
= 21 mod 7
= 0
0
1
2
3
4
5
6
14
2
X
1
1
35
18
Slot 1 is occupied
but now marked
as deleted.

hash(k) = k mod 7
hash(35)
= 35 mod 7
= 0
0
1
2
3
4
5
6
14
2
X
1
1
35
18
Found 35.
Now we can find 35.

Linear Probing: Insert 15 (1/2)
hash(k) = k mod 7
hash(15)
= 15 mod 7
= 1
0
1
2
3
4
5
6
14
2
X
1
1
35
18
Slot 1 is marked as
deleted.
So, we insert this new
value 15 into the slot that
has been marked as
deleted (i.e. slot 1).

Linear Probing: Insert 15 (2/2)
hash(k) = k mod 7
hash(15)
= 15 mod 7
= 1
0
1
2
3
4
5
6
So, 15 is inserted into slot
1, which was marked as
deleted.
Note: We should insert a
new value in first
available slot so that the
find operation for this
value will be the fastest.
21
X
14
15
1
35
18

Problem 1: Primary Clustering
 A cluster is a collection of
consecutive occupied slots
 A cluster that covers the
home address of a key is
called the primary cluster
of the key
 Linear probing can create
large primary clusters that
will increase the running
time of find/insert/delete
operations
0
1
2
3
4
5
6
14
15
1
35
18
consecutive
occupied
slots

Linear Probing: Probe Sequence
 The probe sequence of this linear probing is
hash(key)
( hash(key) + 1 ) % m
( hash(key) + 2 ) % m
( hash(key) + 3 ) % m
⁞
 If there is an empty slot, we are sure to find it
 When an empty slot is found, conflict resolved, but the
primary cluster of the key is expanded as a result
 The size of the resulting primary cluster may be very big
due to the annexation of the neighboring cluster

Modified Linear Probing
 To reduce primary clustering, we can modify the
probe sequence to
hash(key)
( hash(key) + 1 * d ) % m
( hash(key) + 2 * d) % m
( hash(key) + 3 * d) % m
⁞
where d is some constant integer >1 and is
co-prime to m
 Since d and m are co-primes, the probe sequence
covers all the slots in the hash table

Ques. Find the total no of probes required to fill in the
below given keys in a table of size 8.
a. 11111011
b. 01101010
c. 01010010
d. 11011011
e. 10011010
Ques2. Insert keys 12, 18, 13, 2, 3, 23, 5 and 15 into a
hash table using a hash function K mod 10.

Ques3. Given a hash table, give the sequence of keys used when the hash
function used is K mod 10.
Which one of the following choices gives a possible order in which the key
values could have been inserted in the table?
(A) 46, 42, 34, 52, 23, 33
(B) 34, 42, 23, 52, 33, 46
(C) 46, 34, 42, 23, 52, 33
(D) 42, 46, 33, 23, 34, 52
42 23 34 52 46 33
0 1 2 3 4 5 6 7 8 9

Quadratic Probing
 The probe sequence of quadratic probing is
hash(key)
( hash(key) + 1 ) % m
( hash(key) + 4 ) % m
( hash(key) + 9 ) % m
⁞
( hash(key) + k2 ) % m

Quadratic Probing: Insert 18, 3
0
1
2
3
4
5
6
hash(k) = k mod 7
hash(18) = 4
hash(3) = 3
3
18

Quadratic Probing: Insert 38
hash(k) = k mod 7
hash(38) = 3
0
1
2
3
4
5
6
Collision!
3
18
38
+1
+4

Theorem of Quadratic Probing
 How can we be sure that quadratic probing
always terminates?
 Insert 12 into the previous example, followed by 10.
See what happen?
 Theorem: If  < 0.5, and m is prime,
then we can always find an empty slot
 m is the table size and  is the load factor

Problem 2: Secondary Clustering
 In quadratic probing, clusters are formed along
the path of probing, instead of around the home
location
 These clusters are called secondary clusters
 Secondary clusters are formed as a result of
using the same pattern in probing by all keys
 If two keys have the same home location,
their probe sequences are going to be the same
 But it is not as bad as primary clustering in
linear probing

Double Hashing
 To reduce secondary clustering, we can use a
second hash function to generate different probe
sequences for different keys
hash(key)
( hash(key) + 1 * hash2(key) ) % m
⁞
 hash2 is called the secondary hash function
 If hash2(k) = 1, then it is the same as linear probing
 If hash2(k) = d, where d is a constant integer > 1,
then it is the same as modified linear probing

Double Hashing: 14, 18 in, Insert 21
0
1
2
3
4
5
6
hash(k) = k mod 7
hash2(k) = k mod 5
hash(21) = 0
hash2(21) = 1
18
14
21

Double Hashing: Insert 35
hash(k) = k mod 7
hash2(k) = k mod 5
hash(35) = 0
hash2(35) = 0
0
1
2
3
4
5
6
14
21
18
29
But if we insert 35,
the probe sequence
is 0, 0, 0, …
What is wrong?
Since hash2(35)=0.
Not acceptable!

hash2(key) must not be 0
 We can redefine hash2(key) as
 hash2(key) = (key % s) + 1, or
 hash2(key) = s – (key % s)
 Note
 The size of hash table must be a prime m
 When defining hash2(key) = (key % s) + 1
 s < m but s need not be a prime
 Usually s = m – 1

Good Collision Resolution Method
 Minimize clustering
 Always find an empty slot if it exists
 Give different probe sequences when 2 keys
collide (i.e. no secondary clustering)
 Fast, O(1)

Rehash
 Time to rehash
 When the table is getting full, the operations are getting slow
 For quadratic probing, insertions might fail when the table is
more than half full
 Rehash operation
 Build another table about twice as big with a new hash
function
 Scan the original table, for each key, compute the new hash
value and insert the data into the new table
 Delete the original table
 The load factor used to decide when to rehash
 For open addressing: 0.5
 For closed addressing: 1

Summary
 How to hash?
 Criteria for good hash functions
 How to resolve collision?
 Separate chaining
 Linear probing
 Quadratic probing
 Double hashing
 Problem on deletions
 Primary clustering and secondary clustering

Hashing.pptx

More Related Content

Similar to Hashing.pptx (20)

Recently uploaded (20)

Hashing.pptx