Hashing
For Efficient Look-up Tables
Lecture Outline
 What is hashing?
 How to hash?
 What is collision?
 How to resolve collision?
 Separate chaining
 Linear probing
 Quadratic probing
 Double hashing
 Load factor
 Primary clustering and secondary clustering
What is Hashing?
 Hashing is an algorithm (via a hash function)
that maps large data sets of variable length,
called keys, to smaller data sets of a fixed length
 A hash table (or hash map) is a data structure
that uses a hash function to efficiently map keys
to values, for efficient search and retrieval
 Widely used in many kinds of computer software,
particularly for associative arrays, database
indexing, caches, and sets
The easiest form of hashing
Direct Addressing Table
Example: SBSBusServices
 Operations
 Retrieval: find(num)
 Find the bus route of bus service number num
 Insertion: insert(num)
 Introduce a new bus service number num
 Deletion: delete(num)
 Remove bus service number num
 If bus numbers are integers 0 – 999,
we can use an array with 1000 entries
:
:
data_998
998
999
data_2
0
1
2
Now there are more bus operators in SG
Of course for now we assume that bus numbers
don’t have variants, like 96A, 96B…, etc
Example: SBSBusServices
// a[] is an array (the table)
insert(key, data)
a[key] = data
delete(key)
a[key] = NULL
find(key)
return a[key]
:
:
data_998
998
999
data_2
0
1
2
Direct Addressing Table: Limitations
 Range of keys must be small
 Keys must be dense
 i.e. not many gaps in the key values
 How to overcome these restrictions?
Hashing: Idea
•Hashing is the process of mapping large amount of data item
to smaller table with the help of hashing function.
•Hashing is also known as Hashing Algorithm or Message
Digest Function.
•Hashing allows to update and retrieve any data entry in a
constant time O(1).
•Constant time O(1) means the operation does not depend on
the size of the data.
•A fixed process converts a key to a hash key is known as
a Hash Function.
•This function takes a key and maps it to a value of a certain
length which is called a Hash value or Hash.
•Hash table is made with hash value and item pairs
Hash Table: Phone Numbers Example
66752378
68744483
h
237
h 336
68744483,
data
66752378,
data
h is a hash function
h(x) = x%997
Hash Table: Collision
 A hash function may map different
keys to the same slot
 A many-to-one mapping and
not one-to-one
 E.g. 66754372 hashes to the same
location of 66752378
 This is called a “collision”, when
two keys have the same hash value
66754372 h
68744483,
data
66752378,
data
237
Hash Table: Operations
// a[] is an array (the table)
// h is a hash function
insert(key, data)
a[h(key)] = data
delete(key)
a[h(key)] = NULL
find(key)
return a[h(key)]
However, this does not work for all cases! Why?
Another Example:
Hash table with n empty slots:
Arithmetic modulo used as the hash function:
New Hash table after applying modulo hash function:
Now when we want to search for an item, we simply use the hash function to
compute the slot name for the item and then check the hash table to see if it is
present. This searching operation is done in O(1), since a constant amount of time is
required to compute the hash value and then index the hash table at that location. If
everything is where it should be, we have found a constant time search algorithm.
You can probably already see that this technique is going to work only if each item
maps to a unique location in the hash table. For example, if the item 44 had been
the next item in our collection, it would have a hash value of 0, 44%11==0. Since 77
also had a hash value of 0, we would have a problem. This is known as Collision.
Two Important Issues
 How to hash?
 How to resolve collisions?
How to create a good one?
Hash Functions
Hash Functions and Hash Values
 Suppose we have a hash table of size N
 Keys are used to identify the data
 A hash function is used to compute a hash value
 A hash value (hash code) is
 Computed from the key with the use of a hash function to
get a number in the range 0 to N−1
 Used as the index (address) of the table entry for the data
 Regarded as the “home address” of a key
 Desire: The addresses are different and spread
evenly over the range
 When two keys have same hash value — collision
Good Hash Functions
 Fast to compute, O(1)
 Scatter keys evenly throughout the hash table
 Less collisions
 Need less slots (space)
Bad Hash Functions: Example
 Select Digits
 e.g. choose the 4th and 8th digits of a phone number
 hash(67754378) = 58
 hash(63497820) = 90
Perfect Hash Functions
 Perfect hash function is a one-to-one mapping between
keys and hash values. So no collision occurs
 Possible if all keys are known
 Applications: compiler and interpreter search for reserved
words; shell interpreter searches for built-in commands
 GNU gperf is a freely available perfect hash function
generator written in C++ that automatically constructs
perfect functions (a C++ program) from a user supplied
list of keywords
How to Define a Hash Function?
 Division method
 Mid Square method
 Multiplication method
 Hashing of strings
 Folding method
Division Method (mod operator)
 Map into a hash table of m slots
 Use the modulo operator (%) to map an integer
to a value between 0 and m−1
 n mod m = remainder of n divided by m, where n and m
are positive integers
hash(k)  k % m
 The most popular method
 Rule of thumb: Pick a prime number, close to
a power of two, to be m
Mid Square method
 Map into a hash table of m slots
 The key is squared (k*k). H(k)= l,
Where l is obtained by deleting digits from both
ends of the (k*k)
 Same positions must be slashed from all keys.
Folding Method
 Map into a hash table of m slots
 The key is portioned into a number of slots, k1,
k2..kr.
 Each part, except possibly the last has equal
number of digits in it as the required address.
 Then the parts are added together, ignoring the
last carry.
 H(K)= k1+k2+k3+…..+kr.
Multiplication Method
1) Multiply key by a fraction A (between 0 and 1)
2) Extract the fractional part
3) Multiply by m, the hash table size
 hash(k) floor mkA mod 1
 The reciprocal of the golden ratio
= (sqrt(5) −1) / 2 =0.618033
seems to be a good choice for A
Hashing of Strings: Example
// s is a string
hash1(s) {
sum = 0
for each character c in s {
sum += c
// sum up the ASCII values of all characters
}
return sum % m // m is the hash table size
}
Hashing of Strings: Example
hash1("Tan Ah Teck")
= ('T' + 'a' + 'n' + ' ' +
'A' + 'h' + ' ' +
'T' + 'e' + 'c' + 'k') % 11
// hash table size is 11
= (84 + 97 + 110 + 32 +
65 + 104 + 32 +
84 + 101 + 99 + 107) % 11
= 825 % 11
= 0
Hashing of Strings: Example
 All 3 strings below have the same hash value.
Why?
 "Lee Chin Tan"
 "Chen Le Tian"
 "Chan Tin Lee"
 Problem: The hash value is independent of the
positions of the characters
Improved Hashing of Strings
 Better to “shift” the sum before adding the next
character, so that its position affects the hash code
 Polynomial hash code
hash2(s) {
sum = 0
for each character c in s {
sum = sum * 37 + c
}
return sum % m
}
Collision Resolution Techniques
 Separate Chaining
 Linear Probing
 Quadratic Probing
 Double Hashing
Separate Chaining
0
m−1
k4,data
k1,data
k3,data
k2,data
Load Factor
 n: number of keys in the hash table
 m: size of the hash tables — number of slots
 : load factor
  = n / m
 Measures how full the hash table is.
 In separate chaining, table size equals to the
number of linked lists, so  is the average length
of the linked lists
Separate Chaining: Performance
 Hash table operations
 insert (key, data)
 Insert data into the list a[h(key)]
 Takes O(1) time
 find (key)
 Find key from the list a[h(key)]
 Takes O(1+) time on average
 delete (key)
 Delete data from the list a[h(key)]
 Takes O(1+) time on average
If  is bounded by
some constant, then
all three operations
are O(1)
Open Addressing
 Separate chaining is a close addressing system
as the address given to a key is fixed
 When the hash address given to a key is open
(not fixed), the hashing is an open addressing
system
 Open addressing
 Hashed items are in a single array
 Hash code gives the home address
 Collision is resolved by checking multiple positions
 Each check is called a probe into the table
Linear Probing method:
One method for resolving collisions looks into the hash table and tries to find
another open slot to hold the item that caused the collision. A simple way to do
this is to start at the original hash value position and then move in a sequential
manner through the slots until we encounter the first slot that is empty. Note that
we may need to go back to the first slot (circularly) to cover the entire hash table.
This collision resolution process is referred to as open addressing in that it tries to
find the next open slot or address in the hash table. By systematically visiting each
slot one at a time, we are performing an open addressing technique called linear
probing.
Linear Probing
0
1
2
3
4
5
6
hash(k) = k mod 7
Here the table size m = 7
Note: 7 is a prime number.
In linear probing,
when there is a
collision, we scan
forwards for the the
next empty slot
(wrapping around
when we reach the
last slot).
Linear Probing: Insert 18
0
1
2
3
4
5
6
hash(k) = k mod 7
hash(18)
= 18 mod 7
= 4
18
Linear Probing: Insert 14
0
1
2
3
4
5
6
hash(k) = k mod 7
18
hash(14)
= 14 mod 7
= 0
14
Linear Probing: Insert 21
0
1
2
3
4
5
6
hash(k) = k mod 7
18
14
21
hash(21)
= 21 mod 7
= 0
Collision occurs!
Look for next empty slot.
Linear Probing: Insert 1
0
1
2
3
4
5
6
hash(k) = k mod 7
18
14
21
1
Collides with 21
(hash value 0). Look
for next empty slot.
hash(1)
= 1 mod 7
= 1
Linear Probing: Insert 35
hash(k) = k mod 7
0
1
2
3
4
5
6
14
21
1
35
18
Collision, need to
check next 3 slots.
hash(35)
= 35 mod 7
= 0
Linear Probing: Find 35
hash(k) = k mod 7
hash(35)
= 35 mod 7
= 0
0
1
2
3
4
5
6
14
21
1
35
18
Found 35, after 4
probes.
Linear Probing: Find 8
hash(k) = k mod 7
hash(8)
= 8 mod 7
= 1
0
1
2
3
4
5
6
14
21
1
35
18
8 NOT found.
Need 5 probes!
Linear Probing: Delete 21
hash(k) = k mod 7
hash(21)
= 21 mod 7
= 0
0
1
2
3
4
5
6
14
21
1
35
18
Linear Probing: Find 35
hash(k) = k mod 7
hash(35)
= 35 mod 7
= 0
0
1
2
3
4
5
6
14
We cannot simply
remove a value,
because it can
affect find()!
1
35
18
35 NOT found!
Incorrect!
Collision resolution with linear probing
Insert (54)
Insert (26)
Insert (17)
Insert (77)
Insert (31)
Insert (44)
Insert (55)
Insert (20)
Insert (93)
Clustering:
A disadvantage to linear probing is the tendency for clustering; items become
clustered in the table. This means that if many collisions occur at the same hash
value, a number of surrounding slots will be filled by the linear probing resolution.
This will have an impact on other items that are being inserted, as we saw when we
tried to add the item 20 above. A cluster of values hashing to 0 had to be skipped to
finally find an open position.
Solution: One way to deal with clustering is to extend the linear probing
technique so that instead of looking sequentially for the next open slot, we skip
slots, thereby more evenly distributing the items that have caused collisions. This
will potentially reduce the clustering that occurs. Figure below shows the items
when collision resolution is done with a “plus 3” probe. This means that once a
collision occurs, we will look at every third slot until we find one that is empty.
How to Delete?
 Lazy Deletion
 Use three different states at each slot
 Occupied
 Deleted
 Empty
 When a value is removed from linear probed
hash table, we just mark the status of the slot as
“deleted”, instead of emptying the slot
 Need to use a state array the same size as the
hash table
Linear Probing: Delete 21
hash(k) = k mod 7
hash(21)
= 21 mod 7
= 0
0
1
2
3
4
5
6
14
2
X
1
1
35
18
Slot 1 is occupied
but now marked
as deleted.
Linear Probing: Find 35
hash(k) = k mod 7
hash(35)
= 35 mod 7
= 0
0
1
2
3
4
5
6
14
2
X
1
1
35
18
Found 35.
Now we can find 35.
Linear Probing: Insert 15 (1/2)
hash(k) = k mod 7
hash(15)
= 15 mod 7
= 1
0
1
2
3
4
5
6
14
2
X
1
1
35
18
Slot 1 is marked as
deleted.
So, we insert this new
value 15 into the slot that
has been marked as
deleted (i.e. slot 1).
Linear Probing: Insert 15 (2/2)
hash(k) = k mod 7
hash(15)
= 15 mod 7
= 1
0
1
2
3
4
5
6
So, 15 is inserted into slot
1, which was marked as
deleted.
Note: We should insert a
new value in first
available slot so that the
find operation for this
value will be the fastest.
21
X
14
15
1
35
18
Problem 1: Primary Clustering
 A cluster is a collection of
consecutive occupied slots
 A cluster that covers the
home address of a key is
called the primary cluster
of the key
 Linear probing can create
large primary clusters that
will increase the running
time of find/insert/delete
operations
0
1
2
3
4
5
6
14
15
1
35
18
consecutive
occupied
slots
Linear Probing: Probe Sequence
 The probe sequence of this linear probing is
hash(key)
( hash(key) + 1 ) % m
( hash(key) + 2 ) % m
( hash(key) + 3 ) % m
⁞
 If there is an empty slot, we are sure to find it
 When an empty slot is found, conflict resolved, but the
primary cluster of the key is expanded as a result
 The size of the resulting primary cluster may be very big
due to the annexation of the neighboring cluster
Modified Linear Probing
 To reduce primary clustering, we can modify the
probe sequence to
hash(key)
( hash(key) + 1 * d ) % m
( hash(key) + 2 * d) % m
( hash(key) + 3 * d) % m
⁞
where d is some constant integer >1 and is
co-prime to m
 Since d and m are co-primes, the probe sequence
covers all the slots in the hash table
Ques. Find the total no of probes required to fill in the
below given keys in a table of size 8.
a. 11111011
b. 01101010
c. 01010010
d. 11011011
e. 10011010
Ques2. Insert keys 12, 18, 13, 2, 3, 23, 5 and 15 into a
hash table using a hash function K mod 10.
Ques3. Given a hash table, give the sequence of keys used when the hash
function used is K mod 10.
Which one of the following choices gives a possible order in which the key
values could have been inserted in the table?
(A) 46, 42, 34, 52, 23, 33
(B) 34, 42, 23, 52, 33, 46
(C) 46, 34, 42, 23, 52, 33
(D) 42, 46, 33, 23, 34, 52
42 23 34 52 46 33
0 1 2 3 4 5 6 7 8 9
Quadratic Probing
 The probe sequence of quadratic probing is
hash(key)
( hash(key) + 1 ) % m
( hash(key) + 4 ) % m
( hash(key) + 9 ) % m
⁞
( hash(key) + k2 ) % m
Quadratic Probing: Insert 18, 3
0
1
2
3
4
5
6
hash(k) = k mod 7
hash(18) = 4
hash(3) = 3
3
18
Quadratic Probing: Insert 38
hash(k) = k mod 7
hash(38) = 3
0
1
2
3
4
5
6
Collision!
3
18
38
+1
+4
Theorem of Quadratic Probing
 How can we be sure that quadratic probing
always terminates?
 Insert 12 into the previous example, followed by 10.
See what happen?
 Theorem: If  < 0.5, and m is prime,
then we can always find an empty slot
 m is the table size and  is the load factor
Problem 2: Secondary Clustering
 In quadratic probing, clusters are formed along
the path of probing, instead of around the home
location
 These clusters are called secondary clusters
 Secondary clusters are formed as a result of
using the same pattern in probing by all keys
 If two keys have the same home location,
their probe sequences are going to be the same
 But it is not as bad as primary clustering in
linear probing
Double Hashing
 To reduce secondary clustering, we can use a
second hash function to generate different probe
sequences for different keys
hash(key)
( hash(key) + 1 * hash2(key) ) % m
( hash(key) + 2 * hash2(key) ) % m
( hash(key) + 3 * hash2(key) ) % m
⁞
 hash2 is called the secondary hash function
 If hash2(k) = 1, then it is the same as linear probing
 If hash2(k) = d, where d is a constant integer > 1,
then it is the same as modified linear probing
Double Hashing: 14, 18 in, Insert 21
0
1
2
3
4
5
6
hash(k) = k mod 7
hash2(k) = k mod 5
hash(21) = 0
hash2(21) = 1
18
14
21
Double Hashing: 14, 18 in, Insert 21
0
1
2
3
4
5
6
hash(k) = k mod 7
hash2(k) = k mod 5
hash(21) = 0
hash2(21) = 1
18
14
21
Double Hashing: Insert 35
hash(k) = k mod 7
hash2(k) = k mod 5
hash(35) = 0
hash2(35) = 0
0
1
2
3
4
5
6
14
21
18
29
But if we insert 35,
the probe sequence
is 0, 0, 0, …
What is wrong?
Since hash2(35)=0.
Not acceptable!
hash2(key) must not be 0
 We can redefine hash2(key) as
 hash2(key) = (key % s) + 1, or
 hash2(key) = s – (key % s)
 Note
 The size of hash table must be a prime m
 When defining hash2(key) = (key % s) + 1
 s < m but s need not be a prime
 Usually s = m – 1
Good Collision Resolution Method
 Minimize clustering
 Always find an empty slot if it exists
 Give different probe sequences when 2 keys
collide (i.e. no secondary clustering)
 Fast, O(1)
Rehash
 Time to rehash
 When the table is getting full, the operations are getting slow
 For quadratic probing, insertions might fail when the table is
more than half full
 Rehash operation
 Build another table about twice as big with a new hash
function
 Scan the original table, for each key, compute the new hash
value and insert the data into the new table
 Delete the original table
 The load factor used to decide when to rehash
 For open addressing: 0.5
 For closed addressing: 1
Summary
 How to hash?
 Criteria for good hash functions
 How to resolve collision?
 Separate chaining
 Linear probing
 Quadratic probing
 Double hashing
 Problem on deletions
 Primary clustering and secondary clustering

More Related Content

PDF
hashtableeeeeeeeeeeeeeeeeeeeeeeeeeee.pdf
PPT
Hashing in Data Structure and analysis of Algorithms
PPTX
Data Structures-Topic-Hashing, Collision
PDF
LECT 10, 11-DSALGO(Hashing).pdf
PDF
Hashing notes data structures (HASHING AND HASH FUNCTIONS)
PDF
Tojo Sir Hash Tables.pdfsfdasdasv fdsfdfsdv
PPTX
hashing in data strutures advanced in languae java
hashtableeeeeeeeeeeeeeeeeeeeeeeeeeee.pdf
Hashing in Data Structure and analysis of Algorithms
Data Structures-Topic-Hashing, Collision
LECT 10, 11-DSALGO(Hashing).pdf
Hashing notes data structures (HASHING AND HASH FUNCTIONS)
Tojo Sir Hash Tables.pdfsfdasdasv fdsfdfsdv
hashing in data strutures advanced in languae java

Similar to Hashing.pptx (20)

PPTX
Lecture14_15_Hashing.pptx
PDF
Hashing components and its laws 2 types
PPTX
Hashing .pptx
PPTX
Hashing techniques, Hashing function,Collision detection techniques
PPT
Advance algorithm hashing lec II
PPTX
8. Hash table
PPT
Data Structure and Algorithms Hashing
PPTX
Hashing And Hashing Tables
PPT
Hashing Techniques in Data Strucures and Algorithm
PPT
PPT
Design data Analysis hashing.ppt by piyush
PPTX
HASHING IS NOT YASH IT IS HASH.pptx
PPTX
Lec12-Hash-Tables-27122022-125641pm.pptx
PDF
08 Hash Tables
PPTX
hashing explained in detail with hash functions
PPTX
hashing in data structure for Btech.pptx
PPTX
Hashing using a different methods of technic
PPTX
hashing in data structure for engineering.pptx
PPTX
hashing in data structure for Btech .pptx
PPTX
hashing in data structures and its applications
Lecture14_15_Hashing.pptx
Hashing components and its laws 2 types
Hashing .pptx
Hashing techniques, Hashing function,Collision detection techniques
Advance algorithm hashing lec II
8. Hash table
Data Structure and Algorithms Hashing
Hashing And Hashing Tables
Hashing Techniques in Data Strucures and Algorithm
Design data Analysis hashing.ppt by piyush
HASHING IS NOT YASH IT IS HASH.pptx
Lec12-Hash-Tables-27122022-125641pm.pptx
08 Hash Tables
hashing explained in detail with hash functions
hashing in data structure for Btech.pptx
Hashing using a different methods of technic
hashing in data structure for engineering.pptx
hashing in data structure for Btech .pptx
hashing in data structures and its applications
Ad

Recently uploaded (20)

PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
ai agent creaction with langgraph_presentation_
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Machine Learning and working of machine Learning
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PPTX
recommendation Project PPT with details attached
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
SET 1 Compulsory MNH machine learning intro
PPT
statistic analysis for study - data collection
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
DOCX
Factor Analysis Word Document Presentation
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
1 hour to get there before the game is done so you don’t need a car seat for ...
Caseware_IDEA_Detailed_Presentation.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
ai agent creaction with langgraph_presentation_
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
statsppt this is statistics ppt for giving knowledge about this topic
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Machine Learning and working of machine Learning
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
recommendation Project PPT with details attached
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
CYBER SECURITY the Next Warefare Tactics
SET 1 Compulsory MNH machine learning intro
statistic analysis for study - data collection
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Factor Analysis Word Document Presentation
expt-design-lecture-12 hghhgfggjhjd (1).ppt
Ad

Hashing.pptx

  • 2. Lecture Outline  What is hashing?  How to hash?  What is collision?  How to resolve collision?  Separate chaining  Linear probing  Quadratic probing  Double hashing  Load factor  Primary clustering and secondary clustering
  • 3. What is Hashing?  Hashing is an algorithm (via a hash function) that maps large data sets of variable length, called keys, to smaller data sets of a fixed length  A hash table (or hash map) is a data structure that uses a hash function to efficiently map keys to values, for efficient search and retrieval  Widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets
  • 4. The easiest form of hashing Direct Addressing Table
  • 5. Example: SBSBusServices  Operations  Retrieval: find(num)  Find the bus route of bus service number num  Insertion: insert(num)  Introduce a new bus service number num  Deletion: delete(num)  Remove bus service number num  If bus numbers are integers 0 – 999, we can use an array with 1000 entries : : data_998 998 999 data_2 0 1 2 Now there are more bus operators in SG Of course for now we assume that bus numbers don’t have variants, like 96A, 96B…, etc
  • 6. Example: SBSBusServices // a[] is an array (the table) insert(key, data) a[key] = data delete(key) a[key] = NULL find(key) return a[key] : : data_998 998 999 data_2 0 1 2
  • 7. Direct Addressing Table: Limitations  Range of keys must be small  Keys must be dense  i.e. not many gaps in the key values  How to overcome these restrictions?
  • 8. Hashing: Idea •Hashing is the process of mapping large amount of data item to smaller table with the help of hashing function. •Hashing is also known as Hashing Algorithm or Message Digest Function. •Hashing allows to update and retrieve any data entry in a constant time O(1). •Constant time O(1) means the operation does not depend on the size of the data. •A fixed process converts a key to a hash key is known as a Hash Function. •This function takes a key and maps it to a value of a certain length which is called a Hash value or Hash. •Hash table is made with hash value and item pairs
  • 9. Hash Table: Phone Numbers Example 66752378 68744483 h 237 h 336 68744483, data 66752378, data h is a hash function h(x) = x%997
  • 10. Hash Table: Collision  A hash function may map different keys to the same slot  A many-to-one mapping and not one-to-one  E.g. 66754372 hashes to the same location of 66752378  This is called a “collision”, when two keys have the same hash value 66754372 h 68744483, data 66752378, data 237
  • 11. Hash Table: Operations // a[] is an array (the table) // h is a hash function insert(key, data) a[h(key)] = data delete(key) a[h(key)] = NULL find(key) return a[h(key)] However, this does not work for all cases! Why?
  • 12. Another Example: Hash table with n empty slots: Arithmetic modulo used as the hash function:
  • 13. New Hash table after applying modulo hash function: Now when we want to search for an item, we simply use the hash function to compute the slot name for the item and then check the hash table to see if it is present. This searching operation is done in O(1), since a constant amount of time is required to compute the hash value and then index the hash table at that location. If everything is where it should be, we have found a constant time search algorithm. You can probably already see that this technique is going to work only if each item maps to a unique location in the hash table. For example, if the item 44 had been the next item in our collection, it would have a hash value of 0, 44%11==0. Since 77 also had a hash value of 0, we would have a problem. This is known as Collision.
  • 14. Two Important Issues  How to hash?  How to resolve collisions?
  • 15. How to create a good one? Hash Functions
  • 16. Hash Functions and Hash Values  Suppose we have a hash table of size N  Keys are used to identify the data  A hash function is used to compute a hash value  A hash value (hash code) is  Computed from the key with the use of a hash function to get a number in the range 0 to N−1  Used as the index (address) of the table entry for the data  Regarded as the “home address” of a key  Desire: The addresses are different and spread evenly over the range  When two keys have same hash value — collision
  • 17. Good Hash Functions  Fast to compute, O(1)  Scatter keys evenly throughout the hash table  Less collisions  Need less slots (space)
  • 18. Bad Hash Functions: Example  Select Digits  e.g. choose the 4th and 8th digits of a phone number  hash(67754378) = 58  hash(63497820) = 90
  • 19. Perfect Hash Functions  Perfect hash function is a one-to-one mapping between keys and hash values. So no collision occurs  Possible if all keys are known  Applications: compiler and interpreter search for reserved words; shell interpreter searches for built-in commands  GNU gperf is a freely available perfect hash function generator written in C++ that automatically constructs perfect functions (a C++ program) from a user supplied list of keywords
  • 20. How to Define a Hash Function?  Division method  Mid Square method  Multiplication method  Hashing of strings  Folding method
  • 21. Division Method (mod operator)  Map into a hash table of m slots  Use the modulo operator (%) to map an integer to a value between 0 and m−1  n mod m = remainder of n divided by m, where n and m are positive integers hash(k)  k % m  The most popular method  Rule of thumb: Pick a prime number, close to a power of two, to be m
  • 22. Mid Square method  Map into a hash table of m slots  The key is squared (k*k). H(k)= l, Where l is obtained by deleting digits from both ends of the (k*k)  Same positions must be slashed from all keys.
  • 23. Folding Method  Map into a hash table of m slots  The key is portioned into a number of slots, k1, k2..kr.  Each part, except possibly the last has equal number of digits in it as the required address.  Then the parts are added together, ignoring the last carry.  H(K)= k1+k2+k3+…..+kr.
  • 24. Multiplication Method 1) Multiply key by a fraction A (between 0 and 1) 2) Extract the fractional part 3) Multiply by m, the hash table size  hash(k) floor mkA mod 1  The reciprocal of the golden ratio = (sqrt(5) −1) / 2 =0.618033 seems to be a good choice for A
  • 25. Hashing of Strings: Example // s is a string hash1(s) { sum = 0 for each character c in s { sum += c // sum up the ASCII values of all characters } return sum % m // m is the hash table size }
  • 26. Hashing of Strings: Example hash1("Tan Ah Teck") = ('T' + 'a' + 'n' + ' ' + 'A' + 'h' + ' ' + 'T' + 'e' + 'c' + 'k') % 11 // hash table size is 11 = (84 + 97 + 110 + 32 + 65 + 104 + 32 + 84 + 101 + 99 + 107) % 11 = 825 % 11 = 0
  • 27. Hashing of Strings: Example  All 3 strings below have the same hash value. Why?  "Lee Chin Tan"  "Chen Le Tian"  "Chan Tin Lee"  Problem: The hash value is independent of the positions of the characters
  • 28. Improved Hashing of Strings  Better to “shift” the sum before adding the next character, so that its position affects the hash code  Polynomial hash code hash2(s) { sum = 0 for each character c in s { sum = sum * 37 + c } return sum % m }
  • 29. Collision Resolution Techniques  Separate Chaining  Linear Probing  Quadratic Probing  Double Hashing
  • 31. Load Factor  n: number of keys in the hash table  m: size of the hash tables — number of slots  : load factor   = n / m  Measures how full the hash table is.  In separate chaining, table size equals to the number of linked lists, so  is the average length of the linked lists
  • 32. Separate Chaining: Performance  Hash table operations  insert (key, data)  Insert data into the list a[h(key)]  Takes O(1) time  find (key)  Find key from the list a[h(key)]  Takes O(1+) time on average  delete (key)  Delete data from the list a[h(key)]  Takes O(1+) time on average If  is bounded by some constant, then all three operations are O(1)
  • 33. Open Addressing  Separate chaining is a close addressing system as the address given to a key is fixed  When the hash address given to a key is open (not fixed), the hashing is an open addressing system  Open addressing  Hashed items are in a single array  Hash code gives the home address  Collision is resolved by checking multiple positions  Each check is called a probe into the table
  • 34. Linear Probing method: One method for resolving collisions looks into the hash table and tries to find another open slot to hold the item that caused the collision. A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty. Note that we may need to go back to the first slot (circularly) to cover the entire hash table. This collision resolution process is referred to as open addressing in that it tries to find the next open slot or address in the hash table. By systematically visiting each slot one at a time, we are performing an open addressing technique called linear probing.
  • 35. Linear Probing 0 1 2 3 4 5 6 hash(k) = k mod 7 Here the table size m = 7 Note: 7 is a prime number. In linear probing, when there is a collision, we scan forwards for the the next empty slot (wrapping around when we reach the last slot).
  • 36. Linear Probing: Insert 18 0 1 2 3 4 5 6 hash(k) = k mod 7 hash(18) = 18 mod 7 = 4 18
  • 37. Linear Probing: Insert 14 0 1 2 3 4 5 6 hash(k) = k mod 7 18 hash(14) = 14 mod 7 = 0 14
  • 38. Linear Probing: Insert 21 0 1 2 3 4 5 6 hash(k) = k mod 7 18 14 21 hash(21) = 21 mod 7 = 0 Collision occurs! Look for next empty slot.
  • 39. Linear Probing: Insert 1 0 1 2 3 4 5 6 hash(k) = k mod 7 18 14 21 1 Collides with 21 (hash value 0). Look for next empty slot. hash(1) = 1 mod 7 = 1
  • 40. Linear Probing: Insert 35 hash(k) = k mod 7 0 1 2 3 4 5 6 14 21 1 35 18 Collision, need to check next 3 slots. hash(35) = 35 mod 7 = 0
  • 41. Linear Probing: Find 35 hash(k) = k mod 7 hash(35) = 35 mod 7 = 0 0 1 2 3 4 5 6 14 21 1 35 18 Found 35, after 4 probes.
  • 42. Linear Probing: Find 8 hash(k) = k mod 7 hash(8) = 8 mod 7 = 1 0 1 2 3 4 5 6 14 21 1 35 18 8 NOT found. Need 5 probes!
  • 43. Linear Probing: Delete 21 hash(k) = k mod 7 hash(21) = 21 mod 7 = 0 0 1 2 3 4 5 6 14 21 1 35 18
  • 44. Linear Probing: Find 35 hash(k) = k mod 7 hash(35) = 35 mod 7 = 0 0 1 2 3 4 5 6 14 We cannot simply remove a value, because it can affect find()! 1 35 18 35 NOT found! Incorrect!
  • 45. Collision resolution with linear probing Insert (54) Insert (26) Insert (17) Insert (77) Insert (31) Insert (44) Insert (55) Insert (20) Insert (93)
  • 46. Clustering: A disadvantage to linear probing is the tendency for clustering; items become clustered in the table. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution. This will have an impact on other items that are being inserted, as we saw when we tried to add the item 20 above. A cluster of values hashing to 0 had to be skipped to finally find an open position. Solution: One way to deal with clustering is to extend the linear probing technique so that instead of looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the items that have caused collisions. This will potentially reduce the clustering that occurs. Figure below shows the items when collision resolution is done with a “plus 3” probe. This means that once a collision occurs, we will look at every third slot until we find one that is empty.
  • 47. How to Delete?  Lazy Deletion  Use three different states at each slot  Occupied  Deleted  Empty  When a value is removed from linear probed hash table, we just mark the status of the slot as “deleted”, instead of emptying the slot  Need to use a state array the same size as the hash table
  • 48. Linear Probing: Delete 21 hash(k) = k mod 7 hash(21) = 21 mod 7 = 0 0 1 2 3 4 5 6 14 2 X 1 1 35 18 Slot 1 is occupied but now marked as deleted.
  • 49. Linear Probing: Find 35 hash(k) = k mod 7 hash(35) = 35 mod 7 = 0 0 1 2 3 4 5 6 14 2 X 1 1 35 18 Found 35. Now we can find 35.
  • 50. Linear Probing: Insert 15 (1/2) hash(k) = k mod 7 hash(15) = 15 mod 7 = 1 0 1 2 3 4 5 6 14 2 X 1 1 35 18 Slot 1 is marked as deleted. So, we insert this new value 15 into the slot that has been marked as deleted (i.e. slot 1).
  • 51. Linear Probing: Insert 15 (2/2) hash(k) = k mod 7 hash(15) = 15 mod 7 = 1 0 1 2 3 4 5 6 So, 15 is inserted into slot 1, which was marked as deleted. Note: We should insert a new value in first available slot so that the find operation for this value will be the fastest. 21 X 14 15 1 35 18
  • 52. Problem 1: Primary Clustering  A cluster is a collection of consecutive occupied slots  A cluster that covers the home address of a key is called the primary cluster of the key  Linear probing can create large primary clusters that will increase the running time of find/insert/delete operations 0 1 2 3 4 5 6 14 15 1 35 18 consecutive occupied slots
  • 53. Linear Probing: Probe Sequence  The probe sequence of this linear probing is hash(key) ( hash(key) + 1 ) % m ( hash(key) + 2 ) % m ( hash(key) + 3 ) % m ⁞  If there is an empty slot, we are sure to find it  When an empty slot is found, conflict resolved, but the primary cluster of the key is expanded as a result  The size of the resulting primary cluster may be very big due to the annexation of the neighboring cluster
  • 54. Modified Linear Probing  To reduce primary clustering, we can modify the probe sequence to hash(key) ( hash(key) + 1 * d ) % m ( hash(key) + 2 * d) % m ( hash(key) + 3 * d) % m ⁞ where d is some constant integer >1 and is co-prime to m  Since d and m are co-primes, the probe sequence covers all the slots in the hash table
  • 55. Ques. Find the total no of probes required to fill in the below given keys in a table of size 8. a. 11111011 b. 01101010 c. 01010010 d. 11011011 e. 10011010 Ques2. Insert keys 12, 18, 13, 2, 3, 23, 5 and 15 into a hash table using a hash function K mod 10.
  • 56. Ques3. Given a hash table, give the sequence of keys used when the hash function used is K mod 10. Which one of the following choices gives a possible order in which the key values could have been inserted in the table? (A) 46, 42, 34, 52, 23, 33 (B) 34, 42, 23, 52, 33, 46 (C) 46, 34, 42, 23, 52, 33 (D) 42, 46, 33, 23, 34, 52 42 23 34 52 46 33 0 1 2 3 4 5 6 7 8 9
  • 57. Quadratic Probing  The probe sequence of quadratic probing is hash(key) ( hash(key) + 1 ) % m ( hash(key) + 4 ) % m ( hash(key) + 9 ) % m ⁞ ( hash(key) + k2 ) % m
  • 58. Quadratic Probing: Insert 18, 3 0 1 2 3 4 5 6 hash(k) = k mod 7 hash(18) = 4 hash(3) = 3 3 18
  • 59. Quadratic Probing: Insert 38 hash(k) = k mod 7 hash(38) = 3 0 1 2 3 4 5 6 Collision! 3 18 38 +1 +4
  • 60. Theorem of Quadratic Probing  How can we be sure that quadratic probing always terminates?  Insert 12 into the previous example, followed by 10. See what happen?  Theorem: If  < 0.5, and m is prime, then we can always find an empty slot  m is the table size and  is the load factor
  • 61. Problem 2: Secondary Clustering  In quadratic probing, clusters are formed along the path of probing, instead of around the home location  These clusters are called secondary clusters  Secondary clusters are formed as a result of using the same pattern in probing by all keys  If two keys have the same home location, their probe sequences are going to be the same  But it is not as bad as primary clustering in linear probing
  • 62. Double Hashing  To reduce secondary clustering, we can use a second hash function to generate different probe sequences for different keys hash(key) ( hash(key) + 1 * hash2(key) ) % m ( hash(key) + 2 * hash2(key) ) % m ( hash(key) + 3 * hash2(key) ) % m ⁞  hash2 is called the secondary hash function  If hash2(k) = 1, then it is the same as linear probing  If hash2(k) = d, where d is a constant integer > 1, then it is the same as modified linear probing
  • 63. Double Hashing: 14, 18 in, Insert 21 0 1 2 3 4 5 6 hash(k) = k mod 7 hash2(k) = k mod 5 hash(21) = 0 hash2(21) = 1 18 14 21
  • 64. Double Hashing: 14, 18 in, Insert 21 0 1 2 3 4 5 6 hash(k) = k mod 7 hash2(k) = k mod 5 hash(21) = 0 hash2(21) = 1 18 14 21
  • 65. Double Hashing: Insert 35 hash(k) = k mod 7 hash2(k) = k mod 5 hash(35) = 0 hash2(35) = 0 0 1 2 3 4 5 6 14 21 18 29 But if we insert 35, the probe sequence is 0, 0, 0, … What is wrong? Since hash2(35)=0. Not acceptable!
  • 66. hash2(key) must not be 0  We can redefine hash2(key) as  hash2(key) = (key % s) + 1, or  hash2(key) = s – (key % s)  Note  The size of hash table must be a prime m  When defining hash2(key) = (key % s) + 1  s < m but s need not be a prime  Usually s = m – 1
  • 67. Good Collision Resolution Method  Minimize clustering  Always find an empty slot if it exists  Give different probe sequences when 2 keys collide (i.e. no secondary clustering)  Fast, O(1)
  • 68. Rehash  Time to rehash  When the table is getting full, the operations are getting slow  For quadratic probing, insertions might fail when the table is more than half full  Rehash operation  Build another table about twice as big with a new hash function  Scan the original table, for each key, compute the new hash value and insert the data into the new table  Delete the original table  The load factor used to decide when to rehash  For open addressing: 0.5  For closed addressing: 1
  • 69. Summary  How to hash?  Criteria for good hash functions  How to resolve collision?  Separate chaining  Linear probing  Quadratic probing  Double hashing  Problem on deletions  Primary clustering and secondary clustering