SlideShare a Scribd company logo
Analysis of Algorithms
CS 477/677
Hashing
Thanks: George Bebis
(Chapter 11)
2
The Search Problem
• Find items with keys matching a given search key
– Given an array A, containing n keys, and a search key
x, find the index i such as x=A[i]
– As in the case of sorting, a key could be part of a
large record.
3
Applications
• Keeping track of customer account information at
a bank
– Search through records to check balances and perform
transactions
• Keep track of reservations on flights
– Search to find empty seats, cancel/modify reservations
• Search engine
– Looks for all documents containing a given word
4
Special Case: Dictionaries
• Dictionary = data structure that supports mainly
two basic operations: insert a new item and
return an item with a given key
• Queries: return information about the set S:
– Search (S, k)
– Minimum (S), Maximum (S)
– Successor (S, x), Predecessor (S, x)
• Modifying operations: change the set
– Insert (S, k)
– Delete (S, k) – not very often
5
Direct Addressing
• Assumptions:
– Key values are distinct
– Each key is drawn from a universe U = {0, 1, . . . , m - 1}
• Idea:
– Store the items in an array, indexed by keys
• Direct-address table representation:
– An array T[0 . . . m - 1]
– Each slot, or position, in T corresponds to a key in U
– For an element x with key k, a pointer to x (or x itself) will be placed
in location T[k]
– If there are no elements with key k in the set, T[k] is empty,
represented by NIL
6
Direct Addressing (cont’d)
7
Operations
Alg.: DIRECT-ADDRESS-SEARCH(T, k)
return T[k]
Alg.: DIRECT-ADDRESS-INSERT(T, x)
T[key[x]] x←
Alg.: DIRECT-ADDRESS-DELETE(T, x)
T[key[x]] NIL←
• Running time for these operations: O(1)
8
Comparing Different Implementations
• Implementing dictionaries using:
– Direct addressing
– Ordered/unordered arrays
– Ordered/unordered linked lists
Insert Search
ordered array
ordered list
unordered array
unordered list
O(N)
O(N)
O(N)
O(N)
O(1)
O(1)
O(lgN)
O(N)
direct addressing O(1) O(1)
9
Examples Using Direct Addressing
Example 2:
Example 1:
10
Hash Tables
• When K is much smaller than U, a hash table
requires much less space than a direct-address
table
– Can reduce storage requirements to |K|
– Can still get O(1) search time, but on the average
case, not the worst case
11
Hash Tables
Idea:
– Use a function h to compute the slot for each key
– Store the element in slot h(k)
• A hash function h transforms a key into an index in a
hash table T[0…m-1]:
h : U {0, 1, . . . , m - 1}→
• We say that k hashes to slot h(k)
• Advantages:
– Reduce the range of array indices handled: m instead of |U|
– Storage is also reduced
12
Example: HASH TABLES
U
(universe of keys)
K
(actual
keys)
0
m - 1
h(k3)
h(k2) = h(k5)
h(k1)
h(k4)
k1
k4 k2
k5
k3
13
Revisit Example 2
14
Do you see any problems
with this approach?
U
(universe of keys)
K
(actual
keys)
0
m - 1
h(k3)
h(k2) = h(k5)
h(k1)
h(k4)
k1
k4 k2
k5
k3
Collisions!
15
Collisions
• Two or more keys hash to the same slot!!
• For a given set K of keys
– If |K| ≤ m, collisions may or may not happen,
depending on the hash function
– If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
• Avoiding collisions completely is hard, even with
a good hash function
16
Handling Collisions
• We will review the following methods:
– Chaining
– Open addressing
• Linear probing
• Quadratic probing
• Double hashing
• We will discuss chaining first, and ways to
build “good” functions.
17
Handling Collisions Using Chaining
• Idea:
– Put all elements that hash to the same slot into a
linked list
– Slot j contains a pointer to the head of the list of all
elements that hash to j
18
Collision with Chaining - Discussion
• Choosing the size of the table
– Small enough not to waste space
– Large enough such that lists remain short
– Typically 1/5 or 1/10 of the total number of elements
• How should we keep the lists: ordered or not?
– Not ordered!
• Insert is fast
• Can easily remove the most recently inserted elements
19
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
• Worst-case running time is O(1)
• Assumes that the element being inserted isn’t
already in the list
• It would take an additional search to check if it
was already inserted
20
Deletion in Hash Tables
Alg.: CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(key[x])]
• Need to find the element to be deleted.
• Worst-case running time:
– Deletion depends on searching the corresponding list
21
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
search for an element with key k in list T[h(k)]
• Running time is proportional to the length of the
list of elements in slot h(k)
22
Analysis of Hashing with Chaining:
Worst Case
• How long does it take to
search for an element with a
given key?
• Worst case:
– All n keys hash to the same slot
– Worst-case time to search is
Θ(n), plus time to compute the
hash function
0
m - 1
T
chain
23
Analysis of Hashing with Chaining:
Average Case
• Average case
– depends on how well the hash function
distributes the n keys among the m slots
• Simple uniform hashing assumption:
– Any given element is equally likely to hash
into any of the m slots (i.e., probability of
collision Pr(h(x)=h(y)), is 1/m)
• Length of a list:
T[j] = nj, j = 0, 1, . . . , m – 1
• Number of keys in the table:
n = n0 + n1 +· · · + nm-1
• Average value of nj:
E[nj] = α = n/m
n0 = 0
nm – 1 = 0
T
n2
n3
nj
nk
24
Load Factor of a Hash Table
• Load factor of a hash table T:
α = n/m
– n = # of elements stored in the table
– m = # of slots in the table = # of linked lists
∀ α encodes the average number of
elements stored in a chain
∀ α can be <, =, > 1
0
m - 1
T
chain
chain
chain
chain
25
Case 1: Unsuccessful Search
(i.e., item not stored in the table)
Theorem
An unsuccessful search in a hash table takes expected time
under the assumption of simple uniform hashing
(i.e., probability of collision Pr(h(x)=h(y)), is 1/m)
Proof
• Searching unsuccessfully for any key k
– need to search to the end of the list T[h(k)]
• Expected length of the list:
– E[nh(k)] = α = n/m
• Expected number of elements examined in an unsuccessful search is α
• Total time required is:
– O(1) (for computing the hash function) + α 
(1 )αΘ +
(1 )αΘ +
26
Case 2: Successful Search
27
Analysis of Search in Hash Tables
• If m (# of slots) is proportional to n (# of
elements in the table):
• n = O(m)
• α = n/m = O(m)/m = O(1)
⇒ Searching takes constant time on average
28
Hash Functions
• A hash function transforms a key into a table
address
• What makes a good hash function?
(1) Easy to compute
(2) Approximates a random function: for every input,
every output is equally likely (simple uniform hashing)
• In practice, it is very hard to satisfy the simple
uniform hashing property
– i.e., we don’t know in advance the probability
distribution that keys are drawn from
29
Good Approaches for Hash Functions
• Minimize the chance that closely related keys
hash to the same slot
– Strings such as pt and pts should hash to
different slots
• Derive a hash value that is independent from
any patterns that may exist in the distribution
of the keys
30
The Division Method
• Idea:
– Map a key k into one of the m slots by taking
the remainder of k divided by m
h(k) = k mod m
• Advantage:
– fast, requires only one operation
• Disadvantage:
– Certain values of m are bad, e.g.,
• power of 2
• non-prime numbers
31
Example - The Division Method
• If m = 2p
, then h(k) is just the least
significant p bits of k
– p = 1 ⇒ m = 2
⇒ h(k) = , least significant 1 bit of k
– p = 2 ⇒ m = 4
⇒ h(k) = , least significant 2 bits of k
 Choose m to be a prime, not close to a
power of 2
 Column 2:
 Column 3:
{0, 1}
{0, 1, 2, 3}
k mod 97
k mod 100
m
97
m
100
32
The Multiplication Method
Idea:
• Multiply key k by a constant A, where 0 < A < 1
• Extract the fractional part of kA
• Multiply the fractional part by m
• Take the floor of the result
h(k) = = m (k A mod 1)
• Disadvantage: Slower than division method
• Advantage: Value of m is not critical, e.g., typically 2p
fractional part of kA = kA - kA
33
Example – Multiplication Method
34
Universal Hashing
• In practice, keys are not randomly distributed
• Any fixed hash function might yield Θ(n) time
• Goal: hash functions that produce random
table indices irrespective of the keys
• Idea:
– Select a hash function at random, from a designed
class of functions at the beginning of the execution
35
Universal Hashing
(at the beginning
of the execution)
36
Definition of Universal Hash Functions
H={h(k): U(0,1,..,m-1)}
37
How is this property useful?
Pr(h(x)=h(y))=
38
Universal Hashing – Main Result
With universal hashing the chance of collision
between distinct keys k and l is no more than the
1/m chance of collision if locations h(k) and h(l)
were randomly and independently chosen from
the set {0, 1, …, m – 1}
39
Designing a Universal Class
of Hash Functions
• Choose a prime number p large enough so that every
possible key k is in the range [0 ... p – 1]
Zp = {0, 1, …, p - 1} and Zp
*
= {1, …, p - 1}
• Define the following hash function
ha,b(k) = ((ak + b) mod p) mod m,
∀ a ∈ Zp
*
and b ∈ Zp
• The family of all such hash functions is
Hp,m = {ha,b: a ∈ Zp
*
and b ∈ Zp}
• a , b: chosen randomly at the beginning of execution
The class Hp,m of hash
functions is universal
40
Example: Universal Hash Functions
E.g.: p = 17, m = 6
ha,b(k) = ((ak + b) mod p) mod m
h3,4(8) = ((3⋅8 + 4) mod 17) mod 6
= (28 mod 17) mod 6
= 11 mod 6
= 5
41
Advantages of Universal Hashing
• Universal hashing provides good results on
average, independently of the keys to be stored
• Guarantees that no input will always elicit the
worst-case behavior
• Poor performance occurs only when the random
choice returns an inefficient hash function – this
has small probability
42
Open Addressing
• If we have enough contiguous memory to store all the keys
(m > N) ⇒ store the keys in the table itself
• No need to use linked lists anymore
• Basic idea:
– Insertion: if a slot is full, try another one,
until you find an empty one
– Search: follow the same sequence of probes
– Deletion: more difficult ... (we’ll see why)
• Search time depends on the length of the
probe sequence!
e.g., insert 14
43
Generalize hash function notation:
• A hash function contains two arguments now:
(i) Key value, and (ii) Probe number
h(k,p), p=0,1,...,m-1
• Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
– Must be a permutation of <0,1,...,m-1>
– There are m! possible permutations
– Good hash functions should be able to
produce all m! probe sequences
insert 14
<1, 5, 9>
Example
44
Common Open Addressing Methods
• Linear probing
• Quadratic probing
• Double hashing
• Note: None of these methods can generate
more than m2
different probing sequences!
45
Linear probing: Inserting a key
• Idea: when there is a collision, check the next available
position in the table (i.e., probing)
h(k,i) = (h1(k) + i) mod m
i=0,1,2,...
• First slot probed: h1(k)
• Second slot probed: h1(k) + 1
• Third slot probed: h1(k)+2, and so on
• Can generate m probe sequences maximum, why?
probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....>
wrap around
46
Linear probing: Searching for a key
• Three cases:
(1) Position in table is occupied with an
element of equal key
(2) Position in table is empty
(3) Position in table occupied with a
different element
• Case 2: probe the next higher index
until the element is found or an
empty position is found
• The process wraps around to the
beginning of the table
0
m - 1
h(k3)
h(k2) = h(k5)
h(k1)
h(k4)
47
Linear probing: Deleting a key
• Problems
– Cannot mark the slot as empty
– Impossible to retrieve keys inserted
after that slot was occupied
• Solution
– Mark the slot with a sentinel value
DELETED
• The deleted slot can later be used
for insertion
• Searching will be able to find all the
keys
0
m - 1
48
Primary Clustering Problem
• Some slots become more likely than others
• Long chunks of occupied slots are created
⇒ search time increases!!
Slot b:
2/m
Slot d:
4/m
Slot e:
5/m
initially, all slots have probability 1/m
49
Quadratic probing
i=0,1,2,...
50
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the
increment for the probe sequence
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k)
• Second probe is offset by h2(k) mod m, so on ...
• Advantage: avoids clustering
• Disadvantage: harder to delete an element
• Can generate m2
probe sequences maximum
51
Double Hashing: Example
h1(k) = k mod 13
h2(k) = 1+ (k mod 11)
h(k,i) = (h1(k) + i h2(k) ) mod 13
• Insert key 14:
h1(14,0) = 14 mod 13 = 1
h(14,1) = (h1(14) + h2(14)) mod 13
= (1 + 4) mod 13 = 5
h(14,2) = (h1(14) + 2 h2(14)) mod 13
= (1 + 8) mod 13 = 9
79
69
98
72
50
0
9
4
2
3
1
5
6
7
8
10
11
12
14
52
Analysis of Open Addressing
a
1 a−
(load factor)
k=0
53
Analysis of Open Addressing (cont’d)
Example (similar to Exercise 11.4-4, page 244)
Unsuccessful retrieval:
a=0.5 E(#steps) = 2
a=0.9 E(#steps) = 10
Successful retrieval:
a=0.5 E(#steps) = 3.387
a=0.9 E(#steps) = 3.670

More Related Content

PPT
Hash tables
PPTX
Hashing .pptx
PPT
Hash table
PPTX
Lecture 14 run time environment
PPTX
Heap Sort in Design and Analysis of algorithms
PPTX
Greedy Algorithm - Knapsack Problem
PPT
Hashing
PPTX
Np hard
Hash tables
Hashing .pptx
Hash table
Lecture 14 run time environment
Heap Sort in Design and Analysis of algorithms
Greedy Algorithm - Knapsack Problem
Hashing
Np hard

What's hot (20)

PPTX
Coin Change : Greedy vs Dynamic Programming
PPT
Chapter 12 ds
PPT
0/1 knapsack
PDF
Syntax Directed Definition and its applications
PPTX
Fractional Knapsack Problem
PPTX
Hashing in datastructure
PPTX
Knapsack problem using greedy approach
PDF
Compiler design error handling
PPT
Dinive conquer algorithm
PPTX
Hashing Technique In Data Structures
PPTX
Travelling salesman dynamic programming
PPT
Sum of subsets problem by backtracking 
PPT
BackTracking Algorithm: Technique and Examples
PPT
Graph coloring problem
PPT
5.1 greedy
PPTX
Code Optimization
PDF
Daa notes 3
PDF
Serializability
PPT
16. Concurrency Control in DBMS
PPT
UNIT-1-PPTS-DAA.ppt
Coin Change : Greedy vs Dynamic Programming
Chapter 12 ds
0/1 knapsack
Syntax Directed Definition and its applications
Fractional Knapsack Problem
Hashing in datastructure
Knapsack problem using greedy approach
Compiler design error handling
Dinive conquer algorithm
Hashing Technique In Data Structures
Travelling salesman dynamic programming
Sum of subsets problem by backtracking 
BackTracking Algorithm: Technique and Examples
Graph coloring problem
5.1 greedy
Code Optimization
Daa notes 3
Serializability
16. Concurrency Control in DBMS
UNIT-1-PPTS-DAA.ppt
Ad

Similar to Analysis Of Algorithms - Hashing (20)

PPT
Advance algorithm hashing lec II
PPT
13-hashing.ppt
PDF
08 Hash Tables
PPTX
Hashing techniques, Hashing function,Collision detection techniques
PPTX
Unit viii searching and hashing
PPTX
Lec12-Hash-Tables-27122022-125641pm.pptx
PPT
Hashing in Data Structure and analysis of Algorithms
PPTX
Hashing.pptx
PPT
4.4 hashing02
PPTX
Unit 8 searching and hashing
PPT
Hashing
PDF
hashtableeeeeeeeeeeeeeeeeeeeeeeeeeee.pdf
PPT
Hashing
PPT
11_hashtable-1.ppt. Data structure algorithm
PPTX
Data Structures-Topic-Hashing, Collision
PPT
Hashing
PDF
Data Structures Design Notes.pdf
PPT
Hashing Techniques in Data Strucures and Algorithm
PPTX
Hash function
Advance algorithm hashing lec II
13-hashing.ppt
08 Hash Tables
Hashing techniques, Hashing function,Collision detection techniques
Unit viii searching and hashing
Lec12-Hash-Tables-27122022-125641pm.pptx
Hashing in Data Structure and analysis of Algorithms
Hashing.pptx
4.4 hashing02
Unit 8 searching and hashing
Hashing
hashtableeeeeeeeeeeeeeeeeeeeeeeeeeee.pdf
Hashing
11_hashtable-1.ppt. Data structure algorithm
Data Structures-Topic-Hashing, Collision
Hashing
Data Structures Design Notes.pdf
Hashing Techniques in Data Strucures and Algorithm
Hash function
Ad

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
annual-report-2024-2025 original latest.
Business Acumen Training GuidePresentation.pptx
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Fluorescence-microscope_Botany_detailed content
Introduction to Knowledge Engineering Part 1
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Analytics and business intelligence.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Analysis Of Algorithms - Hashing

  • 1. Analysis of Algorithms CS 477/677 Hashing Thanks: George Bebis (Chapter 11)
  • 2. 2 The Search Problem • Find items with keys matching a given search key – Given an array A, containing n keys, and a search key x, find the index i such as x=A[i] – As in the case of sorting, a key could be part of a large record.
  • 3. 3 Applications • Keeping track of customer account information at a bank – Search through records to check balances and perform transactions • Keep track of reservations on flights – Search to find empty seats, cancel/modify reservations • Search engine – Looks for all documents containing a given word
  • 4. 4 Special Case: Dictionaries • Dictionary = data structure that supports mainly two basic operations: insert a new item and return an item with a given key • Queries: return information about the set S: – Search (S, k) – Minimum (S), Maximum (S) – Successor (S, x), Predecessor (S, x) • Modifying operations: change the set – Insert (S, k) – Delete (S, k) – not very often
  • 5. 5 Direct Addressing • Assumptions: – Key values are distinct – Each key is drawn from a universe U = {0, 1, . . . , m - 1} • Idea: – Store the items in an array, indexed by keys • Direct-address table representation: – An array T[0 . . . m - 1] – Each slot, or position, in T corresponds to a key in U – For an element x with key k, a pointer to x (or x itself) will be placed in location T[k] – If there are no elements with key k in the set, T[k] is empty, represented by NIL
  • 7. 7 Operations Alg.: DIRECT-ADDRESS-SEARCH(T, k) return T[k] Alg.: DIRECT-ADDRESS-INSERT(T, x) T[key[x]] x← Alg.: DIRECT-ADDRESS-DELETE(T, x) T[key[x]] NIL← • Running time for these operations: O(1)
  • 8. 8 Comparing Different Implementations • Implementing dictionaries using: – Direct addressing – Ordered/unordered arrays – Ordered/unordered linked lists Insert Search ordered array ordered list unordered array unordered list O(N) O(N) O(N) O(N) O(1) O(1) O(lgN) O(N) direct addressing O(1) O(1)
  • 9. 9 Examples Using Direct Addressing Example 2: Example 1:
  • 10. 10 Hash Tables • When K is much smaller than U, a hash table requires much less space than a direct-address table – Can reduce storage requirements to |K| – Can still get O(1) search time, but on the average case, not the worst case
  • 11. 11 Hash Tables Idea: – Use a function h to compute the slot for each key – Store the element in slot h(k) • A hash function h transforms a key into an index in a hash table T[0…m-1]: h : U {0, 1, . . . , m - 1}→ • We say that k hashes to slot h(k) • Advantages: – Reduce the range of array indices handled: m instead of |U| – Storage is also reduced
  • 12. 12 Example: HASH TABLES U (universe of keys) K (actual keys) 0 m - 1 h(k3) h(k2) = h(k5) h(k1) h(k4) k1 k4 k2 k5 k3
  • 14. 14 Do you see any problems with this approach? U (universe of keys) K (actual keys) 0 m - 1 h(k3) h(k2) = h(k5) h(k1) h(k4) k1 k4 k2 k5 k3 Collisions!
  • 15. 15 Collisions • Two or more keys hash to the same slot!! • For a given set K of keys – If |K| ≤ m, collisions may or may not happen, depending on the hash function – If |K| > m, collisions will definitely happen (i.e., there must be at least two keys that have the same hash value) • Avoiding collisions completely is hard, even with a good hash function
  • 16. 16 Handling Collisions • We will review the following methods: – Chaining – Open addressing • Linear probing • Quadratic probing • Double hashing • We will discuss chaining first, and ways to build “good” functions.
  • 17. 17 Handling Collisions Using Chaining • Idea: – Put all elements that hash to the same slot into a linked list – Slot j contains a pointer to the head of the list of all elements that hash to j
  • 18. 18 Collision with Chaining - Discussion • Choosing the size of the table – Small enough not to waste space – Large enough such that lists remain short – Typically 1/5 or 1/10 of the total number of elements • How should we keep the lists: ordered or not? – Not ordered! • Insert is fast • Can easily remove the most recently inserted elements
  • 19. 19 Insertion in Hash Tables Alg.: CHAINED-HASH-INSERT(T, x) insert x at the head of list T[h(key[x])] • Worst-case running time is O(1) • Assumes that the element being inserted isn’t already in the list • It would take an additional search to check if it was already inserted
  • 20. 20 Deletion in Hash Tables Alg.: CHAINED-HASH-DELETE(T, x) delete x from the list T[h(key[x])] • Need to find the element to be deleted. • Worst-case running time: – Deletion depends on searching the corresponding list
  • 21. 21 Searching in Hash Tables Alg.: CHAINED-HASH-SEARCH(T, k) search for an element with key k in list T[h(k)] • Running time is proportional to the length of the list of elements in slot h(k)
  • 22. 22 Analysis of Hashing with Chaining: Worst Case • How long does it take to search for an element with a given key? • Worst case: – All n keys hash to the same slot – Worst-case time to search is Θ(n), plus time to compute the hash function 0 m - 1 T chain
  • 23. 23 Analysis of Hashing with Chaining: Average Case • Average case – depends on how well the hash function distributes the n keys among the m slots • Simple uniform hashing assumption: – Any given element is equally likely to hash into any of the m slots (i.e., probability of collision Pr(h(x)=h(y)), is 1/m) • Length of a list: T[j] = nj, j = 0, 1, . . . , m – 1 • Number of keys in the table: n = n0 + n1 +· · · + nm-1 • Average value of nj: E[nj] = α = n/m n0 = 0 nm – 1 = 0 T n2 n3 nj nk
  • 24. 24 Load Factor of a Hash Table • Load factor of a hash table T: α = n/m – n = # of elements stored in the table – m = # of slots in the table = # of linked lists ∀ α encodes the average number of elements stored in a chain ∀ α can be <, =, > 1 0 m - 1 T chain chain chain chain
  • 25. 25 Case 1: Unsuccessful Search (i.e., item not stored in the table) Theorem An unsuccessful search in a hash table takes expected time under the assumption of simple uniform hashing (i.e., probability of collision Pr(h(x)=h(y)), is 1/m) Proof • Searching unsuccessfully for any key k – need to search to the end of the list T[h(k)] • Expected length of the list: – E[nh(k)] = α = n/m • Expected number of elements examined in an unsuccessful search is α • Total time required is: – O(1) (for computing the hash function) + α  (1 )αΘ + (1 )αΘ +
  • 27. 27 Analysis of Search in Hash Tables • If m (# of slots) is proportional to n (# of elements in the table): • n = O(m) • α = n/m = O(m)/m = O(1) ⇒ Searching takes constant time on average
  • 28. 28 Hash Functions • A hash function transforms a key into a table address • What makes a good hash function? (1) Easy to compute (2) Approximates a random function: for every input, every output is equally likely (simple uniform hashing) • In practice, it is very hard to satisfy the simple uniform hashing property – i.e., we don’t know in advance the probability distribution that keys are drawn from
  • 29. 29 Good Approaches for Hash Functions • Minimize the chance that closely related keys hash to the same slot – Strings such as pt and pts should hash to different slots • Derive a hash value that is independent from any patterns that may exist in the distribution of the keys
  • 30. 30 The Division Method • Idea: – Map a key k into one of the m slots by taking the remainder of k divided by m h(k) = k mod m • Advantage: – fast, requires only one operation • Disadvantage: – Certain values of m are bad, e.g., • power of 2 • non-prime numbers
  • 31. 31 Example - The Division Method • If m = 2p , then h(k) is just the least significant p bits of k – p = 1 ⇒ m = 2 ⇒ h(k) = , least significant 1 bit of k – p = 2 ⇒ m = 4 ⇒ h(k) = , least significant 2 bits of k  Choose m to be a prime, not close to a power of 2  Column 2:  Column 3: {0, 1} {0, 1, 2, 3} k mod 97 k mod 100 m 97 m 100
  • 32. 32 The Multiplication Method Idea: • Multiply key k by a constant A, where 0 < A < 1 • Extract the fractional part of kA • Multiply the fractional part by m • Take the floor of the result h(k) = = m (k A mod 1) • Disadvantage: Slower than division method • Advantage: Value of m is not critical, e.g., typically 2p fractional part of kA = kA - kA
  • 34. 34 Universal Hashing • In practice, keys are not randomly distributed • Any fixed hash function might yield Θ(n) time • Goal: hash functions that produce random table indices irrespective of the keys • Idea: – Select a hash function at random, from a designed class of functions at the beginning of the execution
  • 35. 35 Universal Hashing (at the beginning of the execution)
  • 36. 36 Definition of Universal Hash Functions H={h(k): U(0,1,..,m-1)}
  • 37. 37 How is this property useful? Pr(h(x)=h(y))=
  • 38. 38 Universal Hashing – Main Result With universal hashing the chance of collision between distinct keys k and l is no more than the 1/m chance of collision if locations h(k) and h(l) were randomly and independently chosen from the set {0, 1, …, m – 1}
  • 39. 39 Designing a Universal Class of Hash Functions • Choose a prime number p large enough so that every possible key k is in the range [0 ... p – 1] Zp = {0, 1, …, p - 1} and Zp * = {1, …, p - 1} • Define the following hash function ha,b(k) = ((ak + b) mod p) mod m, ∀ a ∈ Zp * and b ∈ Zp • The family of all such hash functions is Hp,m = {ha,b: a ∈ Zp * and b ∈ Zp} • a , b: chosen randomly at the beginning of execution The class Hp,m of hash functions is universal
  • 40. 40 Example: Universal Hash Functions E.g.: p = 17, m = 6 ha,b(k) = ((ak + b) mod p) mod m h3,4(8) = ((3⋅8 + 4) mod 17) mod 6 = (28 mod 17) mod 6 = 11 mod 6 = 5
  • 41. 41 Advantages of Universal Hashing • Universal hashing provides good results on average, independently of the keys to be stored • Guarantees that no input will always elicit the worst-case behavior • Poor performance occurs only when the random choice returns an inefficient hash function – this has small probability
  • 42. 42 Open Addressing • If we have enough contiguous memory to store all the keys (m > N) ⇒ store the keys in the table itself • No need to use linked lists anymore • Basic idea: – Insertion: if a slot is full, try another one, until you find an empty one – Search: follow the same sequence of probes – Deletion: more difficult ... (we’ll see why) • Search time depends on the length of the probe sequence! e.g., insert 14
  • 43. 43 Generalize hash function notation: • A hash function contains two arguments now: (i) Key value, and (ii) Probe number h(k,p), p=0,1,...,m-1 • Probe sequences <h(k,0), h(k,1), ..., h(k,m-1)> – Must be a permutation of <0,1,...,m-1> – There are m! possible permutations – Good hash functions should be able to produce all m! probe sequences insert 14 <1, 5, 9> Example
  • 44. 44 Common Open Addressing Methods • Linear probing • Quadratic probing • Double hashing • Note: None of these methods can generate more than m2 different probing sequences!
  • 45. 45 Linear probing: Inserting a key • Idea: when there is a collision, check the next available position in the table (i.e., probing) h(k,i) = (h1(k) + i) mod m i=0,1,2,... • First slot probed: h1(k) • Second slot probed: h1(k) + 1 • Third slot probed: h1(k)+2, and so on • Can generate m probe sequences maximum, why? probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....> wrap around
  • 46. 46 Linear probing: Searching for a key • Three cases: (1) Position in table is occupied with an element of equal key (2) Position in table is empty (3) Position in table occupied with a different element • Case 2: probe the next higher index until the element is found or an empty position is found • The process wraps around to the beginning of the table 0 m - 1 h(k3) h(k2) = h(k5) h(k1) h(k4)
  • 47. 47 Linear probing: Deleting a key • Problems – Cannot mark the slot as empty – Impossible to retrieve keys inserted after that slot was occupied • Solution – Mark the slot with a sentinel value DELETED • The deleted slot can later be used for insertion • Searching will be able to find all the keys 0 m - 1
  • 48. 48 Primary Clustering Problem • Some slots become more likely than others • Long chunks of occupied slots are created ⇒ search time increases!! Slot b: 2/m Slot d: 4/m Slot e: 5/m initially, all slots have probability 1/m
  • 50. 50 Double Hashing (1) Use one hash function to determine the first slot (2) Use a second hash function to determine the increment for the probe sequence h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,... • Initial probe: h1(k) • Second probe is offset by h2(k) mod m, so on ... • Advantage: avoids clustering • Disadvantage: harder to delete an element • Can generate m2 probe sequences maximum
  • 51. 51 Double Hashing: Example h1(k) = k mod 13 h2(k) = 1+ (k mod 11) h(k,i) = (h1(k) + i h2(k) ) mod 13 • Insert key 14: h1(14,0) = 14 mod 13 = 1 h(14,1) = (h1(14) + h2(14)) mod 13 = (1 + 4) mod 13 = 5 h(14,2) = (h1(14) + 2 h2(14)) mod 13 = (1 + 8) mod 13 = 9 79 69 98 72 50 0 9 4 2 3 1 5 6 7 8 10 11 12 14
  • 52. 52 Analysis of Open Addressing a 1 a− (load factor) k=0
  • 53. 53 Analysis of Open Addressing (cont’d) Example (similar to Exercise 11.4-4, page 244) Unsuccessful retrieval: a=0.5 E(#steps) = 2 a=0.9 E(#steps) = 10 Successful retrieval: a=0.5 E(#steps) = 3.387 a=0.9 E(#steps) = 3.670