SlideShare a Scribd company logo
File Structures SNU-OOPSLA Lab. 1
Chap12. Extendible Hashing
서울대학교 컴퓨터공학부
객체지향시스템연구실
SNU-OOPSLA-LAB
교수 김 형 주
File Structures by Folk, Zoellick and Riccardi
File Structures SNU-OOPSLA Lab. 2
Chapter ObjectivesChapter Objectives
Describe the problem solved by extendible hashing and related
approaches
Explain how extendible hashing works; show how it combines
tries with conventional, static hashing
Use the buffer, file, and index classes of previous chapters to
implement extendible hashing, including deletion
Review studies of extendible hashing performance
Examine alternative approaches to the same problem, including
dynamic hashing, linear hashing, and hashing schemes that
control splitting by allowing for overflow buckets
File Structures SNU-OOPSLA Lab. 3
ContentsContents
12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation
12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches
File Structures SNU-OOPSLA Lab. 4
12.1 Introduction
Dynamic files
undergo a lot of growths
Static hashing
described in chapter 11 (direct hashing)
typically worse than B-Tree for dynamic files
eventually requires file reorganization
Extendible hashing
hashing for dynamic file
Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)
File Structures SNU-OOPSLA Lab. 5
Overview(1)Overview(1)
Direct access (hashing) files have static size, so not
suitable for files whose size is unknown in advance
Dynamic file structure is desired which retains the feature
of fast retrieval by primary key, and which also expands
and contracts as the number of records in the file
fluctuates (without reorganizing the whole file)
Similar motivation!
Indexed-sequential File ==> B tree
Hashing ==> Extendible Hashing
File Structures SNU-OOPSLA Lab. 6
Overview(2)Overview(2)
Extendible Hashing
Primary key H(key)
Hashing function
Directory
Index
Extract first d digit
File pointerTable look-up
File Structures SNU-OOPSLA Lab. 7
12.2 How Extendible Hashing works12.2 How Extendible Hashing works
Idea from Tries file (radix searching)
The branching factor of the tree is equal to the # of
alternative symbols in each position of the key
e.g.) Radix 26 trie - able, abrahms, adams, anderson,
adnrews, baird
Use the first n characters for branching
a
b
b
d
n
l
r
d
e
r
able
abrahms
adams
anderson
andrews
baird
File Structures SNU-OOPSLA Lab. 8
Extendible HashingExtendible Hashing
H maps keys to a fixed address space, with size the largest
prime less than a power of 2 (65531 < 216
)
File pointers point to blocks of records known as buckets,
where an entire bucket is read by one physical data transfer,
buckets may be added to or removed from the file dynamically
The d bits are used as an index in a directory array containing
2d
entries, which usually resides in primary memory
The value d, the directory size(2d
), and the number of buckets
change automatically as the file expands and contracts
File Structures SNU-OOPSLA Lab. 9
Extendible Hashing ExampleExtendible Hashing Example
000
001
010
011
100
101
110
111
d’=1
d’=3
d’=3
d’=2
Directory with d=3 and 4 buckets
B0
B100
B101
B11
H(key)=0
H(key)=100
H(key)=101
H(key)=11
d=3
File Structures SNU-OOPSLA Lab. 10
Turning the trie into a directoryTurning the trie into a directory
Using Trie for extendible hashing
(1) Use Radix 2 Trie :
Keys in A : beginning with 0
Keys in B : beginning with 10
Keys in C : beginning with 11
(2) Retrieving from secondary storage the buckets containing
keys, instead of individual keys
A
B
C
0
1 0
1
File Structures SNU-OOPSLA Lab. 11
Representation of Trie (1)Representation of Trie (1)
Tree is not preferable (directory is not big)
A flattened array
1. Make a complete full binary tree
2. Collapse it into the directory structure
0
1
0
1
0
1
C
A
B
00
01
10
11
A
B
C
File Structures SNU-OOPSLA Lab. 12
Representation of Trie(2)Representation of Trie(2)
Directory is a complete binary tree
Directory entry : a pointer to the associated bucket
Given an address beginning with the bits 10, the 210
directory
entries
Introduced for uniform distribution
File Structures SNU-OOPSLA Lab. 13
Retrieve a recordRetrieve a record
Steps in retrieving a record with a given key
find H(given key)
extract first d bits of H(given key)
use this value as an index into the directory to find a pointer
use this pointer to read a bucket into primary memory
locate the desired record within the bucket (scan)
File Structures SNU-OOPSLA Lab. 14
Expansion & Contraction(1)Expansion & Contraction(1)
A pair of adjunct buckets with the same value of d’ which
share a common value of the first d’-1 bits of H(key) can
be combined if the average load < 50%, so all records
would be able to fit into one bucket
File contraction is the reverse of expansion; the directory
can be compacted and d decremented whenever all pairs
of pointers have the same values
File Structures SNU-OOPSLA Lab. 15
Expansion & Contraction(2)Expansion & Contraction(2)
000
001
010
011
100
101
110
111
d’=2
Bucket B0 overflows, then splits into B0 and B1
B00 H(key)=00..
d’=2
B01 H(key)=01..
d’=3
B100 H(key)=100..
d’=3
B00 H(key)=101..
d’=2
B00 H(key)=11..
d=3
File Structures SNU-OOPSLA Lab. 16
Expansion & Contraction(3)Expansion & Contraction(3)
0000
d’=2
Bucket B100 overflows, d increase to 4
B00 H(key)=00..
d’=2
B01 H(key)=01..
d’=4
B1000H(key)=1000..
d’=4
B1001H(key)=1001..
d’=3
B101 H(key)=101..
d=4
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
d’=2
B11 H(key)=11..
File Structures SNU-OOPSLA Lab. 17
Splitting to Handle Overflow (1)Splitting to Handle Overflow (1)
When overflow occurs
e.g.1) Overflowing of bucket A
Split A into A and D
Come to use additional unused bits
No need to expand the directory
00
01
10
11
B
C
A
D
00
01
10
11
A
B
C
File Structures SNU-OOPSLA Lab. 18
Splitting to Handle Overflow(2)Splitting to Handle Overflow(2)
e.g. Overflowing of bucket B
Do not have additional unused bits
(need to expand the directory)
1. Divide B using 3 bits of hash address
2. Make a complete full binary tree
3. Collapse it into the directory structure
00
01
10
11
A
B
C
File Structures SNU-OOPSLA Lab. 19
A
B
C
D
0
1 0
1
0
1
0
1
0
10
1 0
1
0
1 0
1
0
1
A
B
D
C
000
001
010
011
A
100
101
110
111
C
B
D
1. Result of overflow of bucket B
3. Directory
2. Complete Binary Tree
File Structures SNU-OOPSLA Lab. 20
Creating Address
Function hash(KEY)
Fold/Add hashing algorithm
Do not MOD hashing value by address space since no fixed
address space exists
Output from the hash function for a number of keys
bill 0000 0011 0110 1100
lee 0000 0100 0010 1000
pauline 0000 1111 0110 0101
alan 0100 1100 1010 0010
julie 0010 1110 0000 1001
mike 0000 0111 0100 1101
elizabeth 0010 1100 0110 1010
mark 0000 1010 0000 0111
File Structures SNU-OOPSLA Lab. 21
Int Hash (char * key)
{
int sum = 0;
int len = strlen(key);
if (len % 2 == 1) len ++; // make len even
for (int j = 0; j < len; j+2)
sum = (sum + 100 * key[j] + key[j+1]) % 19937;
return sum;
}
Figure 12.7 Function Hash (key) returns an integer hash value for key
for a 15 bit
File Structures SNU-OOPSLA Lab. 22
Int MakeAddress (char * key, int depth)
{
int retval = 0;
int hashVal = Hash(key);
// reverse the bits
for (int j = 0; j < depth; j++)
{
retval = retval << 1;
int lowbit = hashVal & 1;
retval = retval | lowbit;
hashVal = hashVal >> 1;
}
return retval;
}
Figure 12.9 Function MakeAddress(key,depth)
File Structures SNU-OOPSLA Lab. 23
Class Bucket: protected TextIndex
{protected:
Bucket (Directory & dir, int maxKeys = defaultMaxKeys);
int Insert (char * key, int recAddr);
int Remove(char * key);
Bucket * Split ();
int NewRange (int & newStart, int & newEnd);
int Redistribute (Bucket & newBucket);
int FindBuddy ();
int TryCombine ();
int Combine (Bucket * buddy, int buddyIndex);
int Depth;
Directory & Dir;
int BucketAddr;
friend class Directory;
friend class BucketBuffer;
}; Figure 12.10 Main members of class Bucket
File Structures SNU-OOPSLA Lab. 24
class Directory
{public:
Directory (…..); ~Directory();
int Open (..); int Create(…); int Close();
int Insert(…); int Delete(…); int Search(…);
protected
int DoubleSize();
int Collape();
int InsertBucket (….);
int Find (…);
int StoreBucket(…);
int LoadBucket(…)
…..
}
Figure 12.11 Definition of class Directory
File Structures SNU-OOPSLA Lab. 25
12.4 Deletion12.4 Deletion
When to combine buckets
Buddy buckets: the buckets are siblings and at the leaf level
of the tree (Buddy means something like friend)
e.g., B and D in page 19 are buddy buckets
Examine the directory to see if we can make changes
there
Shrink the directory if none of the buckets requires the depth
of address information that is currently available in the
directory
File Structures SNU-OOPSLA Lab. 26
Buddy BucketBuddy Bucket
Given a bucket with an address uvwxy, where u,
v, w, x, and y have values of either 0 or 1, the
buddy bucket, if it exists, has the value uvwxz,
such that
z = y XOR 1
If enough keys are deleted, the contents of buddy
buckets can be combined into a single bucket
File Structures SNU-OOPSLA Lab. 27
Collapsing the DirectoryCollapsing the Directory
Collapse condition
If a single cell, downsizing is impossible
If there is a pair of directory cells that do not both point to the
same bucket, collapsing is impossible
Allocating space
Allocate half the size of the original
Copy the bucket references shared by each cell pair to a single
cell in the new directory
File Structures SNU-OOPSLA Lab. 28
12.5 Extendible Hashing Performance12.5 Extendible Hashing Performance
Time : O(1)
If the directory can kept in RAM: a single access
Otherwise: two accesses are necessary
Space utilization of the bucket
r (# of records), b (block size), N (# of Blocks)
Utilization = r / bN
Average utilization ==> 0.69
Space utilization for the directory
How large a directory should we expect to have,
given an expected number of keys?
Expected value for the directory size by Flajolet(1983)
Estimated directory size =3.92 / b X r(1+1/b)
File Structures SNU-OOPSLA Lab. 29
Periodic and fluctuating
With uniform distributed addresses, all the buckets tend to fill up at the
same time -> split at the same time
As buffer fills up : 90%
After a concentrated series of splits : 50%
r : # of records , b : block size
N ~= 4/(b ln 2)
Utilization = r / bN ~= ln 2 = 0.69
Average utilization of 69%
B tree space utilization
Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %
Space utilization for buckets
File Structures SNU-OOPSLA Lab. 30
12.6 Alternative Approaches(1):12.6 Alternative Approaches(1): Dynamic Hashing
Similar to dynamic extendible hashing
Use a directory to track bucket addresses
Extend the directory through the use of tries
Start with a hash function that covers an address space of
a fixed size
When overflow occurs
splits forming the leaves of a trie that grows down from the
original address node makes a trie
File Structures SNU-OOPSLA Lab. 31
Two kinds of nodes
External node: reference a data bucket
Internal node: point to two children index nodes
When a node has split children, it changed from an external
node to an internal node
Two hash functions
Apply the first hash function original address space
if external node is found : search is completed
if internal node is found : apply second hash function
Alternative Approaches(2):Alternative Approaches(2): Dynamic Hashing
File Structures SNU-OOPSLA Lab. 32
1 2 3 4
41 2 3
40 41
41 3
1
410
20 21 41
411
2
Original
address
space
Original
address
space
Original
address
space
(a)
(b)
(c)
File Structures SNU-OOPSLA Lab. 33
Dynamic Hashing vs. Extendible Hashing(1)Dynamic Hashing vs. Extendible Hashing(1)
Overflow handling
Both schemes extend the hash function locally, as a binary search
trie
Both schemes use directory structure
Dynamic hashing: a linked structure
Extendible hashing: perfect tree expressible as an array
Space Utilization
both schemes is the same (space utilization : 69%)
File Structures SNU-OOPSLA Lab. 34
Dynamic Hashing and Extendible Hashing(2)Dynamic Hashing and Extendible Hashing(2)
Growth of directory
Dynamic hashing: slower, more gradual growth
Extendible hashing: extend directory by doubling it
Actual size of an index node
Dynamic hashing is lager than a directory cell in extendible
hashing (because of pointers)
Page fault
Dynamic hashing: more than one page fault (with linked structure
for the directory)
Extendible hashing: single page fault
File Structures SNU-OOPSLA Lab. 35
Alternative Approaches(3):Alternative Approaches(3): Linear Hashing
Unlike extendible hashing and dynamic hashing, linear hashing does
not use a directory.
The actual address space is extended one bucket at a time as
buckets overflow
Because the extension of the address space does not necessarily
correspond to the bucket that is overflowing,
linear hashing necessarily involves the use of overflow buckets, even
as the address space expands
No directories: Avoid additional seek resulting from additional layer
Use more bits of hashed value
hd(k) : depth d hashing function (using function make_address)
File Structures SNU-OOPSLA Lab. 36
a b c d
00 01 10 11
a b c d A
w
00 01 10 11 100 101
a b c d A B
x
a b c d A B C
00 01 10 11 100 101 110
x
y
(a) (b)
(c) (d)
(continued...)
The growth of address space in linear hashing(1)
000 01 10 11 100
File Structures SNU-OOPSLA Lab. 37
a b c d A B C D
00 01 10 11 100 101 110 111
x
(e)
The growth of address space in linear hashing(2)
File Structures SNU-OOPSLA Lab. 38
Alternative Approaches(5)Alternative Approaches(5)
::Approaches to Controlling SplittingApproaches to Controlling Splitting
Postpone splitting: increase space utilization
B-Tree: redistribution rather than splitting
Hashing: placing records in chains of overflow buckets to
postpone splitting
Triggering event for splitting
Linear hashing
Every time any bucket overflows
Not split overflowing bucket
Litwin(1980): overall load factor of the file
Below 2 seeks, 75% ~ 80% storage utilization
File Structures SNU-OOPSLA Lab. 39
Alternative Approaches(5)Alternative Approaches(5)
::Approaches to Controlling SplittingApproaches to Controlling Splitting
Postpone splitting for extensible hashing
Use chaining overflow bucket
Avoid doubling directory space
1.1 seek, 76% ~ 81% storage utilization
File Structures SNU-OOPSLA Lab. 40
Let’s Review !!!Let’s Review !!!
12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation
12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches

More Related Content

DOCX
Digital Watermarking
PPTX
Graph in data structure
PPTX
Array implementation and linked list as datat structure
PPTX
Unit II - LINEAR DATA STRUCTURES
PPTX
Stack project
PDF
Secteur TIC en région wallonne - SWOT
PDF
Hashing and Hash Tables
PPTX
Graph data structure
Digital Watermarking
Graph in data structure
Array implementation and linked list as datat structure
Unit II - LINEAR DATA STRUCTURES
Stack project
Secteur TIC en région wallonne - SWOT
Hashing and Hash Tables
Graph data structure

What's hot (20)

PPT
Encryption
PPTX
dplyr Package in R
PDF
Searching and Sorting Techniques in Data Structure
PPT
Hashing PPT
PPTX
Oracle Database Security
PPTX
Data structure - Graph
PPTX
My lectures circular queue
PPTX
Queue
PPTX
Graph data structure and algorithms
PDF
C++ Arrays different operations .pdf
PPT
Data structure lecture 5
PDF
Artificial Neural Network and its Applications
PPTX
Secure hash function
PDF
What is Stack, Its Operations, Queue, Circular Queue, Priority Queue
PDF
Lecture Notes Unit4 Chapter13 users , roles and privileges
PPTX
RSA Algorithm
PPTX
Doubly linked list (animated)
PPT
C++ Data Structure PPT.ppt
PDF
Oracle e-business suite R12 step by step Installation
PPTX
Linked list
Encryption
dplyr Package in R
Searching and Sorting Techniques in Data Structure
Hashing PPT
Oracle Database Security
Data structure - Graph
My lectures circular queue
Queue
Graph data structure and algorithms
C++ Arrays different operations .pdf
Data structure lecture 5
Artificial Neural Network and its Applications
Secure hash function
What is Stack, Its Operations, Queue, Circular Queue, Priority Queue
Lecture Notes Unit4 Chapter13 users , roles and privileges
RSA Algorithm
Doubly linked list (animated)
C++ Data Structure PPT.ppt
Oracle e-business suite R12 step by step Installation
Linked list
Ad

Viewers also liked (20)

PPT
Chapter13
PPTX
Hashing Technique In Data Structures
PPT
12. Indexing and Hashing in DBMS
PPT
1.5 weka an intoduction
PPTX
Extendible hashing
PPT
2.5 graph dfs
PPT
2.3 shortest path dijkstra’s
PPT
2.5 dfs & bfs
PDF
Hashing Algorithm
PPT
5.5 back track
PPTX
On Ahimsa : Reply To Lala Lajpat Rai A literature presentation
PPT
2.4 rule based classification
PDF
DBMS topics for BCA
PDF
4 the relational data model and relational database constraints
PPT
PPT
Dijksatra
PPT
Disk scheduling algorithms
PPTX
Disk scheduling
PPT
9 cm402.18
POT
Arrays and addressing modes
Chapter13
Hashing Technique In Data Structures
12. Indexing and Hashing in DBMS
1.5 weka an intoduction
Extendible hashing
2.5 graph dfs
2.3 shortest path dijkstra’s
2.5 dfs & bfs
Hashing Algorithm
5.5 back track
On Ahimsa : Reply To Lala Lajpat Rai A literature presentation
2.4 rule based classification
DBMS topics for BCA
4 the relational data model and relational database constraints
Dijksatra
Disk scheduling algorithms
Disk scheduling
9 cm402.18
Arrays and addressing modes
Ad

Similar to 4.4 external hashing (20)

PPT
Extensible hashing
PDF
extensiblehashing-191010111114.pdf
PPTX
File System Implementation.pptx
PPTX
Lecture14-Hash-Based-Indexing-and-Sorting-MHH-18Oct-2016.pptx
PPT
PAM.ppt
PPT
Hashing
PPT
Database MGMT - Hash Index Linear Hashing only
PPTX
Hash Table.pptx
PDF
Hive Demo Paper at VLDB 2009
PPTX
SKILLWISE-DB2 DBA
PPTX
1.R_For_Libraries_Session_2_-_Data_Exploration.pptx
PDF
VTU 3RD SEM UNIX AND SHELL PROGRAMMING SOLVED PAPERS
PDF
Introduction to r studio on aws 2020 05_06
PDF
Introduction to Python
PPT
python language programming presentation
PPTX
hashing1.pptx Data Structures and Algorithms
PDF
Final exam in advance dbms
PDF
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
PPTX
DBMS Data Storage and Query Processing.
PDF
Algorithms notes tutorials duniya
Extensible hashing
extensiblehashing-191010111114.pdf
File System Implementation.pptx
Lecture14-Hash-Based-Indexing-and-Sorting-MHH-18Oct-2016.pptx
PAM.ppt
Hashing
Database MGMT - Hash Index Linear Hashing only
Hash Table.pptx
Hive Demo Paper at VLDB 2009
SKILLWISE-DB2 DBA
1.R_For_Libraries_Session_2_-_Data_Exploration.pptx
VTU 3RD SEM UNIX AND SHELL PROGRAMMING SOLVED PAPERS
Introduction to r studio on aws 2020 05_06
Introduction to Python
python language programming presentation
hashing1.pptx Data Structures and Algorithms
Final exam in advance dbms
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
DBMS Data Storage and Query Processing.
Algorithms notes tutorials duniya

More from Krish_ver2 (20)

PPT
5.5 back tracking
PPT
5.5 back tracking 02
PPT
5.4 randomized datastructures
PPT
5.4 randomized datastructures
PPT
5.4 randamized algorithm
PPT
5.3 dynamic programming 03
PPT
5.3 dynamic programming
PPT
5.3 dyn algo-i
PPT
5.2 divede and conquer 03
PPT
5.2 divide and conquer
PPT
5.2 divede and conquer 03
PPT
5.1 greedyyy 02
PPT
5.1 greedy
PPT
5.1 greedy 03
PPT
4.4 hashing02
PPT
4.4 hashing
PPT
4.4 hashing ext
PPT
4.2 bst
PPT
4.2 bst 03
PPT
4.2 bst 02
5.5 back tracking
5.5 back tracking 02
5.4 randomized datastructures
5.4 randomized datastructures
5.4 randamized algorithm
5.3 dynamic programming 03
5.3 dynamic programming
5.3 dyn algo-i
5.2 divede and conquer 03
5.2 divide and conquer
5.2 divede and conquer 03
5.1 greedyyy 02
5.1 greedy
5.1 greedy 03
4.4 hashing02
4.4 hashing
4.4 hashing ext
4.2 bst
4.2 bst 03
4.2 bst 02

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Cell Types and Its function , kingdom of life
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Institutional Correction lecture only . . .
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
2.FourierTransform-ShortQuestionswithAnswers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Sports Quiz easy sports quiz sports quiz
Microbial diseases, their pathogenesis and prophylaxis
O7-L3 Supply Chain Operations - ICLT Program
Complications of Minimal Access Surgery at WLH
Cell Types and Its function , kingdom of life
Basic Mud Logging Guide for educational purpose
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Anesthesia in Laparoscopic Surgery in India
Institutional Correction lecture only . . .
Microbial disease of the cardiovascular and lymphatic systems
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Classroom Observation Tools for Teachers
Final Presentation General Medicine 03-08-2024.pptx
PPH.pptx obstetrics and gynecology in nursing
Renaissance Architecture: A Journey from Faith to Humanism

4.4 external hashing

  • 1. File Structures SNU-OOPSLA Lab. 1 Chap12. Extendible Hashing 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주 File Structures by Folk, Zoellick and Riccardi
  • 2. File Structures SNU-OOPSLA Lab. 2 Chapter ObjectivesChapter Objectives Describe the problem solved by extendible hashing and related approaches Explain how extendible hashing works; show how it combines tries with conventional, static hashing Use the buffer, file, and index classes of previous chapters to implement extendible hashing, including deletion Review studies of extendible hashing performance Examine alternative approaches to the same problem, including dynamic hashing, linear hashing, and hashing schemes that control splitting by allowing for overflow buckets
  • 3. File Structures SNU-OOPSLA Lab. 3 ContentsContents 12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches
  • 4. File Structures SNU-OOPSLA Lab. 4 12.1 Introduction Dynamic files undergo a lot of growths Static hashing described in chapter 11 (direct hashing) typically worse than B-Tree for dynamic files eventually requires file reorganization Extendible hashing hashing for dynamic file Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)
  • 5. File Structures SNU-OOPSLA Lab. 5 Overview(1)Overview(1) Direct access (hashing) files have static size, so not suitable for files whose size is unknown in advance Dynamic file structure is desired which retains the feature of fast retrieval by primary key, and which also expands and contracts as the number of records in the file fluctuates (without reorganizing the whole file) Similar motivation! Indexed-sequential File ==> B tree Hashing ==> Extendible Hashing
  • 6. File Structures SNU-OOPSLA Lab. 6 Overview(2)Overview(2) Extendible Hashing Primary key H(key) Hashing function Directory Index Extract first d digit File pointerTable look-up
  • 7. File Structures SNU-OOPSLA Lab. 7 12.2 How Extendible Hashing works12.2 How Extendible Hashing works Idea from Tries file (radix searching) The branching factor of the tree is equal to the # of alternative symbols in each position of the key e.g.) Radix 26 trie - able, abrahms, adams, anderson, adnrews, baird Use the first n characters for branching a b b d n l r d e r able abrahms adams anderson andrews baird
  • 8. File Structures SNU-OOPSLA Lab. 8 Extendible HashingExtendible Hashing H maps keys to a fixed address space, with size the largest prime less than a power of 2 (65531 < 216 ) File pointers point to blocks of records known as buckets, where an entire bucket is read by one physical data transfer, buckets may be added to or removed from the file dynamically The d bits are used as an index in a directory array containing 2d entries, which usually resides in primary memory The value d, the directory size(2d ), and the number of buckets change automatically as the file expands and contracts
  • 9. File Structures SNU-OOPSLA Lab. 9 Extendible Hashing ExampleExtendible Hashing Example 000 001 010 011 100 101 110 111 d’=1 d’=3 d’=3 d’=2 Directory with d=3 and 4 buckets B0 B100 B101 B11 H(key)=0 H(key)=100 H(key)=101 H(key)=11 d=3
  • 10. File Structures SNU-OOPSLA Lab. 10 Turning the trie into a directoryTurning the trie into a directory Using Trie for extendible hashing (1) Use Radix 2 Trie : Keys in A : beginning with 0 Keys in B : beginning with 10 Keys in C : beginning with 11 (2) Retrieving from secondary storage the buckets containing keys, instead of individual keys A B C 0 1 0 1
  • 11. File Structures SNU-OOPSLA Lab. 11 Representation of Trie (1)Representation of Trie (1) Tree is not preferable (directory is not big) A flattened array 1. Make a complete full binary tree 2. Collapse it into the directory structure 0 1 0 1 0 1 C A B 00 01 10 11 A B C
  • 12. File Structures SNU-OOPSLA Lab. 12 Representation of Trie(2)Representation of Trie(2) Directory is a complete binary tree Directory entry : a pointer to the associated bucket Given an address beginning with the bits 10, the 210 directory entries Introduced for uniform distribution
  • 13. File Structures SNU-OOPSLA Lab. 13 Retrieve a recordRetrieve a record Steps in retrieving a record with a given key find H(given key) extract first d bits of H(given key) use this value as an index into the directory to find a pointer use this pointer to read a bucket into primary memory locate the desired record within the bucket (scan)
  • 14. File Structures SNU-OOPSLA Lab. 14 Expansion & Contraction(1)Expansion & Contraction(1) A pair of adjunct buckets with the same value of d’ which share a common value of the first d’-1 bits of H(key) can be combined if the average load < 50%, so all records would be able to fit into one bucket File contraction is the reverse of expansion; the directory can be compacted and d decremented whenever all pairs of pointers have the same values
  • 15. File Structures SNU-OOPSLA Lab. 15 Expansion & Contraction(2)Expansion & Contraction(2) 000 001 010 011 100 101 110 111 d’=2 Bucket B0 overflows, then splits into B0 and B1 B00 H(key)=00.. d’=2 B01 H(key)=01.. d’=3 B100 H(key)=100.. d’=3 B00 H(key)=101.. d’=2 B00 H(key)=11.. d=3
  • 16. File Structures SNU-OOPSLA Lab. 16 Expansion & Contraction(3)Expansion & Contraction(3) 0000 d’=2 Bucket B100 overflows, d increase to 4 B00 H(key)=00.. d’=2 B01 H(key)=01.. d’=4 B1000H(key)=1000.. d’=4 B1001H(key)=1001.. d’=3 B101 H(key)=101.. d=4 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 d’=2 B11 H(key)=11..
  • 17. File Structures SNU-OOPSLA Lab. 17 Splitting to Handle Overflow (1)Splitting to Handle Overflow (1) When overflow occurs e.g.1) Overflowing of bucket A Split A into A and D Come to use additional unused bits No need to expand the directory 00 01 10 11 B C A D 00 01 10 11 A B C
  • 18. File Structures SNU-OOPSLA Lab. 18 Splitting to Handle Overflow(2)Splitting to Handle Overflow(2) e.g. Overflowing of bucket B Do not have additional unused bits (need to expand the directory) 1. Divide B using 3 bits of hash address 2. Make a complete full binary tree 3. Collapse it into the directory structure 00 01 10 11 A B C
  • 19. File Structures SNU-OOPSLA Lab. 19 A B C D 0 1 0 1 0 1 0 1 0 10 1 0 1 0 1 0 1 0 1 A B D C 000 001 010 011 A 100 101 110 111 C B D 1. Result of overflow of bucket B 3. Directory 2. Complete Binary Tree
  • 20. File Structures SNU-OOPSLA Lab. 20 Creating Address Function hash(KEY) Fold/Add hashing algorithm Do not MOD hashing value by address space since no fixed address space exists Output from the hash function for a number of keys bill 0000 0011 0110 1100 lee 0000 0100 0010 1000 pauline 0000 1111 0110 0101 alan 0100 1100 1010 0010 julie 0010 1110 0000 1001 mike 0000 0111 0100 1101 elizabeth 0010 1100 0110 1010 mark 0000 1010 0000 0111
  • 21. File Structures SNU-OOPSLA Lab. 21 Int Hash (char * key) { int sum = 0; int len = strlen(key); if (len % 2 == 1) len ++; // make len even for (int j = 0; j < len; j+2) sum = (sum + 100 * key[j] + key[j+1]) % 19937; return sum; } Figure 12.7 Function Hash (key) returns an integer hash value for key for a 15 bit
  • 22. File Structures SNU-OOPSLA Lab. 22 Int MakeAddress (char * key, int depth) { int retval = 0; int hashVal = Hash(key); // reverse the bits for (int j = 0; j < depth; j++) { retval = retval << 1; int lowbit = hashVal & 1; retval = retval | lowbit; hashVal = hashVal >> 1; } return retval; } Figure 12.9 Function MakeAddress(key,depth)
  • 23. File Structures SNU-OOPSLA Lab. 23 Class Bucket: protected TextIndex {protected: Bucket (Directory & dir, int maxKeys = defaultMaxKeys); int Insert (char * key, int recAddr); int Remove(char * key); Bucket * Split (); int NewRange (int & newStart, int & newEnd); int Redistribute (Bucket & newBucket); int FindBuddy (); int TryCombine (); int Combine (Bucket * buddy, int buddyIndex); int Depth; Directory & Dir; int BucketAddr; friend class Directory; friend class BucketBuffer; }; Figure 12.10 Main members of class Bucket
  • 24. File Structures SNU-OOPSLA Lab. 24 class Directory {public: Directory (…..); ~Directory(); int Open (..); int Create(…); int Close(); int Insert(…); int Delete(…); int Search(…); protected int DoubleSize(); int Collape(); int InsertBucket (….); int Find (…); int StoreBucket(…); int LoadBucket(…) ….. } Figure 12.11 Definition of class Directory
  • 25. File Structures SNU-OOPSLA Lab. 25 12.4 Deletion12.4 Deletion When to combine buckets Buddy buckets: the buckets are siblings and at the leaf level of the tree (Buddy means something like friend) e.g., B and D in page 19 are buddy buckets Examine the directory to see if we can make changes there Shrink the directory if none of the buckets requires the depth of address information that is currently available in the directory
  • 26. File Structures SNU-OOPSLA Lab. 26 Buddy BucketBuddy Bucket Given a bucket with an address uvwxy, where u, v, w, x, and y have values of either 0 or 1, the buddy bucket, if it exists, has the value uvwxz, such that z = y XOR 1 If enough keys are deleted, the contents of buddy buckets can be combined into a single bucket
  • 27. File Structures SNU-OOPSLA Lab. 27 Collapsing the DirectoryCollapsing the Directory Collapse condition If a single cell, downsizing is impossible If there is a pair of directory cells that do not both point to the same bucket, collapsing is impossible Allocating space Allocate half the size of the original Copy the bucket references shared by each cell pair to a single cell in the new directory
  • 28. File Structures SNU-OOPSLA Lab. 28 12.5 Extendible Hashing Performance12.5 Extendible Hashing Performance Time : O(1) If the directory can kept in RAM: a single access Otherwise: two accesses are necessary Space utilization of the bucket r (# of records), b (block size), N (# of Blocks) Utilization = r / bN Average utilization ==> 0.69 Space utilization for the directory How large a directory should we expect to have, given an expected number of keys? Expected value for the directory size by Flajolet(1983) Estimated directory size =3.92 / b X r(1+1/b)
  • 29. File Structures SNU-OOPSLA Lab. 29 Periodic and fluctuating With uniform distributed addresses, all the buckets tend to fill up at the same time -> split at the same time As buffer fills up : 90% After a concentrated series of splits : 50% r : # of records , b : block size N ~= 4/(b ln 2) Utilization = r / bN ~= ln 2 = 0.69 Average utilization of 69% B tree space utilization Normal B-tree : 67%, B-tree with redistribution in insertion : 85 % Space utilization for buckets
  • 30. File Structures SNU-OOPSLA Lab. 30 12.6 Alternative Approaches(1):12.6 Alternative Approaches(1): Dynamic Hashing Similar to dynamic extendible hashing Use a directory to track bucket addresses Extend the directory through the use of tries Start with a hash function that covers an address space of a fixed size When overflow occurs splits forming the leaves of a trie that grows down from the original address node makes a trie
  • 31. File Structures SNU-OOPSLA Lab. 31 Two kinds of nodes External node: reference a data bucket Internal node: point to two children index nodes When a node has split children, it changed from an external node to an internal node Two hash functions Apply the first hash function original address space if external node is found : search is completed if internal node is found : apply second hash function Alternative Approaches(2):Alternative Approaches(2): Dynamic Hashing
  • 32. File Structures SNU-OOPSLA Lab. 32 1 2 3 4 41 2 3 40 41 41 3 1 410 20 21 41 411 2 Original address space Original address space Original address space (a) (b) (c)
  • 33. File Structures SNU-OOPSLA Lab. 33 Dynamic Hashing vs. Extendible Hashing(1)Dynamic Hashing vs. Extendible Hashing(1) Overflow handling Both schemes extend the hash function locally, as a binary search trie Both schemes use directory structure Dynamic hashing: a linked structure Extendible hashing: perfect tree expressible as an array Space Utilization both schemes is the same (space utilization : 69%)
  • 34. File Structures SNU-OOPSLA Lab. 34 Dynamic Hashing and Extendible Hashing(2)Dynamic Hashing and Extendible Hashing(2) Growth of directory Dynamic hashing: slower, more gradual growth Extendible hashing: extend directory by doubling it Actual size of an index node Dynamic hashing is lager than a directory cell in extendible hashing (because of pointers) Page fault Dynamic hashing: more than one page fault (with linked structure for the directory) Extendible hashing: single page fault
  • 35. File Structures SNU-OOPSLA Lab. 35 Alternative Approaches(3):Alternative Approaches(3): Linear Hashing Unlike extendible hashing and dynamic hashing, linear hashing does not use a directory. The actual address space is extended one bucket at a time as buckets overflow Because the extension of the address space does not necessarily correspond to the bucket that is overflowing, linear hashing necessarily involves the use of overflow buckets, even as the address space expands No directories: Avoid additional seek resulting from additional layer Use more bits of hashed value hd(k) : depth d hashing function (using function make_address)
  • 36. File Structures SNU-OOPSLA Lab. 36 a b c d 00 01 10 11 a b c d A w 00 01 10 11 100 101 a b c d A B x a b c d A B C 00 01 10 11 100 101 110 x y (a) (b) (c) (d) (continued...) The growth of address space in linear hashing(1) 000 01 10 11 100
  • 37. File Structures SNU-OOPSLA Lab. 37 a b c d A B C D 00 01 10 11 100 101 110 111 x (e) The growth of address space in linear hashing(2)
  • 38. File Structures SNU-OOPSLA Lab. 38 Alternative Approaches(5)Alternative Approaches(5) ::Approaches to Controlling SplittingApproaches to Controlling Splitting Postpone splitting: increase space utilization B-Tree: redistribution rather than splitting Hashing: placing records in chains of overflow buckets to postpone splitting Triggering event for splitting Linear hashing Every time any bucket overflows Not split overflowing bucket Litwin(1980): overall load factor of the file Below 2 seeks, 75% ~ 80% storage utilization
  • 39. File Structures SNU-OOPSLA Lab. 39 Alternative Approaches(5)Alternative Approaches(5) ::Approaches to Controlling SplittingApproaches to Controlling Splitting Postpone splitting for extensible hashing Use chaining overflow bucket Avoid doubling directory space 1.1 seek, 76% ~ 81% storage utilization
  • 40. File Structures SNU-OOPSLA Lab. 40 Let’s Review !!!Let’s Review !!! 12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches