4.4 external hashing

File Structures SNU-OOPSLA Lab. 1
Chap12. Extendible Hashing
서울대학교 컴퓨터공학부
객체지향시스템연구실
SNU-OOPSLA-LAB
교수 김 형 주
File Structures by Folk, Zoellick and Riccardi

Chapter ObjectivesChapter Objectives
Describe the problem solved by extendible hashing and related
approaches
Explain how extendible hashing works; show how it combines
tries with conventional, static hashing
Use the buffer, file, and index classes of previous chapters to
implement extendible hashing, including deletion
Review studies of extendible hashing performance
Examine alternative approaches to the same problem, including
dynamic hashing, linear hashing, and hashing schemes that
control splitting by allowing for overflow buckets

ContentsContents
12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation
12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches

12.1 Introduction
Dynamic files
undergo a lot of growths
Static hashing
described in chapter 11 (direct hashing)
typically worse than B-Tree for dynamic files
eventually requires file reorganization
Extendible hashing
hashing for dynamic file
Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)

Overview(1)Overview(1)
Direct access (hashing) files have static size, so not
suitable for files whose size is unknown in advance
Dynamic file structure is desired which retains the feature
of fast retrieval by primary key, and which also expands
and contracts as the number of records in the file
fluctuates (without reorganizing the whole file)
Similar motivation!
Indexed-sequential File ==> B tree
Hashing ==> Extendible Hashing

Overview(2)Overview(2)
Extendible Hashing
Primary key H(key)
Hashing function
Directory
Index
Extract first d digit
File pointerTable look-up

12.2 How Extendible Hashing works12.2 How Extendible Hashing works
Idea from Tries file (radix searching)
The branching factor of the tree is equal to the # of
alternative symbols in each position of the key
e.g.) Radix 26 trie - able, abrahms, adams, anderson,
adnrews, baird
Use the first n characters for branching
a
b
b
d
n
l
r
d
e
r
able
abrahms
adams
anderson
andrews
baird

Extendible HashingExtendible Hashing
H maps keys to a fixed address space, with size the largest
prime less than a power of 2 (65531 < 216
)
File pointers point to blocks of records known as buckets,
where an entire bucket is read by one physical data transfer,
buckets may be added to or removed from the file dynamically
The d bits are used as an index in a directory array containing
2d
entries, which usually resides in primary memory
The value d, the directory size(2d
), and the number of buckets
change automatically as the file expands and contracts

Extendible Hashing ExampleExtendible Hashing Example
000
001
010
011
100
101
110
111
d’=1
d’=3
d’=3
d’=2
Directory with d=3 and 4 buckets
B0
B100
B101
B11
H(key)=0
H(key)=100
H(key)=101
H(key)=11
d=3

Turning the trie into a directoryTurning the trie into a directory
Using Trie for extendible hashing
(1) Use Radix 2 Trie :
Keys in A : beginning with 0
Keys in B : beginning with 10
Keys in C : beginning with 11
(2) Retrieving from secondary storage the buckets containing
keys, instead of individual keys
A
B
C
0
1 0
1

Representation of Trie (1)Representation of Trie (1)
Tree is not preferable (directory is not big)
A flattened array
1. Make a complete full binary tree
2. Collapse it into the directory structure
0
1
0
1
0
1
C
A
B
00
01
10
11
A
B
C

Representation of Trie(2)Representation of Trie(2)
Directory is a complete binary tree
Directory entry : a pointer to the associated bucket
Given an address beginning with the bits 10, the 210
directory
entries
Introduced for uniform distribution

Retrieve a recordRetrieve a record
Steps in retrieving a record with a given key
find H(given key)
extract first d bits of H(given key)
use this value as an index into the directory to find a pointer
use this pointer to read a bucket into primary memory
locate the desired record within the bucket (scan)

Expansion & Contraction(1)Expansion & Contraction(1)
A pair of adjunct buckets with the same value of d’ which
share a common value of the first d’-1 bits of H(key) can
be combined if the average load < 50%, so all records
would be able to fit into one bucket
File contraction is the reverse of expansion; the directory
can be compacted and d decremented whenever all pairs
of pointers have the same values

000
001
010
011
100
101
110
111
d’=2
Bucket B0 overflows, then splits into B0 and B1
B00 H(key)=00..
d’=2
B01 H(key)=01..
d’=3
B100 H(key)=100..
d’=3
B00 H(key)=101..
d’=2
B00 H(key)=11..
d=3

0000
d’=2
Bucket B100 overflows, d increase to 4
B00 H(key)=00..
d’=2
B01 H(key)=01..
d’=4
B1000H(key)=1000..
d’=4
B1001H(key)=1001..
d’=3
B101 H(key)=101..
d=4
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
d’=2
B11 H(key)=11..

Splitting to Handle Overflow (1)Splitting to Handle Overflow (1)
When overflow occurs
e.g.1) Overflowing of bucket A
Split A into A and D
Come to use additional unused bits
No need to expand the directory
00
01
10
11
B
C
A
D
00
01
10
11
A
B
C

Splitting to Handle Overflow(2)Splitting to Handle Overflow(2)
e.g. Overflowing of bucket B
Do not have additional unused bits
(need to expand the directory)
1. Divide B using 3 bits of hash address
2. Make a complete full binary tree
3. Collapse it into the directory structure
00
01
10
11
A
B
C

A
B
C
D
0
1 0
1
0
1
0
1
0
10
1 0
1
0
1 0
1
0
1
A
B
D
C
000
001
010
011
A
100
101
110
111
C
B
D
1. Result of overflow of bucket B
3. Directory
2. Complete Binary Tree

Creating Address
Function hash(KEY)
Fold/Add hashing algorithm
Do not MOD hashing value by address space since no fixed
address space exists
Output from the hash function for a number of keys
bill 0000 0011 0110 1100
lee 0000 0100 0010 1000
pauline 0000 1111 0110 0101
alan 0100 1100 1010 0010
julie 0010 1110 0000 1001
mike 0000 0111 0100 1101
elizabeth 0010 1100 0110 1010
mark 0000 1010 0000 0111

Int Hash (char * key)
{
int sum = 0;
int len = strlen(key);
if (len % 2 == 1) len ++; // make len even
for (int j = 0; j < len; j+2)
sum = (sum + 100 * key[j] + key[j+1]) % 19937;
return sum;
}
Figure 12.7 Function Hash (key) returns an integer hash value for key
for a 15 bit

Int MakeAddress (char * key, int depth)
{
int retval = 0;
int hashVal = Hash(key);
// reverse the bits
for (int j = 0; j < depth; j++)
{
retval = retval << 1;
int lowbit = hashVal & 1;
retval = retval | lowbit;
hashVal = hashVal >> 1;
}
return retval;
}
Figure 12.9 Function MakeAddress(key,depth)

Class Bucket: protected TextIndex
{protected:
Bucket (Directory & dir, int maxKeys = defaultMaxKeys);
int Insert (char * key, int recAddr);
int Remove(char * key);
Bucket * Split ();
int NewRange (int & newStart, int & newEnd);
int Redistribute (Bucket & newBucket);
int FindBuddy ();
int TryCombine ();
int Combine (Bucket * buddy, int buddyIndex);
int Depth;
Directory & Dir;
int BucketAddr;
friend class Directory;
friend class BucketBuffer;
}; Figure 12.10 Main members of class Bucket

class Directory
{public:
Directory (…..); ~Directory();
int Open (..); int Create(…); int Close();
int Insert(…); int Delete(…); int Search(…);
protected
int DoubleSize();
int Collape();
int InsertBucket (….);
int Find (…);
int StoreBucket(…);
int LoadBucket(…)
…..
}
Figure 12.11 Definition of class Directory

12.4 Deletion12.4 Deletion
When to combine buckets
Buddy buckets: the buckets are siblings and at the leaf level
of the tree (Buddy means something like friend)
e.g., B and D in page 19 are buddy buckets
Examine the directory to see if we can make changes
there
Shrink the directory if none of the buckets requires the depth
of address information that is currently available in the
directory

Buddy BucketBuddy Bucket
Given a bucket with an address uvwxy, where u,
v, w, x, and y have values of either 0 or 1, the
buddy bucket, if it exists, has the value uvwxz,
such that
z = y XOR 1
If enough keys are deleted, the contents of buddy
buckets can be combined into a single bucket

Collapsing the DirectoryCollapsing the Directory
Collapse condition
If a single cell, downsizing is impossible
If there is a pair of directory cells that do not both point to the
same bucket, collapsing is impossible
Allocating space
Allocate half the size of the original
Copy the bucket references shared by each cell pair to a single
cell in the new directory

12.5 Extendible Hashing Performance12.5 Extendible Hashing Performance
Time : O(1)
If the directory can kept in RAM: a single access
Otherwise: two accesses are necessary
Space utilization of the bucket
r (# of records), b (block size), N (# of Blocks)
Utilization = r / bN
Average utilization ==> 0.69
Space utilization for the directory
How large a directory should we expect to have,
given an expected number of keys?
Expected value for the directory size by Flajolet(1983)
Estimated directory size =3.92 / b X r(1+1/b)

Periodic and fluctuating
With uniform distributed addresses, all the buckets tend to fill up at the
same time -> split at the same time
As buffer fills up : 90%
After a concentrated series of splits : 50%
r : # of records , b : block size
N ~= 4/(b ln 2)
Utilization = r / bN ~= ln 2 = 0.69
Average utilization of 69%
B tree space utilization
Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %
Space utilization for buckets

12.6 Alternative Approaches(1):12.6 Alternative Approaches(1): Dynamic Hashing
Similar to dynamic extendible hashing
Use a directory to track bucket addresses
Extend the directory through the use of tries
Start with a hash function that covers an address space of
a fixed size
When overflow occurs
splits forming the leaves of a trie that grows down from the
original address node makes a trie

Two kinds of nodes
External node: reference a data bucket
Internal node: point to two children index nodes
When a node has split children, it changed from an external
node to an internal node
Two hash functions
Apply the first hash function original address space
if external node is found : search is completed
if internal node is found : apply second hash function
Alternative Approaches(2):Alternative Approaches(2): Dynamic Hashing

1 2 3 4
41 2 3
40 41
41 3
1
410
20 21 41
411
2
Original
address
space
Original
address
space
Original
address
space
(a)
(b)
(c)

Dynamic Hashing vs. Extendible Hashing(1)Dynamic Hashing vs. Extendible Hashing(1)
Overflow handling
Both schemes extend the hash function locally, as a binary search
trie
Both schemes use directory structure
Dynamic hashing: a linked structure
Extendible hashing: perfect tree expressible as an array
Space Utilization
both schemes is the same (space utilization : 69%)

Dynamic Hashing and Extendible Hashing(2)Dynamic Hashing and Extendible Hashing(2)
Growth of directory
Dynamic hashing: slower, more gradual growth
Extendible hashing: extend directory by doubling it
Actual size of an index node
Dynamic hashing is lager than a directory cell in extendible
hashing (because of pointers)
Page fault
Dynamic hashing: more than one page fault (with linked structure
for the directory)
Extendible hashing: single page fault

Alternative Approaches(3):Alternative Approaches(3): Linear Hashing
Unlike extendible hashing and dynamic hashing, linear hashing does
not use a directory.
The actual address space is extended one bucket at a time as
buckets overflow
Because the extension of the address space does not necessarily
correspond to the bucket that is overflowing,
linear hashing necessarily involves the use of overflow buckets, even
as the address space expands
No directories: Avoid additional seek resulting from additional layer
Use more bits of hashed value
hd(k) : depth d hashing function (using function make_address)

a b c d
00 01 10 11
a b c d A
w
00 01 10 11 100 101
a b c d A B
x
a b c d A B C
00 01 10 11 100 101 110
x
y
(a) (b)
(c) (d)
(continued...)
The growth of address space in linear hashing(1)
000 01 10 11 100

a b c d A B C D
00 01 10 11 100 101 110 111
x
(e)
The growth of address space in linear hashing(2)

Alternative Approaches(5)Alternative Approaches(5)
::Approaches to Controlling SplittingApproaches to Controlling Splitting
Postpone splitting: increase space utilization
B-Tree: redistribution rather than splitting
Hashing: placing records in chains of overflow buckets to
postpone splitting
Triggering event for splitting
Linear hashing
Every time any bucket overflows
Not split overflowing bucket
Litwin(1980): overall load factor of the file
Below 2 seeks, 75% ~ 80% storage utilization

Alternative Approaches(5)Alternative Approaches(5)
::Approaches to Controlling SplittingApproaches to Controlling Splitting
Postpone splitting for extensible hashing
Use chaining overflow bucket
Avoid doubling directory space
1.1 seek, 76% ~ 81% storage utilization

Let’s Review !!!Let’s Review !!!
12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation
12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches

4.4 external hashing

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 4.4 external hashing (20)

More from Krish_ver2 (20)

Recently uploaded (20)

4.4 external hashing