Database MGMT - Hash Index Linear Hashing only

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
Hash-Based Indexes
Chapter 11

Introduction
 Hash-based indexes are best for equality selections.
Cannot support range searches.

Static Hashing
 # primary pages fixed, allocated sequentially,
never de-allocated; overflow pages if needed.
 h(k) mod M = bucket to which data entry with
key k belongs. (M = # of buckets)
h(key) mod M
h
key
Primary bucket pages Overflow pages
2
0
M-1

Static Hashing (Contd.)
 Buckets contain data entries.
 Hash fn works on search key field of record r. Must
distribute values over range 0 ... M-1.
 h(key) = (a * key + b) usually works well.
 a and b are constants; lots known about how to tune h.
 Long overflow chains can develop and degrade
performance.
 Extendible and Linear Hashing: Dynamic techniques to fix
this problem.

Extendible Hashing
 Situation: Bucket (primary page) becomes full. Why
not re-organize file by doubling # of buckets?
 Reading and writing all pages is expensive!
 Idea: Use directory of pointers to buckets, double # of
buckets by doubling the directory, splitting just the bucket
that overflowed!
 Directory much smaller than file, so doubling it is much
cheaper. Only one page of data entries is split. No
overflow page!
 Trick lies in how hash function is adjusted!

Example
 Directory is array of size 4.
 To find bucket for r, take last
`global depth’ # bits of h(r); we
denote r by h(r).
 If h(r) = 5 = binary 101, it is
in bucket pointed to by 01.
 Insert: If bucket is full, split it (allocate new page, re-distribute).
 If necessary, double the directory. (As we will see, splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)
13*
00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
DATA PAGES
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5*

Insert h(r)=20 (Causes Doubling)
20*
00
01
10
11
2 2
2
2
LOCAL DEPTH 2
2
DIRECTORY
GLOBAL DEPTH
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2
(`split image'
of Bucket A)
1* 5* 21*13*
32*16*
10*
15* 7* 19*
4* 12*
19*
2
2
2
000
001
010
011
100
101
110
111
3
3
3
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2
(`split image'
of Bucket A)
32*
1* 5* 21*13*
16*
10*
15* 7*
4* 20*
12*
LOCAL DEPTH
GLOBAL DEPTH

Points to Note
 20 = binary 10100. Last 2 bits (00) tell us r belongs in
A or A2. Last 3 bits needed to tell which.
 Global depth of directory: Max # of bits needed to tell which
bucket an entry belongs to.
 Local depth of a bucket: # of bits used to determine if an
entry belongs to this bucket.
 When does bucket split cause directory doubling?
 Before insert, local depth of bucket = global depth. Insert
causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split
image page. (Use of least significant bits enables efficient
doubling via copying of directory!)

Directory Doubling
00
01
10
11
2
Why use least significant bits in directory?
 Allows for doubling via copying!
000
001
010
011
3
100
101
110
111
vs.
0
1
1
6*
6*
6*
6 = 110
00
10
01
11
2
3
0
1
1
6*
6*
6*
6 = 110
000
100
010
110
001
101
011
111
Least Significant Most Significant

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
Comments on Extendible Hashing
 If directory fits in memory, equality search
answered with one disk access; else two.
 100MB file, 100 bytes/rec, 4K pages contains 1,000,000
records (as data entries) and 25,000 directory elements;
chances are high that directory will fit in memory.
 Directory grows in spurts, and, if the distribution of hash
values is skewed, directory can grow large.
 Multiple entries with same hash value cause problems!
 Delete: If removal of data entry makes bucket
empty, can be merged with `split image’. If each
directory element points to same bucket as its split
image, can halve directory.

Linear Hashing
 This is another dynamic hashing scheme, an
alternative to Extendible Hashing.
 LH handles the problem of long overflow chains
without using a directory, and handles duplicates.
 Idea: Use a family of hash functions h0, h1, h2, ...
 hi(key) = h(key) mod(2i
N); N = initial # buckets
 h is some hash function (range is not 0 to N-1)
 If N = 2d0
, for some d0, hi consists of applying h and looking
at the last di bits, where di = d0 + i.
 hi+1 doubles the range of hi (similar to directory doubling)

Linear Hashing (Contd.)
 Directory avoided in LH by using overflow
pages, and choosing bucket to split round-robin.
 Splitting proceeds in `rounds’. Round ends when all
NR initial (for round R) buckets are split. Buckets 0 to
Next-1 have been split; Next to NR yet to be split.
 Current round number is Level.
 Search: To find bucket for data entry r, find hLevel(r):
•If hLevel(r) in range `Next to NR’, r belongs here.
•Else, r could belong to bucket hLevel(r) or bucket
hLevel(r) + NR; must apply hLevel+1(r) to find out.

Overview of LH File
 In the middle of a round.
Level
h
Buckets that existed at the
beginning of this round:
this is the range of
Next
Bucket to be split
of other buckets) in this round
Level
h search key value )
(
search key value )
(
Buckets split in this round:
If
is in this range, must use
h Level+1
`split image' bucket.
to decide if entry is in
created (through splitting
`split image' buckets:

Linear Hashing (Contd.)
 Insert: Find bucket by applying hLevel / hLevel+1:
 If bucket to insert into is full:
•Add overflow page and insert data entry.
•(Maybe) Split Next bucket and increment Next.
 Can choose any criterion to `trigger’ split.
 Since buckets are split round-robin, long overflow
chains don’t develop!
 Doubling of directory in Extendible Hashing is
similar; switching of hash functions is implicit in
how the # of bits examined is increased.

Example of Linear Hashing
 On split, hLevel+1 is used to
re-distribute entries.
0
h
h
1
(This info
is for illustration
only!)
Level=0, N=4
00
01
10
11
000
001
010
011
(The actual contents
of the linear hashed
file)
Next=0
PRIMARY
PAGES
Data entry r
with h(r)=5
Primary
bucket page
44* 36*
32*
25*
9* 5*
14* 18*10*30*
31*35* 11*
7*
0
h
h
1
Level=0
00
01
10
11
000
001
010
011
Next=1
PRIMARY
PAGES
44* 36*
32*
25*
9* 5*
14* 18*10*30*
31*35* 11*
7*
OVERFLOW
PAGES
43*
00
100

Example: End of a Round
0
h
h1
22*
00
01
10
11
000
001
010
011
00
100
Next=3
01
10
101
110
Level=0
PRIMARY
PAGES
OVERFLOW
PAGES
32*
9*
5*
14*
25*
66* 10*
18* 34*
35*
31* 7* 11* 43*
44*36*
37*29*
30*
0
h
h1
37*
00
01
10
11
000
001
010
011
00
100
10
101
110
Next=0
Level=1
111
11
PRIMARY
PAGES
OVERFLOW
PAGES
11
32*
9* 25*
66* 18* 10* 34*
35* 11*
44* 36*
5* 29*
43*
14* 30* 22*
31*7*
50*

LH Described as a Variant of EH
 The two schemes are actually quite similar:
 Begin with an EH index where directory has N elements.
 Use overflow pages, split buckets round-robin.
 First split is at bucket 0. (Imagine directory being doubled at
this point.) But elements <1,N+1>, <2,N+2>, ... are the same.
So, need only create directory element N, which differs from
0, now.
• When bucket 1 splits, create directory element N+1, etc.
 So, directory can double gradually. Also, primary
bucket pages are created in order. If they are allocated
in sequence too (so that finding i’th is easy), we
actually don’t need a directory! Voila, LH.

Summary
 Hash-based indexes: best for equality searches,
cannot support range searches.
 Static Hashing can lead to long overflow chains.
 Extendible Hashing avoids overflow pages by
splitting a full bucket when a new data entry is to be
added to it. (Duplicates may require overflow pages.)
 Directory to keep track of buckets, doubles periodically.
 Can get large with skewed data; additional I/O if this
does not fit in main memory.

Summary (Contd.)
 Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages.
 Overflow pages not likely to be long.
 Space utilization could be lower than Extendible Hashing,
since splits not concentrated on `dense’ data areas.
•Can tune criterion for triggering splits to trade-off
slightly longer chains for better space utilization.
 For hash-based indexes, a skewed data distribution is
one in which the hash values of data entries are not
uniformly distributed!

Database MGMT - Hash Index Linear Hashing only

More Related Content

Similar to Database MGMT - Hash Index Linear Hashing only (20)

Recently uploaded (20)

Database MGMT - Hash Index Linear Hashing only

Editor's Notes