W1.1 i os in database

Ch 9. Storing Data: Disks and Files
- Heap File Structure -
Sang-Won Lee
http://icc.skku.ac.kr/~swlee
SKKU VLDB Lab. & SOS
( http://vldb.skku.ac.kr/ )

2 SKKU VLDB Lab.Ch 9. Storing Disk
Contents
9.0 Overview
9.1 Memory Hierarchy
9.2 RAID(Redundant Array of Independent Disk)
9.3 Disk Space Management
9.4 Buffer Manager
9.5 Files of Records
9.6 Page Format
9.7 Record Format

Memory Hierarchy
Smaller, Faster,
Expensive, Volatile
Bigger, Slower,
Cheaper, Non-Volatile
 Main memory (RAM) for currently used data.
 Disk for the main database (secondary storage).
 Tapes for archiving older versions of the data (tertiary storage)
 WHY MEMORY HIERARCHY?
 What if ideal storage appear? Fast, cheap, large, NV..: PCM, MRAM, FeRAM?

Jim Gray’s Storage Latency Analogy:
How Far Away is the Data?
Registers
On Chip Cache
On Board Cache
Memory
Disk
1
2
10
100
Tape /Optical
Robot
10 9
10 6
Sacramento
This Hotel
This Room
My Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 Years
Andromeda

Disks and Files
 DBMS stores information on (hard) disks.
− Electronic (CPU, DRAM) vs. Mechanical (harddisk)
 This has major implications for DBMS design!
− READ: transfer data from disk to main memory (RAM).
− WRITE: transfer data from RAM to disk.
− Both are expensive operations, relative to in-memory operations, so
must be planned carefully!
 DRAM: ~ 10 ns
 Harddisk: ~ 10ms
 SSD: 80us ~ 10ms

Disks
 Secondary storage device of choice.
 Main advantage over tapes: random access vs. sequential.
− Tapes deteriorate over time
 Data is stored and retrieved in disk blocks or pages unit.
 Unlike RAM, time to retrieve a disk page varies depending upon
location on disk.
− Thus, relative placement of pages on disk has big impact on DB
performance!
 e.g. adjacent allocation of the pages from the same tables.
− We need to optimize both data placement and access
 e.g. elevator disk scheduling algorithm

Anatomy of a Disk
Arm assembly
 The platters spin
 e.g. 5400 / 7200 / 15K rpm
 The arm assembly is moved in or out to
position a head on a desired track. Tracks
under heads make a cylinder
 Mechanical storage -> low IOPS
 Only one head reads/writes at any one
time.
 Parallelism degree: 1
 Block size is a multiple of sector size
 Update-in-place: poisoned apple
 No atomic write
 Fsync for ordering / durability

Accessing a Disk Page
 Time to access (read/write) a disk block:
− seek time (moving arms to position disk head on track)
− rotational delay (waiting for block to rotate under head)
− transfer time (actually moving data to/from disk surface)
 Seek time and rotational delay dominate.
− Seek time: about 1 to 20msec
− Rotational delay: from 0 to 10msec
− Transfer rate: about 1ms per 4KB page
 Key to lower I/O cost: reduce seek/rotation delays!
− E.g. disk scheduling algorithm in OS, Linux 4 I/O schedulers

Arranging Pages on Disk
 `Next’ block concept:
− Blocks on same track, followed by
− Blocks on same cylinder, followed by
− Blocks on adjacent cylinder
 Blocks in a file should be arranged sequentially on disk (by `next’),
to minimize seek and rotational delay.
 Disk fragmentation problem
− Is this still problematic in flash storage?

10
Table, Insertions, Heap Files
CREATE TABLE TEST (a int, b int, c varchar2(650));
/* Insert 1M tuples into TEST table (approximately 664 bytes per
tuple) */
BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO TEST (a, b, c) values (i, i, rpad('X', 650, 'X'));
END LOOP;
END;
/*
Page = 8KB
10 tuples / page
100,000 pages in total
TEST table = 800MB
*/
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i

11
OLAP vs. OLTP
 On-Line Analytical vs. Transactional Processing
SQL> SELECT SUM(b) FROM TEST;
SUM(B)
----------
5.0000E+11
Execution Plan
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
|
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
…..
1 rows processed
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i

12
OLAP vs. OLTP
SQL> SELECT B FROM TEST WHERE A = 500000;
B
----------
500000
Execution Plan
---------------------------------------------------------------------------
---------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
|
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
…..
1 rows processed
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i
 Point or Range Query

13
OLAP vs. OLTP
CREATE INDEX TEST_A ON
TEST(A);
Data
Page 1
Data
Page 2
Data
Page
100000
SELECT B /* point query */
FROM TEST
WHERE A = 500000;
SELECT B /* range query */
FROM TEST
WHERE A BETWEEN 50001 and 50101;
B-tree Index on TEST(A)
SEARCH KEY: 500000
(500000, (50000,10) )
(500000,500000, ’XX…XX’;
Block #: 50000SLOT #: 10
Cost:
 Full Table Scan: 100,000 Block Accesses
 Index: (3~4) + Data Block Access
− Point query: 1
− Range queries: depending on range
TEST SEGMENT
(Heap File)

14
OLAP vs. OLTP
SQL> SELECT B FROM TEST WHERE A = 500000;
B
----------
500000
Execution Plan
--------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
| 1 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TEST_A | 1 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
…..
5 consistent gets
4 physical reads
…..
1 rows processed
 Index-based Table Access

15
OLTP: TPC-A/B/C Benchmark
exec sql begin declare section;
long Aid, Bid, Tid, delta, Abalance;
exec sql end declare section;
DCApplication()
{ read input msg;
exec sql begin work;
exec sql update accounts set Abalance = Abalance + :delta where Aid = :Aid;
exec sql select Abalance into :Abalance from accounts where Aid = :Aid;
exec sql update tellers set Tbalance = Tbalance + :delta where Tid = :Tid;
exec sql update branches set Bbalance = Bbalance + :delta where Bid = :Bid;
exec sql insert into history(Tid, Bid, Aid, delta, time) values (:Tid, :Bid, :Aid, :delta, CURRENT);
send output msg;
exec sql commit work; }
From Gray’s Presentation
Transaction =
Sequence of Reads and Writes

16
OLTP System: Architecture
Transactions 1 …….. N
Database Buffer
Database
SQLs
Physical Read/Writes
Select / Insert / Update / Delete / Commit Operations
Logical Read/Writes

Concurrency
Control
Transaction
Manager
Lock
Manager
Files and Access Methods
Buffer Manager
Disk Space Manager
Recovery
Manager
Plan Executor
Operator Evaluator
Parser
Optimizer
DBMS
shows interaction
Query
Evaluation
Engine
Index Files
Data Files
System Catalog
shows references
DATABASE
Application Front EndsWeb Forms SQL Interface
SQL COMMANDS
Unsophisticated users (customers, travel agents, etc.) Sophisticated users, application
programmers, DB administrators
shows command flow
Figure 1.3 Architecture of a DBMS
Shared
Components
Per
Connection

9.4 Buffer Management in a DBMS
 Data must be in RAM
for DBMS to operate on
it!
DB
MAIN MEMORY
DISK
disk page
free frame
Page Requests from Higher Levels
BUFFER POOL
choice of frame dictated
by replacement policy

When a Page is Requested ...
 Buffer pool information table: <frame#, pageid, pin_cnt, dirty>
− In big systems, it is not trivial to just check whether a page is in pool
 If requested page is not in pool:
− Choose a frame for replacement
− If frame is dirty, write it to disk
− Read requested page into chosen frame
 Pin the page and return its address.
 If requests can be predicted (e.g., sequential scans) pages can be
pre-fetched several pages at a time!

More on Buffer Management
 Requestor of page must unpin it, and indicate whether page has
been modified:
− dirty bit is used for this.
 Page in pool may be requested many times,
− a pin count is used.
− a page is a candidate for replacement iff pin count = 0.
 CC & recovery may entail additional I/O when a frame is chosen for
replacement. (e.g. Write-Ahead Log protocol)

21
Buffer Manager Pseudo Code
SKKU VLDB Lab.Ch 9. Storing Disk
[Source: Uwe Röhm’s Slide]

Buffer Replacement Policy
 Hit vs. miss
 Hit ratio = # of hits / ( # of page requests to buffer cache)
− One miss incurs one (or two) physical IO. Hit saves IO.
− Rule of thumb: at least 80 ~ 90%
 Problem: for the given (future) references, which victim should be
chosen for highest hit ratio (i.e. least # of IOs)?
− Numerous policies
− Does one policy win over the others?
− One policy does not fit all reference patterns!

Buffer Replacement Policy
 Frame is chosen for replacement by a replacement policy:
− Random, FIFO, LRU, MRU, LFU, Clock etc.
− Replacement policy can have big impact on # of I/O’s; depends on the
access pattern
 For a given workload, one replacement policy, A, achieves 90% hit
ratio and the other, B, does 91%.
− How much improvement? 1% or 10%?
− We need to interpret its impact in terms of miss ratio, not hit ratio

Buffer Replacement Policy (2)
 Least Recently Used (LRU)
− For each page in buffer pool, keep track of time last unpinned
− Replace the frame that has the oldest (earliest) time
− Very common policy: intuitive and simple
− Why does it work?
 ``Principle of (temporal) locality” (of references) (https://en.wikipedia.org/wiki/Locality_of_reference)
 Why temporal locality in database?
− The correct implementation is not trivial
 Especially in large scale systems: e.g. time stamp
 Variants
− Linked list of buffer frames, LRU-K, 2Q, midpoint-insertion and touch count
algorithm(Oracle), Clock, ARC …
− Implication of big memory: “random” > “LRU”??

LRU and Sequential Flooding
 Problem of LRU - sequential flooding
− caused by LRU + repeated sequential scans.
− # buffer frames < # pages in file means each page request causes an I/O. MRU
much better in this situation (but not in all situations, of course).
DB
BUFFER POOL SIZE: 4 Blocks
1 2 3 4 5
1 2 3 4
 Assume repeated
sequential scans of file A
 What happens when
reading 5th blocks? and
when reading 1st block
again? ….
File A

“Clock” Replacement Policy
 An approximation of LRU
 Arrange frames into a cycle, store one reference bit per frame
 When pin count reduces to 0, turn on reference bit
 When replacement necessary
do for each page in cycle {
if (pincount == 0 && ref bit is on)
turn off ref bit;
else if (pincount == 0 && ref bit is off)
choose this page for replacement;
} until a page is chosen;
 Second chance algorithm / bit
 Generalized Clock (GCLOCK): ref. bit  ref. counter
A(1)
B(p)
C(1)
D(1)

27
Classification of Replacement Policies

Why Not Store It All in Main Memory?
 Cost!: $20 /1GB GB DRAM vs. 50$ / 150 GB of disk (EIDI/ATA) vs.
100/30GB (SCSI).
− High-end databases today in the 10-100 TB range.
− Approx. 60% of the cost of a production system is in the disks.
 Some specialized systems (e.g. Main Memory(MM) DBMS) store
entire database in main memory.
− Vendors claim 10x speed up vs. traditional DBMS in main memory.
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase ..
 Main memory is volatile: data should be saved between runs.
− Disk write is inevitable: log write for recovery and periodic checkpoint

29
CPU
(Dual, Quad, …)
IOPS
Hit Ratio Multi-threading
(CPU-IO Overlapping)Data size
Buffer Size

TPS
(Transactions Per Second)

Context
Switching
+
CPU-IO Overlapping
 3 States: CPU Bound, IO Bound, Balanced
For perfect CPU-IO overlapping,
IOPS Matters!

30
IOPS Crisis in OLTP
IBM for TPC-C (2008 Dec.)
 6M tpmC
 Total cost: 35M $
− Server HW: 12M $
− Server SW: 2M $
− Storage: 20M $
− Client HW/SW: 1M $
 They are buying IOPS, not capacity

31
IOPS Crisis in OLTP(2)
 For balanced systems, OLTP systems pay huge $$$ on disks for high
IOPS
CPU + Server
300 GIPS
IOPS
10,000 disks
12M $
20M $
A balanced state
Amdhal’s law
IOPS
10,000 disks
A balanced state???
50% CPU utilization;
Same TPS
12M $
CPU + Server
600 GIPS18 months
(Moore’s law)
IOPS
20,000 disks
(short stroking)
A balanced state;
2 X TPS
40M $

Some Techniques to Hide IO Bottlenecks
 Pre-fetching: For a sequential scan, pre-fetching several pages at a
time is a big win! Even cache / disk controller support prefetching
 Caching: modern disk controllers do their own caching.
 IO overlapping: CPU works while IO is performing
− Double buffering, asynchronous IO
 Multiple threads
 And, don’t do IOs, avoid IOs

MMDBMS vs. All-Flash DBMS: Personal Thoughts
 Why MMDBMS has been recently popular?
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase, ….
− Disk IOPS was too expensive
− The price of DRAM has ever dropped.: DRAM_Cost << Disk_IOPS_Cost
− The overhead of disk-based DBMS is not negligible
− Applications with extreme performance requirements
 Flash storage
− Low $/IOPS
− DRAM_Cost > Flash_IOPS_Cost
 MMDBMS vs. All-Flash DBMS (with some optimizations)
− Winner? Time will tell

34
Power Consumption Issue in Big Memory
 Why exponential?
 1KWh = 15 ~ 50 cents, 1 year = 1,752$
Source: sigmod 17 keynote by Anastatia Ailamaki

35
Jim Gray Flash Disk Opportunity for Server Applications
(MS-TR 2007)
 “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from “Flash disk opportunity for server-applications”)

36
Flash SSD vs. HDD: Non-Volatile Storage
VS
60 years champion
A new challenger!
Identical
Interface
Flash SSD HDD
Electronic Mechanical
Read/Write Asymmetric Symmetric
No Overwrite Overwrite

37
Flash SSD: Characteristics
 No overwrite, addr. mapping table
 Asymmetric read/write
 No mechanics (Seq. RD ~ Rand. RD)
 Seq. WR >> Rand. WR
 Parallelism (SSD Architecture)
 New Interface & Computing (SSD Architecture)
9/4/2017
37

38
Storage Device Metrics
 Capacity ($/GB) : Harddisk >> Flash SSD
 Bandwidth (MB/sec): Harddisk < Flash SSD
 Latency (IOPS): Harddisk << Flash SSD
− e.g. Harddisk
 Commodity hdd (7200rpm): 50$ / 1TB / 100MB/s / 100 IOPS
 Enterprise hdd(1.5Krpm): 250$ / 72GB / 200MB/s / 500 IOPS
 The price of harddisks is said to be proportional to IOPS, not capacity.

39
Storage Device Metrics(2): HDD vs. Flash SSDs
 Other Metrics: Weight/shock resistance/heat & cooling,
power(watt) , IOPS/watt, IOPS/$ ….
− Harddisk << Flash SSD
[Source: Rethinking Flash In the Data Center, IEEE Micro 2010]

40
Evolution of DRAM, HDD, and SSD (1987 – 2017)
 Raja et. al., The Five-minute Rule Thirty Years Later and its Impact
on Storage Hierarchy, ADMS ‘17
 the Storage Hierarchy

Technology RATIOS Matter
[ Source: Jim Gray’s PPT ]
 Technology ratio change: 1980s vs. 2010s
 If everything changes in the same way, then nothing really changes.[See next slide]
 If some things get much cheaper/faster than others, then that is real change.
 Some things are not changing much: e.g. cost of people, speed of light
 And some things are changing a LOT: e.g. Moore’s law, disk capacity
 Latency lags behind bandwidth
 Bandwidth lags behind capacity
 Flash memory/NVRAMs and its role in the memory hierarchy? Disruptive
technology ratio change  new disruptive solution!!
 Disk improvements [‘88 – ’04]:
− Capacity: 60%/y
− Bandwidth: 40%/y
− Access time: 16%/y

42
Latency Gap in Memory Hierarchy
Typical access latency & granularity
We need `gap filler’!
 Latency lags behind bandwidth [David Patternson, CACM Oct. 2004]
 Bandwidth problem can be cured with money, but latency problem is harder

43
Our Message in SIGMOD ’09 Paper
One FlashSSD can beat Ten Harddisks in OLTP
- Performance, Price, Capacity, Power – (in 2008)
 In 2015, one FlashSSD can beat more than several tens harddisks in OLTP
 “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from Flash disk opportunity for server-applications
(Jim Gray)

44
Flash-based TPC-C @ 2013 September
SKKU VLDB Lab.
 Oracle + Sun Flash Storage
− 8.5M tpmC
 Total cost: 4.7M $
− Server HW: .6M $
− Server SW: 1.9M $
− Storage: 1.8M $
 216 400GB Flash Moduley: 1.1M $
 86 3TB 7.2K HDD: 0.07M
− Client HW/SW: 0.1M $
− Others: 0.1M$
 Implications
− More vertical stacks (by SW vendor )
− Harddisk vendors (e.g. Seagate)

45
HDD vs. SSD [Patterson 2016]

46
Page Size
 Default page size in Linux and Window: 4KB ( 512B sector)
 Oracle
− “Oracle recommends smaller Oracle Database block sizes (2 KB or 4 KB) for online
transaction processing or mixed workload environments and larger block sizes (8 KB,16
KB, or 32 KB) for decision support system workload environments.” (see chapter “IO
Config. And Design” in “Database Performance Tuning Guide” book )

47
Why Smaller Page in SSD?
 Small page size advantages – Better throughput
47

48
LinkBench: Page Size
 Benefits of small page
− Better read/write IOPS
 Exploit internal parallelism
− Better buffer-pool hit ratio

W1.1 i os in database

More Related Content

What's hot (20)

Similar to W1.1 i os in database (20)

Recently uploaded (20)

W1.1 i os in database

Editor's Notes