SlideShare a Scribd company logo
Ch 9. Storing Data: Disks and Files
- Heap File Structure -
Sang-Won Lee
http://icc.skku.ac.kr/~swlee
SKKU VLDB Lab. & SOS
( http://vldb.skku.ac.kr/ )
2 SKKU VLDB Lab.Ch 9. Storing Disk
Contents
9.0 Overview
9.1 Memory Hierarchy
9.2 RAID(Redundant Array of Independent Disk)
9.3 Disk Space Management
9.4 Buffer Manager
9.5 Files of Records
9.6 Page Format
9.7 Record Format
3 SKKU VLDB Lab.Ch 9. Storing Disk
Memory Hierarchy
Smaller, Faster,
Expensive, Volatile
Bigger, Slower,
Cheaper, Non-Volatile
 Main memory (RAM) for currently used data.
 Disk for the main database (secondary storage).
 Tapes for archiving older versions of the data (tertiary storage)
 WHY MEMORY HIERARCHY?
 What if ideal storage appear? Fast, cheap, large, NV..: PCM, MRAM, FeRAM?
4 SKKU VLDB Lab.Ch 9. Storing Disk
Jim Gray’s Storage Latency Analogy:
How Far Away is the Data?
Registers
On Chip Cache
On Board Cache
Memory
Disk
1
2
10
100
Tape /Optical
Robot
10 9
10 6
Sacramento
This Hotel
This Room
My Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 Years
Andromeda
5 SKKU VLDB Lab.Ch 9. Storing Disk
Disks and Files
 DBMS stores information on (hard) disks.
− Electronic (CPU, DRAM) vs. Mechanical (harddisk)
 This has major implications for DBMS design!
− READ: transfer data from disk to main memory (RAM).
− WRITE: transfer data from RAM to disk.
− Both are expensive operations, relative to in-memory operations, so
must be planned carefully!
 DRAM: ~ 10 ns
 Harddisk: ~ 10ms
 SSD: 80us ~ 10ms
6 SKKU VLDB Lab.Ch 9. Storing Disk
Disks
 Secondary storage device of choice.
 Main advantage over tapes: random access vs. sequential.
− Tapes deteriorate over time
 Data is stored and retrieved in disk blocks or pages unit.
 Unlike RAM, time to retrieve a disk page varies depending upon
location on disk.
− Thus, relative placement of pages on disk has big impact on DB
performance!
 e.g. adjacent allocation of the pages from the same tables.
− We need to optimize both data placement and access
 e.g. elevator disk scheduling algorithm
7 SKKU VLDB Lab.Ch 9. Storing Disk
Anatomy of a Disk
Arm assembly
 The platters spin
 e.g. 5400 / 7200 / 15K rpm
 The arm assembly is moved in or out to
position a head on a desired track. Tracks
under heads make a cylinder
 Mechanical storage -> low IOPS
 Only one head reads/writes at any one
time.
 Parallelism degree: 1
 Block size is a multiple of sector size
 Update-in-place: poisoned apple
 No atomic write
 Fsync for ordering / durability
8 SKKU VLDB Lab.Ch 9. Storing Disk
Accessing a Disk Page
 Time to access (read/write) a disk block:
− seek time (moving arms to position disk head on track)
− rotational delay (waiting for block to rotate under head)
− transfer time (actually moving data to/from disk surface)
 Seek time and rotational delay dominate.
− Seek time: about 1 to 20msec
− Rotational delay: from 0 to 10msec
− Transfer rate: about 1ms per 4KB page
 Key to lower I/O cost: reduce seek/rotation delays!
− E.g. disk scheduling algorithm in OS, Linux 4 I/O schedulers
9 SKKU VLDB Lab.Ch 9. Storing Disk
Arranging Pages on Disk
 `Next’ block concept:
− Blocks on same track, followed by
− Blocks on same cylinder, followed by
− Blocks on adjacent cylinder
 Blocks in a file should be arranged sequentially on disk (by `next’),
to minimize seek and rotational delay.
 Disk fragmentation problem
− Is this still problematic in flash storage?
10
Table, Insertions, Heap Files
CREATE TABLE TEST (a int, b int, c varchar2(650));
/* Insert 1M tuples into TEST table (approximately 664 bytes per
tuple) */
BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO TEST (a, b, c) values (i, i, rpad('X', 650, 'X'));
END LOOP;
END;
/*
Page = 8KB
10 tuples / page
100,000 pages in total
TEST table = 800MB
*/
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i
11
OLAP vs. OLTP
 On-Line Analytical vs. Transactional Processing
SQL> SELECT SUM(b) FROM TEST;
SUM(B)
----------
5.0000E+11
Execution Plan
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
|
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
…..
1 rows processed
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i
12
OLAP vs. OLTP
SQL> SELECT B FROM TEST WHERE A = 500000;
B
----------
500000
Execution Plan
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
|
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
…..
1 rows processed
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i
 Point or Range Query
13
OLAP vs. OLTP
CREATE INDEX TEST_A ON
TEST(A);
Data
Page 1
Data
Page 2
Data
Page
100000
SELECT B /* point query */
FROM TEST
WHERE A = 500000;
SELECT B /* range query */
FROM TEST
WHERE A BETWEEN 50001 and 50101;
B-tree Index on TEST(A)
SEARCH KEY: 500000
(500000, (50000,10) )
(500000,500000, ’XX…XX’;
Block #: 50000SLOT #: 10
Cost:
 Full Table Scan: 100,000 Block Accesses
 Index: (3~4) + Data Block Access
− Point query: 1
− Range queries: depending on range
TEST SEGMENT
(Heap File)
14
OLAP vs. OLTP
SQL> SELECT B FROM TEST WHERE A = 500000;
B
----------
500000
Execution Plan
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TEST_A | 1 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
…..
5 consistent gets
4 physical reads
…..
1 rows processed
 Index-based Table Access
15
OLTP: TPC-A/B/C Benchmark
exec sql begin declare section;
long Aid, Bid, Tid, delta, Abalance;
exec sql end declare section;
DCApplication()
{ read input msg;
exec sql begin work;
exec sql update accounts set Abalance = Abalance + :delta where Aid = :Aid;
exec sql select Abalance into :Abalance from accounts where Aid = :Aid;
exec sql update tellers set Tbalance = Tbalance + :delta where Tid = :Tid;
exec sql update branches set Bbalance = Bbalance + :delta where Bid = :Bid;
exec sql insert into history(Tid, Bid, Aid, delta, time) values (:Tid, :Bid, :Aid, :delta, CURRENT);
send output msg;
exec sql commit work; }
From Gray’s Presentation
Transaction =
Sequence of Reads and Writes
16
OLTP System: Architecture
Transactions 1 …….. N
Database Buffer
Database
SQLs
Physical Read/Writes
Select / Insert / Update / Delete / Commit Operations
Logical Read/Writes
17 SKKU VLDB Lab.Ch 9. Storing Disk
Concurrency
Control
Transaction
Manager
Lock
Manager
Files and Access Methods
Buffer Manager
Disk Space Manager
Recovery
Manager
Plan Executor
Operator Evaluator
Parser
Optimizer
DBMS
shows interaction
Query
Evaluation
Engine
Index Files
Data Files
System Catalog
shows references
DATABASE
Application Front EndsWeb Forms SQL Interface
SQL COMMANDS
Unsophisticated users (customers, travel agents, etc.) Sophisticated users, application
programmers, DB administrators
shows command flow
Figure 1.3 Architecture of a DBMS
Shared
Components
Per
Connection
18 SKKU VLDB Lab.Ch 9. Storing Disk
9.4 Buffer Management in a DBMS
 Data must be in RAM
for DBMS to operate on
it!
DB
MAIN MEMORY
DISK
disk page
free frame
Page Requests from Higher Levels
BUFFER POOL
choice of frame dictated
by replacement policy
19 SKKU VLDB Lab.Ch 9. Storing Disk
When a Page is Requested ...
 Buffer pool information table: <frame#, pageid, pin_cnt, dirty>
− In big systems, it is not trivial to just check whether a page is in pool
 If requested page is not in pool:
− Choose a frame for replacement
− If frame is dirty, write it to disk
− Read requested page into chosen frame
 Pin the page and return its address.
 If requests can be predicted (e.g., sequential scans) pages can be
pre-fetched several pages at a time!
20 SKKU VLDB Lab.Ch 9. Storing Disk
More on Buffer Management
 Requestor of page must unpin it, and indicate whether page has
been modified:
− dirty bit is used for this.
 Page in pool may be requested many times,
− a pin count is used.
− a page is a candidate for replacement iff pin count = 0.
 CC & recovery may entail additional I/O when a frame is chosen for
replacement. (e.g. Write-Ahead Log protocol)
21
Buffer Manager Pseudo Code
SKKU VLDB Lab.Ch 9. Storing Disk
[Source: Uwe Röhm’s Slide]
22 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy
 Hit vs. miss
 Hit ratio = # of hits / ( # of page requests to buffer cache)
− One miss incurs one (or two) physical IO. Hit saves IO.
− Rule of thumb: at least 80 ~ 90%
 Problem: for the given (future) references, which victim should be
chosen for highest hit ratio (i.e. least # of IOs)?
− Numerous policies
− Does one policy win over the others?
− One policy does not fit all reference patterns!
23 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy
 Frame is chosen for replacement by a replacement policy:
− Random, FIFO, LRU, MRU, LFU, Clock etc.
− Replacement policy can have big impact on # of I/O’s; depends on the
access pattern
 For a given workload, one replacement policy, A, achieves 90% hit
ratio and the other, B, does 91%.
− How much improvement? 1% or 10%?
− We need to interpret its impact in terms of miss ratio, not hit ratio
24 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy (2)
 Least Recently Used (LRU)
− For each page in buffer pool, keep track of time last unpinned
− Replace the frame that has the oldest (earliest) time
− Very common policy: intuitive and simple
− Why does it work?
 ``Principle of (temporal) locality” (of references) (https://en.wikipedia.org/wiki/Locality_of_reference)
 Why temporal locality in database?
− The correct implementation is not trivial
 Especially in large scale systems: e.g. time stamp
 Variants
− Linked list of buffer frames, LRU-K, 2Q, midpoint-insertion and touch count
algorithm(Oracle), Clock, ARC …
− Implication of big memory: “random” > “LRU”??
25 SKKU VLDB Lab.Ch 9. Storing Disk
LRU and Sequential Flooding
 Problem of LRU - sequential flooding
− caused by LRU + repeated sequential scans.
− # buffer frames < # pages in file means each page request causes an I/O. MRU
much better in this situation (but not in all situations, of course).
DB
BUFFER POOL SIZE: 4 Blocks
1 2 3 4 5
1 2 3 4
 Assume repeated
sequential scans of file A
 What happens when
reading 5th blocks? and
when reading 1st block
again? ….
File A
26 SKKU VLDB Lab.Ch 9. Storing Disk
“Clock” Replacement Policy
 An approximation of LRU
 Arrange frames into a cycle, store one reference bit per frame
 When pin count reduces to 0, turn on reference bit
 When replacement necessary
do for each page in cycle {
if (pincount == 0 && ref bit is on)
turn off ref bit;
else if (pincount == 0 && ref bit is off)
choose this page for replacement;
} until a page is chosen;
 Second chance algorithm / bit
 Generalized Clock (GCLOCK): ref. bit  ref. counter
A(1)
B(p)
C(1)
D(1)
27
Classification of Replacement Policies
SKKU VLDB Lab.Ch 9. Storing Disk
[Source: Uwe Röhm’s Slide]
28 SKKU VLDB Lab.Ch 9. Storing Disk
Why Not Store It All in Main Memory?
 Cost!: $20 /1GB GB DRAM vs. 50$ / 150 GB of disk (EIDI/ATA) vs.
100/30GB (SCSI).
− High-end databases today in the 10-100 TB range.
− Approx. 60% of the cost of a production system is in the disks.
 Some specialized systems (e.g. Main Memory(MM) DBMS) store
entire database in main memory.
− Vendors claim 10x speed up vs. traditional DBMS in main memory.
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase ..
 Main memory is volatile: data should be saved between runs.
− Disk write is inevitable: log write for recovery and periodic checkpoint
29
CPU
(Dual, Quad, …)
IOPS
Hit Ratio Multi-threading
(CPU-IO Overlapping)Data size
Buffer Size

TPS
(Transactions Per Second)

Context
Switching
+
CPU-IO Overlapping
 3 States: CPU Bound, IO Bound, Balanced
For perfect CPU-IO overlapping,
IOPS Matters!
30
IOPS Crisis in OLTP
IBM for TPC-C (2008 Dec.)
 6M tpmC
 Total cost: 35M $
− Server HW: 12M $
− Server SW: 2M $
− Storage: 20M $
− Client HW/SW: 1M $
 They are buying IOPS, not capacity
31
IOPS Crisis in OLTP(2)
 For balanced systems, OLTP systems pay huge $$$ on disks for high
IOPS
CPU + Server
300 GIPS
IOPS
10,000 disks
12M $
20M $
A balanced state
Amdhal’s law
IOPS
10,000 disks
A balanced state???
50% CPU utilization;
Same TPS
12M $
CPU + Server
600 GIPS18 months
(Moore’s law)
IOPS
20,000 disks
(short stroking)
A balanced state;
2 X TPS
40M $
32 SKKU VLDB Lab.Ch 9. Storing Disk
Some Techniques to Hide IO Bottlenecks
 Pre-fetching: For a sequential scan, pre-fetching several pages at a
time is a big win! Even cache / disk controller support prefetching
 Caching: modern disk controllers do their own caching.
 IO overlapping: CPU works while IO is performing
− Double buffering, asynchronous IO
 Multiple threads
 And, don’t do IOs, avoid IOs
33 SKKU VLDB Lab.Ch 9. Storing Disk
MMDBMS vs. All-Flash DBMS: Personal Thoughts
 Why MMDBMS has been recently popular?
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase, ….
− Disk IOPS was too expensive
− The price of DRAM has ever dropped.: DRAM_Cost << Disk_IOPS_Cost
− The overhead of disk-based DBMS is not negligible
− Applications with extreme performance requirements
 Flash storage
− Low $/IOPS
− DRAM_Cost > Flash_IOPS_Cost
 MMDBMS vs. All-Flash DBMS (with some optimizations)
− Winner? Time will tell
34
Power Consumption Issue in Big Memory
SKKU VLDB Lab.Ch 9. Storing Disk
 Why exponential?
 1KWh = 15 ~ 50 cents, 1 year = 1,752$
Source: sigmod 17 keynote by Anastatia Ailamaki
35
Jim Gray Flash Disk Opportunity for Server Applications
(MS-TR 2007)
 “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from “Flash disk opportunity for server-applications”)
36
Flash SSD vs. HDD: Non-Volatile Storage
VS
60 years champion
A new challenger!
Identical
Interface
Flash SSD HDD
Electronic Mechanical
Read/Write Asymmetric Symmetric
No Overwrite Overwrite
37
Flash SSD: Characteristics
 No overwrite, addr. mapping table
 Asymmetric read/write
 No mechanics (Seq. RD ~ Rand. RD)
 Seq. WR >> Rand. WR
 Parallelism (SSD Architecture)
 New Interface & Computing (SSD Architecture)
9/4/2017
37
38
Storage Device Metrics
 Capacity ($/GB) : Harddisk >> Flash SSD
 Bandwidth (MB/sec): Harddisk < Flash SSD
 Latency (IOPS): Harddisk << Flash SSD
− e.g. Harddisk
 Commodity hdd (7200rpm): 50$ / 1TB / 100MB/s / 100 IOPS
 Enterprise hdd(1.5Krpm): 250$ / 72GB / 200MB/s / 500 IOPS
 The price of harddisks is said to be proportional to IOPS, not capacity.
39
Storage Device Metrics(2): HDD vs. Flash SSDs
SKKU VLDB Lab.Ch 9. Storing Disk
 Other Metrics: Weight/shock resistance/heat & cooling,
power(watt) , IOPS/watt, IOPS/$ ….
− Harddisk << Flash SSD
[Source: Rethinking Flash In the Data Center, IEEE Micro 2010]
40
Evolution of DRAM, HDD, and SSD (1987 – 2017)
 Raja et. al., The Five-minute Rule Thirty Years Later and its Impact
on Storage Hierarchy, ADMS ‘17
 the Storage Hierarchy
SKKU VLDB Lab.Ch 9. Storing Disk
41 SKKU VLDB Lab.Ch 9. Storing Disk
Technology RATIOS Matter
[ Source: Jim Gray’s PPT ]
 Technology ratio change: 1980s vs. 2010s
 If everything changes in the same way, then nothing really changes.[See next slide]
 If some things get much cheaper/faster than others, then that is real change.
 Some things are not changing much: e.g. cost of people, speed of light
 And some things are changing a LOT: e.g. Moore’s law, disk capacity
 Latency lags behind bandwidth
 Bandwidth lags behind capacity
 Flash memory/NVRAMs and its role in the memory hierarchy? Disruptive
technology ratio change  new disruptive solution!!
 Disk improvements [‘88 – ’04]:
− Capacity: 60%/y
− Bandwidth: 40%/y
− Access time: 16%/y
42
Latency Gap in Memory Hierarchy
SKKU VLDB Lab.Ch 9. Storing Disk
Typical access latency & granularity
We need `gap filler’!
[Source: Uwe Röhm’s Slide]
 Latency lags behind bandwidth [David Patternson, CACM Oct. 2004]
 Bandwidth problem can be cured with money, but latency problem is harder
43
Our Message in SIGMOD ’09 Paper
One FlashSSD can beat Ten Harddisks in OLTP
- Performance, Price, Capacity, Power – (in 2008)
 In 2015, one FlashSSD can beat more than several tens harddisks in OLTP
 “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from Flash disk opportunity for server-applications
(Jim Gray)
44
Flash-based TPC-C @ 2013 September
SKKU VLDB Lab.
 Oracle + Sun Flash Storage
− 8.5M tpmC
 Total cost: 4.7M $
− Server HW: .6M $
− Server SW: 1.9M $
− Storage: 1.8M $
 216 400GB Flash Moduley: 1.1M $
 86 3TB 7.2K HDD: 0.07M
− Client HW/SW: 0.1M $
− Others: 0.1M$
 Implications
− More vertical stacks (by SW vendor )
− Harddisk vendors (e.g. Seagate)
45
HDD vs. SSD [Patterson 2016]
SKKU VLDB Lab.Ch 9. Storing Disk
46
Page Size
 Default page size in Linux and Window: 4KB ( 512B sector)
 Oracle
− “Oracle recommends smaller Oracle Database block sizes (2 KB or 4 KB) for online
transaction processing or mixed workload environments and larger block sizes (8 KB,16
KB, or 32 KB) for decision support system workload environments.” (see chapter “IO
Config. And Design” in “Database Performance Tuning Guide” book )
SKKU VLDB Lab.Ch 9. Storing Disk
47
Why Smaller Page in SSD?
 Small page size advantages – Better throughput
47
48
LinkBench: Page Size
 Benefits of small page
− Better read/write IOPS
 Exploit internal parallelism
− Better buffer-pool hit ratio

More Related Content

PDF
Database High Availability Using SHADOW Systems
PDF
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PDF
OSDC 2012 | Taking hot backups with XtraBackup by Alexey Kopytov
PPTX
Ceph - High Performance Without High Costs
PPT
Aerospike: Key Value Data Access
PDF
Exadata下的数据并行加载、并行卸载及性能监控
PDF
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PPT
Les 01 Arch
Database High Availability Using SHADOW Systems
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
OSDC 2012 | Taking hot backups with XtraBackup by Alexey Kopytov
Ceph - High Performance Without High Costs
Aerospike: Key Value Data Access
Exadata下的数据并行加载、并行卸载及性能监控
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
Les 01 Arch

What's hot (20)

PDF
PDF
Native erasure coding support inside hdfs presentation
PPTX
ZFS for Databases
PPTX
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
PPT
Direct SGA access without SQL
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
PPT
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
PPTX
Column store indexes and batch processing mode (nx power lite)
PDF
Out of the box replication in postgres 9.4(pg confus)
PPTX
Backups
PDF
PostgreSQL Table Partitioning / Sharding
PDF
In-core compression: how to shrink your database size in several times
PPTX
Ceph Day KL - Bluestore
PDF
gDBClone - Database Clone “onecommand Automation Tool”
PDF
RedGateWebinar - Where did my CPU go?
PPT
Understanding MySQL Performance through Benchmarking
PPTX
10 ways to improve your rman script
PDF
ODA Backup Restore Utility & ODA Rescue Live Disk
PDF
Out of the box replication in postgres 9.4
PPTX
An introduction to column store indexes and batch mode
Native erasure coding support inside hdfs presentation
ZFS for Databases
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Direct SGA access without SQL
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
Column store indexes and batch processing mode (nx power lite)
Out of the box replication in postgres 9.4(pg confus)
Backups
PostgreSQL Table Partitioning / Sharding
In-core compression: how to shrink your database size in several times
Ceph Day KL - Bluestore
gDBClone - Database Clone “onecommand Automation Tool”
RedGateWebinar - Where did my CPU go?
Understanding MySQL Performance through Benchmarking
10 ways to improve your rman script
ODA Backup Restore Utility & ODA Rescue Live Disk
Out of the box replication in postgres 9.4
An introduction to column store indexes and batch mode
Ad

Similar to W1.1 i os in database (20)

PDF
ZFS Workshop
PDF
Optimizing RocksDB for Open-Channel SSDs
PPTX
IO Dubi Lebel
PDF
FlashSQL 소개 & TechTalk
PPTX
CS 542 Putting it all together -- Storage Management
PPT
Using Statspack and AWR for Memory Monitoring and Tuning
PPTX
Configuring Aerospike - Part 2
PPT
Disk storage systems bits wilp presentation
PPT
Using AWR for IO Subsystem Analysis
PDF
Apache CarbonData:New high performance data format for faster data analysis
PPTX
Storage and performance- Batch processing, Whiptail
PPTX
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
PPTX
Mass storage systemsos
PDF
RMAN – The Pocket Knife of a DBA
PDF
Reference Architecture: Architecting Ceph Storage Solutions
PPT
KSCOPE 2013: Exadata Consolidation Success Story
PPTX
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
PDF
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
PDF
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
ZFS Workshop
Optimizing RocksDB for Open-Channel SSDs
IO Dubi Lebel
FlashSQL 소개 & TechTalk
CS 542 Putting it all together -- Storage Management
Using Statspack and AWR for Memory Monitoring and Tuning
Configuring Aerospike - Part 2
Disk storage systems bits wilp presentation
Using AWR for IO Subsystem Analysis
Apache CarbonData:New high performance data format for faster data analysis
Storage and performance- Batch processing, Whiptail
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Mass storage systemsos
RMAN – The Pocket Knife of a DBA
Reference Architecture: Architecting Ceph Storage Solutions
KSCOPE 2013: Exadata Consolidation Success Story
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Ad

Recently uploaded (20)

PPTX
Structuralism and functionalism dhshjdjejdj
PPTX
IOT Unit 6 PPT ( ~ By Prof. Simran Ahuja ).pptx
PPTX
Lung Cancer - Bimbingan.pptxmnbmbnmnmn mn mn
PPTX
philippine contemporary artscot ppt.pptx
PDF
Impressionism-in-Arts.For.Those.Who.Seek.Academic.Novelty.pdf
PDF
15901922083_ph.cology3.pdf..................................................
PPTX
Nationalism in India Ch-2.pptx ssssss classs 10
PPTX
QA PROCESS FLOW CHART (1).pptxaaaaaaaaaaaa
PDF
INTRODUCTION-TO-ARTS-PRELIM.pdf arts and appreciation
PDF
Celebrate Krishna Janmashtami 2025 | Cottage9
PDF
Himalayan Nature and Tibetan Buddhist Culture in Arunachal -- Kazuharu Mizuno...
PPTX
WEEK-3_TOPIC_Photographic_Rays__Its_Nature_and_Characteristics.pptx
PPTX
National_Artists_for_Dance_with_Examples-1.pptx
PPTX
level measurement foe tttttttttttttttttttttttttttttttttt
PPTX
Chemical Reactions in Our Lives.pptxyyyyyyyyy
PPTX
WATER RESOURCE-1.pptx ssssdsedsddsssssss
PPTX
WEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEK
PDF
lebo101.pdf biology chapter important .....
PPTX
Understanding APIs_ Types Purposes and Implementation.pptx
PDF
witch fraud storyboard sequence-_1x1.pdf
Structuralism and functionalism dhshjdjejdj
IOT Unit 6 PPT ( ~ By Prof. Simran Ahuja ).pptx
Lung Cancer - Bimbingan.pptxmnbmbnmnmn mn mn
philippine contemporary artscot ppt.pptx
Impressionism-in-Arts.For.Those.Who.Seek.Academic.Novelty.pdf
15901922083_ph.cology3.pdf..................................................
Nationalism in India Ch-2.pptx ssssss classs 10
QA PROCESS FLOW CHART (1).pptxaaaaaaaaaaaa
INTRODUCTION-TO-ARTS-PRELIM.pdf arts and appreciation
Celebrate Krishna Janmashtami 2025 | Cottage9
Himalayan Nature and Tibetan Buddhist Culture in Arunachal -- Kazuharu Mizuno...
WEEK-3_TOPIC_Photographic_Rays__Its_Nature_and_Characteristics.pptx
National_Artists_for_Dance_with_Examples-1.pptx
level measurement foe tttttttttttttttttttttttttttttttttt
Chemical Reactions in Our Lives.pptxyyyyyyyyy
WATER RESOURCE-1.pptx ssssdsedsddsssssss
WEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEK
lebo101.pdf biology chapter important .....
Understanding APIs_ Types Purposes and Implementation.pptx
witch fraud storyboard sequence-_1x1.pdf

W1.1 i os in database

  • 1. Ch 9. Storing Data: Disks and Files - Heap File Structure - Sang-Won Lee http://icc.skku.ac.kr/~swlee SKKU VLDB Lab. & SOS ( http://vldb.skku.ac.kr/ )
  • 2. 2 SKKU VLDB Lab.Ch 9. Storing Disk Contents 9.0 Overview 9.1 Memory Hierarchy 9.2 RAID(Redundant Array of Independent Disk) 9.3 Disk Space Management 9.4 Buffer Manager 9.5 Files of Records 9.6 Page Format 9.7 Record Format
  • 3. 3 SKKU VLDB Lab.Ch 9. Storing Disk Memory Hierarchy Smaller, Faster, Expensive, Volatile Bigger, Slower, Cheaper, Non-Volatile  Main memory (RAM) for currently used data.  Disk for the main database (secondary storage).  Tapes for archiving older versions of the data (tertiary storage)  WHY MEMORY HIERARCHY?  What if ideal storage appear? Fast, cheap, large, NV..: PCM, MRAM, FeRAM?
  • 4. 4 SKKU VLDB Lab.Ch 9. Storing Disk Jim Gray’s Storage Latency Analogy: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk 1 2 10 100 Tape /Optical Robot 10 9 10 6 Sacramento This Hotel This Room My Head 10 min 1.5 hr 2 Years 1 min Pluto 2,000 Years Andromeda
  • 5. 5 SKKU VLDB Lab.Ch 9. Storing Disk Disks and Files  DBMS stores information on (hard) disks. − Electronic (CPU, DRAM) vs. Mechanical (harddisk)  This has major implications for DBMS design! − READ: transfer data from disk to main memory (RAM). − WRITE: transfer data from RAM to disk. − Both are expensive operations, relative to in-memory operations, so must be planned carefully!  DRAM: ~ 10 ns  Harddisk: ~ 10ms  SSD: 80us ~ 10ms
  • 6. 6 SKKU VLDB Lab.Ch 9. Storing Disk Disks  Secondary storage device of choice.  Main advantage over tapes: random access vs. sequential. − Tapes deteriorate over time  Data is stored and retrieved in disk blocks or pages unit.  Unlike RAM, time to retrieve a disk page varies depending upon location on disk. − Thus, relative placement of pages on disk has big impact on DB performance!  e.g. adjacent allocation of the pages from the same tables. − We need to optimize both data placement and access  e.g. elevator disk scheduling algorithm
  • 7. 7 SKKU VLDB Lab.Ch 9. Storing Disk Anatomy of a Disk Arm assembly  The platters spin  e.g. 5400 / 7200 / 15K rpm  The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder  Mechanical storage -> low IOPS  Only one head reads/writes at any one time.  Parallelism degree: 1  Block size is a multiple of sector size  Update-in-place: poisoned apple  No atomic write  Fsync for ordering / durability
  • 8. 8 SKKU VLDB Lab.Ch 9. Storing Disk Accessing a Disk Page  Time to access (read/write) a disk block: − seek time (moving arms to position disk head on track) − rotational delay (waiting for block to rotate under head) − transfer time (actually moving data to/from disk surface)  Seek time and rotational delay dominate. − Seek time: about 1 to 20msec − Rotational delay: from 0 to 10msec − Transfer rate: about 1ms per 4KB page  Key to lower I/O cost: reduce seek/rotation delays! − E.g. disk scheduling algorithm in OS, Linux 4 I/O schedulers
  • 9. 9 SKKU VLDB Lab.Ch 9. Storing Disk Arranging Pages on Disk  `Next’ block concept: − Blocks on same track, followed by − Blocks on same cylinder, followed by − Blocks on adjacent cylinder  Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay.  Disk fragmentation problem − Is this still problematic in flash storage?
  • 10. 10 Table, Insertions, Heap Files CREATE TABLE TEST (a int, b int, c varchar2(650)); /* Insert 1M tuples into TEST table (approximately 664 bytes per tuple) */ BEGIN FOR i IN 1..1000000 LOOP INSERT INTO TEST (a, b, c) values (i, i, rpad('X', 650, 'X')); END LOOP; END; /* Page = 8KB 10 tuples / page 100,000 pages in total TEST table = 800MB */ Data Page 1 Data Page 2 Data Page 100,000 TEST SEGMENT (Heap File) Data Page i
  • 11. 11 OLAP vs. OLTP  On-Line Analytical vs. Transactional Processing SQL> SELECT SUM(b) FROM TEST; SUM(B) ---------- 5.0000E+11 Execution Plan --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 | | 1 | SORT AGGREGATE | | 1 | 5 | | | | 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25 | --------------------------------------------------------------------------- Statistics ---------------------------------------------------------- 179 recursive calls 0 db block gets 100152 consistent gets 100112 physical reads ….. 1 rows processed Data Page 1 Data Page 2 Data Page 100,000 TEST SEGMENT (Heap File) Data Page i
  • 12. 12 OLAP vs. OLTP SQL> SELECT B FROM TEST WHERE A = 500000; B ---------- 500000 Execution Plan --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 | | 1 | SORT AGGREGATE | | 1 | 5 | | | | 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25 | --------------------------------------------------------------------------- Statistics ---------------------------------------------------------- 179 recursive calls 0 db block gets 100152 consistent gets 100112 physical reads ….. 1 rows processed Data Page 1 Data Page 2 Data Page 100,000 TEST SEGMENT (Heap File) Data Page i  Point or Range Query
  • 13. 13 OLAP vs. OLTP CREATE INDEX TEST_A ON TEST(A); Data Page 1 Data Page 2 Data Page 100000 SELECT B /* point query */ FROM TEST WHERE A = 500000; SELECT B /* range query */ FROM TEST WHERE A BETWEEN 50001 and 50101; B-tree Index on TEST(A) SEARCH KEY: 500000 (500000, (50000,10) ) (500000,500000, ’XX…XX’; Block #: 50000SLOT #: 10 Cost:  Full Table Scan: 100,000 Block Accesses  Index: (3~4) + Data Block Access − Point query: 1 − Range queries: depending on range TEST SEGMENT (Heap File)
  • 14. 14 OLAP vs. OLTP SQL> SELECT B FROM TEST WHERE A = 500000; B ---------- 500000 Execution Plan -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 10 | 4 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 4 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | TEST_A | 1 | | 3 (0)| 00:00:01 | -------------------------------------------------------------------------------------- Statistics ---------------------------------------------------------- ….. 5 consistent gets 4 physical reads ….. 1 rows processed  Index-based Table Access
  • 15. 15 OLTP: TPC-A/B/C Benchmark exec sql begin declare section; long Aid, Bid, Tid, delta, Abalance; exec sql end declare section; DCApplication() { read input msg; exec sql begin work; exec sql update accounts set Abalance = Abalance + :delta where Aid = :Aid; exec sql select Abalance into :Abalance from accounts where Aid = :Aid; exec sql update tellers set Tbalance = Tbalance + :delta where Tid = :Tid; exec sql update branches set Bbalance = Bbalance + :delta where Bid = :Bid; exec sql insert into history(Tid, Bid, Aid, delta, time) values (:Tid, :Bid, :Aid, :delta, CURRENT); send output msg; exec sql commit work; } From Gray’s Presentation Transaction = Sequence of Reads and Writes
  • 16. 16 OLTP System: Architecture Transactions 1 …….. N Database Buffer Database SQLs Physical Read/Writes Select / Insert / Update / Delete / Commit Operations Logical Read/Writes
  • 17. 17 SKKU VLDB Lab.Ch 9. Storing Disk Concurrency Control Transaction Manager Lock Manager Files and Access Methods Buffer Manager Disk Space Manager Recovery Manager Plan Executor Operator Evaluator Parser Optimizer DBMS shows interaction Query Evaluation Engine Index Files Data Files System Catalog shows references DATABASE Application Front EndsWeb Forms SQL Interface SQL COMMANDS Unsophisticated users (customers, travel agents, etc.) Sophisticated users, application programmers, DB administrators shows command flow Figure 1.3 Architecture of a DBMS Shared Components Per Connection
  • 18. 18 SKKU VLDB Lab.Ch 9. Storing Disk 9.4 Buffer Management in a DBMS  Data must be in RAM for DBMS to operate on it! DB MAIN MEMORY DISK disk page free frame Page Requests from Higher Levels BUFFER POOL choice of frame dictated by replacement policy
  • 19. 19 SKKU VLDB Lab.Ch 9. Storing Disk When a Page is Requested ...  Buffer pool information table: <frame#, pageid, pin_cnt, dirty> − In big systems, it is not trivial to just check whether a page is in pool  If requested page is not in pool: − Choose a frame for replacement − If frame is dirty, write it to disk − Read requested page into chosen frame  Pin the page and return its address.  If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time!
  • 20. 20 SKKU VLDB Lab.Ch 9. Storing Disk More on Buffer Management  Requestor of page must unpin it, and indicate whether page has been modified: − dirty bit is used for this.  Page in pool may be requested many times, − a pin count is used. − a page is a candidate for replacement iff pin count = 0.  CC & recovery may entail additional I/O when a frame is chosen for replacement. (e.g. Write-Ahead Log protocol)
  • 21. 21 Buffer Manager Pseudo Code SKKU VLDB Lab.Ch 9. Storing Disk [Source: Uwe Röhm’s Slide]
  • 22. 22 SKKU VLDB Lab.Ch 9. Storing Disk Buffer Replacement Policy  Hit vs. miss  Hit ratio = # of hits / ( # of page requests to buffer cache) − One miss incurs one (or two) physical IO. Hit saves IO. − Rule of thumb: at least 80 ~ 90%  Problem: for the given (future) references, which victim should be chosen for highest hit ratio (i.e. least # of IOs)? − Numerous policies − Does one policy win over the others? − One policy does not fit all reference patterns!
  • 23. 23 SKKU VLDB Lab.Ch 9. Storing Disk Buffer Replacement Policy  Frame is chosen for replacement by a replacement policy: − Random, FIFO, LRU, MRU, LFU, Clock etc. − Replacement policy can have big impact on # of I/O’s; depends on the access pattern  For a given workload, one replacement policy, A, achieves 90% hit ratio and the other, B, does 91%. − How much improvement? 1% or 10%? − We need to interpret its impact in terms of miss ratio, not hit ratio
  • 24. 24 SKKU VLDB Lab.Ch 9. Storing Disk Buffer Replacement Policy (2)  Least Recently Used (LRU) − For each page in buffer pool, keep track of time last unpinned − Replace the frame that has the oldest (earliest) time − Very common policy: intuitive and simple − Why does it work?  ``Principle of (temporal) locality” (of references) (https://en.wikipedia.org/wiki/Locality_of_reference)  Why temporal locality in database? − The correct implementation is not trivial  Especially in large scale systems: e.g. time stamp  Variants − Linked list of buffer frames, LRU-K, 2Q, midpoint-insertion and touch count algorithm(Oracle), Clock, ARC … − Implication of big memory: “random” > “LRU”??
  • 25. 25 SKKU VLDB Lab.Ch 9. Storing Disk LRU and Sequential Flooding  Problem of LRU - sequential flooding − caused by LRU + repeated sequential scans. − # buffer frames < # pages in file means each page request causes an I/O. MRU much better in this situation (but not in all situations, of course). DB BUFFER POOL SIZE: 4 Blocks 1 2 3 4 5 1 2 3 4  Assume repeated sequential scans of file A  What happens when reading 5th blocks? and when reading 1st block again? …. File A
  • 26. 26 SKKU VLDB Lab.Ch 9. Storing Disk “Clock” Replacement Policy  An approximation of LRU  Arrange frames into a cycle, store one reference bit per frame  When pin count reduces to 0, turn on reference bit  When replacement necessary do for each page in cycle { if (pincount == 0 && ref bit is on) turn off ref bit; else if (pincount == 0 && ref bit is off) choose this page for replacement; } until a page is chosen;  Second chance algorithm / bit  Generalized Clock (GCLOCK): ref. bit  ref. counter A(1) B(p) C(1) D(1)
  • 27. 27 Classification of Replacement Policies SKKU VLDB Lab.Ch 9. Storing Disk [Source: Uwe Röhm’s Slide]
  • 28. 28 SKKU VLDB Lab.Ch 9. Storing Disk Why Not Store It All in Main Memory?  Cost!: $20 /1GB GB DRAM vs. 50$ / 150 GB of disk (EIDI/ATA) vs. 100/30GB (SCSI). − High-end databases today in the 10-100 TB range. − Approx. 60% of the cost of a production system is in the disks.  Some specialized systems (e.g. Main Memory(MM) DBMS) store entire database in main memory. − Vendors claim 10x speed up vs. traditional DBMS in main memory. − Sap Hana, MS Hekaton, Oracle In-memory, Altibase ..  Main memory is volatile: data should be saved between runs. − Disk write is inevitable: log write for recovery and periodic checkpoint
  • 29. 29 CPU (Dual, Quad, …) IOPS Hit Ratio Multi-threading (CPU-IO Overlapping)Data size Buffer Size  TPS (Transactions Per Second)  Context Switching + CPU-IO Overlapping  3 States: CPU Bound, IO Bound, Balanced For perfect CPU-IO overlapping, IOPS Matters!
  • 30. 30 IOPS Crisis in OLTP IBM for TPC-C (2008 Dec.)  6M tpmC  Total cost: 35M $ − Server HW: 12M $ − Server SW: 2M $ − Storage: 20M $ − Client HW/SW: 1M $  They are buying IOPS, not capacity
  • 31. 31 IOPS Crisis in OLTP(2)  For balanced systems, OLTP systems pay huge $$$ on disks for high IOPS CPU + Server 300 GIPS IOPS 10,000 disks 12M $ 20M $ A balanced state Amdhal’s law IOPS 10,000 disks A balanced state??? 50% CPU utilization; Same TPS 12M $ CPU + Server 600 GIPS18 months (Moore’s law) IOPS 20,000 disks (short stroking) A balanced state; 2 X TPS 40M $
  • 32. 32 SKKU VLDB Lab.Ch 9. Storing Disk Some Techniques to Hide IO Bottlenecks  Pre-fetching: For a sequential scan, pre-fetching several pages at a time is a big win! Even cache / disk controller support prefetching  Caching: modern disk controllers do their own caching.  IO overlapping: CPU works while IO is performing − Double buffering, asynchronous IO  Multiple threads  And, don’t do IOs, avoid IOs
  • 33. 33 SKKU VLDB Lab.Ch 9. Storing Disk MMDBMS vs. All-Flash DBMS: Personal Thoughts  Why MMDBMS has been recently popular? − Sap Hana, MS Hekaton, Oracle In-memory, Altibase, …. − Disk IOPS was too expensive − The price of DRAM has ever dropped.: DRAM_Cost << Disk_IOPS_Cost − The overhead of disk-based DBMS is not negligible − Applications with extreme performance requirements  Flash storage − Low $/IOPS − DRAM_Cost > Flash_IOPS_Cost  MMDBMS vs. All-Flash DBMS (with some optimizations) − Winner? Time will tell
  • 34. 34 Power Consumption Issue in Big Memory SKKU VLDB Lab.Ch 9. Storing Disk  Why exponential?  1KWh = 15 ~ 50 cents, 1 year = 1,752$ Source: sigmod 17 keynote by Anastatia Ailamaki
  • 35. 35 Jim Gray Flash Disk Opportunity for Server Applications (MS-TR 2007)  “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives 1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are 68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from “Flash disk opportunity for server-applications”)
  • 36. 36 Flash SSD vs. HDD: Non-Volatile Storage VS 60 years champion A new challenger! Identical Interface Flash SSD HDD Electronic Mechanical Read/Write Asymmetric Symmetric No Overwrite Overwrite
  • 37. 37 Flash SSD: Characteristics  No overwrite, addr. mapping table  Asymmetric read/write  No mechanics (Seq. RD ~ Rand. RD)  Seq. WR >> Rand. WR  Parallelism (SSD Architecture)  New Interface & Computing (SSD Architecture) 9/4/2017 37
  • 38. 38 Storage Device Metrics  Capacity ($/GB) : Harddisk >> Flash SSD  Bandwidth (MB/sec): Harddisk < Flash SSD  Latency (IOPS): Harddisk << Flash SSD − e.g. Harddisk  Commodity hdd (7200rpm): 50$ / 1TB / 100MB/s / 100 IOPS  Enterprise hdd(1.5Krpm): 250$ / 72GB / 200MB/s / 500 IOPS  The price of harddisks is said to be proportional to IOPS, not capacity.
  • 39. 39 Storage Device Metrics(2): HDD vs. Flash SSDs SKKU VLDB Lab.Ch 9. Storing Disk  Other Metrics: Weight/shock resistance/heat & cooling, power(watt) , IOPS/watt, IOPS/$ …. − Harddisk << Flash SSD [Source: Rethinking Flash In the Data Center, IEEE Micro 2010]
  • 40. 40 Evolution of DRAM, HDD, and SSD (1987 – 2017)  Raja et. al., The Five-minute Rule Thirty Years Later and its Impact on Storage Hierarchy, ADMS ‘17  the Storage Hierarchy SKKU VLDB Lab.Ch 9. Storing Disk
  • 41. 41 SKKU VLDB Lab.Ch 9. Storing Disk Technology RATIOS Matter [ Source: Jim Gray’s PPT ]  Technology ratio change: 1980s vs. 2010s  If everything changes in the same way, then nothing really changes.[See next slide]  If some things get much cheaper/faster than others, then that is real change.  Some things are not changing much: e.g. cost of people, speed of light  And some things are changing a LOT: e.g. Moore’s law, disk capacity  Latency lags behind bandwidth  Bandwidth lags behind capacity  Flash memory/NVRAMs and its role in the memory hierarchy? Disruptive technology ratio change  new disruptive solution!!  Disk improvements [‘88 – ’04]: − Capacity: 60%/y − Bandwidth: 40%/y − Access time: 16%/y
  • 42. 42 Latency Gap in Memory Hierarchy SKKU VLDB Lab.Ch 9. Storing Disk Typical access latency & granularity We need `gap filler’! [Source: Uwe Röhm’s Slide]  Latency lags behind bandwidth [David Patternson, CACM Oct. 2004]  Bandwidth problem can be cured with money, but latency problem is harder
  • 43. 43 Our Message in SIGMOD ’09 Paper One FlashSSD can beat Ten Harddisks in OLTP - Performance, Price, Capacity, Power – (in 2008)  In 2015, one FlashSSD can beat more than several tens harddisks in OLTP  “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives 1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are 68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from Flash disk opportunity for server-applications (Jim Gray)
  • 44. 44 Flash-based TPC-C @ 2013 September SKKU VLDB Lab.  Oracle + Sun Flash Storage − 8.5M tpmC  Total cost: 4.7M $ − Server HW: .6M $ − Server SW: 1.9M $ − Storage: 1.8M $  216 400GB Flash Moduley: 1.1M $  86 3TB 7.2K HDD: 0.07M − Client HW/SW: 0.1M $ − Others: 0.1M$  Implications − More vertical stacks (by SW vendor ) − Harddisk vendors (e.g. Seagate)
  • 45. 45 HDD vs. SSD [Patterson 2016] SKKU VLDB Lab.Ch 9. Storing Disk
  • 46. 46 Page Size  Default page size in Linux and Window: 4KB ( 512B sector)  Oracle − “Oracle recommends smaller Oracle Database block sizes (2 KB or 4 KB) for online transaction processing or mixed workload environments and larger block sizes (8 KB,16 KB, or 32 KB) for decision support system workload environments.” (see chapter “IO Config. And Design” in “Database Performance Tuning Guide” book ) SKKU VLDB Lab.Ch 9. Storing Disk
  • 47. 47 Why Smaller Page in SSD?  Small page size advantages – Better throughput 47
  • 48. 48 LinkBench: Page Size  Benefits of small page − Better read/write IOPS  Exploit internal parallelism − Better buffer-pool hit ratio

Editor's Notes

  • #2: 1
  • #8: 21
  • #19: 4
  • #21: 6
  • #23: 7
  • #24: 7
  • #25: 7
  • #26: 7
  • #49: Next, we measured the impact of page size on overall performance. In this experiment, we disabled WRITE_BARRIER and double write buffer. As you can see on the left, smaller page delivers better performance with LinkBench. Small page improves effectiveness of data transfer and is likely to leverage internal parallelism of SSD better. In addition, we also monitored the hit ratio of buffer pool varying the page size. As you can see on the right, smaller pages deliver better hit rates and the gap becomes wider with larger buffer pool. Again, smaller pages reduce the amount of unnecessary data in the buffer pool and utilize the bandwidth better, by reducing the unnecessary data transfer between host and storage device. ---------------------------------------------------------- In this slide, I will explain the advantages of using a small page size by using performance results. As you can see from the left graph, when the write-barrier option was off, the performance gain obtained by using a small page size was significant. More than two-fold increase in transaction throughput by reducing the page size from 16KB to 4KB. This is because a flash memory SSD can exploit the internal parallelism maximally when the write barrier is off. Another benefit from using a small page is an improved buffer pool cache effect. The right graph shows the buffer pool hit ratio observed from running the LinkBench workload with different page sizes. With seeing the graph, the buffer hit ratio was increased more quickly with 4KB page size than others. This higher hit ratio, in combination with the higher IOPS with a small page, contributes to the widening throughput gap among the three page sizes as the size of buffer pool increases. Due to the limited time, we did not include the transaction throughput graph by varying buffer pool size. Please refer to the paper.