SlideShare a Scribd company logo
6
Most read
10
Most read
22
Most read
1
Block I/O Layer Tracing:
blktrace
Gelato – Cupertino, CA
April 2006
Alan D. Brunelle
Hewlett­Packard Company
Open Source and Linux Organization
Scalability & Performance Group
Alan.Brunelle@hp.com
2
Introduction
● Blktrace – overview of a new Linux capability
– Ability to see what's going on inside the block I/O 
layer
● “You can't count what you can't measure”
– Kernel implementation
– Description of user applications 
● Sample Output & Analysis
3
Problem Statement
● Need to know the specific operations performed 
upon each I/O submitted into the block I/O layer
● Who?
– Kernel developers in the I/O path:
● Block I/O layer, I/O scheduler, software RAID, file system, ...
– Performance analysis engineers – HP OSLO S&P...
4
Block I/O Layer (simplified)
Applications
File Systems...
Page Cache
Block I/O Layer: Request Queues
Pseudo devices (MD/DM ­ optional)
Physical devices
5
iostat
● The iostat utility does provide information 
pertaining to request queues associated with 
specifics devices
– Average I/O time on queue, number of merges, number of 
blocks read/written, ...
● However, it does not provide detailed information 
on a per­I/O basis
6
Blktrace – to the rescue!
● Developed and maintained by Jens Axboe (block I/O layer 
maintainer)
– My additions included adding threads & utility splitting, DM remap 
events, blkrawverify utility, binary dump feature, testing,  
kernel/utility patches, and documentation.
● Low­overhead, configurable kernel component which emits events 
for specific operations performed on each I/O entering the block 
I/O layer
● Set of tools which extract and format these events
However, blktrace is not an analysis tool!
7
Feature List
● Provides detailed block layer information concerning individual I/Os
● Low­overhead kernel tracing mechanism
– Seeing less than 2% hits to application performance in relatively stressful I/O situations
● Configurable:
– Specify 1 or more physical or logical devices 
(including MD and DM (LVM2))
– User­selectable events – can specify filter at event acquisition and/or 
when formatting output
● Supports both “live” and “playback” tracing
8
Events Captured
● Request queue entry allocated
● Sleep during request queue 
allocation
● Request queue insertion
● Front/back merge of I/O on 
request queue
● Re­queue of a request
● Request issued to underlying 
block dev
● Request queue plug/unplug op
● I/O split/bounce operation
● I/O remap
– MD or DM
● Request completed
9
blktrace: General Architecture
Block I/O Layer
(request queue)
I/O
I/O
I/O
I/O
...
Relay Channel
Relay Channel
Relay Channel
blktrace
blkparse
Per Dev
Per CPU
Kernel Space User Space
Events
Emitted
10
blktrace Utilities
● blktrace: Device configuration, and event 
extraction utility
– Store events in (long) term storage
– Or, pipe to blkparse utility for live tracing
● Also: networking feature to remote events for parsing on another 
machine
● blkparse: Event formatting utility 
– Supports textual or binary dump output
11
blktrace: Event Output
% blktrace -d /dev/sda -o - | blkparse -i -
8,0 3 1 0.000000000 697 G W 223490 + 8 [kjournald]
8,0 3 2 0.000001829 697 P R [kjournald]
8,0 3 3 0.000002197 697 Q W 223490 + 8 [kjournald]
8,0 3 4 0.000005533 697 M W 223498 + 8 [kjournald]
8,0 3 5 0.000008607 697 M W 223506 + 8 [kjournald]
...
8,0 3 10 0.000024062 697 D W 223490 + 56 [kjournald]
8,0 1 11 0.009507758 0 C W 223490 + 56 [0]
Dev <mjr, mnr>
CPU
Sequence
Number Time
Stamp PID Event Start block + number of blocks
Process
12
blktrace: Summary Output
CPU0 (sdao):
Reads Queued: 0, 0KiB Writes Queued: 77,382, 5,865MiB
Read Dispatches: 0, 0KiB Write Dispatches: 7,329, 3,020MiB
Reads Requeued: 0 Writes Requeued: 6
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0 Write Merges: 68,844
Read depth: 2 Write depth: 65
IO unplugs: 414 Timer unplugs: 414
...
CPU3 (sdao):
Reads Queued: 105, 18KiB Writes Queued: 14,541, 2,578MiB
Read Dispatches: 22, 60KiB Write Dispatches: 6,207, 1,964MiB
Reads Requeued: 0 Writes Requeued: 1,408
Reads Completed: 22, 60KiB Writes Completed: 12,300, 5,059MiB
Read Merges: 83 Write Merges: 10,968
Read depth: 2 Write depth: 65
IO unplugs: 287 Timer unplugs: 287
Total (sdao):
Reads Queued: 105, 18KiB Writes Queued: 92,546, 8,579MiB
Read Dispatches: 22, 60KiB Write Dispatches: 13,714, 5,059MiB
Reads Requeued: 0 Writes Requeued: 1,414
Reads Completed: 22, 60KiB Writes Completed: 12,300, 5,059MiB
Read Merges: 83 Write Merges: 80,246
IO unplugs: 718 Timer unplugs: 718
Throughput (R/W): 0KiB/s / 39,806KiB/s
Events (sdao): 324,011 entries
Skips: 0 forward (0 - 0.0%)
Per CPU details
Avg throughput
Per device
details
Writes submitted on
Writes completed on
13
blktrace: Event Storage Choices
● Physical disk backed file system
– Pros: large/permanent amount of storage available; supported by all kernels
– Cons: potentially higher system impact; may negatively impact devices being watched (if 
storing on the same bus that other devices are being watched on...)
● RAM disk backed file system
– Pros: predictable system impact (RAM allocated at boot); less impact to I/O subsystem
– Cons: limited/temporary storage size; removes RAM from system (even when not tracing); 
may require reboot/kernel build
● TMPFS
– Pros: less impact to I/O subsystem; included in most kernels; only utilizes system RAM while 
events are stored
– Cons: limited/temporary storage; impacts system predictability – RAM “removed” as events 
are stored – could affect application being “watched” 
14
blktrace: Analysis Aid
● As noted previously, blktrace does not analyze 
the data; it is responsible for storing and 
formatting events
● Need to develop post­processing analysis tools 
– Can work on formatted output or binary data stored 
by blktrace itself
– Example: btt – block trace timeline
15
Practical blktrace
● Here at HP OSLO S&P, we are investigating I/O 
scalability at various levels
– Including the efficiency of various hardware configurations 
and the effects on I/O performance caused by software RAID 
(MD and DM)
● blktrace enables us to determine scalability issues within 
the block I/O layer and the overhead costs induced when 
utilizing software RAID
16
Life of an I/O (simplified)
● I/O enters block layer – it can be:
– Remapped onto another device (MD, DM)
– Split into 2 separate I/Os (alignment, size, ...)
– Added to the request queue
– Merged with a previous entry on the queue
All I/Os end up on a request queue at some point
● At some later time, the I/O is issued to a device driver, 
and submitted to a device
● Later, the I/O is completed by the device, and its driver
17
btt: Life of an I/O
● Q2I – time it takes to process an I/O prior to it being 
inserted or merged onto a request queue
– Includes split, and remap time
● I2D – time the I/O is “idle” on the request queue
● D2C – time the I/O is “active” in the driver and on the 
device
● Q2I + I2D + D2C = Q2C 
– Q2C: Total processing time of the I/O
18
btt: Partial Output
DEV #Q #D Ratio BLKmin BLKavg BLKmax Total
------- --- ----- ----- ------- ------ ------ --------
[ 8, 0] 92827 12401 7.5 1 109 1024 10120441
[ 8, 1] 93390 13676 6.8 1 108 1024 10150343
[ 8, 2] 92366 13052 7.1 1 109 1024 10119302
[ 8, 3] 92278 13616 6.8 1 109 1024 10119043
[ 8, 4] 92651 13736 6.7 1 109 1024 10119903
DEV Q2I I2D D2C Q2C
------- ----------- ----------- ----------- -----------
[ 8, 0] 0.049697430 0.302734498 0.074038617 0.400079555
[ 8, 1] 0.031665593 0.050032148 0.058669682 0.125934697
[ 8, 2] 0.035651772 0.031035436 0.047311659 0.096735504
[ 8, 3] 0.021047776 0.011161007 0.038519804 0.060975408
[ 8, 4] 0.028985217 0.008397228 0.034344640 0.058160497
DEV Q2I I2D D2C
------- ------ ------ ------
[ 8, 0] 11.7% 71.0% 17.4%
[ 8, 1] 22.6% 35.6% 41.8%
[ 8, 2] 31.3% 27.2% 41.5%
[ 8, 3] 29.8% 15.8% 54.5%
[ 8, 4] 40.4% 11.7% 47.9%
M
erge Ratio:
#Q / #D
SCSI bus, target:
low­ to high­priority
M
erge Ratio:
#Q / #D
Driver/Device time
Avg I/O time
“Software” tim
e
Excessive “idle” time on 
request queue
19
btt: Q&C Activity
● btt also generates “activity” data – indicating 
ranges where processes and devices are actively 
handling various events (block I/O entered, I/O 
inserted/merged, I/O issued, I/O complete, ...)
● This data can be plotted (e.g. xmgrace) to see 
patterns and extract information concerning 
anomalous behavior 
20
btt: I/O Scheduler Example
mkfs & pdflush
“fight” for device
I/O delayed by 
m
kfs activity
“Q” 
Activity
“C” 
Activity
21
btt: I/O Scheduler ­ Explained
● Characterizing I/O stack
● Noticed very long I2D times for certain processes
● Graph shows continuous stream of I/Os...
– ...at the device level
– ...for the mkfs.ext3 process
● Graph shows significant lag for pdflush daemon
– Last I/O enters block I/O layer around 19 seconds
– But: last batch of I/Os aren't completed until 14 seconds later!
● Cause? Anticipatory scheduler – allows mkfs.ext3 to proceed, 
holding off pdflush I/Os
22
Resources
● Kernel sources:
– Patch for Linux 2.6.14­rc3 (or later, up to 2.6.17)
– Linux 2.6.17 (or later) – built in
● Utilities & documentation (& kernel patches)
– rsync://rsync.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git
– See documentation in doc directory
● Mailing list: linux­btrace@vger.kernel.org

More Related Content

PDF
Physical Memory Management.pdf
PDF
malloc & vmalloc in Linux
PDF
Memory management in Linux kernel
PDF
LAS16-200: SCMI - System Management and Control Interface
PDF
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
PPTX
Linux device drivers
PDF
ClickHouse Keeper
PPTX
Linux Network Stack
Physical Memory Management.pdf
malloc & vmalloc in Linux
Memory management in Linux kernel
LAS16-200: SCMI - System Management and Control Interface
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
Linux device drivers
ClickHouse Keeper
Linux Network Stack

What's hot (20)

PPTX
Linux Kernel Booting Process (1) - For NLKB
PDF
Page cache in Linux kernel
PDF
Reverse Mapping (rmap) in Linux Kernel
PDF
Linux Performance Profiling and Monitoring
PPTX
Linux Initialization Process (1)
PDF
Intel QLC: Cost-effective Ceph on NVMe
PDF
Physical Memory Models.pdf
PDF
Process Address Space: The way to create virtual address (page table) of user...
PDF
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
PPT
Windows Kernel-
PPTX
Linux Memory Management with CMA (Contiguous Memory Allocator)
PPTX
Linux Memory Management
PDF
Memory Mapping Implementation (mmap) in Linux Kernel
PDF
Container Performance Analysis
PPTX
Linux kernel debugging
PDF
Auditing and Monitoring PostgreSQL/EPAS
 
PDF
Memory Management with Page Folios
PPTX
Memory model
PDF
Linux Performance Analysis and Tools
PDF
Linux Synchronization Mechanism: RCU (Read Copy Update)
Linux Kernel Booting Process (1) - For NLKB
Page cache in Linux kernel
Reverse Mapping (rmap) in Linux Kernel
Linux Performance Profiling and Monitoring
Linux Initialization Process (1)
Intel QLC: Cost-effective Ceph on NVMe
Physical Memory Models.pdf
Process Address Space: The way to create virtual address (page table) of user...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
Windows Kernel-
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management
Memory Mapping Implementation (mmap) in Linux Kernel
Container Performance Analysis
Linux kernel debugging
Auditing and Monitoring PostgreSQL/EPAS
 
Memory Management with Page Folios
Memory model
Linux Performance Analysis and Tools
Linux Synchronization Mechanism: RCU (Read Copy Update)
Ad

Similar to Block I/O Layer Tracing: blktrace (20)

PDF
Linux Perf Tools
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
PPTX
The power of linux advanced tracer [POUG18]
PPTX
Final_Presentation_Docker_KP
PDF
Understand and optimize Linux I/O
ODP
Monitoring IO performance with iostat and pt-diskstats
PDF
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
PDF
Linux Server Deep Dives (DrupalCon Amsterdam)
PPT
Capturing comprehensive storage workload traces in windows
PDF
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph
PDF
MeetBSD2014 Performance Analysis
PDF
XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix
PPT
16aug06.ppt
PPT
Servers and Processes: Behavior and Analysis
PPTX
Using the big guns: Advanced OS performance tools for troubleshooting databas...
PPTX
Flashy prefetching for high performance flash drives
PDF
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
PDF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
PPTX
Installing Oracle Database on LDOM
Linux Perf Tools
Linux Performance Analysis: New Tools and Old Secrets
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
The power of linux advanced tracer [POUG18]
Final_Presentation_Docker_KP
Understand and optimize Linux I/O
Monitoring IO performance with iostat and pt-diskstats
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
Linux Server Deep Dives (DrupalCon Amsterdam)
Capturing comprehensive storage workload traces in windows
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph
MeetBSD2014 Performance Analysis
XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix
16aug06.ppt
Servers and Processes: Behavior and Analysis
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Flashy prefetching for high performance flash drives
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Installing Oracle Database on LDOM
Ad

Recently uploaded (20)

PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
medical staffing services at VALiNTRY
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
System and Network Administration Chapter 2
PDF
AI in Product Development-omnex systems
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
L1 - Introduction to python Backend.pptx
PDF
System and Network Administraation Chapter 3
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
Softaken Excel to vCard Converter Software.pdf
Odoo POS Development Services by CandidRoot Solutions
Online Work Permit System for Fast Permit Processing
Navsoft: AI-Powered Business Solutions & Custom Software Development
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
medical staffing services at VALiNTRY
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
System and Network Administration Chapter 2
AI in Product Development-omnex systems
How to Choose the Right IT Partner for Your Business in Malaysia
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
L1 - Introduction to python Backend.pptx
System and Network Administraation Chapter 3
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
ManageIQ - Sprint 268 Review - Slide Deck
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PTS Company Brochure 2025 (1).pdf.......
Understanding Forklifts - TECH EHS Solution
VVF-Customer-Presentation2025-Ver1.9.pptx

Block I/O Layer Tracing: blktrace