SlideShare a Scribd company logo
CALSOFT CONFIDENTIAL
Whitepaper:
Performance Tuning for Software RAID6 driver in Linux
Key Contributors
Aayush Agrawal: aayush.agrawal@calsoftinc.com
Mandar Joshi: mandar.joshi@calsoftinc.com
Pratik Rupala: pratik.rupala@calsoftinc.com
Sachin Patil: sachinkumar.patil@calsoftinc.com
Performance Tuning for Software RAID6 driver in Linux
Abstract
There are multiple performance tuning
parameters available for software RAID6 driver in
Linux. Understanding and optimizing these
tunables; when unified with improved RAID6
driver architecture in Linux Kernel 3.16.4, can
achieve similar per data disk performance
corresponding to RAID0 driver.
This RAID6 performance can be achieved while
keeping the privilege of block level striping with
two parity blocks distributed across all member
disks feature of RAID6 in place. In case of RAID6
driver in Linux Kernel 3.16.4, sequential write I/O
performance after optimizing all the tunables is
approximately 110% more than that with default
tunable settings. The key objective of this paper is
to illustrate what these tunables are and to what
extent use of these parameters can raise
performance when compared to default tunables
setting. Apart from RAID6, this paper also briefly
explains the tuning parameters for raw disks and
RAID0 driver, and at last it includes comparison of
RAID6 performance in older Linux Kernel(2.6.32)
against newer kernel(3.16.4) to explore RAID6
driver architectural improvements.
Introduction
RAID technology is used to increase the
performance and/or reliability of data storage. It
also provides ability to combine several physical
disks into one larger virtual device. Linux software
RAID can work on most block devices. It doesn't
matter whether you use SATA, USB, IDE or SCSI
devices, or a mixture. These features of RAID
have persuaded many organizations to use it on
top of raw devices. For any organization
achieving optimum performance, the given
hardware configuration is super critical. Attaining
this optimization at raw disk and RAID driver level
is possible by understanding and adjusting some
configurable parameters at software layer. Along
with these tunables, accommodation of I/O tool
parameters also plays a vital role in performance
optimization.
In this paper, we will start with the system
configuration used by us for RAID performance
tuning experiment followed by performance
tuning process starting with raw devices then
RAID0 driver and then RAID6 driver. On every
result there are some observations which would
help readers to understand the significance of
parameters used in the test and it would also
make effect of tunables evident.
For this experiment we have considered RAID0
and RAID6 driver from Linux Kernel 2.6.32and
3.16.4 This paper focuses on improving RAID
performance only for sequential Write I/O. Once
the tunables are identified and their meaning is
understood, performance tuning for even random
writes should stand addressable.
System Architecture
Storage Subsystem
Data storage used for this experiment was DELL
JBOD MD3060e chassis with 60 SAS drives each
of 1TB and model ID ST1000NM0023. Each disk
was seen from 4 multiple paths. Each path was
seen through one SAS port. There were 2 SAS
cards with 2 port each, so total of 4 SAS ports.
CALSOFTCONFIDENTIAL 2
Performance Tuning for Software RAID6 driver in Linux
Host and interconnect
Host machine had CentOS 6.5 installed on it. Two
vanilla kernels 2.6.32 and 3.16.4 were then
installed separately. Host was dual processor
machine with 64 GB of RAM and hyper threading
enabled. CPU model was Intel(R) Xeon(R) CPU
E52650 v2 @ 2.60GHz.
IO ToolsUsed
For all of the I/O tests explained below the I/O
tool used is fio. fio is an I/O tool used for both
benchmarking and stress/hardware verification. It
is flexible enough to allow detailed workload
setups, and it contains the necessary reporting to
make sense of the data at completion. It has
support for 19 different types of I/O engines (a
term fio uses to signify how IO is delivered to the
kernel), I/O priorities (for newer Linux kernels),
rate I/O, forked or threaded jobs, and much
more. Though initially developed on Linux, the
areas in fio that aren't tied to OS specific features
work on any platform from Windows to HPUX to
Android. There are also native IO engines on
Linux, Windows, Solaris, etc. This is a key feature
for mixed environments where people want to be
able to run (as close to) identical workloads on
different operating environments.
Our Approach
Our approach for this experiment was, to start
with raw disks and check if it meets the
performance specified by hardware vendor and
also identify faulty disks if any. If it does not meet
the performances specified by vendor, then
theidentify, understand and then tune
performance for raw disks.
•Verify if addition of RAID0 layer has any
negative effect on performance. If it has any
negative effect then work on tuning RAID0 layer
tunables to minimize the performance overhead
due to addition of RAID0 softwarelayer.
• Check how much performance penalty we get
after we introduce software RAID6 layer in Linux.
This performance would be with default tuning
parameters setting of RAID6.
•Examine RAID6 tuning parameters, understand
them and see their effect first individually and
then in combination with other tunables to
decide on optimized set of tuningparameters.
• Consider Linux Kernel 2.6.32 and 3.16.4 for this
exercise. Multiple kernels will illustrate if change
in Kernel version itself has any effect on RAID6
driver performance with and without tuning.
Raw Disks Performance Tuning
There are a couple of reasons for why we need to
start with performance tuning of RAW disks and
that too with performance measurement of
individual disk and in various combinations. If
some disk is faulty then it hampers overall
performance and hence it’s important to check
and eliminate buggy disks at the earliest. Apart
from this it is necessary to identify how the disks
should be selected for creation of RAID or any
other disk array. For example, in our setup we
had four paths for every disk, so using all these
paths efficiently and taking care that we do not
have any disk repeated in a combination was
mandatory to get linear graph between disk
count and performance.
•We started with testing of bare metal I/O
performance of the raw hardware, while
CALSOFTCONFIDENTIAL 3
Performance Tuning for Software RAID6 driver in Linux
bypassing as much of the kernel as possible. The
script uses sgp_dd to carry out raw sequential
disk I/O. It runs with variable numbers of sgp_dd
threads to show how performance varies with
different request queue depths. The data
garnered by this test helped to set expectations
for the performance of RAID devices created
devices. It also verified if the
matches with specifications of
from these
performance
vendor.
For individual disks the average performance
seen was ~170 MB/s for both sequential write
and read. Disks are sensitive to record size so
they should be benchmarked to find out optimal
record size. In our case we got optimal
performance with record size 4 MB. I/O size
ranged from 1GB to 2GB perdisk.
•After this we did buffered I/O with FIO tool on
these devices with default tunables setting. It
showed significant performance drop than bare
metal I/O performance. Here the average
performance seen was dropped from 170 MB/s
for both sequential write and read to around 70
MB/s for sequential write and 130 MB/s for read.
To get back to bare metal I/O performance of the
raw hardware we needed to tune parameters of
RAW disks. We started examining effect of
individual tunables first and then in combination.
We used iostat tool to monitor average request
size, writes per second and reads per second
while checking effect of tunables. Here we got
optimum performance for block size 8MB and
below are the tunables which had prominent role
in performance improvement. I/O size ranged
from 8GB to 16GB perdevice.
o Scheduler: Changed from default “cfq”
to “deadline”.
o max_sectors_kb: This is the maximum
number of kilobytes that the block
layer will allow for a file system
request. Must be smaller than or
equal to the maximum size allowed by
the hardware. Changed from default
1024 to8192.
o nr_requests: This controls how many
requests may be allocated in the block
layer for read or write requests.
Changed from default 128 to 1024.
Linux Kernel 2.6.32
RAID0 Performance Tuning
After identifying settings for optimized
performance of RAW disks we added a software
layer of RAID0 above it. With optimized settings
of RAW disks and without changing any tuning
parameter at RAID0 layer the performance drop
seen was around 20%. The sequential write
speed per disk dropped from 170 MB/s to ~135
MB/s. There were multiple variables which
abetted in performance tuning of RAID0 disks,
Some at RAID0 layer and some in FIO tool.
We used 10 data disks for each RAID0 device and
created 6 such RAID0 devices for this tuning
experiment. Then we inspected effect of chunk
size on RAID0 performance and it had miniscule
impact. But our application over RAID devices
gave us optimal performance for single I/O of
1MB I/O request size and In case of RAID6,
theoretically optimum performance can be
achieved if full stripe writes are done. Moreover,
in case of RAID6 if we use 128K chunk size then
single I/O size becomes 128K * 8 data disks in
CALSOFTCONFIDENTIAL 4
Performance Tuning for Software RAID6 driver in Linux
(8+2) RAID6 configuration = 1MB. So considering
all these facts, in all our experiments of RAID0
and RAID6, we settled on 128K chunk size.
Calibrating one of the tunable “rq_affinity” of
RAID0, revealed performance improvement.
Aside from rq_affinity, adjusting FIO tool options
such as block size and iodepth also proclaimed
performance improvement. We could get back to
170 MB/s sequential read and write speed for
RAID0 devices with rq_affinity changed to 2 from
0, iodepth set to 16 and block size to 64MB.
rq_affinity: If this option is '1', the block layer will
migrate request completions to the cpu "group"
that originally submitted the request. For some
workloads this provides a significant reduction in
CPU cycles due to caching effects. For storage
configurations that need to maximize distribution
of completion processing setting this option to '2'
forces the completion to run on the requesting
cpu (bypassing the "group" aggregation logic)
changed from 0 to2.
In the graph mentioned below, sudden decline in
performance per data disk is visible after 40 disks.
This was not because of inadequate tuning
efforts; this was caused by per SAS port
performance limitation as explained below.
The above graph shows effect of disk count on RAID0
performance.
In our setup each port on the SAS HBA consisted
4 x 6Gb SAS lanes. So I/O rate supported by one
SAS port becomes 6Gb/s * 4 phys = 24Gb/s. After
considering overhead due to encoding, arbitrary
delays and additional framing per SAS port
performance calculation becomes: (24Gb/s per
port)/ (8b/10b encoding) = 2.4GB/s per port.
2.4GB/s * 88.33% (to accommodate arbitrary
delays and additional framing) = 2.16GB/s. Now
as explained above in Storage Subsystem section,
in our setup we had 2 SAS cards with 2 port each,
so total 4 SAS ports. Each path was seen through
one SAS port. Hence, each disk was visible from 4
paths. So for 1 RAID0 device (10 disks), 5 disks
were taken from 1 path and 5 from second. For 2
RAID0 devices, 5 disks were selected from each
path. Similarly for 4 RAID0 devices, 10 disks were
selected from each path and for 5 RAID0 devices
15 disks from 2 paths each and 10 from
remaining 2 paths each.
While doing I/O on single RAID0 device we
observed from “iostat” that I/O rate per disk was
around 170 MB/s. Upto 40 disks, maximum IO
from any path can be (170 * 10) = 1700 MB/s
which is well within the physical limitation of SAS
port i.e. 2.1 GB/s. With 50 disks configuration,
first two paths gets bottlenecked if at any
instance RAID0 driver tries to push 170 MB/s I/O
per disk. i.e. (170 * 15 disks) = 2550 MB/s.
Similarly in case of 6 RAID0 device
configurations all 4 paths get bottlenecked.
RAID6 Performance Tuning
After successfully tuning the RAID0
performance to get per data disk performance
CALSOFTCONFIDENTIAL 5
Performance Tuning for Software RAID6 driver in Linux
same as RAW disks, We replaced RAID0 with
RAID6. With optimized settings of RAW disks and
without changing any tuning parameter at RAID6
layer, per data disk performance drop seen in
case of RAID6 was more as compared to RAID0.
The sequential write speed per disk dropped
from 170 MB/s to ~60 MB/s. Alike RAID0, for
RAID6 tuning also there were some parameters at
both RAID6 and FIO tool level. But still with RAID6
driver we could not achieve the expected
performance. The optimized performance for
RAID6 we could achieve was 90 MB/s per data
disk only.
We could identify five factors causing this
performance drop and only one of them we could
suppress with tuning parameters. Remaining
entailed code level changes. We confirmed our
understanding from RAID mailing list as well.
 First problem is the pre reads happening even
while writing full stripes. Even if we do full
stripe write in bulk, write request arrives at
the RAID6 driver in smaller chunks and it
doesn't always decide correctly whether it
should wait for more writes to arrive, or if it
should start reading now.
 Consider the situation where some of the
blocks are dirtied out of a full stripe write and
few are remaining. Now md/raid driver may
wait in the hope that next I/Os will come and
dirty entire stripe or it may want to do either
readmodifywrite or reconstructwrite.
 Unfortunately RAID6 driver often chooses
reconstructwrite option. This causes preread
of remaining chunks if not up to date and also
recalculate parity. This hampers performance.
tunables
and
To avoid this we found
“stripe_cache_size”
“preread_bypass_threshold”.
 stripe_cache_size: Number of entries in the
stripe cache. This is writable, but there are
upper and lower limits (32768, 16). Default is
128. Changed to 32768.
 preread_bypass_threshold: Number of times
a stripe requiring preread will be bypassed by
a stripe that does not require preread. For
fairness
disables
preread
defaults to 1. Setting this to 0
bypass accounting and requires
stripes to wait until all fullwidth
stripewrites are complete. Valid values are 0
to stripe_cache_size. Changed to 32768.
 Second reason is the breakup of stripes at
RAID6 driver level. Irrespective of what chunk
size user sets, RAID6 driver breaks it into 4K.
So for RAID6 driver each stripe is of 4K * (no.
of disks). These stripes are then handled one
by one. This causes severe performance
damage.
 Third reason is the RAID6 layer maintaining its
own device data caching. Instead of passing
bio and its pages from upper layer to low
level driver, RAID6 layer first copies data to be
transferred (in either direction) in its cache
and performs read/write operation
asynchronously. This caching is helpful when
data transfer size is less than page size, but
for full page read/write, it is just an overhead
which impacts performance. Moreover, there
is no tunable available to avoid such kind of
data copying mechanism for full page write.
CALSOFTCONFIDENTIAL 6
Performance Tuning for Software RAID6 driver in Linux
The graph below shows effect of stripe_cache_size on
RAID6 device write performance. The RAID6 device is
created from 10 disks out of which 8 were data disks and 2
were paritydisks.
 Fourth reason is the inefficient locking
decisions. As discussed earlier, RAID6 layer
handles a single stripe as 4K * (no of disks).
Even though any operation in RAID layer
needs to be done only on a single stripe, lock
is taken on entire RAID6 device which blocks
other operation in RAID6 layer. This type of
locking in RAID layer affects the performance.
 Fifth reason is the single thread handling
RAID6 layer requests. Whenever any write
request lands at RAID6 layer, RAID6 layer
copies data to be transferred to its stripe
cache instead of lower level device and
handles that write request asynchronously at
a later time. To handle requests
asynchronously, RAID6 layer has a separate
thread running which handles these requests.
Even though user application is a
multithreaded program which is sending write
requests in bulk, RAID6 layer handles these
requests one by one by just a single
thread. This becomes bottleneck and causes
drop in performance.
The graph below shows effect of IOdepth on RAID6 device
write performance. The RAID6 device is created from 10
disks out of which 8 were data disks and 2 were parity disks.
Linux Kernel 3.16.4
RAID0 Performance Tuning
There was not much difference in performance of
RAID0 in linux kernel 3.16.4 when compared to
linux kernel 2.6.32. The set of optimized
parameters decided while tuning RAID0
performance in linux kernel 2.6.32 when applied
for RAID0 in linux kernel 3.16.4, showed the same
performance. This matched the RAW disk
performance of 170 MB/s per data disk.
Moreover, observations on effect of disk count on
RAID0 performance also remains the same.
RAID6 Performance Tuning
After observing fall in performance of RAID6
driver in linux kernel 2.6.32 than RAID0 driver, we
tried same performance experiment with RAID6
driver of newer kernel 3.16.4. With optimized
settings of RAW disks and without changing any
tuning parameter at RAID6 layer, per data disk
performance seen in case of RAID6 of 3.16.4
CALSOFTCONFIDENTIAL 7
Performance Tuning for Software RAID6 driver in Linux
Kernel was little bit more than RAID6 of 2.6.32
Kernel. It ranged from 60 MB/s to 70 MB/s per
data disk. But RAID6 of 3.16.4 Kernel has more
tunable param options than 2.6.32 Kernel.
Optimizing these tunables in correct combination
assisted us to get back performance on par with
RAID0 per data disk performance.
Write speed per data disk for RAID6 ranged from
145 MB/s to 160 MB/s for various combinations
of disk count ranging from 10 to 40 disks. As
explained in RAID0 tuning of Linux Kernel 2.6.32
section, after 40 disks per SAS port performance
limitation comes into picture and per data disk
performance drops to around 120 MB/s.
Out of five reasons explained above for RAID6
driver of Linux Kernel 2.6.32, first we could
resolve with same tuning parameters as that for
RAID6 in Linux Kernel 2.6.32. Second reason of
breakup of stripes at RAID6 driver level persists in
newer Kernel 3.16.4 as well. But remaining three
reasons we could overcome with newly added
tunables and features in RAID6 of 3.16.4 kernel.
For RAID6 layer maintaining its own device data
caching factor we found a tunable named
skipcopy which can be set to avoid data copy to
stripe cache if data transfer size is of full page.
Instead bio page is directly used for data transfer.
Locking decisions are also improved in this kernel
compared to 2.6.32. It still takes full device lock
which can block entire RAID6 device but its usage
has been reduced and this lock is taken only
when it is mandatory. For fifth reason of single
threaded handling of RAID6 layer requests, 3.16.4
kernel has tunable named group_thread_cnt. So
now instead of having a single thread handling
requests asynchronously, RAID layer supports
batches of multiple threads to handle requests in
parallel to improveperformance.
All the above mentioned tunables helped to
improve serial write I/O performance. For read
performance, below tunable assisted us.
read_ahead_kb: This setting controls how much
extra data the operating system reads from disk
when performing I/O operations. This tunable
exist at both RAW disks and RAID software layer.
The graph below shows effect of IOdepth on RAID6 device
write performance. The RAID6 device has been created
from 10 disks out of which 8 were data disks and 2 were
paritydisks.
CALSOFTCONFIDENTIAL 8
Performance Tuning for Software RAID6 driver in Linux
The graph below shows effect of stripe_cache_size on
RAID6 device write performance. The RAID6 device is
created from 10 disks out of which 8 were data disks and 2
were paritydisks.
Conclusion
Operating System default configuration settings
for block devices, software md/RAID5 or software
md/RAID6 devices are not enough to get best
optimum performance out of the system.
the type of
expected I/O pattern we need to tune the system
to achieve the optimum performance. In our
experiments, we required to tune raw disks
(block layer) and md/RAID6 devices and could
achieve theoretical optimum performance for our
given hardware and expected IO pattern.
Depending
 For
upon
single
hardware and
(8 + 2) raid6 configuration,
better performance seen with io block
size >= 4MB and iodepth >= 64. Significant
performance improvement seen with
increase in stripe_cache_size from kernel
default to maximum allowable value i.e.
32768
 For single (8+2) RAID6 configuration, serial
write performance seen is around 155-160
MBps per data disk for Linux Kernel 3.16.4.
For 6*(8+2) raid6 config, write performance
seen is around 125 MBps per data disk i.e.
around 6 GBps for entire 6*(8+2) RAID6
configuration. If SAS limitation in our current
setup as discussed above is resolved, 6*(8+2)
RAID6 configuration can also achieve write
performance of 155-160 MBps per data disk.
 Performance improvement seen with
skip_copy option (3.16.3 kernel's inbuilt
zerocopyfeature)
 Performance improvement not seen with
increase in group_thread_count option i.e. by
increasing no of threads in multithreading
feature in 3.16.4 kernel's raid6 driver. Rather
degradation in performance is seen as
number of threads is increased from default
(0) to highervalue.
FutureWork
In future work, RAID6 performance tuning can be done for random read/writes. Same approach can be
followed and same tunables can be used. Only difference would be, instead of serial read/writes
random read/writes will be done and then values of tunables will be optimized accordingly.
CALSOFTCONFIDENTIAL 9
Performance Tuning for Software RAID6 driver in Linux
References
 FIO tool: http://freecode.com/projects/fio
 Linux RAID mailing list: http://marc.info/?l=linuxraid
 Kernel Documentation sysfs parameters: https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt
 Kernel Documentation md/raid proc parameters: https://www.kernel.org/doc/Documentation/md.txt
 Kernel Documentation https://raid.wiki.kernel.org/index.php/Performance
 http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html
 http://blog.jamponi.net/2013/12/sw-raid6-performance-influenced-by.html
CALSOFTCONFIDENTIAL 10

More Related Content

PPTX
Fogo em Teus olhos (1).pptx
PPT
Vem espírito de deus
PPTX
PPTX
Ele vem - Gabriel Guedes e Gabriela Rocha
PPT
Deus do impossível [quando tudo diz que não]
PPT
Ressuscita-me(Aline Barros)
 
PPTX
PPTX
Jesus - Arianne
Fogo em Teus olhos (1).pptx
Vem espírito de deus
Ele vem - Gabriel Guedes e Gabriela Rocha
Deus do impossível [quando tudo diz que não]
Ressuscita-me(Aline Barros)
 
Jesus - Arianne

What's hot (17)

PPTX
Ele vem - Coral Kemuel
PPTX
Sem Palavras Vanilda Bordieri
PPT
Tua Graça me Basta
PPTX
Vou Deixar na Cruz - Kleber Lucas
PDF
Vedic Sanskrit words related to vision, time, motion, control
PPTX
Além da Medicina - Mara Lima
PPTX
Gemidos e palavras
PPTX
Seguirei - Rose Nascimento
PPS
Projeto no deserto voz da verdade
PPTX
Efésios 6 - Anderson Freire
PPT
Quero Que Valorize
PPTX
Palavras - lauriete
PPTX
Roseane ribeiro - Jeova rafah
PPT
Vinho e Pão Fernanda Brum
PPTX
Nada temerei
Ele vem - Coral Kemuel
Sem Palavras Vanilda Bordieri
Tua Graça me Basta
Vou Deixar na Cruz - Kleber Lucas
Vedic Sanskrit words related to vision, time, motion, control
Além da Medicina - Mara Lima
Gemidos e palavras
Seguirei - Rose Nascimento
Projeto no deserto voz da verdade
Efésios 6 - Anderson Freire
Quero Que Valorize
Palavras - lauriete
Roseane ribeiro - Jeova rafah
Vinho e Pão Fernanda Brum
Nada temerei
Ad

Similar to Performance tuning for software raid6 driver in linux (20)

PDF
Disk configtips wp-cn
PPTX
RAID CAAL
PDF
Unlock more mixed storage performance on Dell PowerEdge R750 servers with Bro...
PDF
Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads
PPTX
Ceph Day Taipei - Accelerate Ceph via SPDK
ODP
Using ТРСС to study Firebird performance
PPTX
Getting The Most Out Of Your Flash/SSDs
DOCX
How to choose the right server
PDF
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
PPTX
Performance evolution of raid
PDF
Firebird and RAID
PDF
Alibaba cloud benchmarking report ecs rds limton xavier
PDF
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
PPTX
SEMINAR
DOCX
Raid the redundant array of independent disks technology overview
PPTX
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
PDF
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
PPTX
What is R.A.I.D?
Disk configtips wp-cn
RAID CAAL
Unlock more mixed storage performance on Dell PowerEdge R750 servers with Bro...
Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads
Ceph Day Taipei - Accelerate Ceph via SPDK
Using ТРСС to study Firebird performance
Getting The Most Out Of Your Flash/SSDs
How to choose the right server
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Performance evolution of raid
Firebird and RAID
Alibaba cloud benchmarking report ecs rds limton xavier
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
SEMINAR
Raid the redundant array of independent disks technology overview
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
What is R.A.I.D?
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
A comparative analysis of optical character recognition models for extracting...
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing

Performance tuning for software raid6 driver in linux

  • 1. CALSOFT CONFIDENTIAL Whitepaper: Performance Tuning for Software RAID6 driver in Linux Key Contributors Aayush Agrawal: aayush.agrawal@calsoftinc.com Mandar Joshi: mandar.joshi@calsoftinc.com Pratik Rupala: pratik.rupala@calsoftinc.com Sachin Patil: sachinkumar.patil@calsoftinc.com
  • 2. Performance Tuning for Software RAID6 driver in Linux Abstract There are multiple performance tuning parameters available for software RAID6 driver in Linux. Understanding and optimizing these tunables; when unified with improved RAID6 driver architecture in Linux Kernel 3.16.4, can achieve similar per data disk performance corresponding to RAID0 driver. This RAID6 performance can be achieved while keeping the privilege of block level striping with two parity blocks distributed across all member disks feature of RAID6 in place. In case of RAID6 driver in Linux Kernel 3.16.4, sequential write I/O performance after optimizing all the tunables is approximately 110% more than that with default tunable settings. The key objective of this paper is to illustrate what these tunables are and to what extent use of these parameters can raise performance when compared to default tunables setting. Apart from RAID6, this paper also briefly explains the tuning parameters for raw disks and RAID0 driver, and at last it includes comparison of RAID6 performance in older Linux Kernel(2.6.32) against newer kernel(3.16.4) to explore RAID6 driver architectural improvements. Introduction RAID technology is used to increase the performance and/or reliability of data storage. It also provides ability to combine several physical disks into one larger virtual device. Linux software RAID can work on most block devices. It doesn't matter whether you use SATA, USB, IDE or SCSI devices, or a mixture. These features of RAID have persuaded many organizations to use it on top of raw devices. For any organization achieving optimum performance, the given hardware configuration is super critical. Attaining this optimization at raw disk and RAID driver level is possible by understanding and adjusting some configurable parameters at software layer. Along with these tunables, accommodation of I/O tool parameters also plays a vital role in performance optimization. In this paper, we will start with the system configuration used by us for RAID performance tuning experiment followed by performance tuning process starting with raw devices then RAID0 driver and then RAID6 driver. On every result there are some observations which would help readers to understand the significance of parameters used in the test and it would also make effect of tunables evident. For this experiment we have considered RAID0 and RAID6 driver from Linux Kernel 2.6.32and 3.16.4 This paper focuses on improving RAID performance only for sequential Write I/O. Once the tunables are identified and their meaning is understood, performance tuning for even random writes should stand addressable. System Architecture Storage Subsystem Data storage used for this experiment was DELL JBOD MD3060e chassis with 60 SAS drives each of 1TB and model ID ST1000NM0023. Each disk was seen from 4 multiple paths. Each path was seen through one SAS port. There were 2 SAS cards with 2 port each, so total of 4 SAS ports. CALSOFTCONFIDENTIAL 2
  • 3. Performance Tuning for Software RAID6 driver in Linux Host and interconnect Host machine had CentOS 6.5 installed on it. Two vanilla kernels 2.6.32 and 3.16.4 were then installed separately. Host was dual processor machine with 64 GB of RAM and hyper threading enabled. CPU model was Intel(R) Xeon(R) CPU E52650 v2 @ 2.60GHz. IO ToolsUsed For all of the I/O tests explained below the I/O tool used is fio. fio is an I/O tool used for both benchmarking and stress/hardware verification. It is flexible enough to allow detailed workload setups, and it contains the necessary reporting to make sense of the data at completion. It has support for 19 different types of I/O engines (a term fio uses to signify how IO is delivered to the kernel), I/O priorities (for newer Linux kernels), rate I/O, forked or threaded jobs, and much more. Though initially developed on Linux, the areas in fio that aren't tied to OS specific features work on any platform from Windows to HPUX to Android. There are also native IO engines on Linux, Windows, Solaris, etc. This is a key feature for mixed environments where people want to be able to run (as close to) identical workloads on different operating environments. Our Approach Our approach for this experiment was, to start with raw disks and check if it meets the performance specified by hardware vendor and also identify faulty disks if any. If it does not meet the performances specified by vendor, then theidentify, understand and then tune performance for raw disks. •Verify if addition of RAID0 layer has any negative effect on performance. If it has any negative effect then work on tuning RAID0 layer tunables to minimize the performance overhead due to addition of RAID0 softwarelayer. • Check how much performance penalty we get after we introduce software RAID6 layer in Linux. This performance would be with default tuning parameters setting of RAID6. •Examine RAID6 tuning parameters, understand them and see their effect first individually and then in combination with other tunables to decide on optimized set of tuningparameters. • Consider Linux Kernel 2.6.32 and 3.16.4 for this exercise. Multiple kernels will illustrate if change in Kernel version itself has any effect on RAID6 driver performance with and without tuning. Raw Disks Performance Tuning There are a couple of reasons for why we need to start with performance tuning of RAW disks and that too with performance measurement of individual disk and in various combinations. If some disk is faulty then it hampers overall performance and hence it’s important to check and eliminate buggy disks at the earliest. Apart from this it is necessary to identify how the disks should be selected for creation of RAID or any other disk array. For example, in our setup we had four paths for every disk, so using all these paths efficiently and taking care that we do not have any disk repeated in a combination was mandatory to get linear graph between disk count and performance. •We started with testing of bare metal I/O performance of the raw hardware, while CALSOFTCONFIDENTIAL 3
  • 4. Performance Tuning for Software RAID6 driver in Linux bypassing as much of the kernel as possible. The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable numbers of sgp_dd threads to show how performance varies with different request queue depths. The data garnered by this test helped to set expectations for the performance of RAID devices created devices. It also verified if the matches with specifications of from these performance vendor. For individual disks the average performance seen was ~170 MB/s for both sequential write and read. Disks are sensitive to record size so they should be benchmarked to find out optimal record size. In our case we got optimal performance with record size 4 MB. I/O size ranged from 1GB to 2GB perdisk. •After this we did buffered I/O with FIO tool on these devices with default tunables setting. It showed significant performance drop than bare metal I/O performance. Here the average performance seen was dropped from 170 MB/s for both sequential write and read to around 70 MB/s for sequential write and 130 MB/s for read. To get back to bare metal I/O performance of the raw hardware we needed to tune parameters of RAW disks. We started examining effect of individual tunables first and then in combination. We used iostat tool to monitor average request size, writes per second and reads per second while checking effect of tunables. Here we got optimum performance for block size 8MB and below are the tunables which had prominent role in performance improvement. I/O size ranged from 8GB to 16GB perdevice. o Scheduler: Changed from default “cfq” to “deadline”. o max_sectors_kb: This is the maximum number of kilobytes that the block layer will allow for a file system request. Must be smaller than or equal to the maximum size allowed by the hardware. Changed from default 1024 to8192. o nr_requests: This controls how many requests may be allocated in the block layer for read or write requests. Changed from default 128 to 1024. Linux Kernel 2.6.32 RAID0 Performance Tuning After identifying settings for optimized performance of RAW disks we added a software layer of RAID0 above it. With optimized settings of RAW disks and without changing any tuning parameter at RAID0 layer the performance drop seen was around 20%. The sequential write speed per disk dropped from 170 MB/s to ~135 MB/s. There were multiple variables which abetted in performance tuning of RAID0 disks, Some at RAID0 layer and some in FIO tool. We used 10 data disks for each RAID0 device and created 6 such RAID0 devices for this tuning experiment. Then we inspected effect of chunk size on RAID0 performance and it had miniscule impact. But our application over RAID devices gave us optimal performance for single I/O of 1MB I/O request size and In case of RAID6, theoretically optimum performance can be achieved if full stripe writes are done. Moreover, in case of RAID6 if we use 128K chunk size then single I/O size becomes 128K * 8 data disks in CALSOFTCONFIDENTIAL 4
  • 5. Performance Tuning for Software RAID6 driver in Linux (8+2) RAID6 configuration = 1MB. So considering all these facts, in all our experiments of RAID0 and RAID6, we settled on 128K chunk size. Calibrating one of the tunable “rq_affinity” of RAID0, revealed performance improvement. Aside from rq_affinity, adjusting FIO tool options such as block size and iodepth also proclaimed performance improvement. We could get back to 170 MB/s sequential read and write speed for RAID0 devices with rq_affinity changed to 2 from 0, iodepth set to 16 and block size to 64MB. rq_affinity: If this option is '1', the block layer will migrate request completions to the cpu "group" that originally submitted the request. For some workloads this provides a significant reduction in CPU cycles due to caching effects. For storage configurations that need to maximize distribution of completion processing setting this option to '2' forces the completion to run on the requesting cpu (bypassing the "group" aggregation logic) changed from 0 to2. In the graph mentioned below, sudden decline in performance per data disk is visible after 40 disks. This was not because of inadequate tuning efforts; this was caused by per SAS port performance limitation as explained below. The above graph shows effect of disk count on RAID0 performance. In our setup each port on the SAS HBA consisted 4 x 6Gb SAS lanes. So I/O rate supported by one SAS port becomes 6Gb/s * 4 phys = 24Gb/s. After considering overhead due to encoding, arbitrary delays and additional framing per SAS port performance calculation becomes: (24Gb/s per port)/ (8b/10b encoding) = 2.4GB/s per port. 2.4GB/s * 88.33% (to accommodate arbitrary delays and additional framing) = 2.16GB/s. Now as explained above in Storage Subsystem section, in our setup we had 2 SAS cards with 2 port each, so total 4 SAS ports. Each path was seen through one SAS port. Hence, each disk was visible from 4 paths. So for 1 RAID0 device (10 disks), 5 disks were taken from 1 path and 5 from second. For 2 RAID0 devices, 5 disks were selected from each path. Similarly for 4 RAID0 devices, 10 disks were selected from each path and for 5 RAID0 devices 15 disks from 2 paths each and 10 from remaining 2 paths each. While doing I/O on single RAID0 device we observed from “iostat” that I/O rate per disk was around 170 MB/s. Upto 40 disks, maximum IO from any path can be (170 * 10) = 1700 MB/s which is well within the physical limitation of SAS port i.e. 2.1 GB/s. With 50 disks configuration, first two paths gets bottlenecked if at any instance RAID0 driver tries to push 170 MB/s I/O per disk. i.e. (170 * 15 disks) = 2550 MB/s. Similarly in case of 6 RAID0 device configurations all 4 paths get bottlenecked. RAID6 Performance Tuning After successfully tuning the RAID0 performance to get per data disk performance CALSOFTCONFIDENTIAL 5
  • 6. Performance Tuning for Software RAID6 driver in Linux same as RAW disks, We replaced RAID0 with RAID6. With optimized settings of RAW disks and without changing any tuning parameter at RAID6 layer, per data disk performance drop seen in case of RAID6 was more as compared to RAID0. The sequential write speed per disk dropped from 170 MB/s to ~60 MB/s. Alike RAID0, for RAID6 tuning also there were some parameters at both RAID6 and FIO tool level. But still with RAID6 driver we could not achieve the expected performance. The optimized performance for RAID6 we could achieve was 90 MB/s per data disk only. We could identify five factors causing this performance drop and only one of them we could suppress with tuning parameters. Remaining entailed code level changes. We confirmed our understanding from RAID mailing list as well.  First problem is the pre reads happening even while writing full stripes. Even if we do full stripe write in bulk, write request arrives at the RAID6 driver in smaller chunks and it doesn't always decide correctly whether it should wait for more writes to arrive, or if it should start reading now.  Consider the situation where some of the blocks are dirtied out of a full stripe write and few are remaining. Now md/raid driver may wait in the hope that next I/Os will come and dirty entire stripe or it may want to do either readmodifywrite or reconstructwrite.  Unfortunately RAID6 driver often chooses reconstructwrite option. This causes preread of remaining chunks if not up to date and also recalculate parity. This hampers performance. tunables and To avoid this we found “stripe_cache_size” “preread_bypass_threshold”.  stripe_cache_size: Number of entries in the stripe cache. This is writable, but there are upper and lower limits (32768, 16). Default is 128. Changed to 32768.  preread_bypass_threshold: Number of times a stripe requiring preread will be bypassed by a stripe that does not require preread. For fairness disables preread defaults to 1. Setting this to 0 bypass accounting and requires stripes to wait until all fullwidth stripewrites are complete. Valid values are 0 to stripe_cache_size. Changed to 32768.  Second reason is the breakup of stripes at RAID6 driver level. Irrespective of what chunk size user sets, RAID6 driver breaks it into 4K. So for RAID6 driver each stripe is of 4K * (no. of disks). These stripes are then handled one by one. This causes severe performance damage.  Third reason is the RAID6 layer maintaining its own device data caching. Instead of passing bio and its pages from upper layer to low level driver, RAID6 layer first copies data to be transferred (in either direction) in its cache and performs read/write operation asynchronously. This caching is helpful when data transfer size is less than page size, but for full page read/write, it is just an overhead which impacts performance. Moreover, there is no tunable available to avoid such kind of data copying mechanism for full page write. CALSOFTCONFIDENTIAL 6
  • 7. Performance Tuning for Software RAID6 driver in Linux The graph below shows effect of stripe_cache_size on RAID6 device write performance. The RAID6 device is created from 10 disks out of which 8 were data disks and 2 were paritydisks.  Fourth reason is the inefficient locking decisions. As discussed earlier, RAID6 layer handles a single stripe as 4K * (no of disks). Even though any operation in RAID layer needs to be done only on a single stripe, lock is taken on entire RAID6 device which blocks other operation in RAID6 layer. This type of locking in RAID layer affects the performance.  Fifth reason is the single thread handling RAID6 layer requests. Whenever any write request lands at RAID6 layer, RAID6 layer copies data to be transferred to its stripe cache instead of lower level device and handles that write request asynchronously at a later time. To handle requests asynchronously, RAID6 layer has a separate thread running which handles these requests. Even though user application is a multithreaded program which is sending write requests in bulk, RAID6 layer handles these requests one by one by just a single thread. This becomes bottleneck and causes drop in performance. The graph below shows effect of IOdepth on RAID6 device write performance. The RAID6 device is created from 10 disks out of which 8 were data disks and 2 were parity disks. Linux Kernel 3.16.4 RAID0 Performance Tuning There was not much difference in performance of RAID0 in linux kernel 3.16.4 when compared to linux kernel 2.6.32. The set of optimized parameters decided while tuning RAID0 performance in linux kernel 2.6.32 when applied for RAID0 in linux kernel 3.16.4, showed the same performance. This matched the RAW disk performance of 170 MB/s per data disk. Moreover, observations on effect of disk count on RAID0 performance also remains the same. RAID6 Performance Tuning After observing fall in performance of RAID6 driver in linux kernel 2.6.32 than RAID0 driver, we tried same performance experiment with RAID6 driver of newer kernel 3.16.4. With optimized settings of RAW disks and without changing any tuning parameter at RAID6 layer, per data disk performance seen in case of RAID6 of 3.16.4 CALSOFTCONFIDENTIAL 7
  • 8. Performance Tuning for Software RAID6 driver in Linux Kernel was little bit more than RAID6 of 2.6.32 Kernel. It ranged from 60 MB/s to 70 MB/s per data disk. But RAID6 of 3.16.4 Kernel has more tunable param options than 2.6.32 Kernel. Optimizing these tunables in correct combination assisted us to get back performance on par with RAID0 per data disk performance. Write speed per data disk for RAID6 ranged from 145 MB/s to 160 MB/s for various combinations of disk count ranging from 10 to 40 disks. As explained in RAID0 tuning of Linux Kernel 2.6.32 section, after 40 disks per SAS port performance limitation comes into picture and per data disk performance drops to around 120 MB/s. Out of five reasons explained above for RAID6 driver of Linux Kernel 2.6.32, first we could resolve with same tuning parameters as that for RAID6 in Linux Kernel 2.6.32. Second reason of breakup of stripes at RAID6 driver level persists in newer Kernel 3.16.4 as well. But remaining three reasons we could overcome with newly added tunables and features in RAID6 of 3.16.4 kernel. For RAID6 layer maintaining its own device data caching factor we found a tunable named skipcopy which can be set to avoid data copy to stripe cache if data transfer size is of full page. Instead bio page is directly used for data transfer. Locking decisions are also improved in this kernel compared to 2.6.32. It still takes full device lock which can block entire RAID6 device but its usage has been reduced and this lock is taken only when it is mandatory. For fifth reason of single threaded handling of RAID6 layer requests, 3.16.4 kernel has tunable named group_thread_cnt. So now instead of having a single thread handling requests asynchronously, RAID layer supports batches of multiple threads to handle requests in parallel to improveperformance. All the above mentioned tunables helped to improve serial write I/O performance. For read performance, below tunable assisted us. read_ahead_kb: This setting controls how much extra data the operating system reads from disk when performing I/O operations. This tunable exist at both RAW disks and RAID software layer. The graph below shows effect of IOdepth on RAID6 device write performance. The RAID6 device has been created from 10 disks out of which 8 were data disks and 2 were paritydisks. CALSOFTCONFIDENTIAL 8
  • 9. Performance Tuning for Software RAID6 driver in Linux The graph below shows effect of stripe_cache_size on RAID6 device write performance. The RAID6 device is created from 10 disks out of which 8 were data disks and 2 were paritydisks. Conclusion Operating System default configuration settings for block devices, software md/RAID5 or software md/RAID6 devices are not enough to get best optimum performance out of the system. the type of expected I/O pattern we need to tune the system to achieve the optimum performance. In our experiments, we required to tune raw disks (block layer) and md/RAID6 devices and could achieve theoretical optimum performance for our given hardware and expected IO pattern. Depending  For upon single hardware and (8 + 2) raid6 configuration, better performance seen with io block size >= 4MB and iodepth >= 64. Significant performance improvement seen with increase in stripe_cache_size from kernel default to maximum allowable value i.e. 32768  For single (8+2) RAID6 configuration, serial write performance seen is around 155-160 MBps per data disk for Linux Kernel 3.16.4. For 6*(8+2) raid6 config, write performance seen is around 125 MBps per data disk i.e. around 6 GBps for entire 6*(8+2) RAID6 configuration. If SAS limitation in our current setup as discussed above is resolved, 6*(8+2) RAID6 configuration can also achieve write performance of 155-160 MBps per data disk.  Performance improvement seen with skip_copy option (3.16.3 kernel's inbuilt zerocopyfeature)  Performance improvement not seen with increase in group_thread_count option i.e. by increasing no of threads in multithreading feature in 3.16.4 kernel's raid6 driver. Rather degradation in performance is seen as number of threads is increased from default (0) to highervalue. FutureWork In future work, RAID6 performance tuning can be done for random read/writes. Same approach can be followed and same tunables can be used. Only difference would be, instead of serial read/writes random read/writes will be done and then values of tunables will be optimized accordingly. CALSOFTCONFIDENTIAL 9
  • 10. Performance Tuning for Software RAID6 driver in Linux References  FIO tool: http://freecode.com/projects/fio  Linux RAID mailing list: http://marc.info/?l=linuxraid  Kernel Documentation sysfs parameters: https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt  Kernel Documentation md/raid proc parameters: https://www.kernel.org/doc/Documentation/md.txt  Kernel Documentation https://raid.wiki.kernel.org/index.php/Performance  http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html  http://blog.jamponi.net/2013/12/sw-raid6-performance-influenced-by.html CALSOFTCONFIDENTIAL 10