SlideShare a Scribd company logo
A Journey with SIGLabs
School of Computing
National University of Singapore
• YapYao Jun – yjyap87@gmail.com
• Year 4 – Information Systems (InfoSec)
• Joined SIGLabs or Student Network
Associates in January 2010
About
• Since Late 2010
• Really Huge Storage @ Really
Cheap Price
• $5K to build P.O.C
• 2 Storages pods were built
(45TB and 135TB)
SIGLabs JBOD Project
• Super Micro
• Maximum of 36 Drives
• ≈ SGD$5,200 (w/o RAM, CPU, HDD)
• BackBlaze 45 Drives
• Chassis Only
• USD$872 ≈ SGD$1,133
• Complete (w/o HDD)
• USD$5,395 ≈ SGD$7,013
The Chassis Hunt
• Shipping cost > USD$250
• Really huge box with lots
of bubble wrap
• Flown in over CNY2011
Flown in by SQ from the USA
Building Storage on the Cheap
• Sil3726 Chip
• From Protocase @ USD$60 each
• USD$540 ≈ SGD$700
Multiplier Card
• Sil3124
• PCI-e (1x) Interface
• 4 SATA II Ports
• era-adapter.com @USD$59.95 each
• USD$179.85 ≈ SGD$234
SATA Expansion Card
P.O.C Build
• AMD Althon™ II X640 Processor
• 2 x 4GB DDR3 RAM
• PSU
• iCute 500W
• Cooler Master 460W
• Seasonic X760
• 500GB OS HDD
• 10 x 2TB Seagate LP Drives
Storage Pod #1
Building Storage on the Cheap
Building Storage on the Cheap
Building Storage on the Cheap
Building Storage on the Cheap
Building Storage on the Cheap
But!
• Stand-offs
• Clearance
• 30mm brass standoff is too tall!
• 20mm brass standoff is too short!
• CheapY-Molex Power Connectors are too tall!
• Power Issues
• iCute  Really Ugly AND Not Adorable at all
• Replaced by the Modular SeasonicX-760
• Simultaneous PSU firing
• 2nd PSU is toggled on using an external switch
Initial Setbacks
• Improvise
• Al-cheapo Standoff
• GreenWall Plugs 7mm
• Too Long!
• 1”  0.75”
• 8 per Multiplier Card
• Cut 72 of them
• SGD$2 for 100 pieces. Cheap! 
• Else is USD$29 for the ones
Backblaze is using.
Stand-Offs
• Bought Ready Made
• Daisy Chaining
Right Angle 4Pin Molex
Working.Working.Working.
PSU Hack
Taking the pin out
Connecting a signal cable
directly to the MB socket
Connecting directly to the
PSU
Broken Pin Extractor
The plastic guides too
small  cut
Manage to fit 3.5” HDD into the enclosure
Can power up 10 HDD (no dropouts)
Can power on the 2 PSUs simultaneously
Can do Linux RAID (mdadm)
Can export NFS
Project was supposed to continue but in the
later part of 2011…
Phase 1 Achievements
• HDD Price Sky-rocketed
• Thailand Floods
• Only recover after May 2012
• Project was shelved 
The Great HDD Shortage of 2011
Really filling up the Storage Pod with
Hard Disks
August 2012
• 47 Seagate 1TB ES2
• 1 Hitachi 1TB
• Unable to power them…
• again… 
48 x 1TB Drives from Sun x4500
• POST 33 Drives On
• Build RAID, drives get
dropped while building
• When 1 drive drop, the whole
array on the multiplier drop
• No success
• At best is 27 Drives
Erm…
• Multi-meter to measure voltage at the
multiplier card
• 12V was around 11V.
• Under-Volt.Tsk. :
• Get “raw”-er materials
from Sim Lim Tower
• 2x3 Mini Female Molex
• Fits into Seasonic X-760
• Kudos to modular PSUs
• 4 Pin Disk Drive Molex
• 16 AWG wire
Problem. Again.
Custom Wiring
16AWG wires
Punch In
My PunchTool
&Cable Stripper
Negative Example
• Got a Second Seasonic X-760
• Each multiplier card has its
own dedicated power wiring
• Stable 12V; Stable 5V
• No more HDD array drops
• Replicated to all 9 multiplier
cards
Problem? Solved.
Building Storage on the Cheap
Building Storage on the Cheap
Backblaze Configuration
sdp
sdo
sdn
sdm
sdl
sdk
sdj
sdi
sdh
sdg
sdf
sde
sdd
sdc
Sdb
sdat
sdas
sdar
sdaq
sdap
sdao
sdan
sdam
sdal
sdak
sdaj
sdai
sdah
sdag
sdaf
sdae
sdad
sdac
sdab
sdaa
sdz
sdy
sdx
sdw
sdv
sdu
sdt
sds
sdr
sdq
RAID-6 + JFS
RAID-6 + JFS
RAID-6 + JFS
• # fdisk /dev/sdb
• Partition ID = “fd”  Linux RAID auto
• # mdadm --create /dev/md0 --level=6 --raid-
devices=15 /dev/sdb1 /dev/sdc1…
• #mkfs.xfs /dev/md0
RAID-6
/sys/block || /dev – Before reboot /sys/block || /dev – After reboot
# reboot
sdat
sdas
sdar
sdaq
sdap
sdp
sdo
sdn
sdm
sdl
sdae
sdad
sdac
sdab
sdaa
sdao
sdan
sdam
sdal
sdak
sdk
sdj
sdi
sdh
sdg
sdz
sdy
sdx
sdw
sdv
sdaj
sdai
sdah
sdag
sdaf
sdf
sde
sdd
sdc
Sdb
sdu
sdt
sds
sdr
sdq
sdat
sdas
sdar
sdaq
sdap
sdp
sdo
sdn
sdm
sdl
sdae
sdad
sdac
sdab
sdaa
sdao
sdan
sdam
sdal
sdak
sdk
sdj
sdi
sdh
sdg
sdz
sdy
sdx
sdw
sdv
sdaj
sdai
sdah
sdag
sdaf
sdf
sde
sdd
sdc
Sdb
sdu
sdt
sds
sdr
sdq
• Renaming partitions
• Linux RAID = ‘fd’
• mdadm can use only partitions and not whole drives
• Partitions ends with a number
• udev rules to target the partitions
• Using HDD serial number
Hacking udev
/etc/udev/rules/90-jbod.rules
KERNEL==“sd*[0-9]”, ENV{ID_SERIAL_SHORT}==“5YD1GW0A”, NAME=“jbod1a”
…
..
.
Partitions at /dev
jbod1a
jbod1b
jbod1c
jbod1d
jbod1e
jbod1f
Jbod1g
jbod1h
jbod1i
jbod1j
jbod1k
jbod1l
jbod1m
jbod1n
jbod1o
jbod2a
jbod2b
jbod2c
jbod2d
jbod2e
jbod2f
jbod2g
jbod2h
jbod2i
jbod2j
jbod2k
jbod2l
jbod2m
jbod2n
jbod2o
jbod3a
jbod3b
jbod3c
jbod3d
jbod3e
jbod3f
jbod3g
jbod3h
jbod3i
jbod3j
jbod3k
jbod3l
jbod3m
jbod3n
jbod3o
• It Works!
• Solved the problem
of drives being
loaded at different
times at boot
Not Elegant
FreeNAS! “Worked Out of the Box!”
Running
off a $8
USB
Thumb
Drive!
• Zettabyte File System
• File System and RAID Engine in One
• Copy onWrite
• Prevents silent corruption from Scrubbing
• Incremental Snapshots
• Transparent Compression
• Deduplication
• Zpools
• Grow-able
• mirror can grow
• But RAID-Z cannot grow
• http://youtu.be/CN6iDzesEs0?t=3m25s
ZFS
• Single Parity
• Double Parity
• Minimum 3 Drives
• RAID 6 needs a minimum of 4 Drives
• Triple Parity  Only in ZFS
RAID-Z
• Performance
• ZFS
• Ease of Administration
• Web GUI
• NFS
• iSCSI
• AFP
• rsync
FreeNAS – NAS made Easy
Uh-oh!
This shit happens everyTuesday night!  #swapoff
• FreeNAS 8.3.0 Beta!
• No more page faults onTuesday Nights
Let’s go Cutting Edge!
• What drive is what drive?
• Possible to pin-point but very
troublesome
• HDD Serial Number
• # camcontol devlist
• Configuration
• kern.geom.label.gptid.enable="0“
• kern.geom.label.ufsid.enable="0“
• Label Partitions
• #gpart –i 1 –l drivename1 /dev/ada0
Problem
• Logs
• Not retained  Hard to debug!
• Used a script to do a symbolic link to persistent storage
• https://raw.github.com/jag3773/FreeNAS-Change-
Logging/master/FreeNAS-Change-Logging.sh
• Swap
• FreeNAS “defaults” every HDD 2GiB to Swap
• We got 45 Disks = 90GiB of Swap (Madness!)
• # swapoff
• Export via NFS as nfsnobody:nfsnogroup
• Issue with user permissions
• rsync Debian repositories
• chmod 02775
• Maproot user = root
Problem
Replacement of Hard Disks
Powering 45 Drives
Ease of use via FreeNAS
ZFS
Living in CR1 with Gigabit uplink
Serves NUS Mirror Storage Needs
Psst… It’s performing better
than the X4500…
Phase 2 Achievements
For “Production”
• Intel i3 3220 Ivy-Bridge
• H77 Chipset
• 16GB DDR3 RAM
• 2x Seasonic X-760 PSU
• 45 x Seagate 3TB (4k Aligned)
Storage Pod #2
Building Storage on the Cheap
Note: SATA
Cables have to be
nudged in the
“downward”
direction, else it
may have
connection
problems
Building Storage on the Cheap
• Centos 6.3 – 64bit
• ZFS-on-Linux 0.6.0-rc14
• World Wide Numbers (WWN)
• Bash Scripted rule generator
• Elegance in the management
of naming Hard Drives
What’s Different?
• WWN & vdevs
• /dev/disk/by-vdevs/[1-9][a-e]
• /etc/zfs/vdev_id.conf.[alias, multipath,
sas_direct, sas_switch]
• Alias
• WWN –WorldWide Name
• It’s like MAC addresses but for Hard Disks
• Create Symb-link to sd*
• Allow other programs to run normally
• S.M.A.R.T  smartd  Alert via email
ZFS-on-Linux
• Divisors of 45
• 1x45
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45
• 1x45
• 3x15
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45
• 1x45
• 3x15
• 5x9
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45
• 1x45
• 3x15
• 5x9
• 9x5
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45
• 1x45
• 3x15
• 5x9
• 9x5
• 15x3
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
With ZFS-on-Linux , Bonnie++ and fio
• Scripted testing
• “Layouts”
• Raid-z level
• Compression
• Access time
• Synchronization
Pursuit of the “Best” performance
• Hybrid Storage
• ARC RAM
• L2ARC SSD
• ZIL SSD (SLC)
• HDD Main Storage Pool
• Better Performance
• atime=off
• sync=disabled
• compression=lz4
• Debatable: if better CPU…
ZFS – Arbudens
• # bonnie++ 
-d /mnt/zfspool/  # Directory to test
-s 32G  # Double RAM recommended
-n 2:16M:8k  # 2x1024 files; max16M; min 8k
-r 8G  # Amount of RAM to use
-z 1361377183  # Random Seed
-u root  # Run as Root
-b # No Write Buffering: fsync()
Bonnie++
Bonnie++ Result
0 500000 1000000 1500000
1x45-jbod
1x45-raidz1
1x45-raidz2
1x45-raidz3
3x15-raidz1
3x15-raidz2
3x15-raidz3
5x9-raidz1
5x9-raidz2
5x9-raidz3
9x5-raidz1
9x5-raidz2
9x5-raidz3
15x3-raidz1
15x3-raidz2
Write Block K/Sec Read Block K/Sec
• Best Sequential Write
(atime=off, sync=disabled, compression=lz4)
• 5x9 raidz1
• Tolerate
1 Disk Fail
Best Performing Layout
970000 980000 990000 1000000
5x9-raidz1-atimeoff-…
5x9-raidz2-atimeoff-…
5x9-raidz1-atimeon-…
9x5-raidz2-atimeon-…
1x45-jbod-atimeoff-…
9x5-raidz2-atimeoff-…
9x5-raidz1-atimeoff-…
15x3-raidz1-atimeon-…
1x45-jbod-atimeon-…
5x9-raidz2-atimeon-…
Sequential Input
Block K/Sec
• Best Sequential Read
(atime=off, sync=disabled, compression=lz4)
• 3x15 raidz1
• Tolerate
1 Disk Fail
Best Performing Layout
1300000 1320000 1340000 1360000 1380000
3x15-raidz1-atimeoff-…
3x15-raidz1-atimeon-…
3x15-raidz1-atimeoff-…
3x15-raidz3-atimeon-…
3x15-raidz1-atimeon-…
15x3-raidz1-atimeoff-…
15x3-raidz1-atimeon-…
3x15-raidz3-atimeoff-…
15x3-raidz1-atimeon-…
3x15-raidz3-atimeon-…
Sequential Output
Block K/Sec
• Very High CPU Utilization
• Includes latency reports too
• Complete test has 360 results
• Includes testing of various compression algorithms
• lzjb, zle, gzip[1-9]
Bonnie++ Results
• IOPS = (MBps Throughput / KB per IO) * 1024
• e.g.
598 / 4 * 1024 = 153 088 IOPS
564 /4 * 1024 = 114 384 IOPS
IOPS
Device Type Random 4KB
IOPS (Write)
Random 4KB
IOPS (Read)
Intel 520 Series MLC SSD Up to 80,000 Up to 50,000
Intel 313 Series SLC SSD 4,000 Up to 36,000
Seagate 15K.7 SAS SAS HDD ≈ 121 ≈ 118
WDVelociraptor 10k SATA HDD ≈ 111 ≈ 82
Seagate ES2 7.2K SAS/SATA HDD ≈ 80 ≈ 189
Seagate Barracuda SATA HDD ≈ 114 ≈ 261
Known IOPS
*These figures are plucked out from various sites by googling for IOPS.
Different sites have different testing methods.
• How to test “Random”
• Flexible IO
• But ZFS does not
support direct IO
• Testing against a hybrid storage
• Buffered IO; Not Direct IO
• Quite exhaustive
FIO - IOPS 4K Random
[testname]
rw=randwrite || randread || randrw || read || write || rw
size=32G # Double RAM
directory=<testdir>
numjobs=<number of cores>
group_reporting # Combined Results
bs=4k
thread
write_iops_log=testname # Logs
FIO
Random 4K IOPS WriteTest
0
5000
10000
15000
20000
25000
30000
35000
40000
0 20000 40000 60000 80000 100000 120000
IOPS
Time (ms)
4K RandomWrite
Start: 34232 IOPS  Approach about 100 IOPS eventually
5x9raid-Z1
Random 4k IOPS ReadTest
3x15raid-z2
Apparently 4K Random Read IOPS are very good with ZFS
0
20000
40000
60000
80000
100000
120000
140000
0 20000 40000 60000 80000 100000 120000
IOPS
Time (ms)
4k Random Read – 1Thread
Random 4k IOPS ReadTest
3x15raid-z2
0
50000
100000
150000
200000
250000
300000
0 20000 40000 60000 80000 100000 120000
IOPS
Time (ms)
4K Random Read – 4Threads
Apparently 4K Random Read IOPS are very good with ZFS
• CPU Bottle Neck or Really?
Post FIOTestingThoughts
• More RAM!
• Better Network Interface – “Intel NICs”
• SSD Write Cache
• ZFS Intent Log (ZIL)  SLC SSD Drives
(e.g. Intel 313 SSD Series)
• SyncWrites  AsyncWrites
• Mirrored
• SSD Read Cache
• L2 ARC Cache  MLC SSD Drives
Areas to Improve Performance
Write Cache
Disk
Storage
Application
Write Read
Read Cache
• Android App Available! 
Monitoring ZFS
• DRBD  Linux
• HAST  FreeBSD
• Lustre
• Ceph
• GlusterFS
What’s Next?
Network Replicated Storage
Scalable Network Storage
Questions?
ThankYou! 

More Related Content

PDF
Into The Box 2020 Keynote Day 1
ZIP
Forget The ORM!
PDF
Cassandra Silicon Valley
PDF
The Highs and Lows of Stateful Containers
PDF
Operational Buddhism: Building Reliable Services From Unreliable Components -...
PDF
VoltDB and Erlang - Tech planet 2012
PDF
MySQL High-Availability and Scale-Out architectures
PDF
MySQL Performance Tuning
Into The Box 2020 Keynote Day 1
Forget The ORM!
Cassandra Silicon Valley
The Highs and Lows of Stateful Containers
Operational Buddhism: Building Reliable Services From Unreliable Components -...
VoltDB and Erlang - Tech planet 2012
MySQL High-Availability and Scale-Out architectures
MySQL Performance Tuning

Similar to Building Storage on the Cheap (20)

PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
PDF
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
PDF
Top 5 mistakes when writing Spark applications
PDF
DSSD Scalable High Performance Flash Systems
PPTX
presentasi-raid-server-cloud-computing.pptx
PPTX
9_Storage_Devices.pptx
PDF
OOW13: It's a solid state-world
PDF
Improving Hadoop Performance via Linux
PDF
Colvin exadata mistakes_ioug_2014
PDF
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
PDF
Hadoop - Disk Fail In Place (DFIP)
PPTX
Storage (Hard disk drive)
PDF
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
PPTX
9_Storage_Devices.pptx
PPTX
GPU Cracking on the Cheap
PPTX
GPU Cracking - On the Cheap
PPTX
A Year with Cinder and Ceph at TWC
PDF
Top 5 mistakes when writing Spark applications
PPT
06_External Memory.ppt
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Top 5 mistakes when writing Spark applications
DSSD Scalable High Performance Flash Systems
presentasi-raid-server-cloud-computing.pptx
9_Storage_Devices.pptx
OOW13: It's a solid state-world
Improving Hadoop Performance via Linux
Colvin exadata mistakes_ioug_2014
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Hadoop - Disk Fail In Place (DFIP)
Storage (Hard disk drive)
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
9_Storage_Devices.pptx
GPU Cracking on the Cheap
GPU Cracking - On the Cheap
A Year with Cinder and Ceph at TWC
Top 5 mistakes when writing Spark applications
06_External Memory.ppt
Ad

Recently uploaded (20)

PDF
WOOl fibre morphology and structure.pdf for textiles
PPT
What is a Computer? Input Devices /output devices
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
The various Industrial Revolutions .pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Modernising the Digital Integration Hub
PPTX
1. Introduction to Computer Programming.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
August Patch Tuesday
WOOl fibre morphology and structure.pdf for textiles
What is a Computer? Input Devices /output devices
observCloud-Native Containerability and monitoring.pptx
Programs and apps: productivity, graphics, security and other tools
The various Industrial Revolutions .pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
A novel scalable deep ensemble learning framework for big data classification...
Modernising the Digital Integration Hub
1. Introduction to Computer Programming.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
TLE Review Electricity (Electricity).pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
cloud_computing_Infrastucture_as_cloud_p
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Enhancing emotion recognition model for a student engagement use case through...
Final SEM Unit 1 for mit wpu at pune .pptx
Group 1 Presentation -Planning and Decision Making .pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
August Patch Tuesday
Ad

Building Storage on the Cheap

  • 1. A Journey with SIGLabs School of Computing National University of Singapore
  • 2. • YapYao Jun – yjyap87@gmail.com • Year 4 – Information Systems (InfoSec) • Joined SIGLabs or Student Network Associates in January 2010 About
  • 3. • Since Late 2010 • Really Huge Storage @ Really Cheap Price • $5K to build P.O.C • 2 Storages pods were built (45TB and 135TB) SIGLabs JBOD Project
  • 4. • Super Micro • Maximum of 36 Drives • ≈ SGD$5,200 (w/o RAM, CPU, HDD) • BackBlaze 45 Drives • Chassis Only • USD$872 ≈ SGD$1,133 • Complete (w/o HDD) • USD$5,395 ≈ SGD$7,013 The Chassis Hunt
  • 5. • Shipping cost > USD$250 • Really huge box with lots of bubble wrap • Flown in over CNY2011 Flown in by SQ from the USA
  • 7. • Sil3726 Chip • From Protocase @ USD$60 each • USD$540 ≈ SGD$700 Multiplier Card
  • 8. • Sil3124 • PCI-e (1x) Interface • 4 SATA II Ports • era-adapter.com @USD$59.95 each • USD$179.85 ≈ SGD$234 SATA Expansion Card
  • 10. • AMD Althon™ II X640 Processor • 2 x 4GB DDR3 RAM • PSU • iCute 500W • Cooler Master 460W • Seasonic X760 • 500GB OS HDD • 10 x 2TB Seagate LP Drives Storage Pod #1
  • 16. But!
  • 17. • Stand-offs • Clearance • 30mm brass standoff is too tall! • 20mm brass standoff is too short! • CheapY-Molex Power Connectors are too tall! • Power Issues • iCute  Really Ugly AND Not Adorable at all • Replaced by the Modular SeasonicX-760 • Simultaneous PSU firing • 2nd PSU is toggled on using an external switch Initial Setbacks
  • 18. • Improvise • Al-cheapo Standoff • GreenWall Plugs 7mm • Too Long! • 1”  0.75” • 8 per Multiplier Card • Cut 72 of them • SGD$2 for 100 pieces. Cheap!  • Else is USD$29 for the ones Backblaze is using. Stand-Offs
  • 19. • Bought Ready Made • Daisy Chaining Right Angle 4Pin Molex
  • 21. PSU Hack Taking the pin out Connecting a signal cable directly to the MB socket Connecting directly to the PSU Broken Pin Extractor The plastic guides too small  cut
  • 22. Manage to fit 3.5” HDD into the enclosure Can power up 10 HDD (no dropouts) Can power on the 2 PSUs simultaneously Can do Linux RAID (mdadm) Can export NFS Project was supposed to continue but in the later part of 2011… Phase 1 Achievements
  • 23. • HDD Price Sky-rocketed • Thailand Floods • Only recover after May 2012 • Project was shelved  The Great HDD Shortage of 2011
  • 24. Really filling up the Storage Pod with Hard Disks August 2012
  • 25. • 47 Seagate 1TB ES2 • 1 Hitachi 1TB • Unable to power them… • again…  48 x 1TB Drives from Sun x4500
  • 26. • POST 33 Drives On • Build RAID, drives get dropped while building • When 1 drive drop, the whole array on the multiplier drop • No success • At best is 27 Drives Erm…
  • 27. • Multi-meter to measure voltage at the multiplier card • 12V was around 11V. • Under-Volt.Tsk. : • Get “raw”-er materials from Sim Lim Tower • 2x3 Mini Female Molex • Fits into Seasonic X-760 • Kudos to modular PSUs • 4 Pin Disk Drive Molex • 16 AWG wire Problem. Again.
  • 28. Custom Wiring 16AWG wires Punch In My PunchTool &Cable Stripper
  • 30. • Got a Second Seasonic X-760 • Each multiplier card has its own dedicated power wiring • Stable 12V; Stable 5V • No more HDD array drops • Replicated to all 9 multiplier cards Problem? Solved.
  • 34. • # fdisk /dev/sdb • Partition ID = “fd”  Linux RAID auto • # mdadm --create /dev/md0 --level=6 --raid- devices=15 /dev/sdb1 /dev/sdc1… • #mkfs.xfs /dev/md0 RAID-6
  • 35. /sys/block || /dev – Before reboot /sys/block || /dev – After reboot # reboot sdat sdas sdar sdaq sdap sdp sdo sdn sdm sdl sdae sdad sdac sdab sdaa sdao sdan sdam sdal sdak sdk sdj sdi sdh sdg sdz sdy sdx sdw sdv sdaj sdai sdah sdag sdaf sdf sde sdd sdc Sdb sdu sdt sds sdr sdq sdat sdas sdar sdaq sdap sdp sdo sdn sdm sdl sdae sdad sdac sdab sdaa sdao sdan sdam sdal sdak sdk sdj sdi sdh sdg sdz sdy sdx sdw sdv sdaj sdai sdah sdag sdaf sdf sde sdd sdc Sdb sdu sdt sds sdr sdq
  • 36. • Renaming partitions • Linux RAID = ‘fd’ • mdadm can use only partitions and not whole drives • Partitions ends with a number • udev rules to target the partitions • Using HDD serial number Hacking udev /etc/udev/rules/90-jbod.rules KERNEL==“sd*[0-9]”, ENV{ID_SERIAL_SHORT}==“5YD1GW0A”, NAME=“jbod1a” … .. .
  • 38. • It Works! • Solved the problem of drives being loaded at different times at boot Not Elegant
  • 39. FreeNAS! “Worked Out of the Box!” Running off a $8 USB Thumb Drive!
  • 40. • Zettabyte File System • File System and RAID Engine in One • Copy onWrite • Prevents silent corruption from Scrubbing • Incremental Snapshots • Transparent Compression • Deduplication • Zpools • Grow-able • mirror can grow • But RAID-Z cannot grow • http://youtu.be/CN6iDzesEs0?t=3m25s ZFS
  • 41. • Single Parity • Double Parity • Minimum 3 Drives • RAID 6 needs a minimum of 4 Drives • Triple Parity  Only in ZFS RAID-Z
  • 42. • Performance • ZFS • Ease of Administration • Web GUI • NFS • iSCSI • AFP • rsync FreeNAS – NAS made Easy
  • 43. Uh-oh! This shit happens everyTuesday night!  #swapoff
  • 44. • FreeNAS 8.3.0 Beta! • No more page faults onTuesday Nights Let’s go Cutting Edge!
  • 45. • What drive is what drive? • Possible to pin-point but very troublesome • HDD Serial Number • # camcontol devlist • Configuration • kern.geom.label.gptid.enable="0“ • kern.geom.label.ufsid.enable="0“ • Label Partitions • #gpart –i 1 –l drivename1 /dev/ada0 Problem
  • 46. • Logs • Not retained  Hard to debug! • Used a script to do a symbolic link to persistent storage • https://raw.github.com/jag3773/FreeNAS-Change- Logging/master/FreeNAS-Change-Logging.sh • Swap • FreeNAS “defaults” every HDD 2GiB to Swap • We got 45 Disks = 90GiB of Swap (Madness!) • # swapoff • Export via NFS as nfsnobody:nfsnogroup • Issue with user permissions • rsync Debian repositories • chmod 02775 • Maproot user = root Problem
  • 48. Powering 45 Drives Ease of use via FreeNAS ZFS Living in CR1 with Gigabit uplink Serves NUS Mirror Storage Needs Psst… It’s performing better than the X4500… Phase 2 Achievements
  • 50. • Intel i3 3220 Ivy-Bridge • H77 Chipset • 16GB DDR3 RAM • 2x Seasonic X-760 PSU • 45 x Seagate 3TB (4k Aligned) Storage Pod #2
  • 52. Note: SATA Cables have to be nudged in the “downward” direction, else it may have connection problems
  • 54. • Centos 6.3 – 64bit • ZFS-on-Linux 0.6.0-rc14 • World Wide Numbers (WWN) • Bash Scripted rule generator • Elegance in the management of naming Hard Drives What’s Different?
  • 55. • WWN & vdevs • /dev/disk/by-vdevs/[1-9][a-e] • /etc/zfs/vdev_id.conf.[alias, multipath, sas_direct, sas_switch] • Alias • WWN –WorldWide Name • It’s like MAC addresses but for Hard Disks • Create Symb-link to sd* • Allow other programs to run normally • S.M.A.R.T  smartd  Alert via email ZFS-on-Linux
  • 56. • Divisors of 45 • 1x45 What can we do with 45 Drives? 2a 2b 2c 2d 2e 5a 5b 5c 5d 5e 8a 8b 8c 8d 8e 3a 3b 3c 3d 3e 6a 6b 6c 6d 6e 9a 9b 9c 9d 9e 1a 1b 1c 1d 1e 4a 4b 4c 4d 4e 7a 7b 7c 7d 7e
  • 57. • Divisors of 45 • 1x45 • 3x15 What can we do with 45 Drives? 2a 2b 2c 2d 2e 5a 5b 5c 5d 5e 8a 8b 8c 8d 8e 3a 3b 3c 3d 3e 6a 6b 6c 6d 6e 9a 9b 9c 9d 9e 1a 1b 1c 1d 1e 4a 4b 4c 4d 4e 7a 7b 7c 7d 7e
  • 58. • Divisors of 45 • 1x45 • 3x15 • 5x9 What can we do with 45 Drives? 2a 2b 2c 2d 2e 5a 5b 5c 5d 5e 8a 8b 8c 8d 8e 3a 3b 3c 3d 3e 6a 6b 6c 6d 6e 9a 9b 9c 9d 9e 1a 1b 1c 1d 1e 4a 4b 4c 4d 4e 7a 7b 7c 7d 7e
  • 59. • Divisors of 45 • 1x45 • 3x15 • 5x9 • 9x5 What can we do with 45 Drives? 2a 2b 2c 2d 2e 5a 5b 5c 5d 5e 8a 8b 8c 8d 8e 3a 3b 3c 3d 3e 6a 6b 6c 6d 6e 9a 9b 9c 9d 9e 1a 1b 1c 1d 1e 4a 4b 4c 4d 4e 7a 7b 7c 7d 7e
  • 60. • Divisors of 45 • 1x45 • 3x15 • 5x9 • 9x5 • 15x3 What can we do with 45 Drives? 2a 2b 2c 2d 2e 5a 5b 5c 5d 5e 8a 8b 8c 8d 8e 3a 3b 3c 3d 3e 6a 6b 6c 6d 6e 9a 9b 9c 9d 9e 1a 1b 1c 1d 1e 4a 4b 4c 4d 4e 7a 7b 7c 7d 7e
  • 61. With ZFS-on-Linux , Bonnie++ and fio
  • 62. • Scripted testing • “Layouts” • Raid-z level • Compression • Access time • Synchronization Pursuit of the “Best” performance
  • 63. • Hybrid Storage • ARC RAM • L2ARC SSD • ZIL SSD (SLC) • HDD Main Storage Pool • Better Performance • atime=off • sync=disabled • compression=lz4 • Debatable: if better CPU… ZFS – Arbudens
  • 64. • # bonnie++ -d /mnt/zfspool/ # Directory to test -s 32G # Double RAM recommended -n 2:16M:8k # 2x1024 files; max16M; min 8k -r 8G # Amount of RAM to use -z 1361377183 # Random Seed -u root # Run as Root -b # No Write Buffering: fsync() Bonnie++
  • 65. Bonnie++ Result 0 500000 1000000 1500000 1x45-jbod 1x45-raidz1 1x45-raidz2 1x45-raidz3 3x15-raidz1 3x15-raidz2 3x15-raidz3 5x9-raidz1 5x9-raidz2 5x9-raidz3 9x5-raidz1 9x5-raidz2 9x5-raidz3 15x3-raidz1 15x3-raidz2 Write Block K/Sec Read Block K/Sec
  • 66. • Best Sequential Write (atime=off, sync=disabled, compression=lz4) • 5x9 raidz1 • Tolerate 1 Disk Fail Best Performing Layout 970000 980000 990000 1000000 5x9-raidz1-atimeoff-… 5x9-raidz2-atimeoff-… 5x9-raidz1-atimeon-… 9x5-raidz2-atimeon-… 1x45-jbod-atimeoff-… 9x5-raidz2-atimeoff-… 9x5-raidz1-atimeoff-… 15x3-raidz1-atimeon-… 1x45-jbod-atimeon-… 5x9-raidz2-atimeon-… Sequential Input Block K/Sec
  • 67. • Best Sequential Read (atime=off, sync=disabled, compression=lz4) • 3x15 raidz1 • Tolerate 1 Disk Fail Best Performing Layout 1300000 1320000 1340000 1360000 1380000 3x15-raidz1-atimeoff-… 3x15-raidz1-atimeon-… 3x15-raidz1-atimeoff-… 3x15-raidz3-atimeon-… 3x15-raidz1-atimeon-… 15x3-raidz1-atimeoff-… 15x3-raidz1-atimeon-… 3x15-raidz3-atimeoff-… 15x3-raidz1-atimeon-… 3x15-raidz3-atimeon-… Sequential Output Block K/Sec
  • 68. • Very High CPU Utilization • Includes latency reports too • Complete test has 360 results • Includes testing of various compression algorithms • lzjb, zle, gzip[1-9] Bonnie++ Results
  • 69. • IOPS = (MBps Throughput / KB per IO) * 1024 • e.g. 598 / 4 * 1024 = 153 088 IOPS 564 /4 * 1024 = 114 384 IOPS IOPS
  • 70. Device Type Random 4KB IOPS (Write) Random 4KB IOPS (Read) Intel 520 Series MLC SSD Up to 80,000 Up to 50,000 Intel 313 Series SLC SSD 4,000 Up to 36,000 Seagate 15K.7 SAS SAS HDD ≈ 121 ≈ 118 WDVelociraptor 10k SATA HDD ≈ 111 ≈ 82 Seagate ES2 7.2K SAS/SATA HDD ≈ 80 ≈ 189 Seagate Barracuda SATA HDD ≈ 114 ≈ 261 Known IOPS *These figures are plucked out from various sites by googling for IOPS. Different sites have different testing methods.
  • 71. • How to test “Random” • Flexible IO • But ZFS does not support direct IO • Testing against a hybrid storage • Buffered IO; Not Direct IO • Quite exhaustive FIO - IOPS 4K Random
  • 72. [testname] rw=randwrite || randread || randrw || read || write || rw size=32G # Double RAM directory=<testdir> numjobs=<number of cores> group_reporting # Combined Results bs=4k thread write_iops_log=testname # Logs FIO
  • 73. Random 4K IOPS WriteTest 0 5000 10000 15000 20000 25000 30000 35000 40000 0 20000 40000 60000 80000 100000 120000 IOPS Time (ms) 4K RandomWrite Start: 34232 IOPS  Approach about 100 IOPS eventually 5x9raid-Z1
  • 74. Random 4k IOPS ReadTest 3x15raid-z2 Apparently 4K Random Read IOPS are very good with ZFS 0 20000 40000 60000 80000 100000 120000 140000 0 20000 40000 60000 80000 100000 120000 IOPS Time (ms) 4k Random Read – 1Thread
  • 75. Random 4k IOPS ReadTest 3x15raid-z2 0 50000 100000 150000 200000 250000 300000 0 20000 40000 60000 80000 100000 120000 IOPS Time (ms) 4K Random Read – 4Threads Apparently 4K Random Read IOPS are very good with ZFS
  • 76. • CPU Bottle Neck or Really? Post FIOTestingThoughts
  • 77. • More RAM! • Better Network Interface – “Intel NICs” • SSD Write Cache • ZFS Intent Log (ZIL)  SLC SSD Drives (e.g. Intel 313 SSD Series) • SyncWrites  AsyncWrites • Mirrored • SSD Read Cache • L2 ARC Cache  MLC SSD Drives Areas to Improve Performance Write Cache Disk Storage Application Write Read Read Cache
  • 78. • Android App Available!  Monitoring ZFS
  • 79. • DRBD  Linux • HAST  FreeBSD • Lustre • Ceph • GlusterFS What’s Next? Network Replicated Storage Scalable Network Storage