SlideShare a Scribd company logo
Local file systems update
Red Hat
Luk´ˇ Czerner
   as
February 23, 2013
Copyright ©    2013 Luk´ˇ Czerner, Red Hat.
                         as
Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation
License, Version 1.3 or any later version published by the Free
Software Foundation; with no Invariant Sections, no Front-Cover
Texts, and no Back-Cover Texts. A copy of the license is included
in the COPYING file.
Agenda



 1   Linux file systems overview
 2   Challenges we’re facing today
 3   Xfs
 4   Ext4
 5   Btrfs
 6   Questions ?
Part I
Linux kernel file systems overview
File systems in Linux kernel


   Linux kernel has a number of file systems
       Cluster, network, local
       Special purpose file systems
       Virtual file systems
   Close interaction with other Linux kernel subsystems
       Memory Management
       Block layer
       VFS - virtual file system switch
   Optional stackable device drivers
       device mapper
       mdraid
The Linux I/O Stack Diagram
                                                                 version 0.1, 2012-03-06
                                                   outlines the Linux I/O stack as of Kernel version 3.3

                                                                        Applications (Processes)
                                                                                                                                                            anonymous pages




                                                                                                                chmod(2)
                                                                                                                                                            (malloc)




                                                                           write(2)


                                                                                      open(2)
                                                              read(2)




                                                                                                    stat(2)




                                                                                                                            ...
                                                                                                  VFS

                                      block based FS                      Network FS                           pseudo FS                      special
                                      ext2  ext3  ext4                     NFS  coda                           proc sysfs                   purpose FS
       direct I/O                                                                                                                                                         Page
      (O_DIRECT)                                                           gfs  ocfs                           pipefs       futexfs          tmpfs     ramfs              Cache
                                       xfs    btrfs        ifs
                                                                          smbfs   ...                          usbfs         ...            devtmpfs
                                      iso9660    ...


                                                                                                 network

                                                              Block I/O Layer
                                                                             LVM
                                             optional stackable devices on top
                                             of “normal” block devices – work on bios
                                              mdraid   device   drbd   lvm    ...
                                                       mapper                                                                                 BIOs (Block I/O)




                                                               I/O Scheduler
                                                         maps bios to requests
                                                        cfq       deadline            noop


                                                                                                                                                    hooked in Device Drivers
                                                                                                                                                      (hook in similar like
                                                                                                                                                      stacked devices like
    request-based                                                                                                                                   mdraid/device mapper do)
device mapper targets
                                                                                                                                               /dev/fio*                   /dev/rssd*
     dm-multipath
                                             SCSI upper layer                                                                               iomemory-vsl                  mtip32xx
                                                                                                                                            with module option
                                                                                                /dev/vd*                   /dev/fio*
                                        /dev/sda        /dev/sdb               ...
                                                                                                                                                          /dev/nvme#n#
                                                                                                                                                               nvme
         sysfs
 (transport attributes)
                                              SCSI mid layer                                    virtio_blk            iomemory-vsl

Transport Classes
scsi_transport_fc
   scsi_transport_sas
                                                                           SCSI low layer
        scsi_transport_...                     libata      megaraid sas                aacraid                qla2xxx         lpfc    ...


                                        ahci     ata_piix        ...




                                   HDD         SSD       DVD              LSI         Adaptec                 Qlogic         Emulex         ...   Fusion-io      nvme      Micron
                                                         drive           RAID          RAID                    HBA            HBA                 PCIe Card      device   PCIe Card
                                                                                                   Physical devices


The Linux I/O Stack Diagram (version 0.1, 2012-03-06)
http://www.thomas-krenn.com/en/oss/linux-io-stack-diagram.html
Created by Werner Fischer and Georg Schönberger
License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/
Most active local file systems




  File system   Commits   Developers   Active developers
  Ext4            648        112              13
  Ext3            105        43                2
  Xfs             650        61                8
  Btrfs           1302       114              21
14
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                                                                                                      13
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                                                                                                      12
                                                                                                   1/
                              Ext3
                              Btrfs
                               Xfs
                              Ext4




                                                                                                 /0
                                                                                            01
                                                                                                      11
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                                                                                                      10
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                                                                                                      09
                                                                                                   1/
                                                                                                 /0
                                                                                            01
Number of lines of code




                                                                                                      08
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                                                                                                      07
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                                                                                                      06
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                                                                                                      05
                                                                                                   1/
                                                                                                 /0
                                                                                            01
                          80000


                                  70000


                                          60000


                                                    50000


                                                            40000


                                                                    30000


                                                                            20000


                                                                                    10000
                                                  Lines of code
Part II
Challenges we’re facing today
Scalability



    Common hardware storage capacity increases
        You can buy single 4TB drive for a reasonable price
        Bigger file system and file size
    Common hardware computing power and parallelism
    increases
        More processes/threads accessing the file system
        Locking issues
    I/O stack designed for high latency low IOPS
        Problems solved in networking subsystem
Reliability



    Scalability and Reliability are closely coupled problems
    Being able to fix your file system
        In reasonable time
        With reasonable memory requirements
    Detect errors before your application does
        Metadata checksumming
        Metadata should be self describing
        Online file system scrub
New types of storage

   Non-volatile memory
       Wear levelling more-or-less solved in firmware
       Block layer has it’s IOPS limitations
       We can expect bigger erase blocks
   Thinly provisioned storage
       Lying to users to get more from expensive storage
       Filesystems can throw away most of it’s locality optimization
       Cut down performance
       Device mapper dm-thinp target
   Hierarchical storage
       Hide inexpensive slow storage behind expensive fast storage
       Performance depends on working set size
       Improve performance
       Device mapper dm-cache target, bcache
Maintainability issues
   More file systems with different use cases
       Multiple set of incompatible user space applications
       Different set of features and defaults
       Each file system have different management requirements
   Requirements from different types of storage
       SSD
       Thin provisioning
       Bigger sector sizes
   Deeper storage technology stack
       mdraid
       device mapper
       multipath
   Having a centralized management tool is incredibly useful
   Having a central source of information is a must
   System Storage Manager http://storagemanager.sf.net
Part III
What’s new in xfs
Scalability improvements

   Delayed logging
       Impressive improvements in metadata modification
       performance
       Single threaded workload still slower then ext4, but not much
       With more threads scales much better than ext4
       On-disk format change
   XFS scales well up to hundreds of terabytes
       Allocation scalability
       Free space indexing
   Locking optimization
   Pretty much the best choice for beefy configurations with lots
   of storage
Reliability improvements



   Metadata checksumming
       CRC to detect errors
       Metadata verification as it is written to or read from disk
       On-disk format change
   Future work
       Reverse mapping allocation tree
       Online transparent error correction
       Online metadata scrub
Part IV
What’s new in ext4
Scalability improvements
   Based on very old architecture
       Free space tracked in bitmaps on disk
       Static metadata positions
       Limited size of allocation groups
       Limited file size limit (16TB)
       Advantages are resilient on-disk format and backwards and
       forward compatibility
   Some improvements with bigalloc feature
       Group number of blocs into clusters
       Cluster is now the smallest allocation unit
       Trade-off between performance and space utilization efficiency
   Extent status tree for tracking delayed extents
       No longer need to scan page cache to find delalloc blocks
   Scalability is very much limited by design, on-disk format and
   backwards compatibility
Reliability improvements



   Better memory utilization of user space tools
       No longer stores whole bitmaps - converted to extents
       Biggest advantage for e2fsck
   Faster file system creation
       Inode table initialization postponed to kernel
       Huge time saver when creating bigger file systems
   Metadata checksumming
       CRC to detect errors
       Not enabled by default
Part V
What’s new in btrfs
Getting stabilized



   Performance improvements is not where the focus is
   right now
       Design specific performance problems
       Optimization needed in future
   Still under heavy development
   Not all features are yet ready or even implemented
   File system stabilization takes a long time
Reliability in btrfs



    Userspace tools not in very good shape
        Fsck utility still not fully finished
    Neither kernel nor userspace handles errors gracefully
    Very good design to build on
        Metadata and data checksumming
        Back reference
        Online filesystem scrub
Resources


   Linux Weekly News http://lwn.net
   Kernel mailing lists http://vger.kernel.org
       linux-fsdevel
       linux-ext4
       linux-btrfs
       linux-xfs
   Linux Kernel code http://kernel.org
   Linux IO stack diagram
       http://www.thomas-krenn.com/en/oss/linuxiostack-
       diagram.html
The end.
Thanks for listening.

More Related Content

PDF
LinuxIO-Introduction-FUDCon-2015
PDF
KCC_Final.pdf
PDF
unixtoolbox
PDF
Hard soft1
PDF
[ArabBSD] Unix Basics
PDF
Ha opensuse
PDF
Lecture1 Introduction
PDF
LSA2 - PostgreSQL
LinuxIO-Introduction-FUDCon-2015
KCC_Final.pdf
unixtoolbox
Hard soft1
[ArabBSD] Unix Basics
Ha opensuse
Lecture1 Introduction
LSA2 - PostgreSQL

Viewers also liked (8)

PDF
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
PDF
Linux io-stack-diagram v1.0
PDF
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
PDF
Understand and optimize Linux I/O
PPT
linux software architecture
PDF
Linux kernel architecture
PPT
Linux architecture
PDF
Architecture Of The Linux Kernel
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
Linux io-stack-diagram v1.0
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Understand and optimize Linux I/O
linux software architecture
Linux kernel architecture
Linux architecture
Architecture Of The Linux Kernel
Ad

Similar to Local file systems update (20)

PDF
Ibm tivoli security and system z redp4355
PDF
Android is NOT just 'Java on Linux'
PDF
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...
PDF
淺談探索 Linux 系統設計之道
KEY
ZFS Tutorial USENIX LISA09 Conference
ODP
[Defcon] Hardware backdooring is practical
PDF
Bypat博客出品-手把手教你如何建立自己的linux系统lfs速成手册
PDF
Linux on System z the Toolchain in a Nutshell
PDF
Presentazione laurea 1.2 matteo concas
PPT
Andresen 8 21 02
ODP
Hardware backdooring is practical : slides
PDF
Genode Compositions
PDF
1 04 rao
PPT
Implementation
PDF
DefCon 2012 - Hardware Backdooring (Slides)
PDF
Windsor: Domain 0 Disaggregation for XenServer and XCP
PDF
[Ruxcon 2011] Post Memory Corruption Memory Analysis
PDF
[Hackito2012] Hardware backdooring is practical
PPT
Top ESXi command line v2.0
PDF
The Linux Kernel Implementation of Pipes and FIFOs
Ibm tivoli security and system z redp4355
Android is NOT just 'Java on Linux'
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...
淺談探索 Linux 系統設計之道
ZFS Tutorial USENIX LISA09 Conference
[Defcon] Hardware backdooring is practical
Bypat博客出品-手把手教你如何建立自己的linux系统lfs速成手册
Linux on System z the Toolchain in a Nutshell
Presentazione laurea 1.2 matteo concas
Andresen 8 21 02
Hardware backdooring is practical : slides
Genode Compositions
1 04 rao
Implementation
DefCon 2012 - Hardware Backdooring (Slides)
Windsor: Domain 0 Disaggregation for XenServer and XCP
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Hackito2012] Hardware backdooring is practical
Top ESXi command line v2.0
The Linux Kernel Implementation of Pipes and FIFOs
Ad

Recently uploaded (20)

PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Programs and apps: productivity, graphics, security and other tools
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Programs and apps: productivity, graphics, security and other tools

Local file systems update

  • 1. Local file systems update Red Hat Luk´ˇ Czerner as February 23, 2013
  • 2. Copyright © 2013 Luk´ˇ Czerner, Red Hat. as Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the COPYING file.
  • 3. Agenda 1 Linux file systems overview 2 Challenges we’re facing today 3 Xfs 4 Ext4 5 Btrfs 6 Questions ?
  • 4. Part I Linux kernel file systems overview
  • 5. File systems in Linux kernel Linux kernel has a number of file systems Cluster, network, local Special purpose file systems Virtual file systems Close interaction with other Linux kernel subsystems Memory Management Block layer VFS - virtual file system switch Optional stackable device drivers device mapper mdraid
  • 6. The Linux I/O Stack Diagram version 0.1, 2012-03-06 outlines the Linux I/O stack as of Kernel version 3.3 Applications (Processes) anonymous pages chmod(2) (malloc) write(2) open(2) read(2) stat(2) ... VFS block based FS Network FS pseudo FS special ext2 ext3 ext4 NFS coda proc sysfs purpose FS direct I/O Page (O_DIRECT) gfs ocfs pipefs futexfs tmpfs ramfs Cache xfs btrfs ifs smbfs ... usbfs ... devtmpfs iso9660 ... network Block I/O Layer LVM optional stackable devices on top of “normal” block devices – work on bios mdraid device drbd lvm ... mapper BIOs (Block I/O) I/O Scheduler maps bios to requests cfq deadline noop hooked in Device Drivers (hook in similar like stacked devices like request-based mdraid/device mapper do) device mapper targets /dev/fio* /dev/rssd* dm-multipath SCSI upper layer iomemory-vsl mtip32xx with module option /dev/vd* /dev/fio* /dev/sda /dev/sdb ... /dev/nvme#n# nvme sysfs (transport attributes) SCSI mid layer virtio_blk iomemory-vsl Transport Classes scsi_transport_fc scsi_transport_sas SCSI low layer scsi_transport_... libata megaraid sas aacraid qla2xxx lpfc ... ahci ata_piix ... HDD SSD DVD LSI Adaptec Qlogic Emulex ... Fusion-io nvme Micron drive RAID RAID HBA HBA PCIe Card device PCIe Card Physical devices The Linux I/O Stack Diagram (version 0.1, 2012-03-06) http://www.thomas-krenn.com/en/oss/linux-io-stack-diagram.html Created by Werner Fischer and Georg Schönberger License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/
  • 7. Most active local file systems File system Commits Developers Active developers Ext4 648 112 13 Ext3 105 43 2 Xfs 650 61 8 Btrfs 1302 114 21
  • 8. 14 1/ /0 01 13 1/ /0 01 12 1/ Ext3 Btrfs Xfs Ext4 /0 01 11 1/ /0 01 10 1/ /0 01 09 1/ /0 01 Number of lines of code 08 1/ /0 01 07 1/ /0 01 06 1/ /0 01 05 1/ /0 01 80000 70000 60000 50000 40000 30000 20000 10000 Lines of code
  • 10. Scalability Common hardware storage capacity increases You can buy single 4TB drive for a reasonable price Bigger file system and file size Common hardware computing power and parallelism increases More processes/threads accessing the file system Locking issues I/O stack designed for high latency low IOPS Problems solved in networking subsystem
  • 11. Reliability Scalability and Reliability are closely coupled problems Being able to fix your file system In reasonable time With reasonable memory requirements Detect errors before your application does Metadata checksumming Metadata should be self describing Online file system scrub
  • 12. New types of storage Non-volatile memory Wear levelling more-or-less solved in firmware Block layer has it’s IOPS limitations We can expect bigger erase blocks Thinly provisioned storage Lying to users to get more from expensive storage Filesystems can throw away most of it’s locality optimization Cut down performance Device mapper dm-thinp target Hierarchical storage Hide inexpensive slow storage behind expensive fast storage Performance depends on working set size Improve performance Device mapper dm-cache target, bcache
  • 13. Maintainability issues More file systems with different use cases Multiple set of incompatible user space applications Different set of features and defaults Each file system have different management requirements Requirements from different types of storage SSD Thin provisioning Bigger sector sizes Deeper storage technology stack mdraid device mapper multipath Having a centralized management tool is incredibly useful Having a central source of information is a must System Storage Manager http://storagemanager.sf.net
  • 15. Scalability improvements Delayed logging Impressive improvements in metadata modification performance Single threaded workload still slower then ext4, but not much With more threads scales much better than ext4 On-disk format change XFS scales well up to hundreds of terabytes Allocation scalability Free space indexing Locking optimization Pretty much the best choice for beefy configurations with lots of storage
  • 16. Reliability improvements Metadata checksumming CRC to detect errors Metadata verification as it is written to or read from disk On-disk format change Future work Reverse mapping allocation tree Online transparent error correction Online metadata scrub
  • 18. Scalability improvements Based on very old architecture Free space tracked in bitmaps on disk Static metadata positions Limited size of allocation groups Limited file size limit (16TB) Advantages are resilient on-disk format and backwards and forward compatibility Some improvements with bigalloc feature Group number of blocs into clusters Cluster is now the smallest allocation unit Trade-off between performance and space utilization efficiency Extent status tree for tracking delayed extents No longer need to scan page cache to find delalloc blocks Scalability is very much limited by design, on-disk format and backwards compatibility
  • 19. Reliability improvements Better memory utilization of user space tools No longer stores whole bitmaps - converted to extents Biggest advantage for e2fsck Faster file system creation Inode table initialization postponed to kernel Huge time saver when creating bigger file systems Metadata checksumming CRC to detect errors Not enabled by default
  • 21. Getting stabilized Performance improvements is not where the focus is right now Design specific performance problems Optimization needed in future Still under heavy development Not all features are yet ready or even implemented File system stabilization takes a long time
  • 22. Reliability in btrfs Userspace tools not in very good shape Fsck utility still not fully finished Neither kernel nor userspace handles errors gracefully Very good design to build on Metadata and data checksumming Back reference Online filesystem scrub
  • 23. Resources Linux Weekly News http://lwn.net Kernel mailing lists http://vger.kernel.org linux-fsdevel linux-ext4 linux-btrfs linux-xfs Linux Kernel code http://kernel.org Linux IO stack diagram http://www.thomas-krenn.com/en/oss/linuxiostack- diagram.html
  • 24. The end. Thanks for listening.