SlideShare a Scribd company logo
Introduction to File System &
OCFS2
Gang He <ghe@suse.com>
Oct 14th, 2016
Basic Concepts
3
How to store your data
• Block device
No meta information, usually read/write data
sequentially.
• Database
Key/value style data.
Use their own CLI/API to access data.
• File system
Tree directory structure.
Customized directory/file name, file size.
Comply with POSIX standard interfaces.
4
File system interfaces
• Open
• Read/pread/readv Write/pwrite/writev
• Mmap/munmap
• Splice/sendfile (Linux-specific)
• Io_submit/io_getevents (Linux-specific)
• Reflink (in the future)
• Fsync/msync
• Close
5
File access IO models
• Buffered/Direct IO
• Blocking/Nonblocking IO
• Synchronous/Asynchronous IO
• Mmap (on-demand read/write)
• IO multiplexing – select/poll/epoll
6
File system classification
• Pseudo file system
proc sysfs debugfs tmpfs devtmpfs
• Local file system
ext4 reiserfs xfs btrfs
• Cluster file system
OCFS2 GFS2 GPFS2 VxFS
• Distributed file system
GoogleFS HDFS GlusterFS ceph
Understanding VFS
8
Why introduce VFS
• A virtual file system (VFS) or virtual filesystem switch
is an abstraction layer on top of a more concrete file
system. The purpose of a VFS
is to allow client
applications to
access different
types of concrete
file systems in
a uniform way.
9
Data structures(1) in VFS
• struct super_block
one per file system, file system global state/settings, file system related operations, super
block operations, root dentry, inode list, inode alloc/free/dirty/write/evict.
• struct inode
one per file object, which has a unique ID within the file system. It includes file meta
information (ino, mode, owner, blocks/size, a/m/c times, block pointers, etc.), all kinds of
list_head, inode operations (lookup, create, unlink, setattr/getattr, etc.), address_space.
• struct dentry
one per file name (hard link), it includes file name string, parent dentry pointer, inode
pointer, all kind of list_head, dentry operations (e.g. d_hash, d_compare, d_revalidate).
• struct file
one per open() system call, it includes file open mode, read/write position, dentry pointer,
file operations (e.g. llseek, read, write, fsync, flock, etc.).
10
Data structures(1) relationships
11
Data structures(2) in VFS
• struct address_space
one per inode, which is used to manage page cache, looks like VM mechanism.
12
Data structures(3) in VFS
struct page/buffer_head/bio
13
File access work flow
• Open
SYSCALL_DEFINE3(open) → do_sys_open → do_filp_open → path_openat →
walk_component → do_last → fd_install
• Read
SYSCALL_DEFINE3(read) → vfs_read → do_sync_read → (filp→f_op→ aio_read) →
generic_file_aio_read → do_generic_file_read → (mapping→a_ops→aio_read) →
block_read_full_page(page, xxx_get_block) → submit_bh → submit_bio
• Write
SYSCALL_DEFINE3(write) → vfs_write → do_sync_write → (filp→f_op→aio_write) →
generic_file_buffered_write → generic_perform_write → (mapping→a_ops→write_begin)
→ iov_iter_copy_from_user_atomic → (mapping->a_ops->write_end)
• Close
SYSCALL_DEFINE1(close) → __close_fd → filp_close → (filp→f_op→flush) → fput
EXT3 File System
15
EXT3 file system layout
16
EXT3 inode block layout
17
EXT3 dir entry block layout
18
EXT3 dir index block layout (HashTree)
19
Journal(JBD2)
• Block is marked dirty for the journal.
• Block is written to the journal.
• Transaction is committed.
• Block is marked dirty for the file system.
• Block is written to the file system.
• The transaction is check-pointed.
20
File system mounting
• Try to find file system already using the block device.
If match, use the existing super block.
• For each possible file system, attempt to read the
super block.
• Setup file system-specific structures.
• Replay journal, if applicable.
• Locate root directory inode block.
• Instantiate dentry for root directory and pin it to super
block.
• File system returns success to VFS.
OCFS2 File System
22
OCFS2 overview
• Shared storage (SAN/iSCSI).
• Multiple File system nodes which connects to the
share storage (Also support single node).
• Allow add/delete node online.
• Comply with traditional
file system semantics.
23
OCFS2 design
• Use system files to manage meta information, instead
of traditional fixed bitmap blocks.
• Use extent + B-tree mode to manage file data blocks,
instead of direct/indirect block pointers .
• Each inode occupies an entire block, support data-in-
inode feature.
• Use JBD2 to handle file system journal.
• Use DLM to manage the locks across the nodes.
• Each node has own system files to avoid competition
for global resources.
• Use block/cluster to manage meta data block and data
block.
24
OCFS2 system files
25
OCFS2 space management
global_bitmap: one large flat bitmap in cluster to keep a record of the
allocated data for the whole file system.
local_alloc: allocates data in cluster sizes for the local node.
inode_alloc: allocates inode blocks for the local node.
extent_alloc: allocates meta-data blocks for the local node.
truncate_log: records of deallocated clusters and back to global_bitmap
26
OCFS2 inode block layout
27
OCFS2 inode block mapping
28
OCFS2 dir entry/index block layout
29
OCFS2 file clone(reflink)
• dd if=/dev/zero of=./test1 bs=1M count=8192
• reflink test1 test2
• dd conv=notrunc if=/dev/random of=test2 bs=4096 count=100
30
OCFS2 node cooperation
Introduction to file system and OCFS2

More Related Content

PDF
OpenWrt From Top to Bottom
PDF
Block I/O Layer Tracing: blktrace
PDF
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
PDF
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
PDF
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
PDF
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
PPTX
Introduction to DPDK
PDF
Actor Model and C++: what, why and how?
OpenWrt From Top to Bottom
Block I/O Layer Tracing: blktrace
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
Introduction to DPDK
Actor Model and C++: what, why and how?

What's hot (20)

PPTX
Maria DB Galera Cluster for High Availability
PPTX
Linux Network Stack
PPTX
eBPF Basics
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
DPDK In Depth
PDF
Kernel Recipes 2015 - Kernel dump analysis
PPTX
Dpdk applications
PDF
BPF / XDP 8월 세미나 KossLab
PPTX
Linux container, namespaces & CGroup.
PDF
BlueStore, A New Storage Backend for Ceph, One Year In
PPTX
Enable DPDK and SR-IOV for containerized virtual network functions with zun
PPT
Using galera replication to create geo distributed clusters on the wan
PDF
Ansible Integration in Foreman
PDF
eBPF/XDP
PDF
Memory Management with Page Folios
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
PDF
以 eBPF 構建一個更為堅韌的 Kubernetes 叢集
PPT
OpenWRT guide and memo
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
PDF
Dpdk pmd
Maria DB Galera Cluster for High Availability
Linux Network Stack
eBPF Basics
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
DPDK In Depth
Kernel Recipes 2015 - Kernel dump analysis
Dpdk applications
BPF / XDP 8월 세미나 KossLab
Linux container, namespaces & CGroup.
BlueStore, A New Storage Backend for Ceph, One Year In
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Using galera replication to create geo distributed clusters on the wan
Ansible Integration in Foreman
eBPF/XDP
Memory Management with Page Folios
RAPIDS: GPU-Accelerated ETL and Feature Engineering
以 eBPF 構建一個更為堅韌的 Kubernetes 叢集
OpenWRT guide and memo
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Dpdk pmd
Ad

Similar to Introduction to file system and OCFS2 (20)

PPT
PDF
009709863.pdf
PPT
8 1-os file system implementation
ODP
The evolution of linux file system
PPTX
UNIT III.pptx
PDF
Ch11 file system implementation
PPT
Linux
ODP
4. linux file systems
PDF
Part 03 File System Implementation in Linux
PPT
XFS.ppt
PDF
AOS Lab 9: File system -- Of buffers, logs, and blocks
PDF
Course 102: Lecture 27: FileSystems in Linux (Part 2)
PPTX
FILE Implementation Introduction imp .pptx
PPT
Presentation on nfs,afs,vfs
PDF
Lect12
PPTX
File System Reliability & Virtual File in Operating System
PPTX
OS Unit5.pptx
PPTX
DFSNov1.pptx
PPTX
Files and directories in Linux 6
PPT
file management_part2_os_notes.ppt
009709863.pdf
8 1-os file system implementation
The evolution of linux file system
UNIT III.pptx
Ch11 file system implementation
Linux
4. linux file systems
Part 03 File System Implementation in Linux
XFS.ppt
AOS Lab 9: File system -- Of buffers, logs, and blocks
Course 102: Lecture 27: FileSystems in Linux (Part 2)
FILE Implementation Introduction imp .pptx
Presentation on nfs,afs,vfs
Lect12
File System Reliability & Virtual File in Operating System
OS Unit5.pptx
DFSNov1.pptx
Files and directories in Linux 6
file management_part2_os_notes.ppt
Ad

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Welding lecture in detail for understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Well-logging-methods_new................
PPTX
Geodesy 1.pptx...............................................
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Welding lecture in detail for understanding
Internet of Things (IOT) - A guide to understanding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Foundation to blockchain - A guide to Blockchain Tech
Model Code of Practice - Construction Work - 21102022 .pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CYBER-CRIMES AND SECURITY A guide to understanding
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Well-logging-methods_new................
Geodesy 1.pptx...............................................

Introduction to file system and OCFS2

  • 1. Introduction to File System & OCFS2 Gang He <ghe@suse.com> Oct 14th, 2016
  • 3. 3 How to store your data • Block device No meta information, usually read/write data sequentially. • Database Key/value style data. Use their own CLI/API to access data. • File system Tree directory structure. Customized directory/file name, file size. Comply with POSIX standard interfaces.
  • 4. 4 File system interfaces • Open • Read/pread/readv Write/pwrite/writev • Mmap/munmap • Splice/sendfile (Linux-specific) • Io_submit/io_getevents (Linux-specific) • Reflink (in the future) • Fsync/msync • Close
  • 5. 5 File access IO models • Buffered/Direct IO • Blocking/Nonblocking IO • Synchronous/Asynchronous IO • Mmap (on-demand read/write) • IO multiplexing – select/poll/epoll
  • 6. 6 File system classification • Pseudo file system proc sysfs debugfs tmpfs devtmpfs • Local file system ext4 reiserfs xfs btrfs • Cluster file system OCFS2 GFS2 GPFS2 VxFS • Distributed file system GoogleFS HDFS GlusterFS ceph
  • 8. 8 Why introduce VFS • A virtual file system (VFS) or virtual filesystem switch is an abstraction layer on top of a more concrete file system. The purpose of a VFS is to allow client applications to access different types of concrete file systems in a uniform way.
  • 9. 9 Data structures(1) in VFS • struct super_block one per file system, file system global state/settings, file system related operations, super block operations, root dentry, inode list, inode alloc/free/dirty/write/evict. • struct inode one per file object, which has a unique ID within the file system. It includes file meta information (ino, mode, owner, blocks/size, a/m/c times, block pointers, etc.), all kinds of list_head, inode operations (lookup, create, unlink, setattr/getattr, etc.), address_space. • struct dentry one per file name (hard link), it includes file name string, parent dentry pointer, inode pointer, all kind of list_head, dentry operations (e.g. d_hash, d_compare, d_revalidate). • struct file one per open() system call, it includes file open mode, read/write position, dentry pointer, file operations (e.g. llseek, read, write, fsync, flock, etc.).
  • 11. 11 Data structures(2) in VFS • struct address_space one per inode, which is used to manage page cache, looks like VM mechanism.
  • 12. 12 Data structures(3) in VFS struct page/buffer_head/bio
  • 13. 13 File access work flow • Open SYSCALL_DEFINE3(open) → do_sys_open → do_filp_open → path_openat → walk_component → do_last → fd_install • Read SYSCALL_DEFINE3(read) → vfs_read → do_sync_read → (filp→f_op→ aio_read) → generic_file_aio_read → do_generic_file_read → (mapping→a_ops→aio_read) → block_read_full_page(page, xxx_get_block) → submit_bh → submit_bio • Write SYSCALL_DEFINE3(write) → vfs_write → do_sync_write → (filp→f_op→aio_write) → generic_file_buffered_write → generic_perform_write → (mapping→a_ops→write_begin) → iov_iter_copy_from_user_atomic → (mapping->a_ops->write_end) • Close SYSCALL_DEFINE1(close) → __close_fd → filp_close → (filp→f_op→flush) → fput
  • 17. 17 EXT3 dir entry block layout
  • 18. 18 EXT3 dir index block layout (HashTree)
  • 19. 19 Journal(JBD2) • Block is marked dirty for the journal. • Block is written to the journal. • Transaction is committed. • Block is marked dirty for the file system. • Block is written to the file system. • The transaction is check-pointed.
  • 20. 20 File system mounting • Try to find file system already using the block device. If match, use the existing super block. • For each possible file system, attempt to read the super block. • Setup file system-specific structures. • Replay journal, if applicable. • Locate root directory inode block. • Instantiate dentry for root directory and pin it to super block. • File system returns success to VFS.
  • 22. 22 OCFS2 overview • Shared storage (SAN/iSCSI). • Multiple File system nodes which connects to the share storage (Also support single node). • Allow add/delete node online. • Comply with traditional file system semantics.
  • 23. 23 OCFS2 design • Use system files to manage meta information, instead of traditional fixed bitmap blocks. • Use extent + B-tree mode to manage file data blocks, instead of direct/indirect block pointers . • Each inode occupies an entire block, support data-in- inode feature. • Use JBD2 to handle file system journal. • Use DLM to manage the locks across the nodes. • Each node has own system files to avoid competition for global resources. • Use block/cluster to manage meta data block and data block.
  • 25. 25 OCFS2 space management global_bitmap: one large flat bitmap in cluster to keep a record of the allocated data for the whole file system. local_alloc: allocates data in cluster sizes for the local node. inode_alloc: allocates inode blocks for the local node. extent_alloc: allocates meta-data blocks for the local node. truncate_log: records of deallocated clusters and back to global_bitmap
  • 28. 28 OCFS2 dir entry/index block layout
  • 29. 29 OCFS2 file clone(reflink) • dd if=/dev/zero of=./test1 bs=1M count=8192 • reflink test1 test2 • dd conv=notrunc if=/dev/random of=test2 bs=4096 count=100