Filesystemimplementationpre final-160919095849

1
File-System Implementation (Galvin)
Outline
 FILE-SYSTEMSTRUCTURE
 FILE-SYSTEMIMPLEMENTATION
o Overview
o Partitions and Mounting
o Virtual File Systems
 DIRECTORY IMPLEMENTATION
o Linear List
o HashTable
 ALLOCATION METHODS
o Contiguous Allocation
o Linked Allocation
o Indexed Allocation
o Performance
 FREE-SPACE MANAGEMENT
o Bit Vector
o Linked List
o Grouping
o Counting
o Space Maps
 EFFICIENCY AND PERFORMANCE
o Efficiency
o Performance
 RECOVERY
o Consistency Checking
o Log-Structured File Systems
o Other Solutions
o Backup and Restore
 NFS (Optional)
o Overview
o The Mount Protocol
o The NFS Protocol
o Path-Name Translation
o Remote Operations
 EXAMPLE: THE WAFL FILE SYSTEM (Optional--SKIPPED)
Contents
FILE-SYSTEM STRUCTURE
 The file systemresidespermanentlyon secondarystorage. This chapter is primarilyconcerned withissues surrounding file storage andaccess on
the most common secondary-storage medium, the disk.
 Hard disks have twoimportant properties that make them suitable for secondarystorage of files infile systems:(1) Blocks of data canbe
rewrittenin place;it is possible to read a block fromthe disk, modifythe block, andwrite it back intothe same place, and(2) theyare direct
access, allowing anyblock ofdata to be accessedwith only(relatively) minor movements ofthe diskheads androtational latency. (Disks are
usuallyaccessedinphysical blocks – one or more sectors - rather thana byte at a time. Blocksizes mayrange from512 bytes to 4Kor larger.)
 To provide efficient andconvenient accessto the disk, the OS imposes one or more file systems to allow the data to be stored, located, and
retrieved easily. One of the designproblems a file systemposes is creating algorithms anddata structures to map the logicalfile systemontothe
physical secondary-storage devices.
 The file systemitself is generallycomposed ofmanydifferent levels. The structure showninFigure 11.1 is anexample of a layered design, where
each level inthe designusesthe features oflower levels to create newfeatures for use byhigher levels.

2
 File systems organize storage ondisk drives, andcanbe viewed as a layereddesign:
o At the lowest layer are the physicaldevices, consisting ofthe magnetic media, motors & controls, andthe electronics connectedto themand
controllingthem. Moderndiskput more andmore of the electronic controlsdirectlyon the diskdrive itself, leavingrelativelylittle workfor
the diskcontroller card to perform.
o I/O Control consists ofdevice drivers, special software programs (often writteninassembly) whichcommunicate withthe devices by
reading andwriting special codes directlyto andfrom memoryaddresses correspondingto the controller card's registers. Eachcontroller
card (device) ona system hasa different set ofaddresses (registers, a.k.a. ports) that it listens to, anda unique set of commandcodesand
results codes that it understands. (Book:The I/O control is the lowest level and consists of device drivers andinterrupt handlers to transfer
informationbetweenthe mainmemoryandthe disk system. A device driver canbe thought of as a translator. Its input consists of high-level
commands such as "retrieve block 123". Its output consists of low-level, hardware-specific instructions that are usedbythe hardware
controller, which interfaces the I/O device to the rest of the system. The device driver usuallywrites specific bit patterns to special locations
in the I/O controller's memoryto tellthe controller whichdevice locationto act onandwhat actions to take.)
o The basic file system level works directlywith the device drivers interms of retrieving and storingraw blocks of data, without any
consideration for what is ineach block. Dependingon the system, blocks maybe referred to witha single blocknumber, (e.g. block#
234234), or with head-sector-cylinder combinations. ((Book:The basic file systemneeds onlyto issue generic commands to the appropriate
device driver to readandwrite physical blocks onthe disk. Each physicalblock is identified byits numeric
diskaddress (for example, drive 1, cylinder 73,track 2,sector 10)
o The file organization module knows about files and their logicalblocks, andhow theymap to physical
blocks onthe disk. Inadditionto translatingfrom logical to physicalblocks, the file organizationmodule
also maintains the list of free blocks, and allocates free blocks to files as needed. (Book:The file-
organizationmodule knows about files andtheir logical blocks, as well as physical blocks. Byknowingthe
type of file allocationusedandthe locationof the file, the file-organizationmodule cantranslate logical
block addressesto physical block addresses for the basic file systemto transfer. Each file's logical blocks
are numbered from 0(or 1) throughN. Since the physical blocks containingthe data usuallydo not match
the logical numbers, a translationis neededto locate eachblock. The file-organizationmodule also
includes the free-space manager, which tracks unallocated blocks andprovides these blocks to the file-
allocationmodule when requested.)
o The logical file system dealswith all ofthe meta data associatedwith a file (UID, GID, mode, dates, etc),
i.e. everything about the file except the data itself. This level manages the directorystructure andthe
mapping offile namesto file control blocks, FCBs, which containall of the meta data as wellas block
number informationfor finding the data on the disk. (IBMKnowledgeCenter: The logicalfile systemis the level of the file system at which
users can request file operations bysystem call. This level ofthe file systemprovides the kernel witha consistent view of what might be
multiple physicalfile systems andmultiple file systemimplementations. As far as the logical file system is concerned, file systemtypes,
whether local, remote, or strictlylogical, andregardless of implementation, are indistinguishable. ((Book:The logicalfile systemmanages
metadata information. Metadata includesallof the file-system structure except the actual data (or contents of the files). The logical file
systemmanages the directorystructure to provide the file-organizationmodule withthe informationthe latter needs, givena symbolic file
name. It maintains file structure via FCBs. An FCB contains information about the file, including ownership, permissions, and locationof the
file contents. The logicalfile systemis also responsible for protection and security.)
 The layeredapproachto file systems means that much ofthe code can be used uniformlyfor a wide varietyof different file systems, and only
certainlayers needto be filesystemspecific. (Book: Whena layered structure is usedfor file-systemimplementation, duplication ofcode is
minimized. The I/O control and sometimes backfile-systemcode can be used bymultiple file systems. Each file system can thenhave its own
logical file system andfile-organization modules.)
 Most operatingsystems support more thanone file systems. Inadditionto removable-mediafile systems, each OS has one disk-basedfile system
(or more). UNIXusesthe UNIXfile system (UFS), whichis basedon the BerkeleyFast File System (FFS). Windows NT, 2000, andXPsupport disk
file-systemformats of FAT, FAT 32, andNTFS(or Windows NT File System), as well as CD-ROM, DVDand floppy-disk file-system formats. Although
Linux supports over 40 different file systems, the standardLinux file systemis knownas the extendedfile system, with the most commonversion
being ext2 andext3.
File System Implementation
As was described inSection10.1.2, operatingsystems implement open() andclose() system calls for processesto request access to filecontents. In this
section, we delve into the structures andoperations usedto implement file-systemoperations.
Overview
Several on-disk andin-memorystructures are usedto implement a file system. These structures varydepending onthe OS andthe file system, but some
general principlesapply.
On disk, the file system maycontain information about how to boot anoperatingsystem stored there, the total number of blocks, the number and
locationof free blocks, the directorystructure, andindividual files. Manyof these structures are detailedthroughout the remainder of this chapter;here
we describe thembriefly.

3
 File systems store several important data structures onthe disk (Ilinoispart is erroneous, refer bookparts):
o A boot-control block, (per volume) a.k.a. the boot block inUNIXor the partitionboot sector in Windows contains information about
how to boot the systemoff ofthis disk. This will generallybe the first sector of the volume if there is a bootable system loaded onthat
volume, or the blockwill be left vacant otherwise. (Book: A boot control block (per volume) can containinformation neededbythe
systemto boot anoperating system from that volume. If the disk does not contain anoperatingsystem, this block canbe empty. It is
typicallythe first block of a volume. InUFS, thisis called the boot block;inNTFS, it is the partitionboot sector.)
o A volume control block, (per volume)a.k.a. the master file table inUNIXor the superblock in Windows, whichcontains information
such as the partition table, number of blocks oneach filesystem, andpointers to free blocks andfree FCB blocks. (Book:Avolume
control block (per volume) contains volume (or partition)details, suchas the number of blocks inthe partition, size of the blocks, free-
block count and free-block pointers, andfree FCB count and FCB pointers. InUFS, thisis called a superblock;in NTFS, it is stored in the
master file table)
o A directorystructure (per file system), containing file names andpointers to correspondingFCBs. UNIXuses inode numbers, andNTFS
uses a master file table. (Book: A directorystructure per file system is usedto organize the files. InUFS, this includes file name and
associatedinode numbers. InNTFS, it is storedinthe masterfile table.)
o The File Control Block, FCB, (per file)containingdetails about ownership, size, permissions, dates, etc. UNIXstores thisi nformationin
inodes, andNTFS in the master file table as a relational database structure. (Book: A per-file FCB contains manydetailsabout the file,
includingfile permissions, ownership, size, andlocationof the data blocks. InUFS, this is calledthe inode. InNTFS, this informationis
actuallystoredwithin the master file table, which uses a relational database structure, with a row per file.)
 There are alsoseveral keydata structures storedinmemory ((Book:The in-memoryinformation is usedfor bothfile-systemmanagement and
performance improvement via caching. The data are loadedat mount time anddiscardedat dismount. Th e structures mayinclude the ones
described below):
o An in-memorymount table contains informationabout each mountedvolume.
o An in-memorydirectory-structure cache holds the directoryinformationof recentlyaccesseddirectories. (For directories at which
volumes are mounted, it can contain a pointer to the volume table.).
o The system-wide open-file table contains a copyof the FCB of each openfile, as well as other information.
o A per-process open file table, containing a pointer to the system open file table as well as some other information. (For example the
current file positionpointer maybe either here or inthe systemfile table, dependingon the implementationandwhether the file is
being sharedor not.)(Book: The per-processopen-file table contains a pointer to the appropriate entryinthe system-wide open-file
table, as well as other information.)
 Interactions of file systemcomponents when filesare createdand/or used:
To create a new file, an applicationprogramcalls the logical file system, which
knows the format ofthe directorystructures. Tocreate a newfile, it allocatesa
new FCB. (Alternatively, if the file-system implementationcreatesallFCBs at file-
systemcreation time, anFCB is allocatedfrom the set of free FCBs.) The system
then reads the appropriate directoryintomemory, updates it with the newfile
name and FCB, andwrites it back to the disk. A typical FCB is showninFigure 11.2.
Some operatingsystems, including UNIX, treat a directoryexactlythe same as a
file – one with a type field indicating that it is a directory. Other operating systems,
includingWindows NT, implement separate systemcalls for files and directories
and treat directories as entitiesseparate fromfiles. Whatever the larger structural
issues, the logical file system can call the file-organizationmodule to map the directoryI/O intodisk-block numbers, whichare passed onto the
basic file system and I/O control system.
Now that a file has beencreated, it can be used for I/O. First, though, it must be opened. The open() call passes a file name to the file system.
The open() systemcall first searches the system-wide open-file table to seeif the fileis alreadyinuse byanother process. Ifit is, a per-process
open-file table entryis createdpointingto the existingsystem-wide open-file table. This algorithmcansave substantial overhead. When a file is
opened, the directorystructure is searchedfor the given file name. Parts of the directorystructure are usuallycachedinmemoryto speed
directoryoperations. Once the file is found, the FCB is copiedintoa system-wide open-file table inmemory. This table not onlystores the FCB but
also tracks the number of processesthat have the file open.
Next, an entryis made inthe per-processopen-file table, with a pointer to the entryinthe system-wide open-file table and some other fields.
These other fields caninclude a pointer to the current locationinthe file (for the next read() or write() operation) andthe access mode in which
the file is open. The open() call returns a pointer to the appropriate entryinthe per-processfile-systemtable. All file operations are then
performed via this pointer. The file name maynot be part of the open-file table, as the systemhas nouse for it once the appropriate FCB is
locatedon disk. It could be cached, though, to save time on subsequent opens ofthe same file. The name givento the entryvaries. UNIXsystems
refer to it as a file descriptor; Windows refers to it as a file handle. Consequently, as longas the file is not closed, all file operations are done on
the open-file table.
When a processcloses the file, the per-process table entryis removed, andthe system-wide entry's opencount is decremented. Whenall
users that have opened the file close it, anyupdated metadata is copiedbackto the disk-baseddirectorystructure, and the system-wide open-
file table entryis removed.
Some systems complicate this scheme further byusing the file system as aninterface to other system aspects, suchas networking. For
example, inUFS, the system-wide open-file table holds the inodesandother information for files anddirectories. It alsoholds similar information

4
for network connections and devices. In thisway, once mechanism is used for multiple purposes.
The caching aspects of file-system structures shouldnot be overlooked. Most systems keepall information about an openfile, except for its
actual data blocks inmemory. The BSDUNIXsystem is typical inits use of caches whereve r disk I/O canbe saved. Its average cache hit rate of
85% shows that these techniques are wellworth implementing.
The operating structures ofa file-system implementation are summarizedinFigure 11.3.
 Before movingon to the next section, go to the reference materialon MBT, MFT, VBR andFCB inthe “AssortedContent” section.
Partitions and Mounting:
 Partitions caneither be used as rawdevices (withnostructure imposed upon them), or theycanbe formattedto holda filesystem(i.e. populated
with FCBs and initialdirectorystructures as appropriate.) Raw partitions are generallyused for swap space, andmayalsobe usedfor certain
programs such as databases that choose to manage their owndisk storage system. Partitions containing filesystems ca ngenerallyonlybe
accessed using the file system structure byordinaryusers, but can often be accessedas a raw device alsobyroot.
 The boot blockis accessedas part of a rawpartition, bythe boot program prior to anyoperatingsystembeing loaded. Modern boot programs
understandmultiple OSes andfilesystem formats,and cangive the user a choice ofwhichof several available systems to boot.
 The root partition contains the OS kernel andat least the keyportions of the OS neededto complete the boot process. At boot time the root
partitionis mounted, andcontrol is transferredfrom the boot program to the kernelfoundthere. (Older systems requiredthat the root partition
lie completelywithinthe first 1024 cylinders of the disk, because that was as far as the boot programcould reach. Once the kernel hadcontrol,
then it could access partitions beyond the 1024 cylinder boundary.)
 Continuing with the boot process, additional filesystems get mounted, adding their informationintothe appropriate mount table structure. As a
part of the mounting process the file systems maybe checkedfor
errors or inconsistencies, either because theyare flagged as not having
been closedproperlythe last time theywere used,or just for general
principals. Filesystems maybe mounted either automaticallyor
manually. In UNIXa mount point is indicatedbysetting a flag inthe in-
memorycopyof the inode, so all future references to that inode get re-
directed to the root directoryof the mounted filesystem.
Virtual File Systems:Virtual File Systems, VFS, provide a common interface to
multiple different filesystemtypes. In addition, it provides for a unique
identifier (vnode) for filesacross the entire space, includingacross all
filesystems of different types. (UNIXinodes are unique onlyacross a single
filesystem, and certainlydo not carryacross networkedfile systems.)The VFS
in Linux is based uponfour keyobject types:(a)The inode object, representing
an individual file (b)The file object, representinganopenfile. (c) The
superblockobject, representing a filesystem. (d) The dentryobject,
representinga directoryentry.
Directory Implemenatation
The selectionof directory-allocationanddirectory-management algorithms significantlyaffects the efficiency, performance and reliabilityof the file
system. Inthis section, we discussthe trade-off involved inchoosing one of these algorithms. (Directories needto be fast to search, insert, anddelete,
with a minimum ofwasteddiskspace).
 Linear List: The simplest methodof implementing a directoryis to use a linear list of file nameswithpointers to the data blocks. This methodis
simple to program but time-consuming to execute. To create a new file, we must first searchthe directoryto be sure that noexistingfile has
the same name. Then, we adda new entryat the endof the directory. To delete a file, we search the directoryfor the named file, then release
the space allocatedto it. To reuse the directoryentry, we cando one of several things. We canmark the entryas unused(by assigningit a
specialname, such as an all-blank name, or witha used-unusedbit ineachentry), or we canattach it to a list offree directoryentries. A third
alternative is to copythe last entryinthe directoryinto the freedlocationandto decrease the lengthof the directory. A linked list canalso be
usedto decrease the time required to delete a file (there is an overhead for the links).
The real disadvantage of a linear list of directoryentriesis that finding a file requires a linear search. Directoryinformation is used
frequently, andusers will notice ifaccessto it is slow.

5
A sortedbinarylist allows a binarysearch anddecreasesthe average searchtime. However, the requirement that the list be kept sortedmay
complicate creating anddeleting files, since we mayhave to move substantial amounts ofdirectoryinformationto maintaina sorted directory.
A more sophisticatedtree data structure, suchas a B-tree, might helphere. An advantage of
the sorted list is that a sorteddirectorylisting can be produced without a separate sort
step.
 Hashtable: Another data structure for a file directoryis a hashtable. Withthis method, a
linear list stores the directoryentries, but a hash data structure is alsoused. The hash
table takesa value computedfrom the file name and returns a pointer to the file name
in the linear list. Therefore it can greatlydecrease the directorysearch time.
Allocationmethods
Here we discuss howto allocate space to files so that diskspace is utilized effectivelyandfiles can
be accessed quickly. Three major methods ofallocatingdiskspace are in wide use: Contiguous,
linked and indexed. Some systems (suchas Data General's RDOSfor
its Nova line of computers) support allthree. More commonly, a
systemuses one methodfor all file within a file system type.
Contiguous Allocation: It requires that all blocks of a file be kept together contiguously.
Performance is veryfast, because readingsuccessive blocks of the same file generallyrequires no
movement of the disk heads, or at most one smallstepto the next adjacent cylinder.
 Storage allocationinvolves the same issues discussedearlier for the allocationof
contiguous blocks of memory(first fit, best fit, fragmentationproblems, etc.) The
distinctionis that the hightime penaltyrequiredfor moving the disk heads from spot to
spot maynow justifythe benefits ofkeeping files contiguouslywhen possible. (Evenfile
systems that donot bydefault store files contiguouslycan benefit from certain utilitiesthat
compact the diskandmake all filescontiguous inthe process.)
 Problems canarise whenfilesgrow, or ifthe exact size of a file is unknownat creationtime:
Over-estimationof the file's finalsize increasesexternal fragmentationandwastes disk space. Under-estimationmayrequire that a file be
moved or a processabortedif the file grows beyondits originallyallocatedspace. Ifa file grows slowlyover a long time period and the total
final space must be allocated initially, thena lot of space becomes unusable before the file fills the space.
 To minimize these drawbacks, some operatingsystems use a modified contiguous-allocation scheme. Here, a contiguous chunkof space is
allocated initially;andthen, ifthat amount proves not to be large enough, another chunk ofcontiguous space, knownas ane xtent, is added.
The locationof the file's blocks is thenrecordedas a location and a block count, plus a link to the first block ofthe next extent (usedbyVeritas
file system).
Linked Allocation: Linkedallocationsolves all problems of contiguous allocation. Withlinkedallocation, each file is a linked list of diskblocks;the disk
blocks maybe scatteredanywhere onthe disk. The directorycontains a pointer to the first andlast blocks ofthe file (Each blockcontains a pointer to the
next block). These pointers are not made available to the user. Thus, if each blockis 512 bytes insize, anda disk address(the pointer) requires 4 bytes,
then the user sees blocks of 508 bytes.
 To create a new file, we simplycreate a new entryinthe directory. Withlinkedallocation, each directoryentryhas a pointer to the first disk
block of the file. This pointer is initializedto nil (the end-of-list pointer value) to signify
an emptyfile. The size fieldis alsoset to 0. A write to the file causes the free-space
management system to fine a free block, andthis newblock is written to andis linked
to the end ofthe file. To reada file, we simplyreadblocks byfollowing the pointers
from blockto block. There is no external fragmentation withlinkedallocation, andany
free blockon the free-space list can be used to satisfya request. The size ofa file need
not be declared when that file is created. A file cancontinue to growas long as free
blocks are available. Consequently, it is never necessaryto compact disk space.
 Linkedallocationdoeshave disadvantages, however. The major problem is that it can
be usedeffectivelyonlyfor sequential-access files. To findthe ith blockof a file, we
must start at the beginningof that file and follow the pointers till we get to the ith
block. Each access to a pointer requires a diskread, andsome require a diskseek.
Consequently, it is inefficient to support a direct-access capabilityfor linked-allocation
files. (Another disadvantage is the space requiredfor the pointers).
 The usual solutionto thisproblem is to collect blocks intomultiples, calledclusters, and
to allocate clusters rather thanblocks. For instance, the file system maydefine a cluster

6
as four blocks andoperate on the disk onlyincluster units. Pointers thenuse a muchsmaller percentage of the file's diskspace. The cost ofthis
approachis anincrease ininternal fragmentation, because more space is wastedwhena cluster is partiallyfull thanwhena blockis partially
full. Clusters canbe used to improve the disk-accesstime for manyother algorithms as well, sotheyare usedinmost file systems.
 Another problemof linkedallocation is reliability. The files are linkedtogether bypointers scatteredall over the disk, soconsider what would
happenif a pointer were lost or damaged. One partial solutionis to use doubly-linkedlists, andanother is to store the file name andrelative
block number ineachblock;however, these schemes require even more overhead for eachfile.
 An important variationon linkedallocationis the use of a file-allocationtable (FAT). This simple but efficient methodof disk-space allocationis
usedbythe MS-DOS andOS/2 operating systems. A sectionof disk at the beginning of each volume is set aside to contain the table. The table
has one entryfor eachdiskblock andis indexed byblock number. The FAT is usedinmuchthe same wayas a linkedlist. The directoryentry
contains the block number of the first block oftehfile. The table
entryindexedbythat blocknumber contains the blocknumber of
the next blockinthe file. This chaincontinues untilthe last block,
which hasa special end-of-file value as the table entry. Unused
blocks are indicatedbya 0 table value. Allocatinga new blockto a
file is a simple matter of finding the first 0-valuedtable entryand
replacing the previous end-of-file value with the addressof the new
block. The 0 is thenreplacedwith end-of-file value. An illustrative
example is the FAT structure showninFigure 11.7 for a file
consisting ofdiskblocks 217, 618, and 339.
The FAT allocationscheme canresult in a significant number of
diskheadseeks, unlessthe FAT is cached. The diskhead must move
to the start ofthe volume to read the FAT and findthe location of
the block in question, thenmove to the location ofthe block itself.
In the worst case, both moves occur for eachof the blocks. A benefit
is that random-accesstime is improved, because the disk headcan
find the locationof anyblock byreading the informationinthe FAT.
Indexed Allocation: Linkedallocationsolves the external-fragmentationand size-declarationproblems of contiguous allocation. However, inthe
absence of a FAT, linked allocationcannot support efficient direct access, since the pointers to the blocks are scatteredwiththe blocks themselves all
over the disks and must be retrievedinorder. Indexedallocation solves this problem bybringing all the pointers together into one location:the index
block.
 Each file hasits ownindex block, whichis anarrayof disk-blockaddresses. The ithentryin the index block points to the ith blockof the file.
The directorycontains the address ofthe index block(Figure 11.8). To findandreadthe ithblock, we use the pointer inth e ithindex-block
entry. This scheme is similar to the pagingscheme describedin Section8.4.
 When the file is created, allpointers inthe index blockare set to nil. When
the ith block is first written, a block is obtainedfrom the free-space manager,
and its address is put inthe ithindex-blockentry.
 Indexedallocation supports direct access, without sufferingfrom external
fragmentation, because anyfree blockon the disk can satisfya request for
more space. Indexedallocationdoes suffer from wasted space, however. The
pointer overhead of the index blockis generallygreater than the pointer
overheadof linkedallocation. Consider a commoncase inwhich we have a
file of onlyone or two blocks. Withlinked allocation, we lose the space of
onlyone pointer per block. Withindexed allocation, anentire index block
must be allocated, evenif onlyone or two pointers will be non-nil.
 This point raises the questionof how large the index block shouldbe. Every
file must have anindex block, sowe want the index block to be as small as
possible. Ifthe index blockis too small, however, it will not be able to hold
enoughpointers for a large file, and a mechanism willhave to be available to
deal withthe issue. Mechanisms for this purpose include the following:
o Linkedscheme – An index block is normallyone diskblock. Thus, it canbe read andwrittendirectlybyitself. To allow for large files,
we can linktogether several index blocks. For example, anindex blockmight contain a smallheader giving the name of the file anda
set of the first 100 disk-block addresses. The next address (the last word inthe index block) is nil (for a small file)or is a pointer to
another index block (for a large file).
o Multilevel index – A variant of the linked representationis to use a first-level index block to a set of second-level index blocks, which
in turn point to the file blocks. To accessa block, the OS uses the first-level index to find a second-level index block andthenuses
that block to find the desireddata block. This approach couldbe continuedto a third or fourth level, depending onthe desired
maximum file size. With4096-byte blocks, we could store 1,024 4-byte pointers inan index block. Twolevelsof indexes allow
1,048,576 data blocks and a file size ofup to 4 GB.

7
o Combinedscheme – Another alternative, usedinthe UFS, is to keepthe
first, say, 15 pointers of the index block inthe file's inode. The first 12 of
these pointers point to direct blocks;that is, theycontainaddresses of
blocks that contain data of the file. Thus, the data for smallfiles (ofno
more than12 blocks) donot needa separate index block. Ifthe block
size is 4KB, thenup to 48 KB of data canbe accesseddirectly. The next
three pointers point to indirect blocks. The first points to a single
indirect block, which is anindex blockcontainingnot data but the
addresses ofblocks that do containdata. The secondpoints to a double
indirect block, which contains the address of a block that contains the
addresses ofblocks that contain pointers to the actualdata blocks. The
last pointer contains the addressof a triple indirect block. Under this
method, the number of blocks that can be allocatedto a file exceeds the
amount of space addressable bythe 4-byte file pointers used bymany
OSes. A 32-bit file pointer reaches only2^32 bytes, or 4 GB. ManyUNIXimplementations, including Solaris and IBM's AIX, now
support upto 64-bit file pointers. Pointers ofthis size allow files andfile systems to be terabytes in size. A UNIXinode is shownin
Figure 11.9.
 Indexed-allocation schemes suffer fromsome ofthe same performance problems as does linkedallocation. Specifically, the index blocks can
be cachedinmemory, but the data blocks maybe spread all over a volume.
Performance: The optimal allocation methodis different for sequential accessfiles thanfor random access files, and is alsodifferent for smallfiles than
for large files. Some systems support more than one allocationmethod, whichmayrequire specifying how the file is to be used (sequential or random
access) at the time it is allocated. Such systems also provide conversion utilities. Some systems have beenknown to use contiguous access for small files,
and automaticallyswitch to anindexedscheme whenfile sizes surpass a certainthreshold. Andof course some systems adjust their allocationschemes
(e.g. block sizes)to best matchthe characteristics of the hardware for optimumperformance.
Free-SpaceManagement
Another important aspect of disk management is keeping track of and allocating free space.
 Bit Vector: One simple approachis to use a bit vector, inwhich each bit represents a disk block, set to 1 if free or 0 if allocated. Fast algorithms
exist for quicklyfinding contiguous blocks ofa given size The down side is that a 40GB diskrequires over 5MB just to store the bitmap (For
example).
 Linked List: A linked list canalso be used to keeptrackof all free blocks. Traversingthe list
and/or finding a contiguous block ofa given size are not easy, but fortunatelyare not
frequentlyneededoperations. Generallythe systemjust adds andremoves single blocks from
the beginning of the list. The FAT table keeps trackof the free list as just one more linked list
on the table.
 Grouping: A variationon linkedlist free lists is to use links ofblocks ofi ndices of free blocks. If
a blockholds upto N addresses, thenthe first block inthe linked-list contains upto N-1
addresses offree blocks anda pointer to the next blockof free addresses.
 Counting: When there are multiple contiguous blocks of free space thenthe systemcankeep
track of the startingaddress of the groupandthe number of contiguous free blocks. As long as
the average lengthof a contiguous group offree blocks is greater thantwo this offers a
savings inspace neededfor the free list. (Similar to compressiontechniques used for graphics
imageswhena groupof pixelsallthe same color is encountered.)
 Space Maps: Sun's ZFSfile systemwas designed for HUGE numbers andsizes offiles,
directories, andeven file systems. The resulting data structurescouldbe VERY inefficient if not
implemented carefully. For example, freeingup a 1 GB file ona 1 TB file systemcouldinvolve updating thousands of blocks o f free list bit maps
if the file was spreadacrossthe disk. ZFS uses a combinationof techniques, starting with dividingthe diskup into(hundreds of) metaslabs of a
manageable size, each havingtheir ownspace map. Free blocks are managed using the counting technique, but rather thanwrite the
informationto a table, it is recorded in a log-structuredtransactionrecord. Adjacent free blocks are also coalescedintoa larger single free
block. An in-memoryspace mapis constructed using a balancedtree data structure, constructedfrom the logdata. The combinationof the in-
memorytree andthe on-disklog provide for veryfast andefficient management of these verylarge files and free blocks.
EfficiencyandPerformance
 Efficiency: The efficient use of diskspace depends heavilyon the diskallocationanddirectoryalgorithms in use. For instance, UNIXpre-
allocates inodes, whichoccupies space evenbefore anyfiles are created. UNIXalsodistributes inodes across the disk, and tries to store data
files near their inode, to reduce the distance of diskseeks betweenthe inodes and the data. Some systems use variable size clusters depending
on the file size. The more data that is storedin a directory(e.g., information like last accesstime), the more oftenthe d irectoryblocks have to
be re-written. As technologyadvances, addressingschemeshave had to growas well. Sun's ZFS file system uses 128-bit pointers, which should

8
theoreticallynever needto be expanded. (The mass required to store 2^128 bytes withatomic storage wouldbe at least 272 trillion
kilograms!) Kernel table sizes usedto be fixed, and couldonlybe changedbyrebuildingthe kernels. Modern tables are dynamicallyallocated,
but that requires more complicatedalgorithms for accessingthem.
 Performance: Even after the basic file-system algorithms have been selected, we canstill improve performance inseveral ways. Disk
controllers generallyinclude on-boardcaching. Whena seekis requested, the heads are moved into place, andthen anentire track is read,
startingfrom whatever sector is currentlyunder the heads
(reducinglatency). The requestedsector is returnedandthe
unrequestedportionof the trackis cached inthe disk's
electronics. Some OSes cache diskblocks theyexpect to
need againina buffer cache. A page cache connected to the
virtual memorysystemis actuallymore efficient as memory
addresses donot need to be convertedto diskblock
addresses and back again. Some systems (Solaris, Linux,
Windows 2000, NT, XP) use page caching for bothprocess
pages andfile data in a unifiedvirtual memory. Figures
11.11 and 11.12 showthe advantages ofthe unifiedbuffer
cache foundin some versions of UNIXandLinux - Data does
not needto be storedtwice, andproblems of inconsistent
buffer informationare avoided. (Book: Some systems
maintaina separate sectionof main memory fora buffer cache, where blocks are kept under the assumption that theywillbe usedagain
shortly. Other systems cache filedata usinga page cache. The page cache usesvirtualmemorytechniques to cache filedata as pagesrather
than as file-system-orientedblocks. Cachingfile data usingvirtual addressesis far more efficient than cachingthroughphysical blocks, as
accesses interface with virtual memoryrather than the file system. Several systems, including Solaris/Linus/WIndows NT/XP, use page caching
to cache bothprocess pages andfile data. This is known as unifiedvirtual memory.)
(Book: Some versions of UNIXandLinux provide a unifiedbuffer cache. To illustrate the benefits of the unified buffer cache, consider the
two alternatives for opening and accessinga file. One approachis to use memorymapping(section9.7);the secondis to use the standard
systemcalls read()andwrite(). Without a unifiedbuffer cache, we have a situation similar to Figure 11.11. Here, re ad() and write()systemcalls
go through the buffer cache. The memory-mapping call, however, requires using twocaches - the page cache and the buffer cache. A memory
mapping proceeds byreadingindiskblocks from the file system andstoring theminthe buffer cache. Because the virtual memorydoes not
interface withthe buffer cache, the contents of the file in the buffer cache must be copiedinto the page cache. This situationis knownas
double caching and requires caching file-system data twice. Not onlydoes it waste memorybut it alsowastessignificant CPU andI/O cycles
due to the extra data movement withinsystem memory. Inaddition, inconsistencies betweenthe two cachescanresult in corrup t files. In
contrast, whena unified buffer cache is provided, bothmemorymapping and the read() andwrite()system callsuse the same page cache. This
has the benefit of avoidingdouble caching, and it allows the virtual memorysystem to manage file-systemdata. The unified buffer cache is
showninFigure 11.12.)
o Page replacement strategies canbe complicatedwith a unified cache, as one needs to decide whether to replace process or file
pages, andhowmanypagesto guarantee to each categoryof pages. Solaris, for example, has gone throughmanyvariations,
resulting in priority paging givingprocesspages priorityover file I/O pages, andsettinglimits sothat neither canknock the other
completelyout ofmemory.
o Another issue affecting performance is the questionof whether to implement synchronous writes or asynchronous writes.
Synchronous writes occur inthe order in whichthe disksubsystem receives them, without caching;Asynchronous writes are cached,
allowing the disk subsystemto schedule writesina more efficient order (See Chapter 12.) Metadata writes are oftendone
synchronously. Some systems support flags to the opencall requiring that writes be synchronous, for example for the benefit of
database systems that require their writesbe performed ina required order.
o The type of file access canalsohave animpact on optimal page replacement policies. For example, LRU is not necessarilya good
policyfor sequential access files. For these types of files progression normallygoes in a forward directiononly, andthe m ost recently
usedpage will not be neededagainuntil after the file has beenrewound and re -readfrom the beginning, (ifit is ever neededat all.)
On the other hand, we canexpect to needthe next page inthe file fairlysoon. For this reasonsequential access files often take
advantage of twospecialpolicies:
 Free-behind frees upa page as soonas the next page inthe file is requested, with the assumptionthat we are now done
with the old page andwon't needit again for a long time.
 Read-ahead reads the requested page andseveral subsequent pagesat the same time, withthe assumption that those
pages will be neededin the near future. This is similar to the trackcaching that is alreadyperformedbythe disk controller,
except it saves the future latencyof transferring data from the disk controller memoryintomotherboardmainmemory.
o The caching system andasynchronous writesspeedup disk writes considerably, because the disk subsystemcanschedule physical
writes to the diskto minimize head movement and diskseektimes. (See Chapter 12). Reads, onthe other hand, must be done mo re
synchronouslyinspite of the caching system, withthe result that disk writes cancounter-intuitivelybe much faster on average than
diskreads.
Recovery

9
Filesanddirectoriesare kept bothinmainmemoryandondisk, andcare must be taken to ensure that system failure does not result in loss ofdata or in
data inconsistency. We deal with these issues inthe following sections.
 Consistency Checking: The storingof certaindata structures (e.g. directories and inodes)inmemoryandthe caching ofdiskoperations can
speedup performance, but what happens in the result of a systemcrash? All volatile memorystructuresare lost, and the informationstored
on the hard drive maybe left in aninconsistent state. A Consistency Checker (fsck in UNIX, chkdskor scandiskinWindows) is oftenrun at boot
time or mount time, particularlyif a filesystem was not closed downproperly. Some of the problems that these toolslook for include:
o Diskblocks allocatedto files and also listedon the free list.
o Diskblocks neither allocatedto files nor on the free list.
o Diskblocks allocatedto more thanone file.
o The number of diskblocks allocatedto a file inconsistent with the file's statedsize.
o Properlyallocatedfiles/ inodes which donot appear inanydirectoryentry.
o Link counts for an inode not matching the number of referencesto that inode in the directorystructure.
o Two or more identical file names inthe same directory.
o Illegallylinkeddirectories, e.g. cyclical relationships where those are not allowed, or files/directories that are not accessible fromthe
root of the directorytree.
o Consistencycheckers will often collect questionable disk blocks intonew files with names such as chk00001.dat. These files may
contain valuable informationthat wouldotherwise be lost, but inmost casestheycan be safelydeleted, (returning those disk blo cks
to the free list.)
UNIXcaches directoryinformationfor reads, but anychangesthat affect space allocationor metadata ch anges are written
synchronously, before anyof the correspondingdata blocks are writtento.
 Log-Structured File Systems: Log-based transaction-oriented (a.k.a. journaling) filesystems borrow techniques developedfor databases,
guaranteeing that anygiven transactioneither completes successfullyor can be rolledbackto a safe state before the transactioncommenced:
o All metadata changes are writtensequentiallyto a log.
o A set of changesfor performing a specific task (e.g. moving a file) is a transaction.
o As changes are writtento the log theyare saidto be committed, allowingthe systemto returnto its work.
o In the meantime, the changesfrom the logare carried out onthe actual filesystem, anda pointer keeps track ofwhichchanges in
the log have beencompletedandwhichhave not yet beencompleted.
o When all changescorresponding to a particular transactionhave beencompleted, that transactioncanbe safelyremovedfrom the
log.
o At anygiventime, the log will containinformationpertaining to uncompleted transactions only, e.g. actions that were committedbut
for which the entire transaction hasnot yet beencompleted.
 From the log, the remaining transactions can be completed,
 or if the transactionwas aborted, thenthe partiallycompletedchanges can be undone.
 Backup and Restore: A full backupcopies everyfile ona filesystem. Incrementalbackups copyonlyfiles which have changedsince some
previous time. A combinationof full andincrementalbackups canoffer a compromise betweenfullrecoverability, the number and size of
backuptapes needed, andthe number oftapes that needto be usedto doa full restore. For example, one strategymight be: At the beginning
of the month do a fullbackup. At the endof the first andagainat the endof the secondweek, backup all files which have changedsince the
beginning of the month. At the endof the thirdweek, backup all filesthat have changedsince the endof the secondweek. Everydayof the
month not listedabove, doan incremental backupof all filesthat have changedsince the most recent ofthe weeklybackups d escribedabove.
 Other Solutions: Sun's ZFS andNetwork Appliance's WAFL file systems take a different approach to filesystem consistency. No blocks of data
are ever over-writteninplace. Rather the new data is writtenintofresh newblocks, and after the transactionis complete, the metadata (data
block pointers) is updated to point to the new blocks. The oldblocks can
then be freedup for future use. Alternatively, if the oldblocks andold
metadata are saved, thena snapshot of the systeminits originalstate is
preserved. Thisapproachis taken byWAFL. ZFScombines this with
check-summingof all metadata anddata blocks, andRAID, to ensure that
no inconsistencies are possible, andtherefore ZFSdoes not incorporate a
consistencychecker.
NFS (Optional)
The NFS protocol is implementedas a set of remote procedure calls (RPCs):
Searching for a file in a directory, Reading a set of directoryentries, Manipulating
links anddirectories, Accessing file attributes, Reading and writing files. For remote
operations, bufferingandcaching improve performance, but cancause a disparity
in localversus remote views of the same file(s).
(In addition to the figure 12.15, you can alsoviewthe preceding figures illustratingNFS file system mounting if you forgot)

10
AssortedContent
 Master Boot Record (MBR:Wiki): A master boot record (MBR) is a special type of boot sector at the verybeginning ofpartitioned computer
mass storage devices like fixeddisks or removable drives intendedfor use with IBMPC-compatible systems andbeyond. The MBR holds the
informationonhowthe logical partitions, containing file systems, are organizedonthat medium. The MBR also contains execu table code to
function as a loader for the installedoperatingsystem—usuallybypassing control over to the loader's secondstage, or in conjunctionwith
each partition's volume boot record(VBR). ThisMBR code is usuallyreferredto as a boot loader. MBRs are not present on non-partitioned
media suchas floppies, super floppies or other storage devices configuredto behave as such.
The MBR is not locatedina partition;it is located at a first sector of the device (physical offset 0), preceding the first partition. (The boot
sector present on a non-partitioneddevice or withinanindividual partitionis called a volume boot record instead.)
The organizationof the partitiontable inthe MBR limits the maximumaddressable storage space ofa disk to 2 TiB(232 × 512 bytes).
Approaches to slightlyraise this limit assuming 33-bit arithmetics or 4096-byte sectors are not officiallysupportedas theyfatallybreak
compatibilitywithexistingboot loaders andmost MBR-compliant operating systems and system tools, and can causes serious data corruption
when usedoutside of narrowlycontrolledsystemenvironments. Therefore, the MBR-based partitioning scheme is in the process ofbeing
superseded bythe GUID Partition Table (GPT) scheme in newcomputers. A GPT cancoexist with anMBR inorder to provide some limitedform
of backwardcompatibilityfor older systems.
The MBR consists of 512 or more bytes located inthe first sector of the drive. It maycontainone or more of: (A) A partitiontable describing
the partitions of a storage device. Inthiscontext the boot sector mayalso be calleda partitionsector. (B) Bootstrapcode: Instructions to
identifythe configured bootable partition, thenloadandexecute its volume boot record(VBR)as a chainloader. (C) Optional 32-bit disk
timestamp. (D) Optional 32-bit disksignature.
By convention, there are exactlyfour primarypartitiontable entries inthe MBR partitiontable scheme:
 Second-stage boot loader: Second-stage boot loaders, suchas GNU GRUB, BOOTMGR, Syslinux, NTLDRor BootX, are not themselves operating
systems, but are able to load anoperatingsystemproperlyand transfer executionto it;the operating system subsequentlyinitializes itself and
mayload extra device drivers. The second-stage boot loader does not needdrivers for its own operation, but mayinstead use generic storage
access methods provided bysystemfirmware such as the BIOS or OpenFirmware, thoughtypicallywithrestrictedhardware functionalityand
lower performance.
 Volume Boot Record (VBR): A Volume Boot Record (VBR) (also knownas a volume boot sector, a partitionboot record or a partition boot
sector) is a type of boot sector introduced bythe IBMPersonal Computer. It maybe foundon a partitioned data storage device such as a hard
disk, or anunpartitioneddevice such as a floppydisk, and contains machine code for bootstrappingprograms (usually, but not necessarily,
operatingsystems) storedin other parts of the device. On non-partitionedstorage devices, it is the first sector of the device. On partitioned
devices, it is the first sector of anindividual partition onthe device, with the first sector ofthe entire device beinga Master Boot Record(MBR)
containingthe partitiontable. The code involume boot records is invoked either directlybythe machine's firmware or indirectlybycode in the
master boot record or a boot manager. Code in the MBR and VBRis inessence loaded the same way. Invoking a VBR via a boot manager is
known as chainloading.
 Master File Table (MFT): The NTFS file system contains a file calledthe master file table, or MFT. There is at least one entryinthe MFT for
everyfile onanNTFSfile systemvolume, includingthe MFT itself. All informationabout a file, including its size, time an ddate stamps,
permissions, anddata content, is storedeither in MFT entries, or in space outside the MFT that is describedbyMFT entries. As filesare added
to an NTFS file systemvolume, more entries are addedto the MFT andthe MFT increases in size. When files are deletedfroma nNTFSfile
systemvolume, their MFT entriesare markedas free and maybe reused. However, diskspace that has beenallocated for these entries is not
reallocated, andthe size of the MFT does not decrease. (The master file table (MFT)is a database inwhich information about everyfile and
directoryon anNT File System(NTFS) volume is stored. There is at least one recordfor everyfile and directoryon the NTFSlogical volume.
Each record contains attributesthat tell the operating system (OS) how to deal withthe file or directoryassociatedwith th e record.)
 File Control Block (FCB): A File Control
Block (FCB) is a file systemstructure in
which the state of an openfile is
maintained. A FCB is managed bythe
operatingsystem, but it resides inthe
memoryof the program that uses the
file, not inoperatingsystemmemory.
This allows a process to have as many
files openat one time as it wants to,
provided it canspare enoughmemory
for an FCB per file. A full FCB is 36 bytes long;inearlyversions of CP/M, it was 33 bytes. This fixedsize, which couldno t be increasedwithout
breakingapplicationcompatibility, leadto the FCB's eventual demise as the standardmethod ofaccessing files. The meanings of severalof the
fields inthe FCB differ betweenCP/Mand DOS, andalsodepending onwhat operationis beingperformed. The followingfields have consistent
meanings:

11
To be cleared
 I
Q’s Later
 XXX
Glossary
ReadLater
Further Reading
 S

Grey Areas
 XXX

Filesystemimplementationpre final-160919095849

More Related Content

What's hot (19)

Viewers also liked (8)

Similar to Filesystemimplementationpre final-160919095849 (20)

More from marangburu42 (20)

Recently uploaded (20)

Filesystemimplementationpre final-160919095849