SlideShare a Scribd company logo
1
File-System Implementation (Galvin)
Outline
 FILE-SYSTEMSTRUCTURE
 FILE-SYSTEMIMPLEMENTATION
o Overview
o Partitions and Mounting
o Virtual File Systems
 DIRECTORY IMPLEMENTATION
o Linear List
o HashTable
 ALLOCATION METHODS
o Contiguous Allocation
o Linked Allocation
o Indexed Allocation
o Performance
 FREE-SPACE MANAGEMENT
o Bit Vector
o Linked List
o Grouping
o Counting
o Space Maps
 EFFICIENCY AND PERFORMANCE
o Efficiency
o Performance
 RECOVERY
o Consistency Checking
o Log-Structured File Systems
o Other Solutions
o Backup and Restore
 NFS (Optional)
o Overview
o The Mount Protocol
o The NFS Protocol
o Path-Name Translation
o Remote Operations
 EXAMPLE: THE WAFL FILE SYSTEM (Optional--SKIPPED)
Contents
FILE-SYSTEM STRUCTURE
 The file systemresidespermanentlyon secondarystorage. This chapter is primarilyconcerned withissues surrounding file storage andaccess on
the most common secondary-storage medium, the disk.
 Hard disks have twoimportant properties that make them suitable for secondarystorage of files infile systems:(1) Blocks of data canbe
rewrittenin place;it is possible to read a block fromthe disk, modifythe block, andwrite it back intothe same place, and(2) theyare direct
access, allowing anyblock ofdata to be accessedwith only(relatively) minor movements ofthe diskheads androtational latency. (Disks are
usuallyaccessedinphysical blocks – one or more sectors - rather thana byte at a time. Blocksizes mayrange from512 bytes to 4Kor larger.)
 To provide efficient andconvenient accessto the disk, the OS imposes one or more file systems to allow the data to be stored, located, and
retrieved easily. One of the designproblems a file systemposes is creating algorithms anddata structures to map the logicalfile systemontothe
physical secondary-storage devices.
 The file systemitself is generallycomposed ofmanydifferent levels. The structure showninFigure 11.1 is anexample of a layered design, where
each level inthe designusesthe features oflower levels to create newfeatures for use byhigher levels.
2
File-System Implementation (Galvin)
 File systems organize storage ondisk drives, andcanbe viewed as a layereddesign:
o At the lowest layer are the physicaldevices, consisting ofthe magnetic media, motors & controls, andthe electronics connectedto themand
controllingthem. Moderndiskput more andmore of the electronic controlsdirectlyon the diskdrive itself, leavingrelativelylittle workfor
the diskcontroller card to perform.
o I/O Control consists ofdevice drivers, special software programs (often writteninassembly) whichcommunicate withthe devices by
reading andwriting special codes directlyto andfrom memoryaddresses correspondingto the controller card's registers. Eachcontroller
card (device) ona system hasa different set ofaddresses (registers, a.k.a. ports) that it listens to, anda unique set of commandcodesand
results codes that it understands. (Book:The I/O control is the lowest level and consists of device drivers andinterrupt handlers to transfer
informationbetweenthe mainmemoryandthe disk system. A device driver canbe thought of as a translator. Its input consists of high-level
commands such as "retrieve block 123". Its output consists of low-level, hardware-specific instructions that are usedbythe hardware
controller, which interfaces the I/O device to the rest of the system. The device driver usuallywrites specific bit patterns to special locations
in the I/O controller's memoryto tellthe controller whichdevice locationto act onandwhat actions to take.)
o The basic file system level works directlywith the device drivers interms of retrieving and storingraw blocks of data, without any
consideration for what is ineach block. Dependingon the system, blocks maybe referred to witha single blocknumber, (e.g. block#
234234), or with head-sector-cylinder combinations. ((Book:The basic file systemneeds onlyto issue generic commands to the appropriate
device driver to readandwrite physical blocks onthe disk. Each physicalblock is identified byits numeric
diskaddress (for example, drive 1, cylinder 73,track 2,sector 10)
o The file organization module knows about files and their logicalblocks, andhow theymap to physical
blocks onthe disk. Inadditionto translatingfrom logical to physicalblocks, the file organizationmodule
also maintains the list of free blocks, and allocates free blocks to files as needed. (Book:The file-
organizationmodule knows about files andtheir logical blocks, as well as physical blocks. Byknowingthe
type of file allocationusedandthe locationof the file, the file-organizationmodule cantranslate logical
block addressesto physical block addresses for the basic file systemto transfer. Each file's logical blocks
are numbered from 0(or 1) throughN. Since the physical blocks containingthe data usuallydo not match
the logical numbers, a translationis neededto locate eachblock. The file-organizationmodule also
includes the free-space manager, which tracks unallocated blocks andprovides these blocks to the file-
allocationmodule when requested.)
o The logical file system dealswith all ofthe meta data associatedwith a file (UID, GID, mode, dates, etc),
i.e. everything about the file except the data itself. This level manages the directorystructure andthe
mapping offile namesto file control blocks, FCBs, which containall of the meta data as wellas block
number informationfor finding the data on the disk. (IBMKnowledgeCenter: The logicalfile systemis the level of the file system at which
users can request file operations bysystem call. This level ofthe file systemprovides the kernel witha consistent view of what might be
multiple physicalfile systems andmultiple file systemimplementations. As far as the logical file system is concerned, file systemtypes,
whether local, remote, or strictlylogical, andregardless of implementation, are indistinguishable. ((Book:The logicalfile systemmanages
metadata information. Metadata includesallof the file-system structure except the actual data (or contents of the files). The logical file
systemmanages the directorystructure to provide the file-organizationmodule withthe informationthe latter needs, givena symbolic file
name. It maintains file structure via FCBs. An FCB contains information about the file, including ownership, permissions, and locationof the
file contents. The logicalfile systemis also responsible for protection and security.)
 The layeredapproachto file systems means that much ofthe code can be used uniformlyfor a wide varietyof different file systems, and only
certainlayers needto be filesystemspecific. (Book: Whena layered structure is usedfor file-systemimplementation, duplication ofcode is
minimized. The I/O control and sometimes backfile-systemcode can be used bymultiple file systems. Each file system can thenhave its own
logical file system andfile-organization modules.)
 Most operatingsystems support more thanone file systems. Inadditionto removable-mediafile systems, each OS has one disk-basedfile system
(or more). UNIXusesthe UNIXfile system (UFS), whichis basedon the BerkeleyFast File System (FFS). Windows NT, 2000, andXPsupport disk
file-systemformats of FAT, FAT 32, andNTFS(or Windows NT File System), as well as CD-ROM, DVDand floppy-disk file-system formats. Although
Linux supports over 40 different file systems, the standardLinux file systemis knownas the extendedfile system, with the most commonversion
being ext2 andext3.
File System Implementation
As was described inSection10.1.2, operatingsystems implement open() andclose() system calls for processesto request access to filecontents. In this
section, we delve into the structures andoperations usedto implement file-systemoperations.
Overview
Several on-disk andin-memorystructures are usedto implement a file system. These structures varydepending onthe OS andthe file system, but some
general principlesapply.
On disk, the file system maycontain information about how to boot anoperatingsystem stored there, the total number of blocks, the number and
locationof free blocks, the directorystructure, andindividual files. Manyof these structures are detailedthroughout the remainder of this chapter;here
we describe thembriefly.
3
File-System Implementation (Galvin)
 File systems store several important data structures onthe disk (Ilinoispart is erroneous, refer bookparts):
o A boot-control block, (per volume) a.k.a. the boot block inUNIXor the partitionboot sector in Windows contains information about
how to boot the systemoff ofthis disk. This will generallybe the first sector of the volume if there is a bootable system loaded onthat
volume, or the blockwill be left vacant otherwise. (Book: A boot control block (per volume) can containinformation neededbythe
systemto boot anoperating system from that volume. If the disk does not contain anoperatingsystem, this block canbe empty. It is
typicallythe first block of a volume. InUFS, thisis called the boot block;inNTFS, it is the partitionboot sector.)
o A volume control block, (per volume)a.k.a. the master file table inUNIXor the superblock in Windows, whichcontains information
such as the partition table, number of blocks oneach filesystem, andpointers to free blocks andfree FCB blocks. (Book:Avolume
control block (per volume) contains volume (or partition)details, suchas the number of blocks inthe partition, size of the blocks, free-
block count and free-block pointers, andfree FCB count and FCB pointers. InUFS, thisis called a superblock;in NTFS, it is stored in the
master file table)
o A directorystructure (per file system), containing file names andpointers to correspondingFCBs. UNIXuses inode numbers, andNTFS
uses a master file table. (Book: A directorystructure per file system is usedto organize the files. InUFS, this includes file name and
associatedinode numbers. InNTFS, it is storedinthe masterfile table.)
o The File Control Block, FCB, (per file)containingdetails about ownership, size, permissions, dates, etc. UNIXstores thisi nformationin
inodes, andNTFS in the master file table as a relational database structure. (Book: A per-file FCB contains manydetailsabout the file,
includingfile permissions, ownership, size, andlocationof the data blocks. InUFS, this is calledthe inode. InNTFS, this informationis
actuallystoredwithin the master file table, which uses a relational database structure, with a row per file.)
 There are alsoseveral keydata structures storedinmemory ((Book:The in-memoryinformation is usedfor bothfile-systemmanagement and
performance improvement via caching. The data are loadedat mount time anddiscardedat dismount. Th e structures mayinclude the ones
described below):
o An in-memorymount table contains informationabout each mountedvolume.
o An in-memorydirectory-structure cache holds the directoryinformationof recentlyaccesseddirectories. (For directories at which
volumes are mounted, it can contain a pointer to the volume table.).
o The system-wide open-file table contains a copyof the FCB of each openfile, as well as other information.
o A per-process open file table, containing a pointer to the system open file table as well as some other information. (For example the
current file positionpointer maybe either here or inthe systemfile table, dependingon the implementationandwhether the file is
being sharedor not.)(Book: The per-processopen-file table contains a pointer to the appropriate entryinthe system-wide open-file
table, as well as other information.)
 Interactions of file systemcomponents when filesare createdand/or used:
To create a new file, an applicationprogramcalls the logical file system, which
knows the format ofthe directorystructures. Tocreate a newfile, it allocatesa
new FCB. (Alternatively, if the file-system implementationcreatesallFCBs at file-
systemcreation time, anFCB is allocatedfrom the set of free FCBs.) The system
then reads the appropriate directoryintomemory, updates it with the newfile
name and FCB, andwrites it back to the disk. A typical FCB is showninFigure 11.2.
Some operatingsystems, including UNIX, treat a directoryexactlythe same as a
file – one with a type field indicating that it is a directory. Other operating systems,
includingWindows NT, implement separate systemcalls for files and directories
and treat directories as entitiesseparate fromfiles. Whatever the larger structural
issues, the logical file system can call the file-organizationmodule to map the directoryI/O intodisk-block numbers, whichare passed onto the
basic file system and I/O control system.
Now that a file has beencreated, it can be used for I/O. First, though, it must be opened. The open() call passes a file name to the file system.
The open() systemcall first searches the system-wide open-file table to seeif the fileis alreadyinuse byanother process. Ifit is, a per-process
open-file table entryis createdpointingto the existingsystem-wide open-file table. This algorithmcansave substantial overhead. When a file is
opened, the directorystructure is searchedfor the given file name. Parts of the directorystructure are usuallycachedinmemoryto speed
directoryoperations. Once the file is found, the FCB is copiedintoa system-wide open-file table inmemory. This table not onlystores the FCB but
also tracks the number of processesthat have the file open.
Next, an entryis made inthe per-processopen-file table, with a pointer to the entryinthe system-wide open-file table and some other fields.
These other fields caninclude a pointer to the current locationinthe file (for the next read() or write() operation) andthe access mode in which
the file is open. The open() call returns a pointer to the appropriate entryinthe per-processfile-systemtable. All file operations are then
performed via this pointer. The file name maynot be part of the open-file table, as the systemhas nouse for it once the appropriate FCB is
locatedon disk. It could be cached, though, to save time on subsequent opens ofthe same file. The name givento the entryvaries. UNIXsystems
refer to it as a file descriptor; Windows refers to it as a file handle. Consequently, as longas the file is not closed, all file operations are done on
the open-file table.
When a processcloses the file, the per-process table entryis removed, andthe system-wide entry's opencount is decremented. Whenall
users that have opened the file close it, anyupdated metadata is copiedbackto the disk-baseddirectorystructure, and the system-wide open-
file table entryis removed.
Some systems complicate this scheme further byusing the file system as aninterface to other system aspects, suchas networking. For
example, inUFS, the system-wide open-file table holds the inodesandother information for files anddirectories. It alsoholds similar information
4
File-System Implementation (Galvin)
for network connections and devices. In thisway, once mechanism is used for multiple purposes.
The caching aspects of file-system structures shouldnot be overlooked. Most systems keepall information about an openfile, except for its
actual data blocks inmemory. The BSDUNIXsystem is typical inits use of caches whereve r disk I/O canbe saved. Its average cache hit rate of
85% shows that these techniques are wellworth implementing.
The operating structures ofa file-system implementation are summarizedinFigure 11.3.
 Before movingon to the next section, go to the reference materialon MBT, MFT, VBR andFCB inthe “AssortedContent” section.
Partitions and Mounting:
 Partitions caneither be used as rawdevices (withnostructure imposed upon them), or theycanbe formattedto holda filesystem(i.e. populated
with FCBs and initialdirectorystructures as appropriate.) Raw partitions are generallyused for swap space, andmayalsobe usedfor certain
programs such as databases that choose to manage their owndisk storage system. Partitions containing filesystems ca ngenerallyonlybe
accessed using the file system structure byordinaryusers, but can often be accessedas a raw device alsobyroot.
 The boot blockis accessedas part of a rawpartition, bythe boot program prior to anyoperatingsystembeing loaded. Modern boot programs
understandmultiple OSes andfilesystem formats,and cangive the user a choice ofwhichof several available systems to boot.
 The root partition contains the OS kernel andat least the keyportions of the OS neededto complete the boot process. At boot time the root
partitionis mounted, andcontrol is transferredfrom the boot program to the kernelfoundthere. (Older systems requiredthat the root partition
lie completelywithinthe first 1024 cylinders of the disk, because that was as far as the boot programcould reach. Once the kernel hadcontrol,
then it could access partitions beyond the 1024 cylinder boundary.)
 Continuing with the boot process, additional filesystems get mounted, adding their informationintothe appropriate mount table structure. As a
part of the mounting process the file systems maybe checkedfor
errors or inconsistencies, either because theyare flagged as not having
been closedproperlythe last time theywere used,or just for general
principals. Filesystems maybe mounted either automaticallyor
manually. In UNIXa mount point is indicatedbysetting a flag inthe in-
memorycopyof the inode, so all future references to that inode get re-
directed to the root directoryof the mounted filesystem.
Virtual File Systems:Virtual File Systems, VFS, provide a common interface to
multiple different filesystemtypes. In addition, it provides for a unique
identifier (vnode) for filesacross the entire space, includingacross all
filesystems of different types. (UNIXinodes are unique onlyacross a single
filesystem, and certainlydo not carryacross networkedfile systems.)The VFS
in Linux is based uponfour keyobject types:(a)The inode object, representing
an individual file (b)The file object, representinganopenfile. (c) The
superblockobject, representing a filesystem. (d) The dentryobject,
representinga directoryentry.
Directory Implemenatation
The selectionof directory-allocationanddirectory-management algorithms significantlyaffects the efficiency, performance and reliabilityof the file
system. Inthis section, we discussthe trade-off involved inchoosing one of these algorithms. (Directories needto be fast to search, insert, anddelete,
with a minimum ofwasteddiskspace).
 Linear List: The simplest methodof implementing a directoryis to use a linear list of file nameswithpointers to the data blocks. This methodis
simple to program but time-consuming to execute. To create a new file, we must first searchthe directoryto be sure that noexistingfile has
the same name. Then, we adda new entryat the endof the directory. To delete a file, we search the directoryfor the named file, then release
the space allocatedto it. To reuse the directoryentry, we cando one of several things. We canmark the entryas unused(by assigningit a
specialname, such as an all-blank name, or witha used-unusedbit ineachentry), or we canattach it to a list offree directoryentries. A third
alternative is to copythe last entryinthe directoryinto the freedlocationandto decrease the lengthof the directory. A linked list canalso be
usedto decrease the time required to delete a file (there is an overhead for the links).
The real disadvantage of a linear list of directoryentriesis that finding a file requires a linear search. Directoryinformation is used
frequently, andusers will notice ifaccessto it is slow.
5
File-System Implementation (Galvin)
A sortedbinarylist allows a binarysearch anddecreasesthe average searchtime. However, the requirement that the list be kept sortedmay
complicate creating anddeleting files, since we mayhave to move substantial amounts ofdirectoryinformationto maintaina sorted directory.
A more sophisticatedtree data structure, suchas a B-tree, might helphere. An advantage of
the sorted list is that a sorteddirectorylisting can be produced without a separate sort
step.
 Hashtable: Another data structure for a file directoryis a hashtable. Withthis method, a
linear list stores the directoryentries, but a hash data structure is alsoused. The hash
table takesa value computedfrom the file name and returns a pointer to the file name
in the linear list. Therefore it can greatlydecrease the directorysearch time.
Allocationmethods
Here we discuss howto allocate space to files so that diskspace is utilized effectivelyandfiles can
be accessed quickly. Three major methods ofallocatingdiskspace are in wide use: Contiguous,
linked and indexed. Some systems (suchas Data General's RDOSfor
its Nova line of computers) support allthree. More commonly, a
systemuses one methodfor all file within a file system type.
Contiguous Allocation: It requires that all blocks of a file be kept together contiguously.
Performance is veryfast, because readingsuccessive blocks of the same file generallyrequires no
movement of the disk heads, or at most one smallstepto the next adjacent cylinder.
 Storage allocationinvolves the same issues discussedearlier for the allocationof
contiguous blocks of memory(first fit, best fit, fragmentationproblems, etc.) The
distinctionis that the hightime penaltyrequiredfor moving the disk heads from spot to
spot maynow justifythe benefits ofkeeping files contiguouslywhen possible. (Evenfile
systems that donot bydefault store files contiguouslycan benefit from certain utilitiesthat
compact the diskandmake all filescontiguous inthe process.)
 Problems canarise whenfilesgrow, or ifthe exact size of a file is unknownat creationtime:
Over-estimationof the file's finalsize increasesexternal fragmentationandwastes disk space. Under-estimationmayrequire that a file be
moved or a processabortedif the file grows beyondits originallyallocatedspace. Ifa file grows slowlyover a long time period and the total
final space must be allocated initially, thena lot of space becomes unusable before the file fills the space.
 To minimize these drawbacks, some operatingsystems use a modified contiguous-allocation scheme. Here, a contiguous chunkof space is
allocated initially;andthen, ifthat amount proves not to be large enough, another chunk ofcontiguous space, knownas ane xtent, is added.
The locationof the file's blocks is thenrecordedas a location and a block count, plus a link to the first block ofthe next extent (usedbyVeritas
file system).
Linked Allocation: Linkedallocationsolves all problems of contiguous allocation. Withlinkedallocation, each file is a linked list of diskblocks;the disk
blocks maybe scatteredanywhere onthe disk. The directorycontains a pointer to the first andlast blocks ofthe file (Each blockcontains a pointer to the
next block). These pointers are not made available to the user. Thus, if each blockis 512 bytes insize, anda disk address(the pointer) requires 4 bytes,
then the user sees blocks of 508 bytes.
 To create a new file, we simplycreate a new entryinthe directory. Withlinkedallocation, each directoryentryhas a pointer to the first disk
block of the file. This pointer is initializedto nil (the end-of-list pointer value) to signify
an emptyfile. The size fieldis alsoset to 0. A write to the file causes the free-space
management system to fine a free block, andthis newblock is written to andis linked
to the end ofthe file. To reada file, we simplyreadblocks byfollowing the pointers
from blockto block. There is no external fragmentation withlinkedallocation, andany
free blockon the free-space list can be used to satisfya request. The size ofa file need
not be declared when that file is created. A file cancontinue to growas long as free
blocks are available. Consequently, it is never necessaryto compact disk space.
 Linkedallocationdoeshave disadvantages, however. The major problem is that it can
be usedeffectivelyonlyfor sequential-access files. To findthe ith blockof a file, we
must start at the beginningof that file and follow the pointers till we get to the ith
block. Each access to a pointer requires a diskread, andsome require a diskseek.
Consequently, it is inefficient to support a direct-access capabilityfor linked-allocation
files. (Another disadvantage is the space requiredfor the pointers).
 The usual solutionto thisproblem is to collect blocks intomultiples, calledclusters, and
to allocate clusters rather thanblocks. For instance, the file system maydefine a cluster
6
File-System Implementation (Galvin)
as four blocks andoperate on the disk onlyincluster units. Pointers thenuse a muchsmaller percentage of the file's diskspace. The cost ofthis
approachis anincrease ininternal fragmentation, because more space is wastedwhena cluster is partiallyfull thanwhena blockis partially
full. Clusters canbe used to improve the disk-accesstime for manyother algorithms as well, sotheyare usedinmost file systems.
 Another problemof linkedallocation is reliability. The files are linkedtogether bypointers scatteredall over the disk, soconsider what would
happenif a pointer were lost or damaged. One partial solutionis to use doubly-linkedlists, andanother is to store the file name andrelative
block number ineachblock;however, these schemes require even more overhead for eachfile.
 An important variationon linkedallocationis the use of a file-allocationtable (FAT). This simple but efficient methodof disk-space allocationis
usedbythe MS-DOS andOS/2 operating systems. A sectionof disk at the beginning of each volume is set aside to contain the table. The table
has one entryfor eachdiskblock andis indexed byblock number. The FAT is usedinmuchthe same wayas a linkedlist. The directoryentry
contains the block number of the first block oftehfile. The table
entryindexedbythat blocknumber contains the blocknumber of
the next blockinthe file. This chaincontinues untilthe last block,
which hasa special end-of-file value as the table entry. Unused
blocks are indicatedbya 0 table value. Allocatinga new blockto a
file is a simple matter of finding the first 0-valuedtable entryand
replacing the previous end-of-file value with the addressof the new
block. The 0 is thenreplacedwith end-of-file value. An illustrative
example is the FAT structure showninFigure 11.7 for a file
consisting ofdiskblocks 217, 618, and 339.
The FAT allocationscheme canresult in a significant number of
diskheadseeks, unlessthe FAT is cached. The diskhead must move
to the start ofthe volume to read the FAT and findthe location of
the block in question, thenmove to the location ofthe block itself.
In the worst case, both moves occur for eachof the blocks. A benefit
is that random-accesstime is improved, because the disk headcan
find the locationof anyblock byreading the informationinthe FAT.
Indexed Allocation: Linkedallocationsolves the external-fragmentationand size-declarationproblems of contiguous allocation. However, inthe
absence of a FAT, linked allocationcannot support efficient direct access, since the pointers to the blocks are scatteredwiththe blocks themselves all
over the disks and must be retrievedinorder. Indexedallocation solves this problem bybringing all the pointers together into one location:the index
block.
 Each file hasits ownindex block, whichis anarrayof disk-blockaddresses. The ithentryin the index block points to the ith blockof the file.
The directorycontains the address ofthe index block(Figure 11.8). To findandreadthe ithblock, we use the pointer inth e ithindex-block
entry. This scheme is similar to the pagingscheme describedin Section8.4.
 When the file is created, allpointers inthe index blockare set to nil. When
the ith block is first written, a block is obtainedfrom the free-space manager,
and its address is put inthe ithindex-blockentry.
 Indexedallocation supports direct access, without sufferingfrom external
fragmentation, because anyfree blockon the disk can satisfya request for
more space. Indexedallocationdoes suffer from wasted space, however. The
pointer overhead of the index blockis generallygreater than the pointer
overheadof linkedallocation. Consider a commoncase inwhich we have a
file of onlyone or two blocks. Withlinked allocation, we lose the space of
onlyone pointer per block. Withindexed allocation, anentire index block
must be allocated, evenif onlyone or two pointers will be non-nil.
 This point raises the questionof how large the index block shouldbe. Every
file must have anindex block, sowe want the index block to be as small as
possible. Ifthe index blockis too small, however, it will not be able to hold
enoughpointers for a large file, and a mechanism willhave to be available to
deal withthe issue. Mechanisms for this purpose include the following:
o Linkedscheme – An index block is normallyone diskblock. Thus, it canbe read andwrittendirectlybyitself. To allow for large files,
we can linktogether several index blocks. For example, anindex blockmight contain a smallheader giving the name of the file anda
set of the first 100 disk-block addresses. The next address (the last word inthe index block) is nil (for a small file)or is a pointer to
another index block (for a large file).
o Multilevel index – A variant of the linked representationis to use a first-level index block to a set of second-level index blocks, which
in turn point to the file blocks. To accessa block, the OS uses the first-level index to find a second-level index block andthenuses
that block to find the desireddata block. This approach couldbe continuedto a third or fourth level, depending onthe desired
maximum file size. With4096-byte blocks, we could store 1,024 4-byte pointers inan index block. Twolevelsof indexes allow
1,048,576 data blocks and a file size ofup to 4 GB.
7
File-System Implementation (Galvin)
o Combinedscheme – Another alternative, usedinthe UFS, is to keepthe
first, say, 15 pointers of the index block inthe file's inode. The first 12 of
these pointers point to direct blocks;that is, theycontainaddresses of
blocks that contain data of the file. Thus, the data for smallfiles (ofno
more than12 blocks) donot needa separate index block. Ifthe block
size is 4KB, thenup to 48 KB of data canbe accesseddirectly. The next
three pointers point to indirect blocks. The first points to a single
indirect block, which is anindex blockcontainingnot data but the
addresses ofblocks that do containdata. The secondpoints to a double
indirect block, which contains the address of a block that contains the
addresses ofblocks that contain pointers to the actualdata blocks. The
last pointer contains the addressof a triple indirect block. Under this
method, the number of blocks that can be allocatedto a file exceeds the
amount of space addressable bythe 4-byte file pointers used bymany
OSes. A 32-bit file pointer reaches only2^32 bytes, or 4 GB. ManyUNIXimplementations, including Solaris and IBM's AIX, now
support upto 64-bit file pointers. Pointers ofthis size allow files andfile systems to be terabytes in size. A UNIXinode is shownin
Figure 11.9.
 Indexed-allocation schemes suffer fromsome ofthe same performance problems as does linkedallocation. Specifically, the index blocks can
be cachedinmemory, but the data blocks maybe spread all over a volume.
Performance: The optimal allocation methodis different for sequential accessfiles thanfor random access files, and is alsodifferent for smallfiles than
for large files. Some systems support more than one allocationmethod, whichmayrequire specifying how the file is to be used (sequential or random
access) at the time it is allocated. Such systems also provide conversion utilities. Some systems have beenknown to use contiguous access for small files,
and automaticallyswitch to anindexedscheme whenfile sizes surpass a certainthreshold. Andof course some systems adjust their allocationschemes
(e.g. block sizes)to best matchthe characteristics of the hardware for optimumperformance.
Free-SpaceManagement
Another important aspect of disk management is keeping track of and allocating free space.
 Bit Vector: One simple approachis to use a bit vector, inwhich each bit represents a disk block, set to 1 if free or 0 if allocated. Fast algorithms
exist for quicklyfinding contiguous blocks ofa given size The down side is that a 40GB diskrequires over 5MB just to store the bitmap (For
example).
 Linked List: A linked list canalso be used to keeptrackof all free blocks. Traversingthe list
and/or finding a contiguous block ofa given size are not easy, but fortunatelyare not
frequentlyneededoperations. Generallythe systemjust adds andremoves single blocks from
the beginning of the list. The FAT table keeps trackof the free list as just one more linked list
on the table.
 Grouping: A variationon linkedlist free lists is to use links ofblocks ofi ndices of free blocks. If
a blockholds upto N addresses, thenthe first block inthe linked-list contains upto N-1
addresses offree blocks anda pointer to the next blockof free addresses.
 Counting: When there are multiple contiguous blocks of free space thenthe systemcankeep
track of the startingaddress of the groupandthe number of contiguous free blocks. As long as
the average lengthof a contiguous group offree blocks is greater thantwo this offers a
savings inspace neededfor the free list. (Similar to compressiontechniques used for graphics
imageswhena groupof pixelsallthe same color is encountered.)
 Space Maps: Sun's ZFSfile systemwas designed for HUGE numbers andsizes offiles,
directories, andeven file systems. The resulting data structurescouldbe VERY inefficient if not
implemented carefully. For example, freeingup a 1 GB file ona 1 TB file systemcouldinvolve updating thousands of blocks o f free list bit maps
if the file was spreadacrossthe disk. ZFS uses a combinationof techniques, starting with dividingthe diskup into(hundreds of) metaslabs of a
manageable size, each havingtheir ownspace map. Free blocks are managed using the counting technique, but rather thanwrite the
informationto a table, it is recorded in a log-structuredtransactionrecord. Adjacent free blocks are also coalescedintoa larger single free
block. An in-memoryspace mapis constructed using a balancedtree data structure, constructedfrom the logdata. The combinationof the in-
memorytree andthe on-disklog provide for veryfast andefficient management of these verylarge files and free blocks.
EfficiencyandPerformance
 Efficiency: The efficient use of diskspace depends heavilyon the diskallocationanddirectoryalgorithms in use. For instance, UNIXpre-
allocates inodes, whichoccupies space evenbefore anyfiles are created. UNIXalsodistributes inodes across the disk, and tries to store data
files near their inode, to reduce the distance of diskseeks betweenthe inodes and the data. Some systems use variable size clusters depending
on the file size. The more data that is storedin a directory(e.g., information like last accesstime), the more oftenthe d irectoryblocks have to
be re-written. As technologyadvances, addressingschemeshave had to growas well. Sun's ZFS file system uses 128-bit pointers, which should
8
File-System Implementation (Galvin)
theoreticallynever needto be expanded. (The mass required to store 2^128 bytes withatomic storage wouldbe at least 272 trillion
kilograms!) Kernel table sizes usedto be fixed, and couldonlybe changedbyrebuildingthe kernels. Modern tables are dynamicallyallocated,
but that requires more complicatedalgorithms for accessingthem.
 Performance: Even after the basic file-system algorithms have been selected, we canstill improve performance inseveral ways. Disk
controllers generallyinclude on-boardcaching. Whena seekis requested, the heads are moved into place, andthen anentire track is read,
startingfrom whatever sector is currentlyunder the heads
(reducinglatency). The requestedsector is returnedandthe
unrequestedportionof the trackis cached inthe disk's
electronics. Some OSes cache diskblocks theyexpect to
need againina buffer cache. A page cache connected to the
virtual memorysystemis actuallymore efficient as memory
addresses donot need to be convertedto diskblock
addresses and back again. Some systems (Solaris, Linux,
Windows 2000, NT, XP) use page caching for bothprocess
pages andfile data in a unifiedvirtual memory. Figures
11.11 and 11.12 showthe advantages ofthe unifiedbuffer
cache foundin some versions of UNIXandLinux - Data does
not needto be storedtwice, andproblems of inconsistent
buffer informationare avoided. (Book: Some systems
maintaina separate sectionof main memory fora buffer cache, where blocks are kept under the assumption that theywillbe usedagain
shortly. Other systems cache filedata usinga page cache. The page cache usesvirtualmemorytechniques to cache filedata as pagesrather
than as file-system-orientedblocks. Cachingfile data usingvirtual addressesis far more efficient than cachingthroughphysical blocks, as
accesses interface with virtual memoryrather than the file system. Several systems, including Solaris/Linus/WIndows NT/XP, use page caching
to cache bothprocess pages andfile data. This is known as unifiedvirtual memory.)
(Book: Some versions of UNIXandLinux provide a unifiedbuffer cache. To illustrate the benefits of the unified buffer cache, consider the
two alternatives for opening and accessinga file. One approachis to use memorymapping(section9.7);the secondis to use the standard
systemcalls read()andwrite(). Without a unifiedbuffer cache, we have a situation similar to Figure 11.11. Here, re ad() and write()systemcalls
go through the buffer cache. The memory-mapping call, however, requires using twocaches - the page cache and the buffer cache. A memory
mapping proceeds byreadingindiskblocks from the file system andstoring theminthe buffer cache. Because the virtual memorydoes not
interface withthe buffer cache, the contents of the file in the buffer cache must be copiedinto the page cache. This situationis knownas
double caching and requires caching file-system data twice. Not onlydoes it waste memorybut it alsowastessignificant CPU andI/O cycles
due to the extra data movement withinsystem memory. Inaddition, inconsistencies betweenthe two cachescanresult in corrup t files. In
contrast, whena unified buffer cache is provided, bothmemorymapping and the read() andwrite()system callsuse the same page cache. This
has the benefit of avoidingdouble caching, and it allows the virtual memorysystem to manage file-systemdata. The unified buffer cache is
showninFigure 11.12.)
o Page replacement strategies canbe complicatedwith a unified cache, as one needs to decide whether to replace process or file
pages, andhowmanypagesto guarantee to each categoryof pages. Solaris, for example, has gone throughmanyvariations,
resulting in priority paging givingprocesspages priorityover file I/O pages, andsettinglimits sothat neither canknock the other
completelyout ofmemory.
o Another issue affecting performance is the questionof whether to implement synchronous writes or asynchronous writes.
Synchronous writes occur inthe order in whichthe disksubsystem receives them, without caching;Asynchronous writes are cached,
allowing the disk subsystemto schedule writesina more efficient order (See Chapter 12.) Metadata writes are oftendone
synchronously. Some systems support flags to the opencall requiring that writes be synchronous, for example for the benefit of
database systems that require their writesbe performed ina required order.
o The type of file access canalsohave animpact on optimal page replacement policies. For example, LRU is not necessarilya good
policyfor sequential access files. For these types of files progression normallygoes in a forward directiononly, andthe m ost recently
usedpage will not be neededagainuntil after the file has beenrewound and re -readfrom the beginning, (ifit is ever neededat all.)
On the other hand, we canexpect to needthe next page inthe file fairlysoon. For this reasonsequential access files often take
advantage of twospecialpolicies:
 Free-behind frees upa page as soonas the next page inthe file is requested, with the assumptionthat we are now done
with the old page andwon't needit again for a long time.
 Read-ahead reads the requested page andseveral subsequent pagesat the same time, withthe assumption that those
pages will be neededin the near future. This is similar to the trackcaching that is alreadyperformedbythe disk controller,
except it saves the future latencyof transferring data from the disk controller memoryintomotherboardmainmemory.
o The caching system andasynchronous writesspeedup disk writes considerably, because the disk subsystemcanschedule physical
writes to the diskto minimize head movement and diskseektimes. (See Chapter 12). Reads, onthe other hand, must be done mo re
synchronouslyinspite of the caching system, withthe result that disk writes cancounter-intuitivelybe much faster on average than
diskreads.
Recovery
9
File-System Implementation (Galvin)
Filesanddirectoriesare kept bothinmainmemoryandondisk, andcare must be taken to ensure that system failure does not result in loss ofdata or in
data inconsistency. We deal with these issues inthe following sections.
 Consistency Checking: The storingof certaindata structures (e.g. directories and inodes)inmemoryandthe caching ofdiskoperations can
speedup performance, but what happens in the result of a systemcrash? All volatile memorystructuresare lost, and the informationstored
on the hard drive maybe left in aninconsistent state. A Consistency Checker (fsck in UNIX, chkdskor scandiskinWindows) is oftenrun at boot
time or mount time, particularlyif a filesystem was not closed downproperly. Some of the problems that these toolslook for include:
o Diskblocks allocatedto files and also listedon the free list.
o Diskblocks neither allocatedto files nor on the free list.
o Diskblocks allocatedto more thanone file.
o The number of diskblocks allocatedto a file inconsistent with the file's statedsize.
o Properlyallocatedfiles/ inodes which donot appear inanydirectoryentry.
o Link counts for an inode not matching the number of referencesto that inode in the directorystructure.
o Two or more identical file names inthe same directory.
o Illegallylinkeddirectories, e.g. cyclical relationships where those are not allowed, or files/directories that are not accessible fromthe
root of the directorytree.
o Consistencycheckers will often collect questionable disk blocks intonew files with names such as chk00001.dat. These files may
contain valuable informationthat wouldotherwise be lost, but inmost casestheycan be safelydeleted, (returning those disk blo cks
to the free list.)
UNIXcaches directoryinformationfor reads, but anychangesthat affect space allocationor metadata ch anges are written
synchronously, before anyof the correspondingdata blocks are writtento.
 Log-Structured File Systems: Log-based transaction-oriented (a.k.a. journaling) filesystems borrow techniques developedfor databases,
guaranteeing that anygiven transactioneither completes successfullyor can be rolledbackto a safe state before the transactioncommenced:
o All metadata changes are writtensequentiallyto a log.
o A set of changesfor performing a specific task (e.g. moving a file) is a transaction.
o As changes are writtento the log theyare saidto be committed, allowingthe systemto returnto its work.
o In the meantime, the changesfrom the logare carried out onthe actual filesystem, anda pointer keeps track ofwhichchanges in
the log have beencompletedandwhichhave not yet beencompleted.
o When all changescorresponding to a particular transactionhave beencompleted, that transactioncanbe safelyremovedfrom the
log.
o At anygiventime, the log will containinformationpertaining to uncompleted transactions only, e.g. actions that were committedbut
for which the entire transaction hasnot yet beencompleted.
 From the log, the remaining transactions can be completed,
 or if the transactionwas aborted, thenthe partiallycompletedchanges can be undone.
 Backup and Restore: A full backupcopies everyfile ona filesystem. Incrementalbackups copyonlyfiles which have changedsince some
previous time. A combinationof full andincrementalbackups canoffer a compromise betweenfullrecoverability, the number and size of
backuptapes needed, andthe number oftapes that needto be usedto doa full restore. For example, one strategymight be: At the beginning
of the month do a fullbackup. At the endof the first andagainat the endof the secondweek, backup all files which have changedsince the
beginning of the month. At the endof the thirdweek, backup all filesthat have changedsince the endof the secondweek. Everydayof the
month not listedabove, doan incremental backupof all filesthat have changedsince the most recent ofthe weeklybackups d escribedabove.
 Other Solutions: Sun's ZFS andNetwork Appliance's WAFL file systems take a different approach to filesystem consistency. No blocks of data
are ever over-writteninplace. Rather the new data is writtenintofresh newblocks, and after the transactionis complete, the metadata (data
block pointers) is updated to point to the new blocks. The oldblocks can
then be freedup for future use. Alternatively, if the oldblocks andold
metadata are saved, thena snapshot of the systeminits originalstate is
preserved. Thisapproachis taken byWAFL. ZFScombines this with
check-summingof all metadata anddata blocks, andRAID, to ensure that
no inconsistencies are possible, andtherefore ZFSdoes not incorporate a
consistencychecker.
NFS (Optional)
The NFS protocol is implementedas a set of remote procedure calls (RPCs):
Searching for a file in a directory, Reading a set of directoryentries, Manipulating
links anddirectories, Accessing file attributes, Reading and writing files. For remote
operations, bufferingandcaching improve performance, but cancause a disparity
in localversus remote views of the same file(s).
(In addition to the figure 12.15, you can alsoviewthe preceding figures illustratingNFS file system mounting if you forgot)
10
File-System Implementation (Galvin)
AssortedContent
 Master Boot Record (MBR:Wiki): A master boot record (MBR) is a special type of boot sector at the verybeginning ofpartitioned computer
mass storage devices like fixeddisks or removable drives intendedfor use with IBMPC-compatible systems andbeyond. The MBR holds the
informationonhowthe logical partitions, containing file systems, are organizedonthat medium. The MBR also contains execu table code to
function as a loader for the installedoperatingsystem—usuallybypassing control over to the loader's secondstage, or in conjunctionwith
each partition's volume boot record(VBR). ThisMBR code is usuallyreferredto as a boot loader. MBRs are not present on non-partitioned
media suchas floppies, super floppies or other storage devices configuredto behave as such.
The MBR is not locatedina partition;it is located at a first sector of the device (physical offset 0), preceding the first partition. (The boot
sector present on a non-partitioneddevice or withinanindividual partitionis called a volume boot record instead.)
The organizationof the partitiontable inthe MBR limits the maximumaddressable storage space ofa disk to 2 TiB(232 × 512 bytes).
Approaches to slightlyraise this limit assuming 33-bit arithmetics or 4096-byte sectors are not officiallysupportedas theyfatallybreak
compatibilitywithexistingboot loaders andmost MBR-compliant operating systems and system tools, and can causes serious data corruption
when usedoutside of narrowlycontrolledsystemenvironments. Therefore, the MBR-based partitioning scheme is in the process ofbeing
superseded bythe GUID Partition Table (GPT) scheme in newcomputers. A GPT cancoexist with anMBR inorder to provide some limitedform
of backwardcompatibilityfor older systems.
The MBR consists of 512 or more bytes located inthe first sector of the drive. It maycontainone or more of: (A) A partitiontable describing
the partitions of a storage device. Inthiscontext the boot sector mayalso be calleda partitionsector. (B) Bootstrapcode: Instructions to
identifythe configured bootable partition, thenloadandexecute its volume boot record(VBR)as a chainloader. (C) Optional 32-bit disk
timestamp. (D) Optional 32-bit disksignature.
By convention, there are exactlyfour primarypartitiontable entries inthe MBR partitiontable scheme:
 Second-stage boot loader: Second-stage boot loaders, suchas GNU GRUB, BOOTMGR, Syslinux, NTLDRor BootX, are not themselves operating
systems, but are able to load anoperatingsystemproperlyand transfer executionto it;the operating system subsequentlyinitializes itself and
mayload extra device drivers. The second-stage boot loader does not needdrivers for its own operation, but mayinstead use generic storage
access methods provided bysystemfirmware such as the BIOS or OpenFirmware, thoughtypicallywithrestrictedhardware functionalityand
lower performance.
 Volume Boot Record (VBR): A Volume Boot Record (VBR) (also knownas a volume boot sector, a partitionboot record or a partition boot
sector) is a type of boot sector introduced bythe IBMPersonal Computer. It maybe foundon a partitioned data storage device such as a hard
disk, or anunpartitioneddevice such as a floppydisk, and contains machine code for bootstrappingprograms (usually, but not necessarily,
operatingsystems) storedin other parts of the device. On non-partitionedstorage devices, it is the first sector of the device. On partitioned
devices, it is the first sector of anindividual partition onthe device, with the first sector ofthe entire device beinga Master Boot Record(MBR)
containingthe partitiontable. The code involume boot records is invoked either directlybythe machine's firmware or indirectlybycode in the
master boot record or a boot manager. Code in the MBR and VBRis inessence loaded the same way. Invoking a VBR via a boot manager is
known as chainloading.
 Master File Table (MFT): The NTFS file system contains a file calledthe master file table, or MFT. There is at least one entryinthe MFT for
everyfile onanNTFSfile systemvolume, includingthe MFT itself. All informationabout a file, including its size, time an ddate stamps,
permissions, anddata content, is storedeither in MFT entries, or in space outside the MFT that is describedbyMFT entries. As filesare added
to an NTFS file systemvolume, more entries are addedto the MFT andthe MFT increases in size. When files are deletedfroma nNTFSfile
systemvolume, their MFT entriesare markedas free and maybe reused. However, diskspace that has beenallocated for these entries is not
reallocated, andthe size of the MFT does not decrease. (The master file table (MFT)is a database inwhich information about everyfile and
directoryon anNT File System(NTFS) volume is stored. There is at least one recordfor everyfile and directoryon the NTFSlogical volume.
Each record contains attributesthat tell the operating system (OS) how to deal withthe file or directoryassociatedwith th e record.)
 File Control Block (FCB): A File Control
Block (FCB) is a file systemstructure in
which the state of an openfile is
maintained. A FCB is managed bythe
operatingsystem, but it resides inthe
memoryof the program that uses the
file, not inoperatingsystemmemory.
This allows a process to have as many
files openat one time as it wants to,
provided it canspare enoughmemory
for an FCB per file. A full FCB is 36 bytes long;inearlyversions of CP/M, it was 33 bytes. This fixedsize, which couldno t be increasedwithout
breakingapplicationcompatibility, leadto the FCB's eventual demise as the standardmethod ofaccessing files. The meanings of severalof the
fields inthe FCB differ betweenCP/Mand DOS, andalsodepending onwhat operationis beingperformed. The followingfields have consistent
meanings:
11
File-System Implementation (Galvin)
To be cleared
 I
Q’s Later
 XXX
Glossary
ReadLater
Further Reading
 S

Grey Areas
 XXX

More Related Content

DOCX
File systemimplementationfinal
PDF
File Systems
PDF
Operating Systems - Implementing File Systems
PPTX
Ch11 file system implementation
ODP
NTFS and Inode
PDF
File system
PDF
NTFS file system
DOCX
File system interface Pre Final
File systemimplementationfinal
File Systems
Operating Systems - Implementing File Systems
Ch11 file system implementation
NTFS and Inode
File system
NTFS file system
File system interface Pre Final

What's hot (19)

PPT
PPTX
directory structure and file system mounting
PPTX
Mass Storage Structure
PPT
11.file system implementation
PPTX
Disk and File System Management in Linux
PPT
8 1-os file system implementation
PDF
Ch11 file system implementation
PDF
Buffer cache unix ppt Mrs.Sowmya Jyothi
PDF
File System Implementation - Part1
PDF
ITFT_File system interface in Operating System
PPTX
File system.
PPT
DOCX
File system interfacefinal
PPTX
File System Implementation
PDF
Internal representation of file chapter 4 Sowmya Jyothi
PPTX
File system of windows xp
PPT
Ch12 OS
 
PPTX
file system in operating system
PPT
File system
directory structure and file system mounting
Mass Storage Structure
11.file system implementation
Disk and File System Management in Linux
8 1-os file system implementation
Ch11 file system implementation
Buffer cache unix ppt Mrs.Sowmya Jyothi
File System Implementation - Part1
ITFT_File system interface in Operating System
File system.
File system interfacefinal
File System Implementation
Internal representation of file chapter 4 Sowmya Jyothi
File system of windows xp
Ch12 OS
 
file system in operating system
File system
Ad

Viewers also liked (8)

PDF
6th Math (C1) - Lesson 43--Dec15
TXT
국민임대주택대출『LG777』.『XYZ』aig종신보험 상해골프투어여행사
PDF
7th Pre-Alg - Lessons 37 & 38
DOC
10 utilizari neobisnuite ale otetului
TXT
강원펜션 골프전문인테리어
DOCX
Menús semana del 17 al 21 de agosto
DOCX
File systeminterface-pre-final-formatting
PDF
Chowtodoprogram solutions
6th Math (C1) - Lesson 43--Dec15
국민임대주택대출『LG777』.『XYZ』aig종신보험 상해골프투어여행사
7th Pre-Alg - Lessons 37 & 38
10 utilizari neobisnuite ale otetului
강원펜션 골프전문인테리어
Menús semana del 17 al 21 de agosto
File systeminterface-pre-final-formatting
Chowtodoprogram solutions
Ad

Similar to Filesystemimplementationpre final-160919095849 (20)

PPT
file management_part2_os_notes.ppt
PPTX
FILE Implementation Introduction imp .pptx
PPT
File Management in Operating Systems
PPTX
I/O System and Case study
PPTX
Root file system
PPTX
Introduction to filesystems and computer forensics
PDF
TLPI Chapter 14 File Systems
PPTX
Ankit Bargali Ouiiihhhoojpk;oihigigiS BCA-IV.pptx
PPT
The Storage Systems
PPTX
Introduction to the Sleuth Kit and filesystem forensics
PPT
Chapter 11 - File System Implementation
DOCX
linux file sysytem& input and output
PPT
various commands in linux operating systems
PPT
various commands in linux operating systems
PPT
Windowsforensics
PPTX
File system Os
DOCX
file management
PDF
Root file system for embedded systems
file management_part2_os_notes.ppt
FILE Implementation Introduction imp .pptx
File Management in Operating Systems
I/O System and Case study
Root file system
Introduction to filesystems and computer forensics
TLPI Chapter 14 File Systems
Ankit Bargali Ouiiihhhoojpk;oihigigiS BCA-IV.pptx
The Storage Systems
Introduction to the Sleuth Kit and filesystem forensics
Chapter 11 - File System Implementation
linux file sysytem& input and output
various commands in linux operating systems
various commands in linux operating systems
Windowsforensics
File system Os
file management
Root file system for embedded systems

More from marangburu42 (20)

DOCX
PDF
Write miss
DOCX
Hennchthree 161102111515
DOCX
Hennchthree
DOCX
Hennchthree
DOCX
Sequential circuits
DOCX
Combinational circuits
DOCX
Hennchthree 160912095304
DOCX
Sequential circuits
DOCX
Combinational circuits
DOCX
Karnaugh mapping allaboutcircuits
DOCX
Aac boolean formulae
DOCX
Virtualmemoryfinal 161019175858
DOCX
Io systems final
DOCX
Mass storage structurefinal
DOCX
All aboutcircuits karnaugh maps
DOCX
Virtual memoryfinal
DOCX
Mainmemoryfinal 161019122029
DOCX
Virtualmemorypre final-formatting-161019022904
DOCX
Process synchronizationfinal
Write miss
Hennchthree 161102111515
Hennchthree
Hennchthree
Sequential circuits
Combinational circuits
Hennchthree 160912095304
Sequential circuits
Combinational circuits
Karnaugh mapping allaboutcircuits
Aac boolean formulae
Virtualmemoryfinal 161019175858
Io systems final
Mass storage structurefinal
All aboutcircuits karnaugh maps
Virtual memoryfinal
Mainmemoryfinal 161019122029
Virtualmemorypre final-formatting-161019022904
Process synchronizationfinal

Recently uploaded (20)

PPTX
slide head and neck muscel for medical students
PPTX
vsfbvefbegbefvsegbthnmthndgbdfvbrsjmrysnedgbdzndhzmsr
PPTX
current by laws xxxxxxxxxxxxxxxxxxxxxxxxxxx
PPTX
Certificados y Diplomas para Educación de Colores Candy by Slidesgo.pptx
PPTX
Art Appreciation-Lesson-1-1.pptx College
PPTX
White Green Simple and Professional Business Pitch Deck Presentation.pptx
PPTX
4277547e-f8e2-414e-8962-bf501ea91259.pptx
PPTX
E8 Q1 020ssssssssssssssssssssssssssssss2 PS.pptx
PPTX
CPAR_QR1_WEEK1_INTRODUCTION TO CPAR.pptx
PPTX
CPRC-SOCIAL-STUDIES-FINAL-COACHING-DAY-1.pptx
PPTX
Socio ch 1 characteristics characteristics
PPTX
Presentation on tradtional textiles of kutch
PPTX
SAPOTA CULTIVATION.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
PDF
Ricardo Salinas Pliego Accused of Acting as A Narcotics Kingpin
PDF
TUTI FRUTI RECETA RÁPIDA Y DIVERTIDA PARA TODOS
PPTX
65bc3704-6ed1-4724-977d-a70f145d40da.pptx
PPTX
Green and Orange Illustration Understanding Climate Change Presentation.pptx
PPTX
Green and Blue Illustrative Earth Day Presentation.pptx
PPTX
VAD - Acute and chronic disorders of mesenteric.pptx
PPTX
Military history & Evolution of Armed Forces of the Philippines
slide head and neck muscel for medical students
vsfbvefbegbefvsegbthnmthndgbdfvbrsjmrysnedgbdzndhzmsr
current by laws xxxxxxxxxxxxxxxxxxxxxxxxxxx
Certificados y Diplomas para Educación de Colores Candy by Slidesgo.pptx
Art Appreciation-Lesson-1-1.pptx College
White Green Simple and Professional Business Pitch Deck Presentation.pptx
4277547e-f8e2-414e-8962-bf501ea91259.pptx
E8 Q1 020ssssssssssssssssssssssssssssss2 PS.pptx
CPAR_QR1_WEEK1_INTRODUCTION TO CPAR.pptx
CPRC-SOCIAL-STUDIES-FINAL-COACHING-DAY-1.pptx
Socio ch 1 characteristics characteristics
Presentation on tradtional textiles of kutch
SAPOTA CULTIVATION.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
Ricardo Salinas Pliego Accused of Acting as A Narcotics Kingpin
TUTI FRUTI RECETA RÁPIDA Y DIVERTIDA PARA TODOS
65bc3704-6ed1-4724-977d-a70f145d40da.pptx
Green and Orange Illustration Understanding Climate Change Presentation.pptx
Green and Blue Illustrative Earth Day Presentation.pptx
VAD - Acute and chronic disorders of mesenteric.pptx
Military history & Evolution of Armed Forces of the Philippines

Filesystemimplementationpre final-160919095849

  • 1. 1 File-System Implementation (Galvin) Outline  FILE-SYSTEMSTRUCTURE  FILE-SYSTEMIMPLEMENTATION o Overview o Partitions and Mounting o Virtual File Systems  DIRECTORY IMPLEMENTATION o Linear List o HashTable  ALLOCATION METHODS o Contiguous Allocation o Linked Allocation o Indexed Allocation o Performance  FREE-SPACE MANAGEMENT o Bit Vector o Linked List o Grouping o Counting o Space Maps  EFFICIENCY AND PERFORMANCE o Efficiency o Performance  RECOVERY o Consistency Checking o Log-Structured File Systems o Other Solutions o Backup and Restore  NFS (Optional) o Overview o The Mount Protocol o The NFS Protocol o Path-Name Translation o Remote Operations  EXAMPLE: THE WAFL FILE SYSTEM (Optional--SKIPPED) Contents FILE-SYSTEM STRUCTURE  The file systemresidespermanentlyon secondarystorage. This chapter is primarilyconcerned withissues surrounding file storage andaccess on the most common secondary-storage medium, the disk.  Hard disks have twoimportant properties that make them suitable for secondarystorage of files infile systems:(1) Blocks of data canbe rewrittenin place;it is possible to read a block fromthe disk, modifythe block, andwrite it back intothe same place, and(2) theyare direct access, allowing anyblock ofdata to be accessedwith only(relatively) minor movements ofthe diskheads androtational latency. (Disks are usuallyaccessedinphysical blocks – one or more sectors - rather thana byte at a time. Blocksizes mayrange from512 bytes to 4Kor larger.)  To provide efficient andconvenient accessto the disk, the OS imposes one or more file systems to allow the data to be stored, located, and retrieved easily. One of the designproblems a file systemposes is creating algorithms anddata structures to map the logicalfile systemontothe physical secondary-storage devices.  The file systemitself is generallycomposed ofmanydifferent levels. The structure showninFigure 11.1 is anexample of a layered design, where each level inthe designusesthe features oflower levels to create newfeatures for use byhigher levels.
  • 2. 2 File-System Implementation (Galvin)  File systems organize storage ondisk drives, andcanbe viewed as a layereddesign: o At the lowest layer are the physicaldevices, consisting ofthe magnetic media, motors & controls, andthe electronics connectedto themand controllingthem. Moderndiskput more andmore of the electronic controlsdirectlyon the diskdrive itself, leavingrelativelylittle workfor the diskcontroller card to perform. o I/O Control consists ofdevice drivers, special software programs (often writteninassembly) whichcommunicate withthe devices by reading andwriting special codes directlyto andfrom memoryaddresses correspondingto the controller card's registers. Eachcontroller card (device) ona system hasa different set ofaddresses (registers, a.k.a. ports) that it listens to, anda unique set of commandcodesand results codes that it understands. (Book:The I/O control is the lowest level and consists of device drivers andinterrupt handlers to transfer informationbetweenthe mainmemoryandthe disk system. A device driver canbe thought of as a translator. Its input consists of high-level commands such as "retrieve block 123". Its output consists of low-level, hardware-specific instructions that are usedbythe hardware controller, which interfaces the I/O device to the rest of the system. The device driver usuallywrites specific bit patterns to special locations in the I/O controller's memoryto tellthe controller whichdevice locationto act onandwhat actions to take.) o The basic file system level works directlywith the device drivers interms of retrieving and storingraw blocks of data, without any consideration for what is ineach block. Dependingon the system, blocks maybe referred to witha single blocknumber, (e.g. block# 234234), or with head-sector-cylinder combinations. ((Book:The basic file systemneeds onlyto issue generic commands to the appropriate device driver to readandwrite physical blocks onthe disk. Each physicalblock is identified byits numeric diskaddress (for example, drive 1, cylinder 73,track 2,sector 10) o The file organization module knows about files and their logicalblocks, andhow theymap to physical blocks onthe disk. Inadditionto translatingfrom logical to physicalblocks, the file organizationmodule also maintains the list of free blocks, and allocates free blocks to files as needed. (Book:The file- organizationmodule knows about files andtheir logical blocks, as well as physical blocks. Byknowingthe type of file allocationusedandthe locationof the file, the file-organizationmodule cantranslate logical block addressesto physical block addresses for the basic file systemto transfer. Each file's logical blocks are numbered from 0(or 1) throughN. Since the physical blocks containingthe data usuallydo not match the logical numbers, a translationis neededto locate eachblock. The file-organizationmodule also includes the free-space manager, which tracks unallocated blocks andprovides these blocks to the file- allocationmodule when requested.) o The logical file system dealswith all ofthe meta data associatedwith a file (UID, GID, mode, dates, etc), i.e. everything about the file except the data itself. This level manages the directorystructure andthe mapping offile namesto file control blocks, FCBs, which containall of the meta data as wellas block number informationfor finding the data on the disk. (IBMKnowledgeCenter: The logicalfile systemis the level of the file system at which users can request file operations bysystem call. This level ofthe file systemprovides the kernel witha consistent view of what might be multiple physicalfile systems andmultiple file systemimplementations. As far as the logical file system is concerned, file systemtypes, whether local, remote, or strictlylogical, andregardless of implementation, are indistinguishable. ((Book:The logicalfile systemmanages metadata information. Metadata includesallof the file-system structure except the actual data (or contents of the files). The logical file systemmanages the directorystructure to provide the file-organizationmodule withthe informationthe latter needs, givena symbolic file name. It maintains file structure via FCBs. An FCB contains information about the file, including ownership, permissions, and locationof the file contents. The logicalfile systemis also responsible for protection and security.)  The layeredapproachto file systems means that much ofthe code can be used uniformlyfor a wide varietyof different file systems, and only certainlayers needto be filesystemspecific. (Book: Whena layered structure is usedfor file-systemimplementation, duplication ofcode is minimized. The I/O control and sometimes backfile-systemcode can be used bymultiple file systems. Each file system can thenhave its own logical file system andfile-organization modules.)  Most operatingsystems support more thanone file systems. Inadditionto removable-mediafile systems, each OS has one disk-basedfile system (or more). UNIXusesthe UNIXfile system (UFS), whichis basedon the BerkeleyFast File System (FFS). Windows NT, 2000, andXPsupport disk file-systemformats of FAT, FAT 32, andNTFS(or Windows NT File System), as well as CD-ROM, DVDand floppy-disk file-system formats. Although Linux supports over 40 different file systems, the standardLinux file systemis knownas the extendedfile system, with the most commonversion being ext2 andext3. File System Implementation As was described inSection10.1.2, operatingsystems implement open() andclose() system calls for processesto request access to filecontents. In this section, we delve into the structures andoperations usedto implement file-systemoperations. Overview Several on-disk andin-memorystructures are usedto implement a file system. These structures varydepending onthe OS andthe file system, but some general principlesapply. On disk, the file system maycontain information about how to boot anoperatingsystem stored there, the total number of blocks, the number and locationof free blocks, the directorystructure, andindividual files. Manyof these structures are detailedthroughout the remainder of this chapter;here we describe thembriefly.
  • 3. 3 File-System Implementation (Galvin)  File systems store several important data structures onthe disk (Ilinoispart is erroneous, refer bookparts): o A boot-control block, (per volume) a.k.a. the boot block inUNIXor the partitionboot sector in Windows contains information about how to boot the systemoff ofthis disk. This will generallybe the first sector of the volume if there is a bootable system loaded onthat volume, or the blockwill be left vacant otherwise. (Book: A boot control block (per volume) can containinformation neededbythe systemto boot anoperating system from that volume. If the disk does not contain anoperatingsystem, this block canbe empty. It is typicallythe first block of a volume. InUFS, thisis called the boot block;inNTFS, it is the partitionboot sector.) o A volume control block, (per volume)a.k.a. the master file table inUNIXor the superblock in Windows, whichcontains information such as the partition table, number of blocks oneach filesystem, andpointers to free blocks andfree FCB blocks. (Book:Avolume control block (per volume) contains volume (or partition)details, suchas the number of blocks inthe partition, size of the blocks, free- block count and free-block pointers, andfree FCB count and FCB pointers. InUFS, thisis called a superblock;in NTFS, it is stored in the master file table) o A directorystructure (per file system), containing file names andpointers to correspondingFCBs. UNIXuses inode numbers, andNTFS uses a master file table. (Book: A directorystructure per file system is usedto organize the files. InUFS, this includes file name and associatedinode numbers. InNTFS, it is storedinthe masterfile table.) o The File Control Block, FCB, (per file)containingdetails about ownership, size, permissions, dates, etc. UNIXstores thisi nformationin inodes, andNTFS in the master file table as a relational database structure. (Book: A per-file FCB contains manydetailsabout the file, includingfile permissions, ownership, size, andlocationof the data blocks. InUFS, this is calledthe inode. InNTFS, this informationis actuallystoredwithin the master file table, which uses a relational database structure, with a row per file.)  There are alsoseveral keydata structures storedinmemory ((Book:The in-memoryinformation is usedfor bothfile-systemmanagement and performance improvement via caching. The data are loadedat mount time anddiscardedat dismount. Th e structures mayinclude the ones described below): o An in-memorymount table contains informationabout each mountedvolume. o An in-memorydirectory-structure cache holds the directoryinformationof recentlyaccesseddirectories. (For directories at which volumes are mounted, it can contain a pointer to the volume table.). o The system-wide open-file table contains a copyof the FCB of each openfile, as well as other information. o A per-process open file table, containing a pointer to the system open file table as well as some other information. (For example the current file positionpointer maybe either here or inthe systemfile table, dependingon the implementationandwhether the file is being sharedor not.)(Book: The per-processopen-file table contains a pointer to the appropriate entryinthe system-wide open-file table, as well as other information.)  Interactions of file systemcomponents when filesare createdand/or used: To create a new file, an applicationprogramcalls the logical file system, which knows the format ofthe directorystructures. Tocreate a newfile, it allocatesa new FCB. (Alternatively, if the file-system implementationcreatesallFCBs at file- systemcreation time, anFCB is allocatedfrom the set of free FCBs.) The system then reads the appropriate directoryintomemory, updates it with the newfile name and FCB, andwrites it back to the disk. A typical FCB is showninFigure 11.2. Some operatingsystems, including UNIX, treat a directoryexactlythe same as a file – one with a type field indicating that it is a directory. Other operating systems, includingWindows NT, implement separate systemcalls for files and directories and treat directories as entitiesseparate fromfiles. Whatever the larger structural issues, the logical file system can call the file-organizationmodule to map the directoryI/O intodisk-block numbers, whichare passed onto the basic file system and I/O control system. Now that a file has beencreated, it can be used for I/O. First, though, it must be opened. The open() call passes a file name to the file system. The open() systemcall first searches the system-wide open-file table to seeif the fileis alreadyinuse byanother process. Ifit is, a per-process open-file table entryis createdpointingto the existingsystem-wide open-file table. This algorithmcansave substantial overhead. When a file is opened, the directorystructure is searchedfor the given file name. Parts of the directorystructure are usuallycachedinmemoryto speed directoryoperations. Once the file is found, the FCB is copiedintoa system-wide open-file table inmemory. This table not onlystores the FCB but also tracks the number of processesthat have the file open. Next, an entryis made inthe per-processopen-file table, with a pointer to the entryinthe system-wide open-file table and some other fields. These other fields caninclude a pointer to the current locationinthe file (for the next read() or write() operation) andthe access mode in which the file is open. The open() call returns a pointer to the appropriate entryinthe per-processfile-systemtable. All file operations are then performed via this pointer. The file name maynot be part of the open-file table, as the systemhas nouse for it once the appropriate FCB is locatedon disk. It could be cached, though, to save time on subsequent opens ofthe same file. The name givento the entryvaries. UNIXsystems refer to it as a file descriptor; Windows refers to it as a file handle. Consequently, as longas the file is not closed, all file operations are done on the open-file table. When a processcloses the file, the per-process table entryis removed, andthe system-wide entry's opencount is decremented. Whenall users that have opened the file close it, anyupdated metadata is copiedbackto the disk-baseddirectorystructure, and the system-wide open- file table entryis removed. Some systems complicate this scheme further byusing the file system as aninterface to other system aspects, suchas networking. For example, inUFS, the system-wide open-file table holds the inodesandother information for files anddirectories. It alsoholds similar information
  • 4. 4 File-System Implementation (Galvin) for network connections and devices. In thisway, once mechanism is used for multiple purposes. The caching aspects of file-system structures shouldnot be overlooked. Most systems keepall information about an openfile, except for its actual data blocks inmemory. The BSDUNIXsystem is typical inits use of caches whereve r disk I/O canbe saved. Its average cache hit rate of 85% shows that these techniques are wellworth implementing. The operating structures ofa file-system implementation are summarizedinFigure 11.3.  Before movingon to the next section, go to the reference materialon MBT, MFT, VBR andFCB inthe “AssortedContent” section. Partitions and Mounting:  Partitions caneither be used as rawdevices (withnostructure imposed upon them), or theycanbe formattedto holda filesystem(i.e. populated with FCBs and initialdirectorystructures as appropriate.) Raw partitions are generallyused for swap space, andmayalsobe usedfor certain programs such as databases that choose to manage their owndisk storage system. Partitions containing filesystems ca ngenerallyonlybe accessed using the file system structure byordinaryusers, but can often be accessedas a raw device alsobyroot.  The boot blockis accessedas part of a rawpartition, bythe boot program prior to anyoperatingsystembeing loaded. Modern boot programs understandmultiple OSes andfilesystem formats,and cangive the user a choice ofwhichof several available systems to boot.  The root partition contains the OS kernel andat least the keyportions of the OS neededto complete the boot process. At boot time the root partitionis mounted, andcontrol is transferredfrom the boot program to the kernelfoundthere. (Older systems requiredthat the root partition lie completelywithinthe first 1024 cylinders of the disk, because that was as far as the boot programcould reach. Once the kernel hadcontrol, then it could access partitions beyond the 1024 cylinder boundary.)  Continuing with the boot process, additional filesystems get mounted, adding their informationintothe appropriate mount table structure. As a part of the mounting process the file systems maybe checkedfor errors or inconsistencies, either because theyare flagged as not having been closedproperlythe last time theywere used,or just for general principals. Filesystems maybe mounted either automaticallyor manually. In UNIXa mount point is indicatedbysetting a flag inthe in- memorycopyof the inode, so all future references to that inode get re- directed to the root directoryof the mounted filesystem. Virtual File Systems:Virtual File Systems, VFS, provide a common interface to multiple different filesystemtypes. In addition, it provides for a unique identifier (vnode) for filesacross the entire space, includingacross all filesystems of different types. (UNIXinodes are unique onlyacross a single filesystem, and certainlydo not carryacross networkedfile systems.)The VFS in Linux is based uponfour keyobject types:(a)The inode object, representing an individual file (b)The file object, representinganopenfile. (c) The superblockobject, representing a filesystem. (d) The dentryobject, representinga directoryentry. Directory Implemenatation The selectionof directory-allocationanddirectory-management algorithms significantlyaffects the efficiency, performance and reliabilityof the file system. Inthis section, we discussthe trade-off involved inchoosing one of these algorithms. (Directories needto be fast to search, insert, anddelete, with a minimum ofwasteddiskspace).  Linear List: The simplest methodof implementing a directoryis to use a linear list of file nameswithpointers to the data blocks. This methodis simple to program but time-consuming to execute. To create a new file, we must first searchthe directoryto be sure that noexistingfile has the same name. Then, we adda new entryat the endof the directory. To delete a file, we search the directoryfor the named file, then release the space allocatedto it. To reuse the directoryentry, we cando one of several things. We canmark the entryas unused(by assigningit a specialname, such as an all-blank name, or witha used-unusedbit ineachentry), or we canattach it to a list offree directoryentries. A third alternative is to copythe last entryinthe directoryinto the freedlocationandto decrease the lengthof the directory. A linked list canalso be usedto decrease the time required to delete a file (there is an overhead for the links). The real disadvantage of a linear list of directoryentriesis that finding a file requires a linear search. Directoryinformation is used frequently, andusers will notice ifaccessto it is slow.
  • 5. 5 File-System Implementation (Galvin) A sortedbinarylist allows a binarysearch anddecreasesthe average searchtime. However, the requirement that the list be kept sortedmay complicate creating anddeleting files, since we mayhave to move substantial amounts ofdirectoryinformationto maintaina sorted directory. A more sophisticatedtree data structure, suchas a B-tree, might helphere. An advantage of the sorted list is that a sorteddirectorylisting can be produced without a separate sort step.  Hashtable: Another data structure for a file directoryis a hashtable. Withthis method, a linear list stores the directoryentries, but a hash data structure is alsoused. The hash table takesa value computedfrom the file name and returns a pointer to the file name in the linear list. Therefore it can greatlydecrease the directorysearch time. Allocationmethods Here we discuss howto allocate space to files so that diskspace is utilized effectivelyandfiles can be accessed quickly. Three major methods ofallocatingdiskspace are in wide use: Contiguous, linked and indexed. Some systems (suchas Data General's RDOSfor its Nova line of computers) support allthree. More commonly, a systemuses one methodfor all file within a file system type. Contiguous Allocation: It requires that all blocks of a file be kept together contiguously. Performance is veryfast, because readingsuccessive blocks of the same file generallyrequires no movement of the disk heads, or at most one smallstepto the next adjacent cylinder.  Storage allocationinvolves the same issues discussedearlier for the allocationof contiguous blocks of memory(first fit, best fit, fragmentationproblems, etc.) The distinctionis that the hightime penaltyrequiredfor moving the disk heads from spot to spot maynow justifythe benefits ofkeeping files contiguouslywhen possible. (Evenfile systems that donot bydefault store files contiguouslycan benefit from certain utilitiesthat compact the diskandmake all filescontiguous inthe process.)  Problems canarise whenfilesgrow, or ifthe exact size of a file is unknownat creationtime: Over-estimationof the file's finalsize increasesexternal fragmentationandwastes disk space. Under-estimationmayrequire that a file be moved or a processabortedif the file grows beyondits originallyallocatedspace. Ifa file grows slowlyover a long time period and the total final space must be allocated initially, thena lot of space becomes unusable before the file fills the space.  To minimize these drawbacks, some operatingsystems use a modified contiguous-allocation scheme. Here, a contiguous chunkof space is allocated initially;andthen, ifthat amount proves not to be large enough, another chunk ofcontiguous space, knownas ane xtent, is added. The locationof the file's blocks is thenrecordedas a location and a block count, plus a link to the first block ofthe next extent (usedbyVeritas file system). Linked Allocation: Linkedallocationsolves all problems of contiguous allocation. Withlinkedallocation, each file is a linked list of diskblocks;the disk blocks maybe scatteredanywhere onthe disk. The directorycontains a pointer to the first andlast blocks ofthe file (Each blockcontains a pointer to the next block). These pointers are not made available to the user. Thus, if each blockis 512 bytes insize, anda disk address(the pointer) requires 4 bytes, then the user sees blocks of 508 bytes.  To create a new file, we simplycreate a new entryinthe directory. Withlinkedallocation, each directoryentryhas a pointer to the first disk block of the file. This pointer is initializedto nil (the end-of-list pointer value) to signify an emptyfile. The size fieldis alsoset to 0. A write to the file causes the free-space management system to fine a free block, andthis newblock is written to andis linked to the end ofthe file. To reada file, we simplyreadblocks byfollowing the pointers from blockto block. There is no external fragmentation withlinkedallocation, andany free blockon the free-space list can be used to satisfya request. The size ofa file need not be declared when that file is created. A file cancontinue to growas long as free blocks are available. Consequently, it is never necessaryto compact disk space.  Linkedallocationdoeshave disadvantages, however. The major problem is that it can be usedeffectivelyonlyfor sequential-access files. To findthe ith blockof a file, we must start at the beginningof that file and follow the pointers till we get to the ith block. Each access to a pointer requires a diskread, andsome require a diskseek. Consequently, it is inefficient to support a direct-access capabilityfor linked-allocation files. (Another disadvantage is the space requiredfor the pointers).  The usual solutionto thisproblem is to collect blocks intomultiples, calledclusters, and to allocate clusters rather thanblocks. For instance, the file system maydefine a cluster
  • 6. 6 File-System Implementation (Galvin) as four blocks andoperate on the disk onlyincluster units. Pointers thenuse a muchsmaller percentage of the file's diskspace. The cost ofthis approachis anincrease ininternal fragmentation, because more space is wastedwhena cluster is partiallyfull thanwhena blockis partially full. Clusters canbe used to improve the disk-accesstime for manyother algorithms as well, sotheyare usedinmost file systems.  Another problemof linkedallocation is reliability. The files are linkedtogether bypointers scatteredall over the disk, soconsider what would happenif a pointer were lost or damaged. One partial solutionis to use doubly-linkedlists, andanother is to store the file name andrelative block number ineachblock;however, these schemes require even more overhead for eachfile.  An important variationon linkedallocationis the use of a file-allocationtable (FAT). This simple but efficient methodof disk-space allocationis usedbythe MS-DOS andOS/2 operating systems. A sectionof disk at the beginning of each volume is set aside to contain the table. The table has one entryfor eachdiskblock andis indexed byblock number. The FAT is usedinmuchthe same wayas a linkedlist. The directoryentry contains the block number of the first block oftehfile. The table entryindexedbythat blocknumber contains the blocknumber of the next blockinthe file. This chaincontinues untilthe last block, which hasa special end-of-file value as the table entry. Unused blocks are indicatedbya 0 table value. Allocatinga new blockto a file is a simple matter of finding the first 0-valuedtable entryand replacing the previous end-of-file value with the addressof the new block. The 0 is thenreplacedwith end-of-file value. An illustrative example is the FAT structure showninFigure 11.7 for a file consisting ofdiskblocks 217, 618, and 339. The FAT allocationscheme canresult in a significant number of diskheadseeks, unlessthe FAT is cached. The diskhead must move to the start ofthe volume to read the FAT and findthe location of the block in question, thenmove to the location ofthe block itself. In the worst case, both moves occur for eachof the blocks. A benefit is that random-accesstime is improved, because the disk headcan find the locationof anyblock byreading the informationinthe FAT. Indexed Allocation: Linkedallocationsolves the external-fragmentationand size-declarationproblems of contiguous allocation. However, inthe absence of a FAT, linked allocationcannot support efficient direct access, since the pointers to the blocks are scatteredwiththe blocks themselves all over the disks and must be retrievedinorder. Indexedallocation solves this problem bybringing all the pointers together into one location:the index block.  Each file hasits ownindex block, whichis anarrayof disk-blockaddresses. The ithentryin the index block points to the ith blockof the file. The directorycontains the address ofthe index block(Figure 11.8). To findandreadthe ithblock, we use the pointer inth e ithindex-block entry. This scheme is similar to the pagingscheme describedin Section8.4.  When the file is created, allpointers inthe index blockare set to nil. When the ith block is first written, a block is obtainedfrom the free-space manager, and its address is put inthe ithindex-blockentry.  Indexedallocation supports direct access, without sufferingfrom external fragmentation, because anyfree blockon the disk can satisfya request for more space. Indexedallocationdoes suffer from wasted space, however. The pointer overhead of the index blockis generallygreater than the pointer overheadof linkedallocation. Consider a commoncase inwhich we have a file of onlyone or two blocks. Withlinked allocation, we lose the space of onlyone pointer per block. Withindexed allocation, anentire index block must be allocated, evenif onlyone or two pointers will be non-nil.  This point raises the questionof how large the index block shouldbe. Every file must have anindex block, sowe want the index block to be as small as possible. Ifthe index blockis too small, however, it will not be able to hold enoughpointers for a large file, and a mechanism willhave to be available to deal withthe issue. Mechanisms for this purpose include the following: o Linkedscheme – An index block is normallyone diskblock. Thus, it canbe read andwrittendirectlybyitself. To allow for large files, we can linktogether several index blocks. For example, anindex blockmight contain a smallheader giving the name of the file anda set of the first 100 disk-block addresses. The next address (the last word inthe index block) is nil (for a small file)or is a pointer to another index block (for a large file). o Multilevel index – A variant of the linked representationis to use a first-level index block to a set of second-level index blocks, which in turn point to the file blocks. To accessa block, the OS uses the first-level index to find a second-level index block andthenuses that block to find the desireddata block. This approach couldbe continuedto a third or fourth level, depending onthe desired maximum file size. With4096-byte blocks, we could store 1,024 4-byte pointers inan index block. Twolevelsof indexes allow 1,048,576 data blocks and a file size ofup to 4 GB.
  • 7. 7 File-System Implementation (Galvin) o Combinedscheme – Another alternative, usedinthe UFS, is to keepthe first, say, 15 pointers of the index block inthe file's inode. The first 12 of these pointers point to direct blocks;that is, theycontainaddresses of blocks that contain data of the file. Thus, the data for smallfiles (ofno more than12 blocks) donot needa separate index block. Ifthe block size is 4KB, thenup to 48 KB of data canbe accesseddirectly. The next three pointers point to indirect blocks. The first points to a single indirect block, which is anindex blockcontainingnot data but the addresses ofblocks that do containdata. The secondpoints to a double indirect block, which contains the address of a block that contains the addresses ofblocks that contain pointers to the actualdata blocks. The last pointer contains the addressof a triple indirect block. Under this method, the number of blocks that can be allocatedto a file exceeds the amount of space addressable bythe 4-byte file pointers used bymany OSes. A 32-bit file pointer reaches only2^32 bytes, or 4 GB. ManyUNIXimplementations, including Solaris and IBM's AIX, now support upto 64-bit file pointers. Pointers ofthis size allow files andfile systems to be terabytes in size. A UNIXinode is shownin Figure 11.9.  Indexed-allocation schemes suffer fromsome ofthe same performance problems as does linkedallocation. Specifically, the index blocks can be cachedinmemory, but the data blocks maybe spread all over a volume. Performance: The optimal allocation methodis different for sequential accessfiles thanfor random access files, and is alsodifferent for smallfiles than for large files. Some systems support more than one allocationmethod, whichmayrequire specifying how the file is to be used (sequential or random access) at the time it is allocated. Such systems also provide conversion utilities. Some systems have beenknown to use contiguous access for small files, and automaticallyswitch to anindexedscheme whenfile sizes surpass a certainthreshold. Andof course some systems adjust their allocationschemes (e.g. block sizes)to best matchthe characteristics of the hardware for optimumperformance. Free-SpaceManagement Another important aspect of disk management is keeping track of and allocating free space.  Bit Vector: One simple approachis to use a bit vector, inwhich each bit represents a disk block, set to 1 if free or 0 if allocated. Fast algorithms exist for quicklyfinding contiguous blocks ofa given size The down side is that a 40GB diskrequires over 5MB just to store the bitmap (For example).  Linked List: A linked list canalso be used to keeptrackof all free blocks. Traversingthe list and/or finding a contiguous block ofa given size are not easy, but fortunatelyare not frequentlyneededoperations. Generallythe systemjust adds andremoves single blocks from the beginning of the list. The FAT table keeps trackof the free list as just one more linked list on the table.  Grouping: A variationon linkedlist free lists is to use links ofblocks ofi ndices of free blocks. If a blockholds upto N addresses, thenthe first block inthe linked-list contains upto N-1 addresses offree blocks anda pointer to the next blockof free addresses.  Counting: When there are multiple contiguous blocks of free space thenthe systemcankeep track of the startingaddress of the groupandthe number of contiguous free blocks. As long as the average lengthof a contiguous group offree blocks is greater thantwo this offers a savings inspace neededfor the free list. (Similar to compressiontechniques used for graphics imageswhena groupof pixelsallthe same color is encountered.)  Space Maps: Sun's ZFSfile systemwas designed for HUGE numbers andsizes offiles, directories, andeven file systems. The resulting data structurescouldbe VERY inefficient if not implemented carefully. For example, freeingup a 1 GB file ona 1 TB file systemcouldinvolve updating thousands of blocks o f free list bit maps if the file was spreadacrossthe disk. ZFS uses a combinationof techniques, starting with dividingthe diskup into(hundreds of) metaslabs of a manageable size, each havingtheir ownspace map. Free blocks are managed using the counting technique, but rather thanwrite the informationto a table, it is recorded in a log-structuredtransactionrecord. Adjacent free blocks are also coalescedintoa larger single free block. An in-memoryspace mapis constructed using a balancedtree data structure, constructedfrom the logdata. The combinationof the in- memorytree andthe on-disklog provide for veryfast andefficient management of these verylarge files and free blocks. EfficiencyandPerformance  Efficiency: The efficient use of diskspace depends heavilyon the diskallocationanddirectoryalgorithms in use. For instance, UNIXpre- allocates inodes, whichoccupies space evenbefore anyfiles are created. UNIXalsodistributes inodes across the disk, and tries to store data files near their inode, to reduce the distance of diskseeks betweenthe inodes and the data. Some systems use variable size clusters depending on the file size. The more data that is storedin a directory(e.g., information like last accesstime), the more oftenthe d irectoryblocks have to be re-written. As technologyadvances, addressingschemeshave had to growas well. Sun's ZFS file system uses 128-bit pointers, which should
  • 8. 8 File-System Implementation (Galvin) theoreticallynever needto be expanded. (The mass required to store 2^128 bytes withatomic storage wouldbe at least 272 trillion kilograms!) Kernel table sizes usedto be fixed, and couldonlybe changedbyrebuildingthe kernels. Modern tables are dynamicallyallocated, but that requires more complicatedalgorithms for accessingthem.  Performance: Even after the basic file-system algorithms have been selected, we canstill improve performance inseveral ways. Disk controllers generallyinclude on-boardcaching. Whena seekis requested, the heads are moved into place, andthen anentire track is read, startingfrom whatever sector is currentlyunder the heads (reducinglatency). The requestedsector is returnedandthe unrequestedportionof the trackis cached inthe disk's electronics. Some OSes cache diskblocks theyexpect to need againina buffer cache. A page cache connected to the virtual memorysystemis actuallymore efficient as memory addresses donot need to be convertedto diskblock addresses and back again. Some systems (Solaris, Linux, Windows 2000, NT, XP) use page caching for bothprocess pages andfile data in a unifiedvirtual memory. Figures 11.11 and 11.12 showthe advantages ofthe unifiedbuffer cache foundin some versions of UNIXandLinux - Data does not needto be storedtwice, andproblems of inconsistent buffer informationare avoided. (Book: Some systems maintaina separate sectionof main memory fora buffer cache, where blocks are kept under the assumption that theywillbe usedagain shortly. Other systems cache filedata usinga page cache. The page cache usesvirtualmemorytechniques to cache filedata as pagesrather than as file-system-orientedblocks. Cachingfile data usingvirtual addressesis far more efficient than cachingthroughphysical blocks, as accesses interface with virtual memoryrather than the file system. Several systems, including Solaris/Linus/WIndows NT/XP, use page caching to cache bothprocess pages andfile data. This is known as unifiedvirtual memory.) (Book: Some versions of UNIXandLinux provide a unifiedbuffer cache. To illustrate the benefits of the unified buffer cache, consider the two alternatives for opening and accessinga file. One approachis to use memorymapping(section9.7);the secondis to use the standard systemcalls read()andwrite(). Without a unifiedbuffer cache, we have a situation similar to Figure 11.11. Here, re ad() and write()systemcalls go through the buffer cache. The memory-mapping call, however, requires using twocaches - the page cache and the buffer cache. A memory mapping proceeds byreadingindiskblocks from the file system andstoring theminthe buffer cache. Because the virtual memorydoes not interface withthe buffer cache, the contents of the file in the buffer cache must be copiedinto the page cache. This situationis knownas double caching and requires caching file-system data twice. Not onlydoes it waste memorybut it alsowastessignificant CPU andI/O cycles due to the extra data movement withinsystem memory. Inaddition, inconsistencies betweenthe two cachescanresult in corrup t files. In contrast, whena unified buffer cache is provided, bothmemorymapping and the read() andwrite()system callsuse the same page cache. This has the benefit of avoidingdouble caching, and it allows the virtual memorysystem to manage file-systemdata. The unified buffer cache is showninFigure 11.12.) o Page replacement strategies canbe complicatedwith a unified cache, as one needs to decide whether to replace process or file pages, andhowmanypagesto guarantee to each categoryof pages. Solaris, for example, has gone throughmanyvariations, resulting in priority paging givingprocesspages priorityover file I/O pages, andsettinglimits sothat neither canknock the other completelyout ofmemory. o Another issue affecting performance is the questionof whether to implement synchronous writes or asynchronous writes. Synchronous writes occur inthe order in whichthe disksubsystem receives them, without caching;Asynchronous writes are cached, allowing the disk subsystemto schedule writesina more efficient order (See Chapter 12.) Metadata writes are oftendone synchronously. Some systems support flags to the opencall requiring that writes be synchronous, for example for the benefit of database systems that require their writesbe performed ina required order. o The type of file access canalsohave animpact on optimal page replacement policies. For example, LRU is not necessarilya good policyfor sequential access files. For these types of files progression normallygoes in a forward directiononly, andthe m ost recently usedpage will not be neededagainuntil after the file has beenrewound and re -readfrom the beginning, (ifit is ever neededat all.) On the other hand, we canexpect to needthe next page inthe file fairlysoon. For this reasonsequential access files often take advantage of twospecialpolicies:  Free-behind frees upa page as soonas the next page inthe file is requested, with the assumptionthat we are now done with the old page andwon't needit again for a long time.  Read-ahead reads the requested page andseveral subsequent pagesat the same time, withthe assumption that those pages will be neededin the near future. This is similar to the trackcaching that is alreadyperformedbythe disk controller, except it saves the future latencyof transferring data from the disk controller memoryintomotherboardmainmemory. o The caching system andasynchronous writesspeedup disk writes considerably, because the disk subsystemcanschedule physical writes to the diskto minimize head movement and diskseektimes. (See Chapter 12). Reads, onthe other hand, must be done mo re synchronouslyinspite of the caching system, withthe result that disk writes cancounter-intuitivelybe much faster on average than diskreads. Recovery
  • 9. 9 File-System Implementation (Galvin) Filesanddirectoriesare kept bothinmainmemoryandondisk, andcare must be taken to ensure that system failure does not result in loss ofdata or in data inconsistency. We deal with these issues inthe following sections.  Consistency Checking: The storingof certaindata structures (e.g. directories and inodes)inmemoryandthe caching ofdiskoperations can speedup performance, but what happens in the result of a systemcrash? All volatile memorystructuresare lost, and the informationstored on the hard drive maybe left in aninconsistent state. A Consistency Checker (fsck in UNIX, chkdskor scandiskinWindows) is oftenrun at boot time or mount time, particularlyif a filesystem was not closed downproperly. Some of the problems that these toolslook for include: o Diskblocks allocatedto files and also listedon the free list. o Diskblocks neither allocatedto files nor on the free list. o Diskblocks allocatedto more thanone file. o The number of diskblocks allocatedto a file inconsistent with the file's statedsize. o Properlyallocatedfiles/ inodes which donot appear inanydirectoryentry. o Link counts for an inode not matching the number of referencesto that inode in the directorystructure. o Two or more identical file names inthe same directory. o Illegallylinkeddirectories, e.g. cyclical relationships where those are not allowed, or files/directories that are not accessible fromthe root of the directorytree. o Consistencycheckers will often collect questionable disk blocks intonew files with names such as chk00001.dat. These files may contain valuable informationthat wouldotherwise be lost, but inmost casestheycan be safelydeleted, (returning those disk blo cks to the free list.) UNIXcaches directoryinformationfor reads, but anychangesthat affect space allocationor metadata ch anges are written synchronously, before anyof the correspondingdata blocks are writtento.  Log-Structured File Systems: Log-based transaction-oriented (a.k.a. journaling) filesystems borrow techniques developedfor databases, guaranteeing that anygiven transactioneither completes successfullyor can be rolledbackto a safe state before the transactioncommenced: o All metadata changes are writtensequentiallyto a log. o A set of changesfor performing a specific task (e.g. moving a file) is a transaction. o As changes are writtento the log theyare saidto be committed, allowingthe systemto returnto its work. o In the meantime, the changesfrom the logare carried out onthe actual filesystem, anda pointer keeps track ofwhichchanges in the log have beencompletedandwhichhave not yet beencompleted. o When all changescorresponding to a particular transactionhave beencompleted, that transactioncanbe safelyremovedfrom the log. o At anygiventime, the log will containinformationpertaining to uncompleted transactions only, e.g. actions that were committedbut for which the entire transaction hasnot yet beencompleted.  From the log, the remaining transactions can be completed,  or if the transactionwas aborted, thenthe partiallycompletedchanges can be undone.  Backup and Restore: A full backupcopies everyfile ona filesystem. Incrementalbackups copyonlyfiles which have changedsince some previous time. A combinationof full andincrementalbackups canoffer a compromise betweenfullrecoverability, the number and size of backuptapes needed, andthe number oftapes that needto be usedto doa full restore. For example, one strategymight be: At the beginning of the month do a fullbackup. At the endof the first andagainat the endof the secondweek, backup all files which have changedsince the beginning of the month. At the endof the thirdweek, backup all filesthat have changedsince the endof the secondweek. Everydayof the month not listedabove, doan incremental backupof all filesthat have changedsince the most recent ofthe weeklybackups d escribedabove.  Other Solutions: Sun's ZFS andNetwork Appliance's WAFL file systems take a different approach to filesystem consistency. No blocks of data are ever over-writteninplace. Rather the new data is writtenintofresh newblocks, and after the transactionis complete, the metadata (data block pointers) is updated to point to the new blocks. The oldblocks can then be freedup for future use. Alternatively, if the oldblocks andold metadata are saved, thena snapshot of the systeminits originalstate is preserved. Thisapproachis taken byWAFL. ZFScombines this with check-summingof all metadata anddata blocks, andRAID, to ensure that no inconsistencies are possible, andtherefore ZFSdoes not incorporate a consistencychecker. NFS (Optional) The NFS protocol is implementedas a set of remote procedure calls (RPCs): Searching for a file in a directory, Reading a set of directoryentries, Manipulating links anddirectories, Accessing file attributes, Reading and writing files. For remote operations, bufferingandcaching improve performance, but cancause a disparity in localversus remote views of the same file(s). (In addition to the figure 12.15, you can alsoviewthe preceding figures illustratingNFS file system mounting if you forgot)
  • 10. 10 File-System Implementation (Galvin) AssortedContent  Master Boot Record (MBR:Wiki): A master boot record (MBR) is a special type of boot sector at the verybeginning ofpartitioned computer mass storage devices like fixeddisks or removable drives intendedfor use with IBMPC-compatible systems andbeyond. The MBR holds the informationonhowthe logical partitions, containing file systems, are organizedonthat medium. The MBR also contains execu table code to function as a loader for the installedoperatingsystem—usuallybypassing control over to the loader's secondstage, or in conjunctionwith each partition's volume boot record(VBR). ThisMBR code is usuallyreferredto as a boot loader. MBRs are not present on non-partitioned media suchas floppies, super floppies or other storage devices configuredto behave as such. The MBR is not locatedina partition;it is located at a first sector of the device (physical offset 0), preceding the first partition. (The boot sector present on a non-partitioneddevice or withinanindividual partitionis called a volume boot record instead.) The organizationof the partitiontable inthe MBR limits the maximumaddressable storage space ofa disk to 2 TiB(232 × 512 bytes). Approaches to slightlyraise this limit assuming 33-bit arithmetics or 4096-byte sectors are not officiallysupportedas theyfatallybreak compatibilitywithexistingboot loaders andmost MBR-compliant operating systems and system tools, and can causes serious data corruption when usedoutside of narrowlycontrolledsystemenvironments. Therefore, the MBR-based partitioning scheme is in the process ofbeing superseded bythe GUID Partition Table (GPT) scheme in newcomputers. A GPT cancoexist with anMBR inorder to provide some limitedform of backwardcompatibilityfor older systems. The MBR consists of 512 or more bytes located inthe first sector of the drive. It maycontainone or more of: (A) A partitiontable describing the partitions of a storage device. Inthiscontext the boot sector mayalso be calleda partitionsector. (B) Bootstrapcode: Instructions to identifythe configured bootable partition, thenloadandexecute its volume boot record(VBR)as a chainloader. (C) Optional 32-bit disk timestamp. (D) Optional 32-bit disksignature. By convention, there are exactlyfour primarypartitiontable entries inthe MBR partitiontable scheme:  Second-stage boot loader: Second-stage boot loaders, suchas GNU GRUB, BOOTMGR, Syslinux, NTLDRor BootX, are not themselves operating systems, but are able to load anoperatingsystemproperlyand transfer executionto it;the operating system subsequentlyinitializes itself and mayload extra device drivers. The second-stage boot loader does not needdrivers for its own operation, but mayinstead use generic storage access methods provided bysystemfirmware such as the BIOS or OpenFirmware, thoughtypicallywithrestrictedhardware functionalityand lower performance.  Volume Boot Record (VBR): A Volume Boot Record (VBR) (also knownas a volume boot sector, a partitionboot record or a partition boot sector) is a type of boot sector introduced bythe IBMPersonal Computer. It maybe foundon a partitioned data storage device such as a hard disk, or anunpartitioneddevice such as a floppydisk, and contains machine code for bootstrappingprograms (usually, but not necessarily, operatingsystems) storedin other parts of the device. On non-partitionedstorage devices, it is the first sector of the device. On partitioned devices, it is the first sector of anindividual partition onthe device, with the first sector ofthe entire device beinga Master Boot Record(MBR) containingthe partitiontable. The code involume boot records is invoked either directlybythe machine's firmware or indirectlybycode in the master boot record or a boot manager. Code in the MBR and VBRis inessence loaded the same way. Invoking a VBR via a boot manager is known as chainloading.  Master File Table (MFT): The NTFS file system contains a file calledthe master file table, or MFT. There is at least one entryinthe MFT for everyfile onanNTFSfile systemvolume, includingthe MFT itself. All informationabout a file, including its size, time an ddate stamps, permissions, anddata content, is storedeither in MFT entries, or in space outside the MFT that is describedbyMFT entries. As filesare added to an NTFS file systemvolume, more entries are addedto the MFT andthe MFT increases in size. When files are deletedfroma nNTFSfile systemvolume, their MFT entriesare markedas free and maybe reused. However, diskspace that has beenallocated for these entries is not reallocated, andthe size of the MFT does not decrease. (The master file table (MFT)is a database inwhich information about everyfile and directoryon anNT File System(NTFS) volume is stored. There is at least one recordfor everyfile and directoryon the NTFSlogical volume. Each record contains attributesthat tell the operating system (OS) how to deal withthe file or directoryassociatedwith th e record.)  File Control Block (FCB): A File Control Block (FCB) is a file systemstructure in which the state of an openfile is maintained. A FCB is managed bythe operatingsystem, but it resides inthe memoryof the program that uses the file, not inoperatingsystemmemory. This allows a process to have as many files openat one time as it wants to, provided it canspare enoughmemory for an FCB per file. A full FCB is 36 bytes long;inearlyversions of CP/M, it was 33 bytes. This fixedsize, which couldno t be increasedwithout breakingapplicationcompatibility, leadto the FCB's eventual demise as the standardmethod ofaccessing files. The meanings of severalof the fields inthe FCB differ betweenCP/Mand DOS, andalsodepending onwhat operationis beingperformed. The followingfields have consistent meanings:
  • 11. 11 File-System Implementation (Galvin) To be cleared  I Q’s Later  XXX Glossary ReadLater Further Reading  S  Grey Areas  XXX