SlideShare a Scribd company logo
Imports topics from Galvin Operating System .pdf
Contents
PART ONE OVERVIEW
Chapter 1 Introduction
1.1 What Operating Systems Do 4
1.2 Computer-System Organization 7
1.3 Computer-System Architecture 12
1.4 Operating-System Structure 19
1.5 Operating-System Operations 21
1.6 Process Management 24
1.7 Memory Management 25
1.8 Storage Management 26
1.9 Protection and Security 30
1.10 Kernel Data Structures 31
1.11 Computing Environments 35
1.12 Open-Source Operating Systems 43
1.13 Summary 47
Exercises 49
Bibliographical Notes 52
Chapter 2 Operating-System Structures
2.1 Operating-System Services 55
2.2 User and Operating-System
Interface 58
2.3 System Calls 62
2.4 Types of System Calls 66
2.5 System Programs 74
2.6 Operating-System Design and
Implementation 75
2.7 Operating-System Structure 78
2.8 Operating-System Debugging 86
2.9 Operating-System Generation 91
2.10 System Boot 92
2.11 Summary 93
Exercises 94
Bibliographical Notes 101
PART TWO PROCESS MANAGEMENT
Chapter 3 Processes
3.1 Process Concept 105
3.2 Process Scheduling 110
3.3 Operations on Processes 115
3.4 Interprocess Communication 122
3.5 Examples of IPC Systems 130
3.6 Communication in Client–
Server Systems 136
3.7 Summary 147
Exercises 149
Bibliographical Notes 161
xvii
xviii Contents
Chapter 4 Threads
4.1 Overview 163
4.2 Multicore Programming 166
4.3 Multithreading Models 169
4.4 Thread Libraries 171
4.5 Implicit Threading 177
4.6 Threading Issues 183
4.7 Operating-System Examples 188
4.8 Summary 191
Exercises 191
Bibliographical Notes 199
Chapter 5 Process Synchronization
5.1 Background 203
5.2 The Critical-Section Problem 206
5.3 Peterson’s Solution 207
5.4 Synchronization Hardware 209
5.5 Mutex Locks 212
5.6 Semaphores 213
5.7 Classic Problems of
Synchronization 219
5.8 Monitors 223
5.9 Synchronization Examples 232
5.10 Alternative Approaches 238
5.11 Summary 242
Exercises 242
Bibliographical Notes 258
Chapter 6 CPU Scheduling
6.1 Basic Concepts 261
6.2 Scheduling Criteria 265
6.3 Scheduling Algorithms 266
6.4 Thread Scheduling 277
6.5 Multiple-Processor Scheduling 278
6.6 Real-Time CPU Scheduling 283
6.7 Operating-System Examples 290
6.8 Algorithm Evaluation 300
6.9 Summary 304
Exercises 305
Bibliographical Notes 311
Chapter 7 Deadlocks
7.1 System Model 315
7.2 Deadlock Characterization 317
7.3 Methods for Handling Deadlocks 322
7.4 Deadlock Prevention 323
7.5 Deadlock Avoidance 327
7.6 Deadlock Detection 333
7.7 Recovery from Deadlock 337
7.8 Summary 339
Exercises 339
Bibliographical Notes 346
PART THREE MEMORY MANAGEMENT
Chapter 8 Main Memory
8.1 Background 351
8.2 Swapping 358
8.3 Contiguous Memory Allocation 360
8.4 Segmentation 364
8.5 Paging 366
8.6 Structure of the Page Table 378
8.7 Example: Intel 32 and 64-bit
Architectures 383
8.8 Example: ARM Architecture 388
8.9 Summary 389
Exercises 390
Bibliographical Notes 394
Contents xix
Chapter 9 Virtual Memory
9.1 Background 397
9.2 Demand Paging 401
9.3 Copy-on-Write 408
9.4 Page Replacement 409
9.5 Allocation of Frames 421
9.6 Thrashing 425
9.7 Memory-Mapped Files 430
9.8 Allocating Kernel Memory 436
9.9 Other Considerations 439
9.10 Operating-System Examples 445
9.11 Summary 448
Exercises 449
Bibliographical Notes 461
PART FOUR STORAGE MANAGEMENT
Chapter 10 Mass-Storage Structure
10.1 Overview of Mass-Storage
Structure 467
10.2 Disk Structure 470
10.3 Disk Attachment 471
10.4 Disk Scheduling 472
10.5 Disk Management 478
10.6 Swap-Space Management 482
10.7 RAID Structure 484
10.8 Stable-Storage Implementation 494
10.9 Summary 496
Exercises 497
Bibliographical Notes 501
Chapter 11 File-System Interface
11.1 File Concept 503
11.2 Access Methods 513
11.3 Directory and Disk Structure 515
11.4 File-System Mounting 526
11.5 File Sharing 528
11.6 Protection 533
11.7 Summary 538
Exercises 539
Bibliographical Notes 541
Chapter 12 File-System Implementation
12.1 File-System Structure 543
12.2 File-System Implementation 546
12.3 Directory Implementation 552
12.4 Allocation Methods 553
12.5 Free-Space Management 561
12.6 Efficiency and Performance 564
12.7 Recovery 568
12.8 NFS 571
12.9 Example: The WAFL File System 577
12.10 Summary 580
Exercises 581
Bibliographical Notes 585
Chapter 13 I/O Systems
13.1 Overview 587
13.2 I/O Hardware 588
13.3 Application I/O Interface 597
13.4 Kernel I/O Subsystem 604
13.5 Transforming I/O Requests to
Hardware Operations 611
13.6 STREAMS 613
13.7 Performance 615
13.8 Summary 618
Exercises 619
Bibliographical Notes 621
xx Contents
PART FIVE PROTECTION AND SECURITY
Chapter 14 Protection
14.1 Goals of Protection 625
14.2 Principles of Protection 626
14.3 Domain of Protection 627
14.4 Access Matrix 632
14.5 Implementation of the Access
Matrix 636
14.6 Access Control 639
14.7 Revocation of Access Rights 640
14.8 Capability-Based Systems 641
14.9 Language-Based Protection 644
14.10 Summary 649
Exercises 650
Bibliographical Notes 652
Chapter 15 Security
15.1 The Security Problem 657
15.2 Program Threats 661
15.3 System and Network Threats 669
15.4 Cryptography as a Security Tool 674
15.5 User Authentication 685
15.6 Implementing Security Defenses 689
15.7 Firewalling to Protect Systems and
Networks 696
15.8 Computer-Security
Classifications 698
15.9 An Example: Windows 7 699
15.10 Summary 701
Exercises 702
Bibliographical Notes 704
PART SIX ADVANCED TOPICS
Chapter 16 Virtual Machines
16.1 Overview 711
16.2 History 713
16.3 Benefits and Features 714
16.4 Building Blocks 717
16.5 Types of Virtual Machines and Their
Implementations 721
16.6 Virtualization and Operating-System
Components 728
16.7 Examples 735
16.8 Summary 737
Exercises 738
Bibliographical Notes 739
Chapter 17 Distributed Systems
17.1 Advantages of Distributed
Systems 741
17.2 Types of Network-
based Operating Systems 743
17.3 Network Structure 747
17.4 Communication Structure 751
17.5 Communication Protocols 756
17.6 An Example: TCP/IP 760
17.7 Robustness 762
17.8 Design Issues 764
17.9 Distributed File Systems 765
17.10 Summary 773
Exercises 774
Bibliographical Notes 777
Contents xxi
PART SEVEN CASE STUDIES
Chapter 18 The Linux System
18.1 Linux History 781
18.2 Design Principles 786
18.3 Kernel Modules 789
18.4 Process Management 792
18.5 Scheduling 795
18.6 Memory Management 800
18.7 File Systems 809
18.8 Input and Output 815
18.9 Interprocess Communication 818
18.10 Network Structure 819
18.11 Security 821
18.12 Summary 824
Exercises 824
Bibliographical Notes 826
Chapter 19 Windows 7
19.1 History 829
19.2 Design Principles 831
19.3 System Components 838
19.4 Terminal Services and Fast User
Switching 862
19.5 File System 863
19.6 Networking 869
19.7 Programmer Interface 874
19.8 Summary 883
Exercises 883
Bibliographical Notes 885
Chapter 20 Influential Operating Systems
20.1 Feature Migration 887
20.2 Early Systems 888
20.3 Atlas 895
20.4 XDS-940 896
20.5 THE 897
20.6 RC 4000 897
20.7 CTSS 898
20.8 MULTICS 899
20.9 IBM OS/360 899
20.10 TOPS-20 901
20.11 CP/M and MS/DOS 901
20.12 Macintosh Operating System and
Windows 902
20.13 Mach 902
20.14 Other Systems 904
Exercises 904
Bibliographical Notes 904
PART EIGHT APPENDICES
Appendix A BSD UNIX
A.1 UNIX History A1
A.2 Design Principles A6
A.3 Programmer Interface A8
A.4 User Interface A15
A.5 Process Management A18
A.6 Memory Management A22
A.7 File System A24
A.8 I/O System A32
A.9 Interprocess Communication A36
A.10 Summary A40
Exercises A41
Bibliographical Notes A42
xxii Contents
Appendix B The Mach System
B.1 History of the Mach System B1
B.2 Design Principles B3
B.3 System Components B4
B.4 Process Management B7
B.5 Interprocess Communication B13
B.6 Memory Management B18
B.7 Programmer Interface B23
B.8 Summary B24
Exercises B25
Bibliographical Notes B26
Part One
Overview
An operating system acts as an intermediary between the user of a
computer and the computer hardware. The purpose of an operating
system is to provide an environment in which a user can execute
programs in a convenient and efficient manner.
An operating system is software that manages the computer hard-
ware. The hardware must provide appropriate mechanisms to ensure the
correct operation of the computer system and to prevent user programs
from interfering with the proper operation of the system.
Internally, operating systems vary greatly in their makeup, since they
are organized along many different lines. The design of a new operating
system is a major task. It is important that the goals of the system be well
defined before the design begins. These goals form the basis for choices
among various algorithms and strategies.
Because an operating system is large and complex, it must be created
piece by piece. Each of these pieces should be a well-delineated portion
of the system, with carefully defined inputs, outputs, and functions.
Imports topics from Galvin Operating System .pdf
1
C H A P T E R
Introduction
An operating system is a program that manages a computer’s hardware. It
also provides a basis for application programs and acts as an intermediary
between the computer user and the computer hardware. An amazing aspect of
operating systems is how they vary in accomplishing these tasks. Mainframe
operating systems are designed primarily to optimize utilization of hardware.
Personal computer (PC) operating systems support complex games, business
applications, and everything in between. Operating systems for mobile com-
puters provide an environment in which a user can easily interface with the
computer to execute programs. Thus, some operating systems are designed to
be convenient, others to be efficient, and others to be some combination of the
two.
Before we can explore the details of computer system operation, we need to
know something about system structure. We thus discuss the basic functions
of system startup, I/O, and storage early in this chapter. We also describe
the basic computer architecture that makes it possible to write a functional
operating system.
Because an operating system is large and complex, it must be created
piece by piece. Each of these pieces should be a well-delineated portion of the
system, with carefully defined inputs, outputs, and functions. In this chapter,
we provide a general overview of the major components of a contemporary
computer system as well as the functions provided by the operating system.
Additionally, we cover several other topics to help set the stage for the
remainder of this text: data structures used in operating systems, computing
environments, and open-source operating systems.
CHAPTER OBJECTIVES
• To describe the basic organization of computer systems.
• To provide a grand tour of the major components of operating systems.
• To give an overview of the many types of computing environments.
• To explore several open-source operating systems.
3
4 Chapter 1 Introduction
user
1
user
2
user
3
computer hardware
operating system
system and application programs
compiler assembler text editor database
system
user
n
…
…
Figure 1.1 Abstract view of the components of a computer system.
1.1 What Operating Systems Do
We begin our discussion by looking at the operating system’s role in the
overall computer system. A computer system can be divided roughly into four
components: the hardware, the operating system, the application programs,
and the users (Figure 1.1).
The hardware—the central processing unit (CPU), the memory, and the
input/output (I/O) devices—provides the basic computing resources for the
system. The application programs—such as word processors, spreadsheets,
compilers, and Web browsers—define the ways in which these resources are
used to solve users’ computing problems. The operating system controls the
hardware and coordinates its use among the various application programs for
the various users.
We can also view a computer system as consisting of hardware, software,
and data. The operating system provides the means for proper use of these
resources in the operation of the computer system. An operating system is
similar to a government. Like a government, it performs no useful function by
itself. It simply provides an environment within which other programs can do
useful work.
To understand more fully the operating system’s role, we next explore
operating systems from two viewpoints: that of the user and that of the system.
1.1.1 User View
The user’s view of the computer varies according to the interface being
used. Most computer users sit in front of a PC, consisting of a monitor,
keyboard, mouse, and system unit. Such a system is designed for one user
1.1 What Operating Systems Do 5
to monopolize its resources. The goal is to maximize the work (or play) that
the user is performing. In this case, the operating system is designed mostly
for ease of use, with some attention paid to performance and none paid
to resource utilization—how various hardware and software resources are
shared. Performance is, of course, important to the user; but such systems
are optimized for the single-user experience rather than the requirements of
multiple users.
In other cases, a user sits at a terminal connected to a mainframe or a
minicomputer. Other users are accessing the same computer through other
terminals. These users share resources and may exchange information. The
operating system in such cases is designed to maximize resource utilization—
to assure that all available CPU time, memory, and I/O are used efficiently and
that no individual user takes more than her fair share.
In still other cases, users sit at workstations connected to networks of
other workstations and servers. These users have dedicated resources at
their disposal, but they also share resources such as networking and servers,
including file, compute, and print servers. Therefore, their operating system is
designed to compromise between individual usability and resource utilization.
Recently, many varieties of mobile computers, such as smartphones and
tablets, have come into fashion. Most mobile computers are standalone units for
individual users. Quite often, they are connected to networks through cellular
or other wireless technologies. Increasingly, these mobile devices are replacing
desktop and laptop computers for people who are primarily interested in
using computers for e-mail and web browsing. The user interface for mobile
computers generally features a touch screen, where the user interacts with the
system by pressing and swiping fingers across the screen rather than using a
physical keyboard and mouse.
Some computers have little or no user view. For example, embedded
computers in home devices and automobiles may have numeric keypads and
may turn indicator lights on or off to show status, but they and their operating
systems are designed primarily to run without user intervention.
1.1.2 System View
From the computer’s point of view, the operating system is the program
most intimately involved with the hardware. In this context, we can view
an operating system as a resource allocator. A computer system has many
resources that may be required to solve a problem: CPU time, memory space,
file-storage space, I/O devices, and so on. The operating system acts as the
manager of these resources. Facing numerous and possibly conflicting requests
for resources, the operating system must decide how to allocate them to specific
programs and users so that it can operate the computer system efficiently and
fairly. As we have seen, resource allocation is especially important where many
users access the same mainframe or minicomputer.
A slightly different view of an operating system emphasizes the need to
control the various I/O devices and user programs. An operating system is a
control program. A control program manages the execution of user programs
to prevent errors and improper use of the computer. It is especially concerned
with the operation and control of I/O devices.
6 Chapter 1 Introduction
1.1.3 Defining Operating Systems
By now, you can probably see that the term operating system covers many roles
and functions. That is the case, at least in part, because of the myriad designs
and uses of computers. Computers are present within toasters, cars, ships,
spacecraft, homes, and businesses. They are the basis for game machines, music
players, cable TV tuners, and industrial control systems. Although computers
have a relatively short history, they have evolved rapidly. Computing started
as an experiment to determine what could be done and quickly moved to
fixed-purpose systems for military uses, such as code breaking and trajectory
plotting, and governmental uses, such as census calculation. Those early
computers evolved into general-purpose, multifunction mainframes, and
that’s when operating systems were born. In the 1960s, Moore’s Law predicted
that the number of transistors on an integrated circuit would double every
eighteen months, and that prediction has held true. Computers gained in
functionality and shrunk in size, leading to a vast number of uses and a vast
number and variety of operating systems. (See Chapter 20 for more details on
the history of operating systems.)
How, then, can we define what an operating system is? In general, we have
no completely adequate definition of an operating system. Operating systems
exist because they offer a reasonable way to solve the problem of creating a
usable computing system. The fundamental goal of computer systems is to
execute user programs and to make solving user problems easier. Computer
hardware is constructed toward this goal. Since bare hardware alone is not
particularly easy to use, application programs are developed. These programs
require certain common operations, such as those controlling the I/O devices.
The common functions of controlling and allocating resources are then brought
together into one piece of software: the operating system.
In addition, we have no universally accepted definition of what is part of the
operating system. A simple viewpoint is that it includes everything a vendor
ships when you order “the operating system.” The features included, however,
vary greatly across systems. Some systems take up less than a megabyte of
space and lack even a full-screen editor, whereas others require gigabytes of
space and are based entirely on graphical windowing systems. A more common
definition, and the one that we usually follow, is that the operating system
is the one program running at all times on the computer—usually called
the kernel. (Along with the kernel, there are two other types of programs:
system programs, which are associated with the operating system but are not
necessarily part of the kernel, and application programs, which include all
programs not associated with the operation of the system.)
The matter of what constitutes an operating system became increasingly
important as personal computers became more widespread and operating
systems grew increasingly sophisticated. In 1998, the United States Department
of Justice filed suit against Microsoft, in essence claiming that Microsoft
included too much functionality in its operating systems and thus prevented
application vendors from competing. (For example, a Web browser was an
integral part of the operating systems.) As a result, Microsoft was found guilty
of using its operating-system monopoly to limit competition.
Today, however, if we look at operating systems for mobile devices, we
see that once again the number of features constituting the operating system
1.2 Computer-System Organization 7
is increasing. Mobile operating systems often include not only a core kernel
but also middleware—a set of software frameworks that provide additional
services to application developers. For example, each of the two most promi-
nent mobile operating systems—Apple’s iOS and Google’s Android—features
a core kernel along with middleware that supports databases, multimedia, and
graphics (to name a only few).
1.2 Computer-System Organization
Before we can explore the details of how computer systems operate, we need
general knowledge of the structure of a computer system. In this section,
we look at several parts of this structure. The section is mostly concerned
with computer-system organization, so you can skim or skip it if you already
understand the concepts.
1.2.1 Computer-System Operation
A modern general-purpose computer system consists of one or more CPUs
and a number of device controllers connected through a common bus that
provides access to shared memory (Figure 1.2). Each device controller is in
charge of a specific type of device (for example, disk drives, audio devices,
or video displays). The CPU and the device controllers can execute in parallel,
competing for memory cycles. To ensure orderly access to the shared memory,
a memory controller synchronizes access to the memory.
For a computer to start running—for instance, when it is powered up or
rebooted—it needs to have an initial program to run. This initial program,
or bootstrap program, tends to be simple. Typically, it is stored within
the computer hardware in read-only memory (ROM) or electrically erasable
programmable read-only memory (EEPROM), known by the general term
firmware. It initializes all aspects of the system, from CPU registers to device
controllers to memory contents. The bootstrap program must know how to load
the operating system and how to start executing that system. To accomplish
USB controller
keyboard printer
mouse monitor
disks
graphics
adapter
disk
controller
memory
CPU
on-line
Figure 1.2 A modern computer system.
8 Chapter 1 Introduction
user
process
executing
CPU
I/O interrupt
processing
I/O
request
transfer
done
I/O
request
transfer
done
I/O
device
idle
transferring
Figure 1.3 Interrupt timeline for a single process doing output.
this goal, the bootstrap program must locate the operating-system kernel and
load it into memory.
Once the kernel is loaded and executing, it can start providing services to
the system and its users. Some services are provided outside of the kernel, by
system programs that are loaded into memory at boot time to become system
processes, or system daemons that run the entire time the kernel is running.
On UNIX, the first system process is “init,” and it starts many other daemons.
Once this phase is complete, the system is fully booted, and the system waits
for some event to occur.
The occurrence of an event is usually signaled by an interrupt from either
the hardware or the software. Hardware may trigger an interrupt at any time
by sending a signal to the CPU, usually by way of the system bus. Software
may trigger an interrupt by executing a special operation called a system call
(also called a monitor call).
When the CPU is interrupted, it stops what it is doing and immediately
transfers execution to a fixed location. The fixed location usually contains
the starting address where the service routine for the interrupt is located.
The interrupt service routine executes; on completion, the CPU resumes the
interrupted computation. A timeline of this operation is shown in Figure 1.3.
Interrupts are an important part of a computer architecture. Each computer
design has its own interrupt mechanism, but several functions are common.
The interrupt must transfer control to the appropriate interrupt service routine.
The straightforward method for handling this transfer would be to invoke
a generic routine to examine the interrupt information. The routine, in turn,
would call the interrupt-specific handler. However, interrupts must be handled
quickly. Since only a predefined number of interrupts is possible, a table of
pointers to interrupt routines can be used instead to provide the necessary
speed. The interrupt routine is called indirectly through the table, with no
intermediate routine needed. Generally, the table of pointers is stored in low
memory (the first hundred or so locations). These locations hold the addresses
of the interrupt service routines for the various devices. This array, or interrupt
vector, of addresses is then indexed by a unique device number, given with
the interrupt request, to provide the address of the interrupt service routine for
1.2 Computer-System Organization 9
STORAGE DEFINITIONS AND NOTATION
The basic unit of computer storage is the bit. A bit can contain one of two
values, 0 and 1. All other storage in a computer is based on collections of bits.
Given enough bits, it is amazing how many things a computer can represent:
numbers, letters, images, movies, sounds, documents, and programs, to name
a few. A byte is 8 bits, and on most computers it is the smallest convenient
chunk of storage. For example, most computers don’t have an instruction to
move a bit but do have one to move a byte. A less common term is word,
which is a given computer architecture’s native unit of data. A word is made
up of one or more bytes. For example, a computer that has 64-bit registers and
64-bit memory addressing typically has 64-bit (8-byte) words. A computer
executes many operations in its native word size rather than a byte at a time.
Computer storage, along with most computer throughput, is generally
measured and manipulated in bytes and collections of bytes. A kilobyte, or
KB, is 1,024 bytes; a megabyte, or MB, is 1,0242
bytes; a gigabyte, or GB, is
1,0243
bytes; a terabyte, or TB, is 1,0244
bytes; and a petabyte, or PB, is 1,0245
bytes. Computer manufacturers often round off these numbers and say that
a megabyte is 1 million bytes and a gigabyte is 1 billion bytes. Networking
measurements are an exception to this general rule; they are given in bits
(because networks move data a bit at a time).
the interrupting device. Operating systems as different as Windows and UNIX
dispatch interrupts in this manner.
The interrupt architecture must also save the address of the interrupted
instruction. Many old designs simply stored the interrupt address in a
fixed location or in a location indexed by the device number. More recent
architectures store the return address on the system stack. If the interrupt
routine needs to modify the processor state—for instance, by modifying
register values—it must explicitly save the current state and then restore that
state before returning. After the interrupt is serviced, the saved return address
is loaded into the program counter, and the interrupted computation resumes
as though the interrupt had not occurred.
1.2.2 Storage Structure
The CPU can load instructions only from memory, so any programs to run must
be stored there. General-purpose computers run most of their programs from
rewritable memory, called main memory (also called random-access memory,
or RAM). Main memory commonly is implemented in a semiconductor
technology called dynamic random-access memory (DRAM).
Computers use other forms of memory as well. We have already mentioned
read-only memory, ROM) and electrically erasable programmable read-only
memory, EEPROM). Because ROM cannot be changed, only static programs, such
as the bootstrap program described earlier, are stored there. The immutability
of ROM is of use in game cartridges. EEPROM can be changed but cannot
be changed frequently and so contains mostly static programs. For example,
smartphones have EEPROM to store their factory-installed programs.
10 Chapter 1 Introduction
All forms of memory provide an array of bytes. Each byte has its
own address. Interaction is achieved through a sequence of load or store
instructions to specific memory addresses. The load instruction moves a byte
or word from main memory to an internal register within the CPU, whereas the
store instruction moves the content of a register to main memory. Aside from
explicit loads and stores, the CPU automatically loads instructions from main
memory for execution.
A typical instruction–execution cycle, as executed on a system with a von
Neumann architecture, first fetches an instruction from memory and stores
that instruction in the instruction register. The instruction is then decoded
and may cause operands to be fetched from memory and stored in some
internal register. After the instruction on the operands has been executed, the
result may be stored back in memory. Notice that the memory unit sees only
a stream of memory addresses. It does not know how they are generated (by
the instruction counter, indexing, indirection, literal addresses, or some other
means) or what they are for (instructions or data). Accordingly, we can ignore
how a memory address is generated by a program. We are interested only in
the sequence of memory addresses generated by the running program.
Ideally, we want the programs and data to reside in main memory
permanently. This arrangement usually is not possible for the following two
reasons:
1. Main memory is usually too small to store all needed programs and data
permanently.
2. Main memory is a volatile storage device that loses its contents when
power is turned off or otherwise lost.
Thus, most computer systems provide secondary storage as an extension of
main memory. The main requirement for secondary storage is that it be able to
hold large quantities of data permanently.
The most common secondary-storage device is a magnetic disk, which
provides storage for both programs and data. Most programs (system and
application) are stored on a disk until they are loaded into memory. Many
programs then use the disk as both the source and the destination of their
processing. Hence, the proper management of disk storage is of central
importance to a computer system, as we discuss in Chapter 10.
In a larger sense, however, the storage structure that we have described—
consisting of registers, main memory, and magnetic disks—is only one of many
possible storage systems. Others include cache memory, CD-ROM, magnetic
tapes, and so on. Each storage system provides the basic functions of storing
a datum and holding that datum until it is retrieved at a later time. The main
differences among the various storage systems lie in speed, cost, size, and
volatility.
The wide variety of storage systems can be organized in a hierarchy (Figure
1.4) according to speed and cost. The higher levels are expensive, but they are
fast. As we move down the hierarchy, the cost per bit generally decreases,
whereas the access time generally increases. This trade-off is reasonable; if a
given storage system were both faster and less expensive than another—other
properties being the same—then there would be no reason to use the slower,
more expensive memory. In fact, many early storage devices, including paper
1.2 Computer-System Organization 11
registers
cache
main memory
solid-state disk
magnetic disk
optical disk
magnetic tapes
Figure 1.4 Storage-device hierarchy.
tape and core memories, are relegated to museums now that magnetic tape and
semiconductor memory have become faster and cheaper. The top four levels
of memory in Figure 1.4 may be constructed using semiconductor memory.
In addition to differing in speed and cost, the various storage systems are
either volatile or nonvolatile. As mentioned earlier, volatile storage loses its
contents when the power to the device is removed. In the absence of expensive
battery and generator backup systems, data must be written to nonvolatile
storage for safekeeping. In the hierarchy shown in Figure 1.4, the storage
systems above the solid-state disk are volatile, whereas those including the
solid-state disk and below are nonvolatile.
Solid-state disks have several variants but in general are faster than
magnetic disks and are nonvolatile. One type of solid-state disk stores data in a
large DRAM array during normal operation but also contains a hidden magnetic
hard disk and a battery for backup power. If external power is interrupted, this
solid-state disk’s controller copies the data from RAM to the magnetic disk.
When external power is restored, the controller copies the data back into RAM.
Another form of solid-state disk is flash memory, which is popular in cameras
and personal digital assistants (PDAs), in robots, and increasingly for storage
on general-purpose computers. Flash memory is slower than DRAM but needs
no power to retain its contents. Another form of nonvolatile storage is NVRAM,
which is DRAM with battery backup power. This memory can be as fast as
DRAM and (as long as the battery lasts) is nonvolatile.
The design of a complete memory system must balance all the factors just
discussed: it must use only as much expensive memory as necessary while
providing as much inexpensive, nonvolatile memory as possible. Caches can
12 Chapter 1 Introduction
be installed to improve performance where a large disparity in access time or
transfer rate exists between two components.
1.2.3 I/O Structure
Storage is only one of many types of I/O devices within a computer. A large
portion of operating system code is dedicated to managing I/O, both because
of its importance to the reliability and performance of a system and because of
the varying nature of the devices. Next, we provide an overview of I/O.
A general-purpose computer system consists of CPUs and multiple device
controllers that are connected through a common bus. Each device controller
is in charge of a specific type of device. Depending on the controller, more
than one device may be attached. For instance, seven or more devices can be
attached to the small computer-systems interface (SCSI) controller. A device
controller maintains some local buffer storage and a set of special-purpose
registers. The device controller is responsible for moving the data between
the peripheral devices that it controls and its local buffer storage. Typically,
operating systems have a device driver for each device controller. This device
driver understands the device controller and provides the rest of the operating
system with a uniform interface to the device.
To start an I/O operation, the device driver loads the appropriate registers
within the device controller. The device controller, in turn, examines the
contents of these registers to determine what action to take (such as “read
a character from the keyboard”). The controller starts the transfer of data from
the device to its local buffer. Once the transfer of data is complete, the device
controller informs the device driver via an interrupt that it has finished its
operation. The device driver then returns control to the operating system,
possibly returning the data or a pointer to the data if the operation was a read.
For other operations, the device driver returns status information.
This form of interrupt-driven I/O is fine for moving small amounts of data
but can produce high overhead when used for bulk data movement such as disk
I/O. To solve this problem, direct memory access (DMA) is used. After setting
up buffers, pointers, and counters for the I/O device, the device controller
transfers an entire block of data directly to or from its own buffer storage to
memory, with no intervention by the CPU. Only one interrupt is generated per
block, to tell the device driver that the operation has completed, rather than
the one interrupt per byte generated for low-speed devices. While the device
controller is performing these operations, the CPU is available to accomplish
other work.
Some high-end systems use switch rather than bus architecture. On these
systems, multiple components can talk to other components concurrently,
rather than competing for cycles on a shared bus. In this case, DMA is even
more effective. Figure 1.5 shows the interplay of all components of a computer
system.
1.3 Computer-System Architecture
In Section 1.2, we introduced the general structure of a typical computer system.
A computer system can be organized in a number of different ways, which we
1.3 Computer-System Architecture 13
thread of execution
instructions
and
data
instruction execution
cycle
data movement
DMA
memory
interrupt
cache
data
I/O
request
CPU (*N)
device
(*M)
Figure 1.5 How a modern computer system works.
can categorize roughly according to the number of general-purpose processors
used.
1.3.1 Single-Processor Systems
Until recently, most computer systems used a single processor. On a single-
processor system, there is one main CPU capable of executing a general-purpose
instruction set, including instructions from user processes. Almost all single-
processor systems have other special-purpose processors as well. They may
come in the form of device-specific processors, such as disk, keyboard, and
graphics controllers; or, on mainframes, they may come in the form of more
general-purpose processors, such as I/O processors that move data rapidly
among the components of the system.
All of these special-purpose processors run a limited instruction set and
do not run user processes. Sometimes, they are managed by the operating
system, in that the operating system sends them information about their next
task and monitors their status. For example, a disk-controller microprocessor
receives a sequence of requests from the main CPU and implements its own disk
queue and scheduling algorithm. This arrangement relieves the main CPU of
the overhead of disk scheduling. PCs contain a microprocessor in the keyboard
to convert the keystrokes into codes to be sent to the CPU. In other systems
or circumstances, special-purpose processors are low-level components built
into the hardware. The operating system cannot communicate with these
processors; they do their jobs autonomously. The use of special-purpose
microprocessors is common and does not turn a single-processor system into
14 Chapter 1 Introduction
a multiprocessor. If there is only one general-purpose CPU, then the system is
a single-processor system.
1.3.2 Multiprocessor Systems
Within the past several years, multiprocessor systems (also known as parallel
systems or multicore systems) have begun to dominate the landscape of
computing. Such systems have two or more processors in close communication,
sharing the computer bus and sometimes the clock, memory, and peripheral
devices. Multiprocessor systems first appeared prominently appeared in
servers and have since migrated to desktop and laptop systems. Recently,
multiple processors have appeared on mobile devices such as smartphones
and tablet computers.
Multiprocessor systems have three main advantages:
1. Increased throughput. By increasing the number of processors, we expect
to get more work done in less time. The speed-up ratio with N processors
is not N, however; rather, it is less than N. When multiple processors
cooperate on a task, a certain amount of overhead is incurred in keeping
all the parts working correctly. This overhead, plus contention for shared
resources, lowers the expected gain from additional processors. Similarly,
N programmers working closely together do not produce N times the
amount of work a single programmer would produce.
2. Economy of scale. Multiprocessor systems can cost less than equivalent
multiple single-processor systems, because they can share peripherals,
mass storage, and power supplies. If several programs operate on the
same set of data, it is cheaper to store those data on one disk and to have
all the processors share them than to have many computers with local
disks and many copies of the data.
3. Increased reliability. If functions can be distributed properly among
several processors, then the failure of one processor will not halt the
system, only slow it down. If we have ten processors and one fails, then
each of the remaining nine processors can pick up a share of the work of
the failed processor. Thus, the entire system runs only 10 percent slower,
rather than failing altogether.
Increased reliability of a computer system is crucial in many applications.
The ability to continue providing service proportional to the level of surviving
hardware is called graceful degradation. Some systems go beyond graceful
degradation and are called fault tolerant, because they can suffer a failure of
any single component and still continue operation. Fault tolerance requires
a mechanism to allow the failure to be detected, diagnosed, and, if possible,
corrected. The HP NonStop (formerly Tandem) system uses both hardware and
software duplication to ensure continued operation despite faults. The system
consists of multiple pairs of CPUs, working in lockstep. Both processors in the
pair execute each instruction and compare the results. If the results differ, then
one CPU of the pair is at fault, and both are halted. The process that was being
executed is then moved to another pair of CPUs, and the instruction that failed
1.3 Computer-System Architecture 15
is restarted. This solution is expensive, since it involves special hardware and
considerable hardware duplication.
The multiple-processor systems in use today are of two types. Some
systems use asymmetric multiprocessing, in which each processor is assigned
a specific task. A boss processor controls the system; the other processors either
look to the boss for instruction or have predefined tasks. This scheme defines
a boss–worker relationship. The boss processor schedules and allocates work
to the worker processors.
The most common systems use symmetric multiprocessing (SMP), in
which each processor performs all tasks within the operating system. SMP
means that all processors are peers; no boss–worker relationship exists
between processors. Figure 1.6 illustrates a typical SMP architecture. Notice
that each processor has its own set of registers, as well as a private—or local
—cache. However, all processors share physical memory. An example of an
SMP system is AIX, a commercial version of UNIX designed by IBM. An AIX
system can be configured to employ dozens of processors. The benefit of this
model is that many processes can run simultaneously—N processes can run
if there are N CPUs—without causing performance to deteriorate significantly.
However, we must carefully control I/O to ensure that the data reach the
appropriate processor. Also, since the CPUs are separate, one may be sitting
idle while another is overloaded, resulting in inefficiencies. These inefficiencies
can be avoided if the processors share certain data structures. A multiprocessor
system of this form will allow processes and resources—such as memory—
to be shared dynamically among the various processors and can lower the
variance among the processors. Such a system must be written carefully, as
we shall see in Chapter 5. Virtually all modern operating systems—including
Windows, Mac OS X, and Linux—now provide support for SMP.
The difference between symmetric and asymmetric multiprocessing may
result from either hardware or software. Special hardware can differentiate the
multiple processors, or the software can be written to allow only one boss and
multiple workers. For instance, Sun Microsystems’ operating system SunOS
Version 4 provided asymmetric multiprocessing, whereas Version 5 (Solaris) is
symmetric on the same hardware.
Multiprocessing adds CPUs to increase computing power. If the CPU has an
integrated memory controller, then adding CPUs can also increase the amount
CPU0
registers
cache
CPU1
registers
cache
CPU2
registers
cache
memory
Figure 1.6 Symmetric multiprocessing architecture.
16 Chapter 1 Introduction
of memory addressable in the system. Either way, multiprocessing can cause
a system to change its memory access model from uniform memory access
(UMA) to non-uniform memory access (NUMA). UMA is defined as the situation
in which access to any RAM from any CPU takes the same amount of time. With
NUMA, some parts of memory may take longer to access than other parts,
creating a performance penalty. Operating systems can minimize the NUMA
penalty through resource management, as discussed in Section 9.5.4.
A recent trend in CPU design is to include multiple computing cores
on a single chip. Such multiprocessor systems are termed multicore. They
can be more efficient than multiple chips with single cores because on-chip
communication is faster than between-chip communication. In addition, one
chip with multiple cores uses significantly less power than multiple single-core
chips.
It is important to note that while multicore systems are multiprocessor
systems, not all multiprocessor systems are multicore, as we shall see in Section
1.3.3. In our coverage of multiprocessor systems throughout this text, unless
we state otherwise, we generally use the more contemporary term multicore,
which excludes some multiprocessor systems.
In Figure 1.7, we show a dual-core design with two cores on the same
chip. In this design, each core has its own register set as well as its own local
cache. Other designs might use a shared cache or a combination of local and
shared caches. Aside from architectural considerations, such as cache, memory,
and bus contention, these multicore CPUs appear to the operating system as
N standard processors. This characteristic puts pressure on operating system
designers—and application programmers—to make use of those processing
cores.
Finally, blade servers are a relativelyrecent development in which multiple
processor boards, I/O boards, and networking boards are placed in the same
chassis. The difference between these and traditional multiprocessor systems
is that each blade-processor board boots independently and runs its own
operating system. Some blade-server boards are multiprocessor as well, which
blurs the lines between types of computers. In essence, these servers consist of
multiple independent multiprocessor systems.
CPU core0
registers
cache
CPU core1
registers
cache
memory
Figure 1.7 A dual-core design with two cores placed on the same chip.
1.3 Computer-System Architecture 17
1.3.3 Clustered Systems
Another type of multiprocessor system is a clustered system, which gathers
together multiple CPUs. Clustered systems differ from the multiprocessor
systems described in Section 1.3.2 in that they are composed of two or more
individual systems—or nodes—joined together. Such systems are considered
loosely coupled. Each node may be a single processor system or a multicore
system. We should note that the definition of clustered is not concrete; many
commercial packages wrestle to define a clustered system and why one form
is better than another. The generally accepted definition is that clustered
computers share storage and are closely linked via a local-area network LAN
(as described in Chapter 17) or a faster interconnect, such as InfiniBand.
Clustering is usually used to provide high-availability service—that is,
service will continue even if one or more systems in the cluster fail. Generally,
we obtain high availability by adding a level of redundancy in the system.
A layer of cluster software runs on the cluster nodes. Each node can monitor
one or more of the others (over the LAN). If the monitored machine fails,
the monitoring machine can take ownership of its storage and restart the
applications that were running on the failed machine. The users and clients of
the applications see only a brief interruption of service.
Clustering can be structured asymmetrically or symmetrically. In asym-
metric clustering, one machine is in hot-standby mode while the other is
running the applications. The hot-standby host machine does nothing but
monitor the active server. If that server fails, the hot-standby host becomes
the active server. In symmetric clustering, two or more hosts are running
applications and are monitoring each other. This structure is obviously more
efficient, as it uses all of the available hardware. However it does require that
more than one application be available to run.
Since a cluster consists of several computer systems connected via a
network, clusters can also be used to provide high-performance computing
environments. Such systems can supply significantly greater computational
power than single-processor or even SMP systems because they can run an
application concurrently on all computers in the cluster. The application must
have been written specifically to take advantage of the cluster, however. This
involves a technique known as parallelization, which divides a program into
separate components that run in parallel on individual computers in the cluster.
Typically, these applications are designed so that once each computing node in
the cluster has solved its portion of the problem, the results from all the nodes
are combined into a final solution.
Other forms of clusters include parallel clusters and clustering over a
wide-area network (WAN) (as described in Chapter 17). Parallel clusters allow
multiple hosts to access the same data on shared storage. Because most
operating systems lack support for simultaneous data access by multiple hosts,
parallel clusters usually require the use of special versions of software and
special releases of applications. For example, Oracle Real Application Cluster
is a version of Oracle’s database that has been designed to run on a parallel
cluster. Each machine runs Oracle, and a layer of software tracks access to the
shared disk. Each machine has full access to all data in the database. To provide
this shared access, the system must also supply access control and locking to
18 Chapter 1 Introduction
BEOWULF CLUSTERS
Beowulf clusters are designed to solve high-performance computing tasks.
A Beowulf cluster consists of commodity hardware—such as personal
computers—connected via a simple local-area network. No single specific
software package is required to construct a cluster. Rather, the nodes use a
set of open-source software libraries to communicate with one another. Thus,
there are a variety of approaches to constructing a Beowulf cluster. Typically,
though, Beowulf computing nodes run the Linux operating system. Since
Beowulf clusters require no special hardware and operate using open-source
software that is available free, they offer a low-cost strategy for building
a high-performance computing cluster. In fact, some Beowulf clusters built
from discarded personal computers are using hundreds of nodes to solve
computationally expensive scientific computing problems.
ensure that no conflicting operations occur. This function, commonly known
as a distributed lock manager (DLM), is included in some cluster technology.
Cluster technology is changing rapidly. Some cluster products support
dozens of systems in a cluster, as well as clustered nodes that are separated
by miles. Many of these improvements are made possible by storage-area
networks (SANs), as described in Section 10.3.3, which allow many systems
to attach to a pool of storage. If the applications and their data are stored on
the SAN, then the cluster software can assign the application to run on any
host that is attached to the SAN. If the host fails, then any other host can take
over. In a database cluster, dozens of hosts can share the same database, greatly
increasing performance and reliability. Figure 1.8 depicts the general structure
of a clustered system.
computer
interconnect
computer
interconnect
computer
storage area
network
Figure 1.8 General structure of a clustered system.
1.4 Operating-System Structure 19
job 1
0
Max
operating system
job 2
job 3
job 4
Figure 1.9 Memory layout for a multiprogramming system.
1.4 Operating-System Structure
Now that we have discussed basic computer-system organization and archi-
tecture, we are ready to talk about operating systems. An operating system
provides the environment within which programs are executed. Internally,
operating systems vary greatly in their makeup, since they are organized
along many different lines. There are, however, many commonalities, which
we consider in this section.
One of the most important aspects of operating systems is the ability
to multiprogram. A single program cannot, in general, keep either the CPU
or the I/O devices busy at all times. Single users frequently have multiple
programsrunning.Multiprogrammingincreases CPU utilizationbyorganizing
jobs (code and data) so that the CPU always has one to execute.
The idea is as follows: The operating system keeps several jobs in memory
simultaneously (Figure 1.9). Since, in general, main memory is too small to
accommodate all jobs, the jobs are kept initially on the disk in the job pool.
This pool consists of all processes residing on disk awaiting allocation of main
memory.
The set of jobs in memory can be a subset of the jobs kept in the job
pool. The operating system picks and begins to execute one of the jobs in
memory. Eventually, the job may have to wait for some task, such as an I/O
operation, to complete. In a non-multiprogrammed system, the CPU would sit
idle. In a multiprogrammed system, the operating system simply switches to,
and executes, another job. When that job needs to wait, the CPU switches to
another job, and so on. Eventually, the first job finishes waiting and gets the
CPU back. As long as at least one job needs to execute, the CPU is never idle.
This idea is common in other life situations. A lawyer does not work for
only one client at a time, for example. While one case is waiting to go to trial
or have papers typed, the lawyer can work on another case. If he has enough
clients, the lawyer will never be idle for lack of work. (Idle lawyers tend to
become politicians, so there is a certain social value in keeping lawyers busy.)
20 Chapter 1 Introduction
Multiprogrammed systems provide an environment in which the various
system resources (for example, CPU, memory, and peripheral devices) are
utilized effectively, but they do not provide for user interaction with the
computer system. Time sharing (or multitasking) is a logical extension of
multiprogramming. In time-sharing systems, the CPU executes multiple jobs
by switching among them, but the switches occur so frequently that the users
can interact with each program while it is running.
Time sharing requires an interactive computer system, which provides
direct communication between the user and the system. The user gives
instructions to the operating system or to a program directly, using a input
device such as a keyboard, mouse, touch pad, or touch screen, and waits for
immediate results on an output device. Accordingly, the response time should
be short—typically less than one second.
A time-shared operating system allows many users to share the computer
simultaneously. Since each action or command in a time-shared system tends
to be short, only a little CPU time is needed for each user. As the system switches
rapidly from one user to the next, each user is given the impression that the
entire computer system is dedicated to his use, even though it is being shared
among many users.
A time-shared operating system uses CPU scheduling and multiprogram-
ming to provide each user with a small portion of a time-shared computer.
Each user has at least one separate program in memory. A program loaded into
memory and executing is called a process. When a process executes, it typically
executes for only a short time before it either finishes or needs to perform I/O.
I/O may be interactive; that is, output goes to a display for the user, and input
comes from a user keyboard, mouse, or other device. Since interactive I/O
typically runs at “people speeds,” it may take a long time to complete. Input,
for example, may be bounded by the user’s typing speed; seven characters per
second is fast for people but incredibly slow for computers. Rather than let
the CPU sit idle as this interactive input takes place, the operating system will
rapidly switch the CPU to the program of some other user.
Time sharing and multiprogramming require that several jobs be kept
simultaneously in memory. If several jobs are ready to be brought into memory,
and if there is not enough room for all of them, then the system must choose
among them. Making this decision involves job scheduling, which we discuss
in Chapter 6. When the operating system selects a job from the job pool, it loads
that job into memory for execution. Having several programs in memory at
the same time requires some form of memory management, which we cover in
Chapters 8 and 9. In addition, if several jobs are ready to run at the same time,
the system must choose which job will run first. Making this decision is CPU
scheduling, which is also discussed in Chapter 6. Finally, running multiple
jobs concurrently requires that their ability to affect one another be limited in
all phases of the operating system, including process scheduling, disk storage,
and memory management. We discuss these considerations throughout the
text.
In a time-sharing system, the operating system must ensure reasonable
response time. This goal is sometimes accomplished through swapping,
whereby processes are swapped in and out of main memory to the disk. A more
common method for ensuring reasonable response time is virtual memory, a
technique that allows the execution of a process that is not completely in
1.5 Operating-System Operations 21
memory (Chapter 9). The main advantage of the virtual-memory scheme is that
it enables users to run programs that are larger than actual physical memory.
Further, it abstracts main memory into a large, uniform array of storage,
separating logical memory as viewed by the user from physical memory.
This arrangement frees programmers from concern over memory-storage
limitations.
A time-sharing system must also provide a file system (Chapters 11 and
12). The file system resides on a collection of disks; hence, disk management
must be provided (Chapter 10). In addition, a time-sharing system provides
a mechanism for protecting resources from inappropriate use (Chapter 14).
To ensure orderly execution, the system must provide mechanisms for job
synchronization and communication (Chapter 5), and it may ensure that jobs
do not get stuck in a deadlock, forever waiting for one another (Chapter 7).
1.5 Operating-System Operations
As mentioned earlier, modern operating systems are interrupt driven. If there
are no processes to execute, no I/O devices to service, and no users to whom
to respond, an operating system will sit quietly, waiting for something to
happen. Events are almost always signaled by the occurrence of an interrupt
or a trap. A trap (or an exception) is a software-generated interrupt caused
either by an error (for example, division by zero or invalid memory access)
or by a specific request from a user program that an operating-system service
be performed. The interrupt-driven nature of an operating system defines
that system’s general structure. For each type of interrupt, separate segments
of code in the operating system determine what action should be taken. An
interrupt service routine is provided to deal with the interrupt.
Since the operating system and the users share the hardware and software
resources of the computer system, we need to make sure that an error in a
user program could cause problems only for the one program running. With
sharing, many processes could be adversely affected by a bug in one program.
For example, if a process gets stuck in an infinite loop, this loop could prevent
the correct operation of many other processes. More subtle errors can occur
in a multiprogramming system, where one erroneous program might modify
another program, the data of another program, or even the operating system
itself.
Without protection against these sorts of errors, either the computer must
execute only one process at a time or all output must be suspect. A properly
designed operating system must ensure that an incorrect (or malicious)
program cannot cause other programs to execute incorrectly.
1.5.1 Dual-Mode and Multimode Operation
In order to ensure the proper execution of the operating system, we must be
able to distinguish between the execution of operating-system code and user-
defined code. The approach taken by most computer systems is to provide
hardware support that allows us to differentiate among various modes of
execution.
22 Chapter 1 Introduction
user process executing
user process
kernel
calls system call return from system call
user mode
(mode bit = 1)
trap
mode bit = 0
return
mode bit = 1
kernel mode
(mode bit = 0)
execute system call
Figure 1.10 Transition from user to kernel mode.
At the very least, we need two separate modes of operation: user mode
and kernel mode (also called supervisor mode, system mode, or privileged
mode). A bit, called the mode bit, is added to the hardware of the computer
to indicate the current mode: kernel (0) or user (1). With the mode bit, we can
distinguish between a task that is executed on behalf of the operating system
and one that is executed on behalf of the user. When the computer system is
executing on behalf of a user application, the system is in user mode. However,
when a user application requests a service from the operating system (via a
system call), the system must transition from user to kernel mode to fulfill
the request. This is shown in Figure 1.10. As we shall see, this architectural
enhancement is useful for many other aspects of system operation as well.
At system boot time, the hardware starts in kernel mode. The operating
system is then loaded and starts user applications in user mode. Whenever a
trap or interrupt occurs, the hardware switches from user mode to kernel mode
(that is, changes the state of the mode bit to 0). Thus, whenever the operating
system gains control of the computer, it is in kernel mode. The system always
switches to user mode (by setting the mode bit to 1) before passing control to
a user program.
The dual mode of operation provides us with the means for protecting the
operating system from errant users—and errant users from one another. We
accomplishthisprotectionbydesignatingsome ofthe machine instructionsthat
may cause harm as privileged instructions. The hardware allows privileged
instructions to be executed only in kernel mode. If an attempt is made to
execute a privileged instruction in user mode, the hardware does not execute
the instruction but rather treats it as illegal and traps it to the operating system.
The instruction to switch to kernel mode is an example of a privileged
instruction. Some other examples include I/O control, timer management, and
interrupt management. As we shall see throughout the text, there are many
additional privileged instructions.
The concept of modes can be extended beyond two modes (in which case
the CPU uses more than one bit to set and test the mode). CPUs that support
virtualization (Section 16.1) frequently have a separate mode to indicate when
the virtual machine manager (VMM)—and the virtualization management
software—is in control of the system. In this mode, the VMM has more
privileges than user processes but fewer than the kernel. It needs that level
of privilege so it can create and manage virtual machines, changing the CPU
state to do so. Sometimes, too, different modes are used by various kernel
1.5 Operating-System Operations 23
components. We should note that, as an alternative to modes, the CPU designer
may use other methods to differentiate operational privileges. The Intel 64
family of CPUs supports four privilege levels, for example, and supports
virtualization but does not have a separate mode for virtualization.
We can now see the life cycle of instruction execution in a computer system.
Initial control resides in the operating system, where instructions are executed
in kernel mode. When control is given to a user application, the mode is set to
user mode. Eventually, control is switched back to the operating system via an
interrupt, a trap, or a system call.
System calls provide the means for a user program to ask the operating
system to perform tasks reserved for the operating system on the user
program’s behalf. A system call is invoked in a variety of ways, depending
on the functionality provided by the underlying processor. In all forms, it is the
method used by a process to request action by the operating system. A system
call usually takes the form of a trap to a specific location in the interrupt vector.
This trap can be executed by a generic trap instruction, although some systems
(such as MIPS) have a specific syscall instruction to invoke a system call.
When a system call is executed, it is typically treated by the hardware
as a software interrupt. Control passes through the interrupt vector to a
service routine in the operating system, and the mode bit is set to kernel
mode. The system-call service routine is a part of the operating system. The
kernel examines the interrupting instruction to determine what system call
has occurred; a parameter indicates what type of service the user program is
requesting. Additional information needed for the request may be passed in
registers, on the stack, or in memory (with pointers to the memory locations
passed in registers). The kernel verifies that the parameters are correct and
legal, executes the request, and returns control to the instruction following the
system call. We describe system calls more fully in Section 2.3.
The lack of a hardware-supported dual mode can cause serious shortcom-
ings in an operating system. For instance, MS-DOS was written for the Intel
8088 architecture, which has no mode bit and therefore no dual mode. A user
program running awry can wipe out the operating system by writing over it
with data; and multiple programs are able to write to a device at the same
time, with potentially disastrous results. Modern versions of the Intel CPU
do provide dual-mode operation. Accordingly, most contemporary operating
systems—such as Microsoft Windows 7, as well as Unix and Linux—take
advantage of this dual-mode feature and provide greater protection for the
operating system.
Once hardware protection is in place, it detects errors that violate modes.
These errors are normally handled by the operating system. If a user program
fails in some way—such as by making an attempt either to execute an illegal
instruction or to access memory that is not in the user’s address space—then
the hardware traps to the operating system. The trap transfers control through
the interrupt vector to the operating system, just as an interrupt does. When
a program error occurs, the operating system must terminate the program
abnormally. This situation is handled by the same code as a user-requested
abnormal termination. An appropriate error message is given, and the memory
of the program may be dumped. The memory dump is usually written to a
file so that the user or programmer can examine it and perhaps correct it and
restart the program.
24 Chapter 1 Introduction
1.5.2 Timer
We must ensure that the operating system maintains control over the CPU.
We cannot allow a user program to get stuck in an infinite loop or to fail
to call system services and never return control to the operating system. To
accomplish this goal, we can use a timer. A timer can be set to interrupt
the computer after a specified period. The period may be fixed (for example,
1/60 second) or variable (for example, from 1 millisecond to 1 second). A
variable timer is generally implemented by a fixed-rate clock and a counter.
The operating system sets the counter. Every time the clock ticks, the counter
is decremented. When the counter reaches 0, an interrupt occurs. For instance,
a 10-bit counter with a 1-millisecond clock allows interrupts at intervals from
1 millisecond to 1,024 milliseconds, in steps of 1 millisecond.
Before turning over control to the user, the operating system ensures
that the timer is set to interrupt. If the timer interrupts, control transfers
automatically to the operating system, which may treat the interrupt as a fatal
error or may give the program more time. Clearly, instructions that modify the
content of the timer are privileged.
We can use the timer to prevent a user program from running too long.
A simple technique is to initialize a counter with the amount of time that a
program is allowed to run. A program with a 7-minute time limit, for example,
would have its counter initialized to 420. Every second, the timer interrupts,
and the counter is decremented by 1. As long as the counter is positive, control
is returned to the user program. When the counter becomes negative, the
operating system terminates the program for exceeding the assigned time
limit.
1.6 Process Management
A program does nothing unless its instructions are executed by a CPU. A
program in execution, as mentioned, is a process. A time-shared user program
such as a compiler is a process. A word-processing program being run by an
individual user on a PC is a process. A system task, such as sending output
to a printer, can also be a process (or at least part of one). For now, you can
consider a process to be a job or a time-shared program, but later you will learn
that the concept is more general. As we shall see in Chapter 3, it is possible
to provide system calls that allow processes to create subprocesses to execute
concurrently.
A process needs certain resources—including CPU time, memory, files,
and I/O devices—to accomplish its task. These resources are either given to
the process when it is created or allocated to it while it is running. In addition
to the various physical and logical resources that a process obtains when it is
created, various initialization data (input) may be passed along. For example,
consider a process whose function is to display the status of a file on the screen
of a terminal. The process will be given the name of the file as an input and will
execute the appropriate instructions and system calls to obtain and display
the desired information on the terminal. When the process terminates, the
operating system will reclaim any reusable resources.
We emphasize that a program by itself is not a process. A program is a
passive entity, like the contents of a file stored on disk, whereas a process
1.7 Memory Management 25
is an active entity. A single-threaded process has one program counter
specifying the next instruction to execute. (Threads are covered in Chapter
4.) The execution of such a process must be sequential. The CPU executes one
instruction of the process after another, until the process completes. Further,
at any time, one instruction at most is executed on behalf of the process. Thus,
although two processes may be associated with the same program, they are
nevertheless considered two separate execution sequences. A multithreaded
process has multiple program counters, each pointing to the next instruction
to execute for a given thread.
A process is the unit of work in a system. A system consists of a collection
of processes, some of which are operating-system processes (those that execute
system code) and the rest of which are user processes (those that execute
user code). All these processes can potentially execute concurrently—by
multiplexing on a single CPU, for example.
The operating system is responsible for the following activities in connec-
tion with process management:
• Scheduling processes and threads on the CPUs
• Creating and deleting both user and system processes
• Suspending and resuming processes
• Providing mechanisms for process synchronization
• Providing mechanisms for process communication
We discuss process-management techniques in Chapters 3 through 5.
1.7 Memory Management
As we discussed in Section 1.2.2, the main memory is central to the operation
of a modern computer system. Main memory is a large array of bytes, ranging
in size from hundreds of thousands to billions. Each byte has its own address.
Main memory is a repository of quickly accessible data shared by the CPU and
I/O devices. The central processor reads instructions from main memory during
the instruction-fetch cycle and both reads and writes data from main memory
during the data-fetch cycle (on a von Neumann architecture). As noted earlier,
the main memory is generally the only large storage device that the CPU is able
to address and access directly. For example, for the CPU to process data from
disk, those data must first be transferred to main memory by CPU-generated
I/O calls. In the same way, instructions must be in memory for the CPU to
execute them.
For a program to be executed, it must be mapped to absolute addresses and
loaded into memory. As the program executes, it accesses program instructions
and data from memory by generating these absolute addresses. Eventually,
the program terminates, its memory space is declared available, and the next
program can be loaded and executed.
To improve both the utilization of the CPU and the speed of the computer’s
response to its users, general-purpose computers must keep several programs
in memory, creating a need for memory management. Many different memory-
26 Chapter 1 Introduction
management schemes are used. These schemes reflect various approaches, and
the effectiveness of any given algorithm depends on the situation. In selecting a
memory-management scheme for a specific system, we must take into account
many factors—especially the hardware design of the system. Each algorithm
requires its own hardware support.
The operating system is responsible for the following activities in connec-
tion with memory management:
• Keeping track of which parts of memory are currently being used and who
is using them
• Deciding which processes (or parts of processes) and data to move into
and out of memory
• Allocating and deallocating memory space as needed
Memory-management techniques are discussed in Chapters 8 and 9.
1.8 Storage Management
To make the computer system convenient for users, the operating system
provides a uniform, logical view of information storage. The operating system
abstracts from the physical properties of its storage devices to define a logical
storage unit, the file. The operating system maps files onto physical media and
accesses these files via the storage devices.
1.8.1 File-System Management
File management is one of the most visible components of an operating system.
Computers can store information on several different types of physical media.
Magnetic disk, optical disk, and magnetic tape are the most common. Each
of these media has its own characteristics and physical organization. Each
medium is controlled by a device, such as a disk drive or tape drive, that
also has its own unique characteristics. These properties include access speed,
capacity, data-transfer rate, and access method (sequential or random).
A file is a collection of related information defined by its creator. Commonly,
files represent programs (both source and object forms) and data. Data files may
be numeric, alphabetic, alphanumeric, or binary. Files may be free-form (for
example, text files), or they may be formatted rigidly (for example, fixed fields).
Clearly, the concept of a file is an extremely general one.
The operating system implements the abstract concept of a file by managing
mass-storage media, such as tapes and disks, and the devices that control them.
In addition, files are normally organized into directories to make them easier
to use. Finally, when multiple users have access to files, it may be desirable
to control which user may access a file and how that user may access it (for
example, read, write, append).
The operating system is responsible for the following activities in connec-
tion with file management:
• Creating and deleting files
1.8 Storage Management 27
• Creating and deleting directories to organize files
• Supporting primitives for manipulating files and directories
• Mapping files onto secondary storage
• Backing up files on stable (nonvolatile) storage media
File-management techniques are discussed in Chapters 11 and 12.
1.8.2 Mass-Storage Management
As we have already seen, because main memory is too small to accommodate
all data and programs, and because the data that it holds are lost when power
is lost, the computer system must provide secondary storage to back up main
memory. Most modern computer systems use disks as the principal on-line
storage medium for both programs and data. Most programs—including
compilers, assemblers, word processors, editors, and formatters—are stored
on a disk until loaded into memory. They then use the disk as both the source
and destination of their processing. Hence, the proper management of disk
storage is of central importance to a computer system. The operating system is
responsible for the following activities in connection with disk management:
• Free-space management
• Storage allocation
• Disk scheduling
Because secondary storage is used frequently, it must be used efficiently. The
entire speed of operation of a computer may hinge on the speeds of the disk
subsystem and the algorithms that manipulate that subsystem.
There are, however, many uses for storage that is slower and lower in
cost (and sometimes of higher capacity) than secondary storage. Backups of
disk data, storage of seldom-used data, and long-term archival storage are
some examples. Magnetic tape drives and their tapes and CD and DVD drives
and platters are typical tertiary storage devices. The media (tapes and optical
platters) vary between WORM (write-once, read-many-times) and RW (read–
write) formats.
Tertiary storage is not crucial to system performance, but it still must
be managed. Some operating systems take on this task, while others leave
tertiary-storage management to application programs. Some of the functions
that operating systems can provide include mounting and unmounting media
in devices, allocating and freeing the devices for exclusive use by processes,
and migrating data from secondary to tertiary storage.
Techniques for secondary and tertiary storage management are discussed
in Chapter 10.
1.8.3 Caching
Caching is an important principle of computer systems. Here’s how it works.
Information is normally kept in some storage system (such as main memory).
As it is used, it is copied into a faster storage system—the cache—on a
28 Chapter 1 Introduction
temporary basis. When we need a particular piece of information, we first
check whether it is in the cache. If it is, we use the information directly from
the cache. If it is not, we use the information from the source, putting a copy
in the cache under the assumption that we will need it again soon.
In addition, internal programmable registers, such as index registers,
provide a high-speed cache for main memory. The programmer (or compiler)
implements the register-allocation and register-replacement algorithms to
decide which information to keep in registers and which to keep in main
memory.
Other caches are implemented totally in hardware. For instance, most
systems have an instruction cache to hold the instructions expected to be
executed next. Without this cache, the CPU would have to wait several cycles
while an instruction was fetched from main memory. For similar reasons, most
systems have one or more high-speed data caches in the memory hierarchy.
We are not concerned with these hardware-only caches in this text, since they
are outside the control of the operating system.
Because caches have limited size, cache management is an important
design problem. Careful selection of the cache size and of a replacement policy
can result in greatly increased performance. Figure 1.11 compares storage
performance in large workstations and small servers. Various replacement
algorithms for software-controlled caches are discussed in Chapter 9.
Main memory can be viewed as a fast cache for secondary storage, since
data in secondary storage must be copied into main memory for use and
data must be in main memory before being moved to secondary storage for
safekeeping. The file-system data, which resides permanently on secondary
storage, may appear on several levels in the storage hierarchy. At the highest
level, the operating system may maintain a cache of file-system data in main
memory. In addition, solid-state disks may be used for high-speed storage that
is accessed through the file-system interface. The bulk of secondary storage
is on magnetic disks. The magnetic-disk storage, in turn, is often backed up
onto magnetic tapes or removable disks to protect against data loss in case
of a hard-disk failure. Some systems automatically archive old file data from
secondary storage to tertiary storage, such as tape jukeboxes, to lower the
storage cost (see Chapter 10).
Level
Name
Typical size
Implementation
technology
Access time (ns)
Bandwidth (MB/sec)
Managed by
Backed by
1
registers
< 1 KB
custom memory
with multiple
ports CMOS
0.25 - 0.5
20,000 - 100,000
compiler
cache
2
cache
< 16MB
on-chip or
off-chip
CMOS SRAM
0.5 - 25
5,000 - 10,000
hardware
main memory
3
main memory
< 64GB
CMOS SRAM
80 - 250
1,000 - 5,000
operating system
disk
4
solid state disk
< 1 TB
flash memory
25,000 - 50,000
500
operating system
disk
5
magnetic disk
< 10 TB
magnetic disk
5,000,000
20 - 150
operating system
disk or tape
Figure 1.11 Performance of various levels of storage.
1.8 Storage Management 29
A A A
magnetic
disk
main
memory
hardware
register
cache
Figure 1.12 Migration of integer A from disk to register.
The movement of information between levels of a storage hierarchy may
be either explicit or implicit, depending on the hardware design and the
controlling operating-system software. For instance, data transfer from cache
to CPU and registers is usually a hardware function, with no operating-system
intervention. In contrast, transfer of data from disk to memory is usually
controlled by the operating system.
In a hierarchical storage structure, the same data may appear in different
levels of the storage system. For example, suppose that an integer A that is to
be incremented by 1 is located in file B, and file B resides on magnetic disk.
The increment operation proceeds by first issuing an I/O operation to copy the
disk block on which A resides to main memory. This operation is followed by
copying A to the cache and to an internal register. Thus, the copy of A appears
in several places: on the magnetic disk, in main memory, in the cache, and in an
internal register (see Figure 1.12). Once the increment takes place in the internal
register, the value of A differs in the various storage systems. The value of A
becomes the same only after the new value of A is written from the internal
register back to the magnetic disk.
In a computing environment where only one process executes at a time,
this arrangement poses no difficulties, since an access to integer A will always
be to the copy at the highest level of the hierarchy. However, in a multitasking
environment, where the CPU is switched back and forth among various
processes, extreme care must be taken to ensure that, if several processes wish
to access A, then each of these processes will obtain the most recently updated
value of A.
The situation becomes more complicated in a multiprocessor environment
where, in addition to maintaining internal registers, each of the CPUs also
contains a local cache (Figure 1.6). In such an environment, a copy of A may
exist simultaneously in several caches. Since the various CPUs can all execute
in parallel, we must make sure that an update to the value of A in one cache
is immediately reflected in all other caches where A resides. This situation is
called cache coherency, and it is usually a hardware issue (handled below the
operating-system level).
In a distributed environment, the situation becomes even more complex.
In this environment, several copies (or replicas) of the same file can be kept on
different computers. Since the various replicas may be accessed and updated
concurrently, some distributed systems ensure that, when a replica is updated
in one place, all other replicas are brought up to date as soon as possible. There
are various ways to achieve this guarantee, as we discuss in Chapter 17.
1.8.4 I/O Systems
One of the purposes of an operating system is to hide the peculiarities of specific
hardware devices from the user. For example, in UNIX, the peculiarities of I/O
30 Chapter 1 Introduction
devices are hidden from the bulk of the operating system itself by the I/O
subsystem. The I/O subsystem consists of several components:
• A memory-management component that includes buffering, caching, and
spooling
• A general device-driver interface
• Drivers for specific hardware devices
Only the device driver knows the peculiarities of the specific device to which
it is assigned.
We discussed in Section 1.2.3 how interrupt handlers and device drivers are
used in the construction of efficient I/O subsystems. In Chapter 13, we discuss
how the I/O subsystem interfaces to the other system components, manages
devices, transfers data, and detects I/O completion.
1.9 Protection and Security
If a computer system has multiple users and allows the concurrent execution
of multiple processes, then access to data must be regulated. For that purpose,
mechanisms ensure that files, memory segments, CPU, and other resources can
be operated on by only those processes that have gained proper authoriza-
tion from the operating system. For example, memory-addressing hardware
ensures that a process can execute only within its own address space. The
timer ensures that no process can gain control of the CPU without eventually
relinquishing control. Device-control registers are not accessible to users, so
the integrity of the various peripheral devices is protected.
Protection, then, is any mechanism for controlling the access of processes
or users to the resources defined by a computer system. This mechanism must
provide means to specify the controls to be imposed and to enforce the controls.
Protection can improve reliability by detecting latent errors at the interfaces
between component subsystems. Early detection of interface errors can often
prevent contamination of a healthy subsystem by another subsystem that is
malfunctioning. Furthermore, an unprotected resource cannot defend against
use (or misuse) by an unauthorized or incompetent user. A protection-oriented
system provides a means to distinguish between authorized and unauthorized
usage, as we discuss in Chapter 14.
A system can have adequate protection but still be prone to failure and
allow inappropriate access. Consider a user whose authentication information
(her means of identifying herself to the system) is stolen. Her data could be
copied or deleted, even though file and memory protection are working. It is
the job of security to defend a system from external and internal attacks. Such
attacks spread across a huge range and include viruses and worms, denial-of-
service attacks (which use all of a system’s resources and so keep legitimate
users out of the system), identity theft, and theft of service (unauthorized
use of a system). Prevention of some of these attacks is considered an
operating-system function on some systems, while other systems leave it to
policy or additional software. Due to the alarming rise in security incidents,
2
C H A P T E R
Operating-
System
Structures
An operating system provides the environment within which programs are
executed. Internally, operating systems vary greatly in their makeup, since
they are organized along many different lines. The design of a new operating
system is a major task. It is important that the goals of the system be well
defined before the design begins. These goals form the basis for choices among
various algorithms and strategies.
We can view an operating system from several vantage points. One view
focuses on the services that the system provides; another, on the interface that
it makes available to users and programmers; a third, on its components and
their interconnections. In this chapter, we explore all three aspects of operating
systems, showing the viewpoints of users, programmers, and operating system
designers. We consider what services an operating system provides, how they
are provided, how they are debugged, and what the various methodologies
are for designing such systems. Finally, we describe how operating systems
are created and how a computer starts its operating system.
CHAPTER OBJECTIVES
• To describe the services an operating system provides to users, processes,
and other systems.
• To discuss the various ways of structuring an operating system.
• To explain how operating systems are installed and customized and how
they boot.
2.1 Operating-System Services
An operating system provides an environment for the execution of programs.
It provides certain services to programs and to the users of those programs.
The specific services provided, of course, differ from one operating system to
another, but we can identify common classes. These operating system services
are provided for the convenience of the programmer, to make the programming
55
56 Chapter 2 Operating-System Structures
user and other system programs
services
operating system
hardware
system calls
GUI batch
user interfaces
command line
program
execution
I/O
operations
file
systems
communication
resource
allocation
accounting
protection
and
security
error
detection
Figure 2.1 A view of operating system services.
task easier. Figure 2.1 shows one view of the various operating-system services
and how they interrelate.
One set of operating system services provides functions that are helpful to
the user.
• User interface. Almost all operating systems have a user interface (UI).
This interface can take several forms. One is a command-line interface
(CLI), which uses text commands and a method for entering them (say,
a keyboard for typing in commands in a specific format with specific
options). Another is a batch interface, in which commands and directives
to control those commands are entered into files, and those files are
executed. Most commonly, a graphical user interface (GUI) is used. Here,
the interface is a window system with a pointing device to direct I/O,
choose from menus, and make selections and a keyboard to enter text.
Some systems provide two or all three of these variations.
• Program execution. The system must be able to load a program into
memory and to run that program. The program must be able to end its
execution, either normally or abnormally (indicating error).
• I/O operations. A running program may require I/O, which may involve a
file or an I/O device. For specific devices, special functions may be desired
(such as recording to a CD or DVD drive or blanking a display screen). For
efficiency and protection, users usually cannot control I/O devices directly.
Therefore, the operating system must provide a means to do I/O.
• File-system manipulation. The file system is of particular interest. Obvi-
ously, programs need to read and write files and directories. They also
need to create and delete them by name, search for a given file, and
list file information. Finally, some operating systems include permissions
management to allow or deny access to files or directories based on file
ownership. Many operating systems provide a variety of file systems,
sometimes to allow personal choice and sometimes to provide specific
features or performance characteristics.
2.1 Operating-System Services 57
• Communications. There are many circumstances in which one process
needs to exchange information with another process. Such communication
may occur between processes that are executing on the same computer or
between processes that are executing on different computer systems tied
together by a computer network. Communications may be implemented
via shared memory, in which two or more processes read and write to
a shared section of memory, or message passing, in which packets of
information in predefined formats are moved between processes by the
operating system.
• Error detection. The operating system needs to be detecting and correcting
errors constantly. Errors may occur in the CPU and memory hardware (such
as a memory error or a power failure), in I/O devices (such as a parity error
on disk, a connection failure on a network, or lack of paper in the printer),
and in the user program (such as an arithmetic overflow, an attempt to
access an illegal memory location, or a too-great use of CPU time). For
each type of error, the operating system should take the appropriate action
to ensure correct and consistent computing. Sometimes, it has no choice
but to halt the system. At other times, it might terminate an error-causing
process or return an error code to a process for the process to detect and
possibly correct.
Another set of operating system functions exists not for helping the user
but rather for ensuring the efficient operation of the system itself. Systems with
multiple users can gain efficiency by sharing the computer resources among
the users.
• Resource allocation. When there are multiple users or multiple jobs
running at the same time, resources must be allocated to each of them. The
operating system manages many different types of resources. Some (such
as CPU cycles, main memory, and file storage) may have special allocation
code, whereas others (such as I/O devices) may have much more general
request and release code. For instance, in determining how best to use
the CPU, operating systems have CPU-scheduling routines that take into
account the speed of the CPU, the jobs that must be executed, the number of
registers available, and other factors. There may also be routines to allocate
printers, USB storage drives, and other peripheral devices.
• Accounting. We want to keep track of which users use how much and
what kinds of computer resources. This record keeping may be used for
accounting (so that users can be billed) or simply for accumulating usage
statistics. Usage statistics may be a valuable tool for researchers who wish
to reconfigure the system to improve computing services.
• Protection and security. The owners of information stored in a multiuser or
networked computer system may want to control use of that information.
When several separate processes execute concurrently, it should not be
possible for one process to interfere with the others or with the operating
system itself. Protection involves ensuring that all access to system
resources is controlled. Security of the system from outsiders is also
important. Such security starts with requiring each user to authenticate
58 Chapter 2 Operating-System Structures
himself or herself to the system, usually by means of a password, to gain
access to system resources. It extends to defending external I/O devices,
including network adapters, from invalid access attempts and to recording
all such connections for detection of break-ins. If a system is to be protected
and secure, precautions must be instituted throughout it. A chain is only
as strong as its weakest link.
2.2 User and Operating-System Interface
We mentioned earlier that there are several ways for users to interface with
the operating system. Here, we discuss two fundamental approaches. One
provides a command-line interface, or command interpreter, that allows users
to directly enter commands to be performed by the operating system. The
other allows users to interface with the operating system via a graphical user
interface, or GUI.
2.2.1 Command Interpreters
Some operating systems include the command interpreter in the kernel. Others,
such as Windows and UNIX, treat the command interpreter as a special program
that is running when a job is initiated or when a user first logs on (on interactive
systems). On systems with multiple command interpreters to choose from, the
interpreters are known as shells. For example, on UNIX and Linux systems, a
user may choose among several different shells, including the Bourne shell, C
shell, Bourne-Again shell, Korn shell, and others. Third-party shells and free
user-written shells are also available. Most shells provide similar functionality,
and a user’s choice of which shell to use is generally based on personal
preference. Figure 2.2 shows the Bourne shell command interpreter being used
on Solaris 10.
The main function of the command interpreter is to get and execute the next
user-specified command. Many of the commands given at this level manipulate
files: create, delete, list, print, copy, execute, and so on. The MS-DOS and UNIX
shells operate in this way. These commands can be implemented in two general
ways.
In one approach, the command interpreter itself contains the code to
execute the command. For example, a command to delete a file may cause
the command interpreter to jump to a section of its code that sets up the
parameters and makes the appropriate system call. In this case, the number of
commands that can be given determines the size of the command interpreter,
since each command requires its own implementing code.
An alternative approach—used by UNIX, among other operating systems
—implements most commands through system programs. In this case, the
command interpreter does not understand the command in any way; it merely
uses the command to identify a file to be loaded into memory and executed.
Thus, the UNIX command to delete a file
rm file.txt
would search for a file called rm, load the file into memory, and execute it with
the parameter file.txt. The function associated with the rm command would
2.2 User and Operating-System Interface 59
Figure 2.2 The Bourne shell command interpreter in Solrais 10.
be defined completely by the code in the file rm. In this way, programmers can
add new commands to the system easily by creating new files with the proper
names. The command-interpreter program, which can be small, does not have
to be changed for new commands to be added.
2.2.2 Graphical User Interfaces
A second strategy for interfacing with the operating system is through a user-
friendly graphical user interface, or GUI. Here, rather than entering commands
directly via a command-line interface, users employ a mouse-based window-
and-menu system characterized by a desktop metaphor. The user moves the
mouse to position its pointer on images, or icons, on the screen (the desktop)
that represent programs, files, directories, and system functions. Depending
on the mouse pointer’s location, clicking a button on the mouse can invoke a
program, select a file or directory—known as a folder—or pull down a menu
that contains commands.
Graphical user interfaces first appeared due in part to research taking place
in the early 1970s at Xerox PARC research facility. The first GUI appeared on
the Xerox Alto computer in 1973. However, graphical interfaces became more
widespread with the advent of Apple Macintosh computers in the 1980s. The
user interface for the Macintosh operating system (Mac OS) has undergone
various changes over the years, the most significant being the adoption of
the Aqua interface that appeared with Mac OS X. Microsoft’s first version of
Windows—Version 1.0—was based on the addition of a GUI interface to the
MS-DOS operating system. Later versions of Windows have made cosmetic
60 Chapter 2 Operating-System Structures
changes in the appearance of the GUI along with several enhancements in its
functionality.
Because a mouse is impractical for most mobile systems, smartphones and
handheld tablet computers typically use a touchscreen interface. Here, users
interact by making gestures on the touchscreen—for example, pressing and
swiping fingers across the screen. Figure 2.3 illustrates the touchscreen of the
Apple iPad. Whereas earlier smartphones included a physical keyboard, most
smartphones now simulate a keyboard on the touchscreen.
Traditionally, UNIX systems have been dominated by command-line inter-
faces. Various GUI interfaces are available, however. These include the Common
Desktop Environment (CDE) and X-Windows systems, which are common
on commercial versions of UNIX, such as Solaris and IBM’s AIX system. In
addition, there has been significant development in GUI designs from various
open-source projects, such as K Desktop Environment (or KDE) and the GNOME
desktop by the GNU project. Both the KDE and GNOME desktops run on Linux
and various UNIX systems and are available under open-source licenses, which
means their source code is readily available for reading and for modification
under specific license terms.
Figure 2.3 The iPad touchscreen.
2.2 User and Operating-System Interface 61
2.2.3 Choice of Interface
The choice of whether to use a command-line or GUI interface is mostly
one of personal preference. System administrators who manage computers
and power users who have deep knowledge of a system frequently use the
command-line interface. For them, it is more efficient, giving them faster
access to the activities they need to perform. Indeed, on some systems, only a
subset of system functions is available via the GUI, leaving the less common
tasks to those who are command-line knowledgeable. Further, command-
line interfaces usually make repetitive tasks easier, in part because they have
their own programmability. For example, if a frequent task requires a set of
command-line steps, those steps can be recorded into a file, and that file can
be run just like a program. The program is not compiled into executable code
but rather is interpreted by the command-line interface. These shell scripts are
very common on systems that are command-line oriented, such as UNIX and
Linux.
In contrast, most Windows users are happy to use the Windows GUI
environment and almost never use the MS-DOS shell interface. The various
changes undergone by the Macintosh operating systems provide a nice study
in contrast. Historically, Mac OS has not provided a command-line interface,
always requiring its users to interface with the operating system using its GUI.
However, with the release of Mac OS X (which is in part implemented using a
UNIX kernel), the operating system now provides both a Aqua interface and a
command-line interface. Figure 2.4 is a screenshot of the Mac OS X GUI.
Figure 2.4 The Mac OS X GUI.
62 Chapter 2 Operating-System Structures
The user interface can vary from system to system and even from user
to user within a system. It typically is substantially removed from the actual
system structure. The design of a useful and friendly user interface is therefore
not a direct function of the operating system. In this book, we concentrate on
the fundamental problems of providing adequate service to user programs.
From the point of view of the operating system, we do not distinguish between
user programs and system programs.
2.3 System Calls
System calls provide an interface to the services made available by an operating
system. These calls are generally available as routines written in C and
C++, although certain low-level tasks (for example, tasks where hardware
must be accessed directly) may have to be written using assembly-language
instructions.
Before we discuss how an operating system makes system calls available,
let’s first use an example to illustrate how system calls are used: writing a
simple program to read data from one file and copy them to another file. The
first input that the program will need is the names of the two files: the input file
and the output file. These names can be specified in many ways, depending on
the operating-system design. One approach is for the program to ask the user
for the names. In an interactive system, this approach will require a sequence of
system calls, first to write a prompting message on the screen and then to read
from the keyboard the characters that define the two files. On mouse-based and
icon-based systems, a menu of file names is usually displayed in a window.
The user can then use the mouse to select the source name, and a window
can be opened for the destination name to be specified. This sequence requires
many I/O system calls.
Once the two file names have been obtained, the program must open the
input file and create the output file. Each of these operations requires another
system call. Possible error conditions for each operation can require additional
system calls. When the program tries to open the input file, for example, it may
find that there is no file of that name or that the file is protected against access.
In these cases, the program should print a message on the console (another
sequence of system calls) and then terminate abnormally (another system call).
If the input file exists, then we must create a new output file. We may find that
there is already an output file with the same name. This situation may cause
the program to abort (a system call), or we may delete the existing file (another
system call) and create a new one (yet another system call). Another option,
in an interactive system, is to ask the user (via a sequence of system calls to
output the prompting message and to read the response from the terminal)
whether to replace the existing file or to abort the program.
When both files are set up, we enter a loop that reads from the input file
(a system call) and writes to the output file (another system call). Each read
and write must return status information regarding various possible error
conditions. On input, the program may find that the end of the file has been
reached or that there was a hardware failure in the read (such as a parity error).
The write operation may encounter various errors, depending on the output
device (for example, no more disk space).
2.3 System Calls 63
Finally, after the entire file is copied, the program may close both files
(another system call), write a message to the console or window (more system
calls), and finally terminate normally (the final system call). This system-call
sequence is shown in Figure 2.5.
As you can see, even simple programs may make heavy use of the
operating system. Frequently, systems execute thousands of system calls
per second. Most programmers never see this level of detail, however.
Typically, application developers design programs according to an application
programming interface (API). The API specifies a set of functions that are
available to an application programmer, including the parameters that are
passed to each function and the return values the programmer can expect.
Three of the most common APIs available to application programmers are
the Windows API for Windows systems, the POSIX API for POSIX-based systems
(which include virtually all versions of UNIX, Linux, and Mac OS X), and the Java
API for programs that run on the Java virtual machine. A programmer accesses
an API via a library of code provided by the operating system. In the case of
UNIX and Linux for programs written in the C language, the library is called
libc. Note that—unless specified—the system-call names used throughout
this text are generic examples. Each operating system has its own name for
each system call.
Behind the scenes, the functions that make up an API typically invoke the
actual system calls on behalf of the application programmer. For example, the
Windows function CreateProcess() (which unsurprisingly is used to create
a new process) actually invokes the NTCreateProcess() system call in the
Windows kernel.
Why would an application programmer prefer programming according to
an API rather than invoking actual system calls? There are several reasons for
doing so. One benefit concerns program portability. An application program-
source file destination file
Example System Call Sequence
Acquire input file name
Write prompt to screen
Accept input
Acquire output file name
Write prompt to screen
Accept input
Open the input file
if file doesn't exist, abort
Create output file
if file exists, abort
Loop
Read from input file
Write to output file
Until read fails
Close output file
Write completion message to screen
Terminate normally
Figure 2.5 Example of how system calls are used.
64 Chapter 2 Operating-System Structures
EXAMPLE OF STANDARD API
As an example of a standard API, consider the read() function that is
available in UNIX and Linux systems. The API for this function is obtained
from the man page by invoking the command
man read
on the command line. A description of this API appears below:
#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count)
return
value
function
name
parameters
A program that uses the read() function must include the unistd.h header
file, as this file defines the ssize t and size t data types (among other
things). The parameters passed to read() are as follows:
• int fd—the file descriptor to be read
• void *buf—a buffer where the data will be read into
• size t count—the maximum number of bytes to be read into the
buffer
On a successful read, the number of bytes read is returned. A return value of
0 indicates end of file. If an error occurs, read() returns −1.
mer designing a program using an API can expect her program to compile and
run on any system that supports the same API (although, in reality, architectural
differences often make this more difficult than it may appear). Furthermore,
actual system calls can often be more detailed and difficult to work with than
the API available to an application programmer. Nevertheless, there often exists
a strong correlation between a function in the API and its associated system call
within the kernel. In fact, many of the POSIX and Windows APIs are similar to
the native system calls provided by the UNIX, Linux, and Windows operating
systems.
For most programming languages, the run-time support system (a set of
functions built into libraries included with a compiler) provides a system-
call interface that serves as the link to system calls made available by the
operating system. The system-call interface intercepts function calls in the API
and invokes the necessary system calls within the operating system. Typically,
a number is associated with each system call, and the system-call interface
maintains a table indexed according to these numbers. The system call interface
2.3 System Calls 65
Implementation
of open ( )
system call
open ( )
user
mode
return
user application
system call interface
kernel
mode
i
open ( )
Figure 2.6 The handling of a user application invoking the open() system call.
then invokes the intended system call in the operating-system kernel and
returns the status of the system call and any return values.
The caller need know nothing about how the system call is implemented
or what it does during execution. Rather, the caller need only obey the API and
understand what the operating system will do as a result of the execution of
that system call. Thus, most of the details of the operating-system interface
are hidden from the programmer by the API and are managed by the run-time
support library. The relationship between an API, the system-call interface,
and the operating system is shown in Figure 2.6, which illustrates how the
operating system handles a user application invoking the open() system call.
System calls occur in different ways, depending on the computer in use.
Often, more information is required than simply the identity of the desired
system call. The exact type and amount of information vary according to the
particular operating system and call. For example, to get input, we may need
to specify the file or device to use as the source, as well as the address and
length of the memory buffer into which the input should be read. Of course,
the device or file and length may be implicit in the call.
Three general methods are used to pass parameters to the operating system.
The simplest approach is to pass the parameters in registers. In some cases,
however, there may be more parameters than registers. In these cases, the
parameters are generally stored in a block, or table, in memory, and the
address of the block is passed as a parameter in a register (Figure 2.7). This
is the approach taken by Linux and Solaris. Parameters also can be placed,
or pushed, onto the stack by the program and popped off the stack by the
operating system. Some operating systems prefer the block or stack method
because those approaches do not limit the number or length of parameters
being passed.
66 Chapter 2 Operating-System Structures
code for
system
call 13
operating system
user program
use parameters
from table X
register
X
X: parameters
for call
load address X
system call 13
Figure 2.7 Passing of parameters as a table.
2.4 Types of System Calls
System calls can be grouped roughly into six major categories: process
control, file manipulation, device manipulation, information maintenance,
communications, and protection. In Sections 2.4.1 through 2.4.6, we briefly
discuss the types of system calls that may be provided by an operating system.
Most of these system calls support, or are supported by, concepts and functions
that are discussed in later chapters. Figure 2.8 summarizes the types of system
calls normally provided by an operating system. As mentioned, in this text,
we normally refer to the system calls by generic names. Throughout the text,
however, we provide examples of the actual counterparts to the system calls
for Windows, UNIX, and Linux systems.
2.4.1 Process Control
A running program needs to be able to halt its execution either normally
(end()) or abnormally (abort()). If a system call is made to terminate the
currently running program abnormally, or if the program runs into a problem
and causes an error trap, a dump of memory is sometimes taken and an error
message generated. The dump is written to disk and may be examined by
a debugger—a system program designed to aid the programmer in finding
and correcting errors, or bugs—to determine the cause of the problem. Under
either normal or abnormal circumstances, the operating system must transfer
control to the invoking command interpreter. The command interpreter then
reads the next command. In an interactive system, the command interpreter
simply continues with the next command; it is assumed that the user will
issue an appropriate command to respond to any error. In a GUI system, a
pop-up window might alert the user to the error and ask for guidance. In a
batch system, the command interpreter usually terminates the entire job and
continues with the next job. Some systems may allow for special recovery
actions in case an error occurs. If the program discovers an error in its input
and wants to terminate abnormally, it may also want to define an error level.
More severe errors can be indicated by a higher-level error parameter. It is then
2.4 Types of System Calls 67
• Process control
◦ end, abort
◦ load, execute
◦ create process, terminate process
◦ get process attributes, set process attributes
◦ wait for time
◦ wait event, signal event
◦ allocate and free memory
• File management
◦ create file, delete file
◦ open, close
◦ read, write, reposition
◦ get file attributes, set file attributes
• Device management
◦ request device, release device
◦ read, write, reposition
◦ get device attributes, set device attributes
◦ logically attach or detach devices
• Information maintenance
◦ get time or date, set time or date
◦ get system data, set system data
◦ get process, file, or device attributes
◦ set process, file, or device attributes
• Communications
◦ create, delete communication connection
◦ send, receive messages
◦ transfer status information
◦ attach or detach remote devices
Figure 2.8 Types of system calls.
possible to combine normal and abnormal termination by defining a normal
termination as an error at level 0. The command interpreter or a following
program can use this error level to determine the next action automatically.
A process or job executing one program may want to load() and
execute() another program. This feature allows the command interpreter to
execute a program as directed by, for example, a user command, the click of a
68 Chapter 2 Operating-System Structures
EXAMPLES OF WINDOWS AND UNIX SYSTEM CALLS
Windows Unix
Process CreateProcess() fork()
Control ExitProcess() exit()
WaitForSingleObject() wait()
File CreateFile() open()
Manipulation ReadFile() read()
WriteFile() write()
CloseHandle() close()
Device SetConsoleMode() ioctl()
Manipulation ReadConsole() read()
WriteConsole() write()
Information GetCurrentProcessID() getpid()
Maintenance SetTimer() alarm()
Sleep() sleep()
Communication CreatePipe() pipe()
CreateFileMapping() shm open()
MapViewOfFile() mmap()
Protection SetFileSecurity() chmod()
InitlializeSecurityDescriptor() umask()
SetSecurityDescriptorGroup() chown()
mouse, or a batch command. An interesting question is where to return control
when the loaded program terminates. This question is related to whether the
existing program is lost, saved, or allowed to continue execution concurrently
with the new program.
If control returns to the existing program when the new program termi-
nates, we must save the memory image of the existing program; thus, we have
effectively created a mechanism for one program to call another program. If
both programs continue concurrently, we have created a new job or process to
be multiprogrammed. Often, there is a system call specifically for this purpose
(create process() or submit job()).
If we create a new job or process, or perhaps even a set of jobs or
processes, we should be able to control its execution. This control requires
the ability to determine and reset the attributes of a job or process, includ-
ing the job’s priority, its maximum allowable execution time, and so on
(get process attributes() and set process attributes()). We may also
want to terminate a job or process that we created (terminate process()) if
we find that it is incorrect or is no longer needed.
2.4 Types of System Calls 69
EXAMPLE OF STANDARD C LIBRARY
The standard C library provides a portion of the system-call interface for
many versions of UNIX and Linux. As an example, let’s assume a C program
invokes the printf() statement. The C library intercepts this call and
invokes the necessary system call (or calls) in the operating system—in this
instance, the write() system call. The C library takes the value returned by
write() and passes it back to the user program. This is shown below:
write ( )
system call
user
mode
kernel
mode
#include <stdio.h>
int main ( )
{
•
•
•
printf ("Greetings");
•
•
•
return 0;
}
standard C library
write ( )
Having created new jobs or processes, we may need to wait for them to
finish their execution. We may want to wait for a certain amount of time to
pass (wait time()). More probably, we will want to wait for a specific event
to occur (wait event()). The jobs or processes should then signal when that
event has occurred (signal event()).
Quite often, two or more processes may share data. To ensure the integrity
of the data being shared, operating systems often provide system calls allowing
a process to lock shared data. Then, no other process can access the data until
the lock is released. Typically, such system calls include acquire lock() and
release lock(). System calls of these types, dealing with the coordination of
concurrent processes, are discussed in great detail in Chapter 5.
There are so many facets of and variations in process and job control that
we next use two examples—one involving a single-tasking system and the
other a multitasking system—to clarify these concepts. The MS-DOS operating
system is an example of a single-tasking system. It has a command interpreter
that is invoked when the computer is started (Figure 2.9(a)). Because MS-DOS
is single-tasking, it uses a simple method to run a program and does not create
a new process. It loads the program into memory, writing over most of itself to
70 Chapter 2 Operating-System Structures
(a) (b)
free memory
command
interpreter
kernel
process
free memory
command
interpreter
kernel
Figure 2.9 MS-DOS execution. (a) At system startup. (b) Running a program.
give the program as much memory as possible (Figure 2.9(b)). Next, it sets the
instruction pointer to the first instruction of the program. The program then
runs, and either an error causes a trap, or the program executes a system call
to terminate. In either case, the error code is saved in the system memory for
later use. Following this action, the small portion of the command interpreter
that was not overwritten resumes execution. Its first task is to reload the rest
of the command interpreter from disk. Then the command interpreter makes
the previous error code available to the user or to the next program.
FreeBSD (derived from Berkeley UNIX) is an example of a multitasking
system. When a user logs on to the system, the shell of the user’s choice
is run. This shell is similar to the MS-DOS shell in that it accepts commands
and executes programs that the user requests. However, since FreeBSD is a
multitasking system, the command interpreter may continue running while
another program is executed (Figure 2.10). To start a new process, the shell
free memory
interpreter
kernel
process D
process C
process B
Figure 2.10 FreeBSD running multiple programs.
2.4 Types of System Calls 71
executes a fork() system call. Then, the selected program is loaded into
memory via an exec() system call, and the program is executed. Depending
on the way the command was issued, the shell then either waits for the process
to finish or runs the process “in the background.” In the latter case, the shell
immediately requests another command. When a process is running in the
background, it cannot receive input directly from the keyboard, because the
shell is using this resource. I/O is therefore done through files or through a GUI
interface. Meanwhile, the user is free to ask the shell to run other programs, to
monitor the progress of the running process, to change that program’s priority,
and so on. When the process is done, it executes an exit() system call to
terminate, returning to the invoking process a status code of 0 or a nonzero
error code. This status or error code is then available to the shell or other
programs. Processes are discussed in Chapter 3 with a program example using
the fork() and exec() system calls.
2.4.2 File Management
The file system is discussed in more detail in Chapters 11 and 12. We can,
however, identify several common system calls dealing with files.
We first need to be able to create() and delete() files. Either system call
requires the name of the file and perhaps some of the file’s attributes. Once
the file is created, we need to open() it and to use it. We may also read(),
write(), or reposition() (rewind or skip to the end of the file, for example).
Finally, we need to close() the file, indicating that we are no longer using it.
We may need these same sets of operations for directories if we have a
directory structure for organizing files in the file system. In addition, for either
files or directories, we need to be able to determine the values of various
attributes and perhaps to reset them if necessary. File attributes include the file
name, file type, protection codes, accounting information, and so on. At least
two system calls, get file attributes() and set file attributes(), are
required for this function. Some operating systems provide many more calls,
such as calls for file move() and copy(). Others might provide an API that
performs those operations using code and other system calls, and others might
provide system programs to perform those tasks. If the system programs are
callable by other programs, then each can be considered an API by other system
programs.
2.4.3 Device Management
A process may need several resources to execute—main memory, disk drives,
access to files, and so on. If the resources are available, they can be granted,
and control can be returned to the user process. Otherwise, the process will
have to wait until sufficient resources are available.
The various resources controlled by the operating system can be thought
of as devices. Some of these devices are physical devices (for example, disk
drives), while others can be thought of as abstract or virtual devices (for
example, files). A system with multiple users may require us to first request()
a device, to ensure exclusive use of it. After we are finished with the device, we
release() it. These functions are similar to the open() and close() system
calls for files. Other operating systems allow unmanaged access to devices.
72 Chapter 2 Operating-System Structures
The hazard then is the potential for device contention and perhaps deadlock,
which are described in Chapter 7.
Once the device has been requested (and allocated to us), we can read(),
write(), and (possibly) reposition() the device, just as we can with files. In
fact, the similarity between I/O devices and files is so great that many operating
systems, including UNIX, merge the two into a combined file–device structure.
In this case, a set of system calls is used on both files and devices. Sometimes,
I/O devices are identified by special file names, directory placement, or file
attributes.
The user interface can also make files and devices appear to be similar, even
though the underlying system calls are dissimilar. This is another example of
the many design decisions that go into building an operating system and user
interface.
2.4.4 Information Maintenance
Many system calls exist simply for the purpose of transferring information
between the user program and the operating system. For example, most
systems have a system call to return the current time() and date(). Other
system calls may return information about the system, such as the number of
current users, the version number of the operating system, the amount of free
memory or disk space, and so on.
Another set of system calls is helpful in debugging a program. Many
systems provide system calls to dump() memory. This provision is useful for
debugging. A program trace lists each system call as it is executed. Even
microprocessors provide a CPU mode known as single step, in which a trap
is executed by the CPU after every instruction. The trap is usually caught by a
debugger.
Many operating systems provide a time profile of a program to indicate
the amount of time that the program executes at a particular location or set
of locations. A time profile requires either a tracing facility or regular timer
interrupts. At every occurrence of the timer interrupt, the value of the program
counter is recorded. With sufficiently frequent timer interrupts, a statistical
picture of the time spent on various parts of the program can be obtained.
In addition, the operating system keeps information about all its processes,
and system calls are used to access this information. Generally, calls are
also used to reset the process information (get process attributes() and
set process attributes()). In Section 3.1.3, we discuss what information is
normally kept.
2.4.5 Communication
There are two common models of interprocess communication: the message-
passing model and the shared-memory model. In the message-passing model,
the communicating processes exchange messages with one another to transfer
information. Messages can be exchanged between the processes either directly
or indirectly through a common mailbox. Before communication can take
place, a connection must be opened. The name of the other communicator
must be known, be it another process on the same system or a process on
another computer connected by a communications network. Each computer in
a network has a host name by which it is commonly known. A host also has a
2.4 Types of System Calls 73
network identifier, such as an IP address. Similarly, each process has a process
name, and this name is translated into an identifier by which the operating
system can refer to the process. The get hostid() and get processid()
system calls do this translation. The identifiers are then passed to the general-
purpose open() and close() calls provided by the file system or to specific
open connection()and close connection()systemcalls, dependingonthe
system’s model of communication. The recipient process usually must give its
permission for communication to take place with an accept connection()
call. Most processes that will be receiving connections are special-purpose
daemons, which are system programs provided for that purpose. They execute
a wait for connection() call and are awakened when a connection is made.
The source of the communication, known as the client, and the receiving
daemon, known as a server, then exchange messages by using read message()
and write message() system calls. The close connection() call terminates
the communication.
In the shared-memory model, processes use shared memory create()
and shared memory attach() system calls to create and gain access to regions
of memory owned by other processes. Recall that, normally, the operating
system tries to prevent one process from accessing another process’s memory.
Shared memory requires that two or more processes agree to remove this
restriction. They can then exchange information by reading and writing data
in the shared areas. The form of the data is determined by the processes and is
not under the operating system’s control. The processes are also responsible for
ensuring that they are not writing to the same location simultaneously. Such
mechanisms are discussed in Chapter 5. In Chapter 4, we look at a variation of
the process scheme—threads—in which memory is shared by default.
Both of the models just discussed are common in operating systems,
and most systems implement both. Message passing is useful for exchanging
smaller amounts of data, because no conflicts need be avoided. It is also easier to
implement than is shared memory for intercomputer communication. Shared
memory allows maximum speed and convenience of communication, since it
can be done at memory transfer speeds when it takes place within a computer.
Problems exist, however, in the areas of protection and synchronization
between the processes sharing memory.
2.4.6 Protection
Protection provides a mechanism for controlling access to the resources
provided by a computer system. Historically, protection was a concern only on
multiprogrammed computer systems with several users. However, with the
advent of networking and the Internet, all computer systems, from servers to
mobile handheld devices, must be concerned with protection.
Typically, system calls providing protection include set permission()
and get permission(), which manipulate the permission settings of
resources such as files and disks. The allow user() and deny user() system
calls specify whether particular users can—or cannot—be allowed access to
certain resources.
We cover protection in Chapter 14 and the much larger issue of security in
Chapter 15.
Part Two
Process
Management
A process can be thought of as a program in execution. A process will
need certain resources—such as CPU time, memory, files, and I/O devices
—to accomplish its task. These resources are allocated to the process
either when it is created or while it is executing.
A process is the unit of work in most systems. Systems consist of
a collection of processes: operating-system processes execute system
code, and user processes execute user code. All these processes may
execute concurrently.
Although traditionally a process contained only a single thread of
control as it ran, most modern operating systems now support processes
that have multiple threads.
The operating system is responsible for several important aspects of
process and thread management: the creation and deletion of both user
and system processes; the scheduling of processes; and the provision of
mechanisms for synchronization, communication, and deadlock handling
for processes.
Imports topics from Galvin Operating System .pdf
3
C H A P T E R
Processes
Early computers allowed only one program to be executed at a time. This
program had complete control of the system and had access to all the system’s
resources. In contrast, contemporary computer systems allow multiple pro-
grams to be loaded into memory and executed concurrently. This evolution
required firmer control and more compartmentalization of the various pro-
grams; and these needs resulted in the notion of a process, which is a program
in execution. A process is the unit of work in a modern time-sharing system.
The more complex the operating system is, the more it is expected to do on
behalf of its users. Although its main concern is the execution of user programs,
it also needs to take care of various system tasks that are better left outside the
kernel itself. A system therefore consists of a collection of processes: operating-
system processes executing system code and user processes executing user
code. Potentially, all these processes can execute concurrently, with the CPU (or
CPUs) multiplexed among them. By switching the CPU between processes, the
operating system can make the computer more productive. In this chapter, you
will read about what processes are and how they work.
CHAPTER OBJECTIVES
• To introduce the notion of a process—a program in execution, which forms
the basis of all computation.
• To describe the various features of processes, including scheduling,
creation, and termination.
• To explore interprocess communication using shared memory and mes-
sage passing.
• To describe communication in client–server systems.
3.1 Process Concept
A question that arises in discussing operating systems involves what to call
all the CPU activities. A batch system executes jobs, whereas a time-shared
105
106 Chapter 3 Processes
system has user programs, or tasks. Even on a single-user system, a user may
be able to run several programs at one time: a word processor, a Web browser,
and an e-mail package. And even if a user can execute only one program at a
time, such as on an embedded device that does not support multitasking, the
operating system may need to support its own internal programmed activities,
such as memory management. In many respects, all these activities are similar,
so we call all of them processes.
The terms job and process are used almost interchangeably in this text.
Although we personally prefer the term process, much of operating-system
theory and terminology was developed during a time when the major activity
of operating systems was job processing. It would be misleading to avoid
the use of commonly accepted terms that include the word job (such as job
scheduling) simply because process has superseded job.
3.1.1 The Process
Informally, as mentioned earlier, a process is a program in execution. A process
is more than the program code, which is sometimes known as the text section.
It also includes the current activity, as represented by the value of the program
counter and the contents of the processor’s registers. A process generally also
includes the process stack, which contains temporary data (such as function
parameters, return addresses, and local variables), and a data section, which
contains global variables. A process may also include a heap, which is memory
that is dynamically allocated during process run time. The structure of a process
in memory is shown in Figure 3.1.
We emphasize that a program by itself is not a process. A program is a
passive entity, such as a file containing a list of instructions stored on disk
(often called an executable file). In contrast, a process is an active entity,
with a program counter specifying the next instruction to execute and a set
of associated resources. A program becomes a process when an executable file
is loaded into memory. Two common techniques for loading executable files
text
0
max
data
heap
stack
Figure 3.1 Process in memory.
3.1 Process Concept 107
are double-clicking an icon representing the executable file and entering the
name of the executable file on the command line (as in prog.exe or a.out).
Although two processes may be associated with the same program, they
are nevertheless considered two separate execution sequences. For instance,
several users may be running different copies of the mail program, or the same
user may invoke many copies of the web browser program. Each of these is a
separate process; and although the text sections are equivalent, the data, heap,
and stack sections vary. It is also common to have a process that spawns many
processes as it runs. We discuss such matters in Section 3.4.
Note that a process itself can be an execution environment for other
code. The Java programming environment provides a good example. In most
circumstances, an executable Java program is executed within the Java virtual
machine (JVM). The JVM executes as a process that interprets the loaded Java
code and takes actions (via native machine instructions) on behalf of that code.
For example, to run the compiled Java program Program.class, we would
enter
java Program
The command java runs the JVM as an ordinary process, which in turns
executes the Java program Program in the virtual machine. The concept is the
same as simulation, except that the code, instead of being written for a different
instruction set, is written in the Java language.
3.1.2 Process State
As a process executes, it changes state. The state of a process is defined in part
by the current activity of that process. A process may be in one of the following
states:
• New. The process is being created.
• Running. Instructions are being executed.
• Waiting. The process is waiting for some event to occur (such as an I/O
completion or reception of a signal).
• Ready. The process is waiting to be assigned to a processor.
• Terminated. The process has finished execution.
These names are arbitrary, and they vary across operating systems. The states
that they represent are found on all systems, however. Certain operating
systems also more finely delineate process states. It is important to realize
that only one process can be running on any processor at any instant. Many
processes may be ready and waiting, however. The state diagram corresponding
to these states is presented in Figure 3.2.
3.1.3 Process Control Block
Each process is represented in the operating system by a process control block
(PCB)—also called a task control block. A PCB is shown in Figure 3.3. It contains
many pieces of information associated with a specific process, including these:
108 Chapter 3 Processes
new terminated
running
ready
admitted interrupt
scheduler dispatch
I/O or event completion I/O or event wait
exit
waiting
Figure 3.2 Diagram of process state.
• Process state. The state may be new, ready, running, waiting, halted, and
so on.
• Program counter. The counter indicates the address of the next instruction
to be executed for this process.
• CPU registers. The registers vary in number and type, depending on
the computer architecture. They include accumulators, index registers,
stack pointers, and general-purpose registers, plus any condition-code
information. Along with the program counter, this state information must
be saved when an interrupt occurs, to allow the process to be continued
correctly afterward (Figure 3.4).
• CPU-scheduling information. This information includes a process priority,
pointers to scheduling queues, and any other scheduling parameters.
(Chapter 6 describes process scheduling.)
• Memory-management information. This information may include such
items as the value of the base and limit registers and the page tables, or the
segment tables, depending on the memory system used by the operating
system (Chapter 8).
process state
process number
program counter
memory limits
list of open files
registers
•
•
•
Figure 3.3 Process control block (PCB).
3.1 Process Concept 109
process P0 process P1
save state into PCB0
save state into PCB1
reload state from PCB1
reload state from PCB0
operating system
idle
idle
executing
idle
executing
executing
interrupt or system call
interrupt or system call
•
•
•
•
•
•
Figure 3.4 Diagram showing CPU switch from process to process.
• Accounting information. This information includes the amount of CPU
and real time used, time limits, account numbers, job or process numbers,
and so on.
• I/O status information. This information includes the list of I/O devices
allocated to the process, a list of open files, and so on.
In brief, the PCB simply serves as the repository for any information that may
vary from process to process.
3.1.4 Threads
The process model discussed so far has implied that a process is a program that
performs a single thread of execution. For example, when a process is running
a word-processor program, a single thread of instructions is being executed.
This single thread of control allows the process to perform only one task at
a time. The user cannot simultaneously type in characters and run the spell
checker within the same process, for example. Most modern operating systems
have extended the process concept to allow a process to have multiple threads
of execution and thus to perform more than one task at a time. This feature
is especially beneficial on multicore systems, where multiple threads can run
in parallel. On a system that supports threads, the PCB is expanded to include
information for each thread. Other changes throughout the system are also
needed to support threads. Chapter 4 explores threads in detail.
110 Chapter 3 Processes
PROCESS REPRESENTATION IN LINUX
The process control block in the Linux operating system is represented by
the C structure task struct, which is found in the <linux/sched.h>
include file in the kernel source-code directory. This structure contains all the
necessary information for representing a process, including the state of the
process, scheduling and memory-management information, list of open files,
and pointers to the process’s parent and a list of its children and siblings. (A
process’s parent is the process that created it; its children are any processes
that it creates. Its siblings are children with the same parent process.) Some
of these fields include:
long state; /* state of the process */
struct sched entity se; /* scheduling information */
struct task struct *parent; /* this process’s parent */
struct list head children; /* this process’s children */
struct files struct *files; /* list of open files */
struct mm struct *mm; /* address space of this process */
For example, the state of a process is represented by the field long state
in this structure. Within the Linux kernel, all active processes are represented
using a doubly linked list of task struct. The kernel maintains a pointer—
current—to the process currently executing on the system, as shown below:
struct task_struct
process information
•
•
•
struct task_struct
process information
•
•
•
current
(currently executing proccess)
struct task_struct
process information
•
•
•
• • •
As an illustration of how the kernel might manipulate one of the fields in
the task struct for a specified process, let’s assume the system would like
to change the state of the process currently running to the value new state.
If current is a pointer to the process currently executing, its state is changed
with the following:
current->state = new state;
3.2 Process Scheduling
The objective of multiprogramming is to have some process running at all
times, to maximize CPU utilization. The objective of time sharing is to switch the
CPU among processes so frequently that users can interact with each program
3.2 Process Scheduling 111
queue header PCB7
PCB3
PCB5
PCB14 PCB6
PCB2
head
head
head
head
head
ready
queue
disk
unit 0
terminal
unit 0
mag
tape
unit 0
mag
tape
unit 1
tail registers registers
tail
tail
tail
tail
•
•
•
•
•
•
•
•
•
Figure 3.5 The ready queue and various I/O device queues.
while it is running. To meet these objectives, the process scheduler selects
an available process (possibly from a set of several available processes) for
program execution on the CPU. For a single-processor system, there will never
be more than one running process. If there are more processes, the rest will
have to wait until the CPU is free and can be rescheduled.
3.2.1 Scheduling Queues
As processes enter the system, they are put into a job queue, which consists
of all processes in the system. The processes that are residing in main memory
and are ready and waiting to execute are kept on a list called the ready queue.
This queue is generally stored as a linked list. A ready-queue header contains
pointers to the first and final PCBs in the list. Each PCB includes a pointer field
that points to the next PCB in the ready queue.
The system also includes other queues. When a process is allocated the
CPU, it executes for a while and eventually quits, is interrupted, or waits for
the occurrence of a particular event, such as the completion of an I/O request.
Suppose the process makes an I/O request to a shared device, such as a disk.
Since there are many processes in the system, the disk may be busy with the
I/O request of some other process. The process therefore may have to wait for
the disk. The list of processes waiting for a particular I/O device is called a
device queue. Each device has its own device queue (Figure 3.5).
112 Chapter 3 Processes
ready queue CPU
I/O I/O queue I/O request
time slice
expired
fork a
child
wait for an
interrupt
interrupt
occurs
child
executes
Figure 3.6 Queueing-diagram representation of process scheduling.
A common representation of process scheduling is a queueing diagram,
such as that in Figure 3.6. Each rectangular box represents a queue. Two types
of queues are present: the ready queue and a set of device queues. The circles
represent the resources that serve the queues, and the arrows indicate the flow
of processes in the system.
A new process is initially put in the ready queue. It waits there until it is
selected for execution, or dispatched. Once the process is allocated the CPU
and is executing, one of several events could occur:
• The process could issue an I/O request and then be placed in an I/O queue.
• The process could create a new child process and wait for the child’s
termination.
• The process could be removed forcibly from the CPU, as a result of an
interrupt, and be put back in the ready queue.
In the first two cases, the process eventually switches from the waiting state
to the ready state and is then put back in the ready queue. A process continues
this cycle until it terminates, at which time it is removed from all queues and
has its PCB and resources deallocated.
3.2.2 Schedulers
A process migrates among the various scheduling queues throughout its
lifetime. The operating system must select, for scheduling purposes, processes
from these queues in some fashion. The selection process is carried out by the
appropriate scheduler.
Often, in a batch system, more processes are submitted than can be executed
immediately. These processes are spooled to a mass-storage device (typically a
disk), where they are kept for later execution. The long-term scheduler, or job
scheduler, selects processes from this pool and loads them into memory for
3.2 Process Scheduling 113
execution. The short-term scheduler, or CPU scheduler, selects from among
the processes that are ready to execute and allocates the CPU to one of them.
The primary distinction between these two schedulers lies in frequency
of execution. The short-term scheduler must select a new process for the CPU
frequently. A process may execute for only a few milliseconds before waiting
for an I/O request. Often, the short-term scheduler executes at least once every
100 milliseconds. Because of the short time between executions, the short-term
scheduler must be fast. If it takes 10 milliseconds to decide to execute a process
for 100 milliseconds, then 10/(100 + 10) = 9 percent of the CPU is being used
(wasted) simply for scheduling the work.
The long-term scheduler executes much less frequently; minutes may sep-
arate the creation of one new process and the next. The long-term scheduler
controls the degree of multiprogramming (the number of processes in mem-
ory). If the degree of multiprogramming is stable, then the average rate of
process creation must be equal to the average departure rate of processes
leaving the system. Thus, the long-term scheduler may need to be invoked
only when a process leaves the system. Because of the longer interval between
executions, the long-term scheduler can afford to take more time to decide
which process should be selected for execution.
It is important that the long-term scheduler make a careful selection. In
general, most processes can be described as either I/O bound or CPU bound.
An I/O-bound process is one that spends more of its time doing I/O than
it spends doing computations. A CPU-bound process, in contrast, generates
I/O requests infrequently, using more of its time doing computations. It is
important that the long-term scheduler select a good process mix of I/O-bound
and CPU-bound processes. If all processes are I/O bound, the ready queue will
almost always be empty, and the short-term scheduler will have little to do.
If all processes are CPU bound, the I/O waiting queue will almost always be
empty, devices will go unused, and again the system will be unbalanced. The
system with the best performance will thus have a combination of CPU-bound
and I/O-bound processes.
On some systems, the long-term scheduler may be absent or minimal.
For example, time-sharing systems such as UNIX and Microsoft Windows
systems often have no long-term scheduler but simply put every new process in
memory for the short-term scheduler. The stability of these systems depends
either on a physical limitation (such as the number of available terminals)
or on the self-adjusting nature of human users. If performance declines to
unacceptable levels on a multiuser system, some users will simply quit.
Some operating systems, such as time-sharing systems, may introduce an
additional, intermediate level of scheduling. This medium-term scheduler is
diagrammed in Figure 3.7. The key idea behind a medium-term scheduler is
that sometimes it can be advantageous to remove a process from memory
(and from active contention for the CPU) and thus reduce the degree of
multiprogramming. Later, the process can be reintroduced into memory, and its
execution can be continued where it left off. This scheme is called swapping.
The process is swapped out, and is later swapped in, by the medium-term
scheduler. Swapping may be necessary to improve the process mix or because
a change in memory requirements has overcommitted available memory,
requiring memory to be freed up. Swapping is discussed in Chapter 8.
114 Chapter 3 Processes
swap in swap out
end
CPU
I/O
I/O waiting
queues
ready queue
partially executed
swapped-out processes
Figure 3.7 Addition of medium-term scheduling to the queueing diagram.
3.2.3 Context Switch
As mentioned in Section 1.2.1, interrupts cause the operating system to change
a CPU from its current task and to run a kernel routine. Such operations happen
frequently on general-purpose systems. When an interrupt occurs, the system
needs to save the current context of the process running on the CPU so that
it can restore that context when its processing is done, essentially suspending
the process and then resuming it. The context is represented in the PCB of the
process. It includes the value of the CPU registers, the process state (see Figure
3.2), and memory-management information. Generically, we perform a state
save of the current state of the CPU, be it in kernel or user mode, and then a
state restore to resume operations.
Switching the CPU to another process requires performing a state save of
the current process and a state restore of a different process. This task is known
as a context switch. When a context switch occurs, the kernel saves the context
of the old process in its PCB and loads the saved context of the new process
scheduled to run. Context-switch time is pure overhead, because the system
does no useful work while switching. Switching speed varies from machine to
machine, depending on the memory speed, the number of registers that must
be copied, and the existence of special instructions (such as a single instruction
to load or store all registers). A typical speed is a few milliseconds.
Context-switch times are highly dependent on hardware support. For
instance, some processors (such as the Sun UltraSPARC) provide multiple sets
of registers. A context switch here simply requires changing the pointer to the
current register set. Of course, if there are more active processes than there are
register sets, the system resorts to copying register data to and from memory,
as before. Also, the more complex the operating system, the greater the amount
of work that must be done during a context switch. As we will see in Chapter
8, advanced memory-management techniques may require that extra data be
switched with each context. For instance, the address space of the current
process must be preserved as the space of the next task is prepared for use.
How the address space is preserved, and what amount of work is needed
to preserve it, depend on the memory-management method of the operating
system.
3.3 Operations on Processes 115
MULTITASKING IN MOBILE SYSTEMS
Because of the constraints imposed on mobile devices, early versions of iOS
did not provide user-application multitasking; only one application runs in
the foreground and all other user applications are suspended. Operating-
system tasks were multitasked because they were written by Apple and well
behaved. However, beginning with iOS 4, Apple now provides a limited
form of multitasking for user applications, thus allowing a single foreground
application to run concurrently with multiple background applications. (On
a mobile device, the foreground application is the application currently
open and appearing on the display. The background application remains
in memory, but does not occupy the display screen.) The iOS 4 programming
API provides support for multitasking, thus allowing a process to run in
the background without being suspended. However, it is limited and only
available for a limited number of application types, including applications
• running a single, finite-length task (such as completing a download of
content from a network);
• receiving notifications of an event occurring (such as a new email
message);
• with long-running background tasks (such as an audio player.)
Apple probably limits multitasking due to battery life and memory use
concerns. The CPU certainly has the features to support multitasking, but
Apple chooses to not take advantage of some of them in order to better
manage resource use.
Android does not place such constraints on the types of applications that
can run in the background. If an application requires processing while in
the background, the application must use a service, a separate application
component that runs on behalf of the background process. Consider a
streaming audio application: if the application moves to the background, the
service continues to send audio files to the audio device driver on behalf of
the background application. In fact, the service will continue to run even if the
background application is suspended. Services do not have a user interface
and have a small memory footprint, thus providing an efficient technique for
multitasking in a mobile environment.
3.3 Operations on Processes
The processes in most systems can execute concurrently, and they may
be created and deleted dynamically. Thus, these systems must provide a
mechanism for process creation and termination. In this section, we explore
the mechanisms involved in creating processes and illustrate process creation
on UNIX and Windows systems.
116 Chapter 3 Processes
3.3.1 Process Creation
During the course of execution, a process may create several new processes. As
mentioned earlier, the creating process is called a parent process, and the new
processes are called the children of that process. Each of these new processes
may in turn create other processes, forming a tree of processes.
Most operating systems (including UNIX, Linux, and Windows) identify
processes according to a unique process identifier (or pid), which is typically
an integer number. The pid provides a unique value for each process in the
system, and it can be used as an index to access various attributes of a process
within the kernel.
Figure 3.8 illustrates a typical process tree for the Linux operating system,
showing the name of each process and its pid. (We use the term process rather
loosely, as Linux prefers the term task instead.) The initprocess (which always
has a pid of 1) serves as the root parent process for all user processes. Once the
system has booted, the initprocess can also create various user processes, such
as a web or print server, an ssh server, and the like. In Figure 3.8, we see two
children of init—kthreadd and sshd. The kthreadd process is responsible
for creating additional processes that perform tasks on behalf of the kernel
(in this situation, khelper and pdflush). The sshd process is responsible for
managing clients that connect to the system by using ssh (which is short for
secure shell). The login process is responsible for managing clients that directly
log onto the system. In this example, a client has logged on and is using the
bash shell, which has been assigned pid 8416. Using the bash command-line
interface, this user has created the process ps as well as the emacs editor.
On UNIX and Linux systems, we can obtain a listing of processes by using
the ps command. For example, the command
ps -el
will list complete information for all processes currently active in the system.
It is easy to construct a process tree similar to the one shown in Figure 3.8 by
recursively tracing parent processes all the way to the init process.
init
pid = 1
sshd
pid = 3028
login
pid = 8415
kthreadd
pid = 2
sshd
pid = 3610
pdflush
pid = 200
khelper
pid = 6
tcsch
pid = 4005
emacs
pid = 9204
bash
pid = 8416
ps
pid = 9298
Figure 3.8 A tree of processes on a typical Linux system.
3.3 Operations on Processes 117
In general, when a process creates a child process, that child process will
need certain resources (CPU time, memory, files, I/O devices) to accomplish
its task. A child process may be able to obtain its resources directly from
the operating system, or it may be constrained to a subset of the resources
of the parent process. The parent may have to partition its resources among
its children, or it may be able to share some resources (such as memory or
files) among several of its children. Restricting a child process to a subset of
the parent’s resources prevents any process from overloading the system by
creating too many child processes.
In addition to supplying various physical and logical resources, the parent
process may pass along initialization data (input) to the child process. For
example, consider a process whose function is to display the contents of a file
—say, image.jpg—on the screen of a terminal. When the process is created,
it will get, as an input from its parent process, the name of the file image.jpg.
Using that file name, it will open the file and write the contents out. It may
also get the name of the output device. Alternatively, some operating systems
pass resources to child processes. On such a system, the new process may get
two open files, image.jpg and the terminal device, and may simply transfer
the datum between the two.
When a process creates a new process, two possibilities for execution exist:
1. The parent continues to execute concurrently with its children.
2. The parent waits until some or all of its children have terminated.
There are also two address-space possibilities for the new process:
1. The child process is a duplicate of the parent process (it has the same
program and data as the parent).
2. The child process has a new program loaded into it.
To illustrate these differences, let’s first consider the UNIX operating system.
In UNIX, as we’ve seen, each process is identified by its process identifier,
which is a unique integer. A new process is created by the fork() system
call. The new process consists of a copy of the address space of the original
process. This mechanism allows the parent process to communicate easily with
its child process. Both processes (the parent and the child) continue execution
at the instruction after the fork(), with one difference: the return code for
the fork() is zero for the new (child) process, whereas the (nonzero) process
identifier of the child is returned to the parent.
After a fork() system call, one of the two processes typically uses the
exec() system call to replace the process’s memory space with a new program.
The exec() system call loads a binary file into memory (destroying the
memory image of the program containing the exec() system call) and starts
its execution. In this manner, the two processes are able to communicate and
then go their separate ways. The parent can then create more children; or, if it
has nothing else to do while the child runs, it can issue a wait() system call to
move itself off the ready queue until the termination of the child. Because the
118 Chapter 3 Processes
#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
pid t pid;
/* fork a child process */
pid = fork();
if (pid < 0) { /* error occurred */
fprintf(stderr, "Fork Failed");
return 1;
}
else if (pid == 0) { /* child process */
execlp("/bin/ls","ls",NULL);
}
else { /* parent process */
/* parent will wait for the child to complete */
wait(NULL);
printf("Child Complete");
}
return 0;
}
Figure 3.9 Creating a separate process using the UNIX fork() system call.
call to exec() overlays the process’s address space with a new program, the
call to exec() does not return control unless an error occurs.
The C program shown in Figure 3.9 illustrates the UNIX system calls
previously described. We now have two different processes running copies
of the same program. The only difference is that the value of pid (the process
identifier) for the child process is zero, while that for the parent is an integer
value greater than zero (in fact, it is the actual pid of the child process). The
child process inherits privileges and scheduling attributes from the parent,
as well certain resources, such as open files. The child process then overlays
its address space with the UNIX command /bin/ls (used to get a directory
listing) using the execlp() system call (execlp() is a version of the exec()
system call). The parent waits for the child process to complete with the wait()
system call. When the child process completes (by either implicitly or explicitly
invoking exit()), the parent process resumes from the call to wait(), where it
completes using the exit() system call. This is also illustrated in Figure 3.10.
Of course, there is nothing to prevent the child from not invoking exec()
and instead continuing to execute as a copy of the parent process. In this
scenario, the parent and child are concurrent processes running the same code
3.3 Operations on Processes 119
pid = fork()
exec()
parent
parent (pid > 0)
child (pid = 0)
wait()
exit()
parent resumes
Figure 3.10 Process creation using the fork() system call.
instructions. Because the child is a copy of the parent, each process has its own
copy of any data.
As an alternative example, we next consider process creation in Windows.
Processes are created in the Windows API using the CreateProcess() func-
tion, which is similar to fork() in that a parent creates a new child process.
However, whereas fork() has the child process inheriting the address space
of its parent, CreateProcess() requires loading a specified program into the
address space of the child process at process creation. Furthermore, whereas
fork() is passed no parameters, CreateProcess() expects no fewer than ten
parameters.
The C program shown in Figure 3.11 illustrates the CreateProcess()
function, which creates a child process that loads the application mspaint.exe.
We opt for many of the default values of the ten parameters passed to
CreateProcess(). Readers interested in pursuing the details of process
creation and management in the Windows API are encouraged to consult the
bibliographical notes at the end of this chapter.
The two parameters passed to the CreateProcess() function are instances
of the STARTUPINFO and PROCESS INFORMATION structures. STARTUPINFO
specifies many properties of the new process, such as window size and
appearance and handles to standard input and output files. The PRO-
CESS INFORMATION structure contains a handle and the identifiers to the
newly created process and its thread. We invoke the ZeroMemory() func-
tion to allocate memory for each of these structures before proceeding with
CreateProcess().
The first two parameters passed to CreateProcess() are the application
name and command-line parameters. If the application name is NULL (as it is
in this case), the command-line parameter specifies the application to load. In
this instance, we are loading the Microsoft Windows mspaint.exe application.
Beyond these two initial parameters, we use the default parameters for
inheriting process and thread handles as well as specifying that there will be no
creation flags. We also use the parent’s existing environment block and starting
directory. Last, we provide two pointers to the STARTUPINFO and PROCESS -
INFORMATION structures created at the beginning of the program. In Figure
3.9, the parent process waits for the child to complete by invoking the wait()
system call. The equivalent of this in Windows is WaitForSingleObject(),
which is passed a handle of the child process—pi.hProcess—and waits for
this process to complete. Once the child process exits, control returns from the
WaitForSingleObject() function in the parent process.
120 Chapter 3 Processes
#include <stdio.h>
#include <windows.h>
int main(VOID)
{
STARTUPINFO si;
PROCESS INFORMATION pi;
/* allocate memory */
ZeroMemory(&si, sizeof(si));
si.cb = sizeof(si);
ZeroMemory(&pi, sizeof(pi));
/* create child process */
if (!CreateProcess(NULL, /* use command line */
"C:WINDOWSsystem32mspaint.exe", /* command */
NULL, /* don’t inherit process handle */
NULL, /* don’t inherit thread handle */
FALSE, /* disable handle inheritance */
0, /* no creation flags */
NULL, /* use parent’s environment block */
NULL, /* use parent’s existing directory */
&si,
&pi))
{
fprintf(stderr, "Create Process Failed");
return -1;
}
/* parent will wait for the child to complete */
WaitForSingleObject(pi.hProcess, INFINITE);
printf("Child Complete");
/* close handles */
CloseHandle(pi.hProcess);
CloseHandle(pi.hThread);
}
Figure 3.11 Creating a separate process using the Windows API.
3.3.2 Process Termination
A process terminates when it finishes executing its final statement and asks the
operating system to delete it by using the exit() system call. At that point, the
process may return a status value (typically an integer) to its parent process
(via the wait() system call). All the resources of the process—including
physical and virtual memory, open files, and I/O buffers—are deallocated
by the operating system.
Termination can occur in other circumstances as well. A process can cause
the termination of another process via an appropriate system call (for example,
TerminateProcess() in Windows). Usually, such a system call can be invoked
3.3 Operations on Processes 121
only by the parent of the process that is to be terminated. Otherwise, users could
arbitrarily kill each other’s jobs. Note that a parent needs to know the identities
of its children if it is to terminate them. Thus, when one process creates a new
process, the identity of the newly created process is passed to the parent.
A parent may terminate the execution of one of its children for a variety of
reasons, such as these:
• The child has exceeded its usage of some of the resources that it has been
allocated. (To determine whether this has occurred, the parent must have
a mechanism to inspect the state of its children.)
• The task assigned to the child is no longer required.
• The parent is exiting, and the operating system does not allow a child to
continue if its parent terminates.
Some systems do not allow a child to exist if its parent has terminated. In
such systems, if a process terminates (either normally or abnormally), then
all its children must also be terminated. This phenomenon, referred to as
cascading termination, is normally initiated by the operating system.
To illustrate process execution and termination, consider that, in Linux
and UNIX systems, we can terminate a process by using the exit() system
call, providing an exit status as a parameter:
/* exit with status 1 */
exit(1);
In fact, under normal termination, exit() may be called either directly (as
shown above) or indirectly (by a return statement in main()).
A parent process may wait for the termination of a child process by using
the wait() system call. The wait() system call is passed a parameter that
allows the parent to obtain the exit status of the child. This system call also
returns the process identifier of the terminated child so that the parent can tell
which of its children has terminated:
pid t pid;
int status;
pid = wait(&status);
When a process terminates, its resources are deallocated by the operating
system. However, its entry in the process table must remain there until the
parent calls wait(), because the process table contains the process’s exit status.
A process that has terminated, but whose parent has not yet called wait(), is
known as a zombie process. All processes transition to this state when they
terminate, but generally they exist as zombies only briefly. Once the parent
calls wait(), the process identifier of the zombie process and its entry in the
process table are released.
Now consider what would happen if a parent did not invoke wait() and
instead terminated, thereby leaving its child processes as orphans. Linux and
UNIX address this scenario by assigning the init process as the new parent to
122 Chapter 3 Processes
orphan processes. (Recall from Figure 3.8 that the init process is the root of the
process hierarchy in UNIX and Linux systems.) The init process periodically
invokes wait(), thereby allowing the exit status of any orphaned process to be
collected and releasing the orphan’s process identifier and process-table entry.
3.4 Interprocess Communication
Processes executing concurrently in the operating system may be either
independent processes or cooperating processes. A process is independent
if it cannot affect or be affected by the other processes executing in the system.
Any process that does not share data with any other process is independent. A
process is cooperating if it can affect or be affected by the other processes
executing in the system. Clearly, any process that shares data with other
processes is a cooperating process.
There are several reasons for providing an environment that allows process
cooperation:
• Information sharing. Since several users may be interested in the same
piece of information (for instance, a shared file), we must provide an
environment to allow concurrent access to such information.
• Computation speedup. If we want a particular task to run faster, we must
break it into subtasks, each of which will be executing in parallel with the
others. Notice that such a speedup can be achieved only if the computer
has multiple processing cores.
• Modularity. We may want to construct the system in a modular fashion,
dividing the system functions into separate processes or threads, as we
discussed in Chapter 2.
• Convenience. Even an individual user may work on many tasks at the
same time. For instance, a user may be editing, listening to music, and
compiling in parallel.
Cooperating processes require an interprocess communication (IPC) mech-
anism that will allow them to exchange data and information. There are two
fundamental models of interprocess communication: shared memory and mes-
sage passing. In the shared-memory model, a region of memory that is shared
by cooperating processes is established. Processes can then exchange informa-
tion by reading and writing data to the shared region. In the message-passing
model, communication takes place by means of messages exchanged between
the cooperating processes. The two communications models are contrasted in
Figure 3.12.
Both of the models just mentioned are common in operating systems,
and many systems implement both. Message passing is useful for exchanging
smaller amounts of data, because no conflicts need be avoided. Message
passing is also easier to implement in a distributed system than shared memory.
(Although there are systems that provide distributed shared memory, we do not
consider them in this text.) Shared memory can be faster than message passing,
since message-passing systems are typically implemented using system calls
3.4 Interprocess Communication 123
MULTIPROCESS ARCHITECTURE—CHROME BROWSER
Many websites contain active content such as JavaScript, Flash, and HTML5 to
provide a rich and dynamic web-browsing experience. Unfortunately, these
web applications may also contain software bugs, which can result in sluggish
response times and can even cause the web browser to crash. This isn’t a big
problem in a web browser that displays content from only one website. But
most contemporary web browsers provide tabbed browsing, which allows a
single instance of a web browser application to open several websites at the
same time, with each site in a separate tab. To switch between the different
sites , a user need only click on the appropriate tab. This arrangement is
illustrated below:
A problem with this approach is that if a web application in any tab crashes,
the entire process—including all other tabs displaying additional websites
—crashes as well.
Google’s Chrome web browser was designed to address this issue by
using a multiprocess architecture. Chrome identifies three different types of
processes: browser, renderers, and plug-ins.
• The browser process is responsible for managing the user interface as
well as disk and network I/O. A new browser process is created when
Chrome is started. Only one browser process is created.
• Renderer processes contain logic for rendering web pages. Thus, they
contain the logic for handling HTML, Javascript, images, and so forth. As
a general rule, a new renderer process is created for each website opened
in a new tab, and so several renderer processes may be active at the same
time.
• A plug-in process is created for each type of plug-in (such as Flash or
QuickTime) in use. Plug-in processes contain the code for the plug-in as
well as additional code that enables the plug-in to communicate with
associated renderer processes and the browser process.
The advantage of the multiprocess approach is that websites run in
isolation from one another. If one website crashes, only its renderer process
is affected; all other processes remain unharmed. Furthermore, renderer
processes run in a sandbox, which means that access to disk and network
I/O is restricted, minimizing the effects of any security exploits.
and thus require the more time-consuming task of kernel intervention. In
shared-memory systems, system calls are required only to establish shared-
124 Chapter 3 Processes
process A
message queue
kernel
(a) (b)
process A
shared memory
kernel
process B
m0 m1 m2 ...
m3 mn
process B
Figure 3.12 Communications models. (a) Message passing. (b) Shared memory.
memory regions. Once shared memory is established, all accesses are treated
as routine memory accesses, and no assistance from the kernel is required.
Recent research on systems with several processing cores indicates that
message passing provides better performance than shared memory on such
systems. Shared memory suffers from cache coherency issues, which arise
because shared data migrate among the several caches. As the number of
processing cores on systems increases, it is possible that we will see message
passing as the preferred mechanism for IPC.
In the remainder of this section, we explore shared-memory and message-
passing systems in more detail.
3.4.1 Shared-Memory Systems
Interprocess communication using shared memory requires communicating
processes to establish a region of shared memory. Typically, a shared-memory
region resides in the address space of the process creating the shared-memory
segment. Other processes that wish to communicate using this shared-memory
segment must attach it to their address space. Recall that, normally, the
operating system tries to prevent one process from accessing another process’s
memory. Shared memory requires that two or more processes agree to remove
this restriction. They can then exchange information by reading and writing
data in the shared areas. The form of the data and the location are determined by
these processes and are not under the operating system’s control. The processes
are also responsible for ensuring that they are not writing to the same location
simultaneously.
To illustrate the concept of cooperating processes, let’s consider the
producer–consumer problem, which is a common paradigm for cooperating
processes. A producer process produces information that is consumed by a
consumer process. For example, a compiler may produce assembly code that
is consumed by an assembler. The assembler, in turn, may produce object
modules that are consumed by the loader. The producer–consumer problem
3.4 Interprocess Communication 125
item next produced;
while (true) {
/* produce an item in next produced */
while (((in + 1) % BUFFER SIZE) == out)
; /* do nothing */
buffer[in] = next produced;
in = (in + 1) % BUFFER SIZE;
}
Figure 3.13 The producer process using shared memory.
also provides a useful metaphor for the client–server paradigm. We generally
think of a server as a producer and a client as a consumer. For example, a web
server produces (that is, provides) HTML files and images, which are consumed
(that is, read) by the client web browser requesting the resource.
One solution to the producer–consumer problem uses shared memory. To
allow producer and consumer processes to run concurrently, we must have
available a buffer of items that can be filled by the producer and emptied by
the consumer. This buffer will reside in a region of memory that is shared by
the producer and consumer processes. A producer can produce one item while
the consumer is consuming another item. The producer and consumer must
be synchronized, so that the consumer does not try to consume an item that
has not yet been produced.
Two types of buffers can be used. The unbounded buffer places no practical
limit on the size of the buffer. The consumer may have to wait for new items,
but the producer can always produce new items. The bounded buffer assumes
a fixed buffer size. In this case, the consumer must wait if the buffer is empty,
and the producer must wait if the buffer is full.
Let’s look more closely at how the bounded buffer illustrates interprocess
communication using shared memory. The following variables reside in a
region of memory shared by the producer and consumer processes:
#define BUFFER SIZE 10
typedef struct {
. . .
}item;
item buffer[BUFFER SIZE];
int in = 0;
int out = 0;
The shared buffer is implemented as a circular array with two logical pointers:
in and out. The variable in points to the next free position in the buffer; out
points to the first full position in the buffer. The buffer is empty when in ==
out; the buffer is full when ((in + 1) % BUFFER SIZE) == out.
The code for the producer process is shown in Figure 3.13, and the code
for the consumer process is shown in Figure 3.14. The producer process has a
126 Chapter 3 Processes
item next consumed;
while (true) {
while (in == out)
; /* do nothing */
next consumed = buffer[out];
out = (out + 1) % BUFFER SIZE;
/* consume the item in next consumed */
}
Figure 3.14 The consumer process using shared memory.
local variable next produced in which the new item to be produced is stored.
The consumer process has a local variable next consumed in which the item
to be consumed is stored.
This scheme allows at most BUFFER SIZE − 1 items in the buffer at the
same time. We leave it as an exercise for you to provide a solution in which
BUFFER SIZE items can be in the buffer at the same time. In Section 3.5.1, we
illustrate the POSIX API for shared memory.
One issue this illustration does not address concerns the situation in which
both the producer process and the consumer process attempt to access the
shared buffer concurrently. In Chapter 5, we discuss how synchronization
among cooperating processes can be implemented effectively in a shared-
memory environment.
3.4.2 Message-Passing Systems
In Section 3.4.1, we showed how cooperating processes can communicate in a
shared-memory environment. The scheme requires that these processes share a
region of memory and that the code for accessing and manipulating the shared
memory be written explicitly by the application programmer. Another way to
achieve the same effect is for the operating system to provide the means for
cooperating processes to communicate with each other via a message-passing
facility.
Message passing provides a mechanism to allow processes to communicate
and to synchronize their actions without sharing the same address space. It is
particularly useful in a distributed environment, where the communicating
processes may reside on different computers connected by a network. For
example, an Internet chat program could be designed so that chat participants
communicate with one another by exchanging messages.
A message-passing facility provides at least two operations:
send(message) receive(message)
Messages sent by a process can be either fixed or variable in size. If only
fixed-sized messages can be sent, the system-level implementation is straight-
forward. This restriction, however, makes the task of programming more
difficult. Conversely, variable-sized messages require a more complex system-
3.4 Interprocess Communication 127
level implementation, but the programming task becomes simpler. This is a
common kind of tradeoff seen throughout operating-system design.
If processes P and Q want to communicate, they must send messages to and
receive messages from each other: a communication link must exist between
them. This link can be implemented in a variety of ways. We are concerned here
not with the link’s physical implementation (such as shared memory, hardware
bus, or network, which are covered in Chapter 17) but rather with its logical
implementation. Here are several methods for logically implementing a link
and the send()/receive() operations:
• Direct or indirect communication
• Synchronous or asynchronous communication
• Automatic or explicit buffering
We look at issues related to each of these features next.
3.4.2.1 Naming
Processes that want to communicate must have a way to refer to each other.
They can use either direct or indirect communication.
Under direct communication, each process that wants to communicate
must explicitly name the recipient or sender of the communication. In this
scheme, the send() and receive() primitives are defined as:
• send(P, message)—Send a message to process P.
• receive(Q, message)—Receive a message from process Q.
A communication link in this scheme has the following properties:
• A link is established automatically between every pair of processes that
want to communicate. The processes need to know only each other’s
identity to communicate.
• A link is associated with exactly two processes.
• Between each pair of processes, there exists exactly one link.
This scheme exhibits symmetry in addressing; that is, both the sender
process and the receiver process must name the other to communicate. A
variant of this scheme employs asymmetry in addressing. Here, only the sender
names the recipient; the recipient is not required to name the sender. In this
scheme, the send() and receive() primitives are defined as follows:
• send(P, message)—Send a message to process P.
• receive(id, message)—Receive a message from any process. The
variable id is set to the name of the process with which communication
has taken place.
128 Chapter 3 Processes
The disadvantage in both of these schemes (symmetric and asymmetric)
is the limited modularity of the resulting process definitions. Changing the
identifier of a process may necessitate examining all other process definitions.
All references to the old identifier must be found, so that they can be modified
to the new identifier. In general, any such hard-coding techniques, where
identifiersmustbe explicitlystated, are lessdesirable thantechniquesinvolving
indirection, as described next.
With indirect communication, the messages are sent to and received from
mailboxes, or ports. A mailbox can be viewed abstractly as an object into which
messages can be placed by processes and from which messages can be removed.
Each mailbox has a unique identification. For example, POSIX message queues
use an integer value to identify a mailbox. A process can communicate with
another process via a number of different mailboxes, but two processes can
communicate only if they have a shared mailbox. The send() and receive()
primitives are defined as follows:
• send(A, message)—Send a message to mailbox A.
• receive(A, message)—Receive a message from mailbox A.
In this scheme, a communication link has the following properties:
• A link is established between a pair of processes only if both members of
the pair have a shared mailbox.
• A link may be associated with more than two processes.
• Between each pair of communicating processes, a number of different links
may exist, with each link corresponding to one mailbox.
Now suppose that processes P1, P2, and P3 all share mailbox A. Process
P1 sends a message to A, while both P2 and P3 execute a receive() from A.
Which process will receive the message sent by P1? The answer depends on
which of the following methods we choose:
• Allow a link to be associated with two processes at most.
• Allow at most one process at a time to execute a receive() operation.
• Allow the system to select arbitrarily which process will receive the
message (that is, either P2 or P3, but not both, will receive the message). The
system may define an algorithm for selecting which process will receive the
message (for example, round robin, where processes take turns receiving
messages). The system may identify the receiver to the sender.
A mailbox may be owned either by a process or by the operating system.
If the mailbox is owned by a process (that is, the mailbox is part of the address
space of the process), then we distinguish between the owner (which can
only receive messages through this mailbox) and the user (which can only
send messages to the mailbox). Since each mailbox has a unique owner, there
can be no confusion about which process should receive a message sent to
this mailbox. When a process that owns a mailbox terminates, the mailbox
3.4 Interprocess Communication 129
disappears. Any process that subsequently sends a message to this mailbox
must be notified that the mailbox no longer exists.
In contrast, a mailbox that is owned by the operating system has an
existence of its own. It is independent and is not attached to any particular
process. The operating system then must provide a mechanism that allows a
process to do the following:
• Create a new mailbox.
• Send and receive messages through the mailbox.
• Delete a mailbox.
The process that creates a new mailbox is that mailbox’s owner by default.
Initially, the owner is the only process that can receive messages through this
mailbox. However, the ownership and receiving privilege may be passed to
other processes through appropriate system calls. Of course, this provision
could result in multiple receivers for each mailbox.
3.4.2.2 Synchronization
Communication between processes takes place through calls to send() and
receive() primitives. There are different design options for implementing
each primitive. Message passing may be either blocking or nonblocking—
also known as synchronous and asynchronous. (Throughout this text, you
will encounter the concepts of synchronous and asynchronous behavior in
relation to various operating-system algorithms.)
• Blocking send. The sending process is blocked until the message is
received by the receiving process or by the mailbox.
• Nonblocking send. The sending process sends the message and resumes
operation.
• Blocking receive. The receiver blocks until a message is available.
• Nonblocking receive. The receiver retrieves either a valid message or a
null.
Different combinations of send() and receive() are possible. When both
send() and receive() are blocking, we have a rendezvous between the
sender and the receiver. The solution to the producer–consumer problem
becomes trivial when we use blocking send() and receive() statements.
The producer merely invokes the blocking send() call and waits until the
message is delivered to either the receiver or the mailbox. Likewise, when the
consumer invokes receive(), it blocks until a message is available. This is
illustrated in Figures 3.15 and 3.16.
3.4.2.3 Buffering
Whether communication is direct or indirect, messages exchanged by commu-
nicating processes reside in a temporary queue. Basically, such queues can be
implemented in three ways:
130 Chapter 3 Processes
message next produced;
while (true) {
/* produce an item in next produced */
send(next produced);
}
Figure 3.15 The producer process using message passing.
• Zero capacity. The queue has a maximum length of zero; thus, the link
cannot have any messages waiting in it. In this case, the sender must block
until the recipient receives the message.
• Bounded capacity. The queue has finite length n; thus, at most n messages
can reside in it. If the queue is not full when a new message is sent, the
message is placed in the queue (either the message is copied or a pointer
to the message is kept), and the sender can continue execution without
waiting. The link’s capacity is finite, however. If the link is full, the sender
must block until space is available in the queue.
• Unbounded capacity. The queue’s length is potentially infinite; thus, any
number of messages can wait in it. The sender never blocks.
The zero-capacity case is sometimes referred to as a message system with no
buffering. The other cases are referred to as systems with automatic buffering.
3.5 Examples of IPC Systems
In this section, we explore three different IPC systems. We first cover the POSIX
API forshared memoryand thendiscussmessage passinginthe Machoperating
system. We conclude with Windows, which interestingly uses shared memory
as a mechanism for providing certain types of message passing.
3.5.1 An Example: POSIX Shared Memory
Several IPC mechanisms are available for POSIX systems, including shared
memory and message passing. Here, we explore the POSIX API for shared
memory.
POSIX shared memory is organized using memory-mapped files, which
associate the region of shared memory with a file. A process must first create
message next consumed;
while (true) {
receive(next consumed);
/* consume the item in next consumed */
}
Figure 3.16 The consumer process using message passing.
Programming Projects 159
Part II—Creating a History Feature
The next task is to modify the shell interface program so that it provides
a history feature that allows the user to access the most recently entered
commands. The user will be able to access up to 10 commands by using the
feature. The commands will be consecutively numbered starting at 1, and
the numbering will continue past 10. For example, if the user has entered 35
commands, the 10 most recent commands will be numbered 26 to 35.
The user will be able to list the command history by entering the command
history
at the osh> prompt. As an example, assume that the history consists of the
commands (from most to least recent):
ps, ls -l, top, cal, who, date
The command history will output:
6 ps
5 ls -l
4 top
3 cal
2 who
1 date
Your program should support two techniques for retrieving commands
from the command history:
1. When the user enters !!, the most recent command in the history is
executed.
2. When the user enters a single ! followed by an integer N, the Nth
command in the history is executed.
Continuing our example from above, if the user enters !!, the ps command
will be performed; if the user enters !3, the command cal will be executed.
Any command executed in this fashion should be echoed on the user’s screen.
The command should also be placed in the history buffer as the next command.
The program should also manage basic error handling. If there are
no commands in the history, entering !! should result in a message “No
commands in history.” If there is no command corresponding to the number
entered with the single !, the program should output "No such command in
history."
Project 2—Linux Kernel Module for Listing Tasks
In this project, you will write a kernel module that lists all current tasks in a
Linux system. Be sure to review the programming project in Chapter 2, which
deals with creating Linux kernel modules, before you begin this project. The
project can be completed using the Linux virtual machine provided with this
text.
160 Chapter 3 Processes
Part I—Iterating over Tasks Linearly
As illustrated in Section 3.1, the PCB in Linux is represented by the structure
task struct, which is found in the <linux/sched.h> include file. In Linux,
the for each process() macro easily allows iteration over all current tasks
in the system:
#include <linux/sched.h>
struct task struct *task;
for each process(task) {
/* on each iteration task points to the next task */
}
The various fields in task struct can then be displayed as the program loops
through the for each process() macro.
Part I Assignment
Design a kernel module that iterates through all tasks in the system using the
for each process() macro. In particular, output the task name (known as
executable name), state, and process id of each task. (You will probably have
to read through the task struct structure in <linux/sched.h> to obtain the
names of these fields.) Write this code in the module entry point so that its
contents will appear in the kernel log buffer, which can be viewed using the
dmesg command. To verify that your code is working correctly, compare the
contents of the kernel log buffer with the output of the following command,
which lists all tasks in the system:
ps -el
The two values should be very similar. Because tasks are dynamic, however, it
is possible that a few tasks may appear in one listing but not the other.
Part II—Iterating over Tasks with a Depth-First Search Tree
The second portion of this project involves iterating over all tasks in the system
using a depth-first search (DFS) tree. (As an example: the DFS iteration of the
processes in Figure 3.8 is 1, 8415, 8416, 9298, 9204, 2, 6, 200, 3028, 3610, 4005.)
Linux maintains its process tree as a series of lists. Examining the
task struct in <linux/sched.h>, we see two struct list head objects:
children
and
sibling
Bibliographical Notes 161
These objects are pointers to a list of the task’s children, as well as its sib-
lings. Linux also maintains references to the init task (struct task struct
init task). Using this information as well as macro operations on lists, we
can iterate over the children of init as follows:
struct task struct *task;
struct list head *list;
list for each(list, &init task->children) {
task = list entry(list, struct task struct, sibling);
/* task points to the next child in the list */
}
The list for each() macro is passed two parameters, both of type struct
list head:
• A pointer to the head of the list to be traversed
• A pointer to the head node of the list to be traversed
At each iteration of list for each(), the first parameter is set to the list
structure of the next child. We then use this value to obtain each structure in
the list using the list entry() macro.
Part II Assignment
Beginning from the init task, design a kernel module that iterates over all tasks
in the system using a DFS tree. Just as in the first part of this project, output
the name, state, and pid of each task. Perform this iteration in the kernel entry
module so that its output appears in the kernel log buffer.
If you output all tasks in the system, you may see many more tasks than
appear with the ps -ael command. This is because some threads appear as
children but do not show up as ordinary processes. Therefore, to check the
output of the DFS tree, use the command
ps -eLf
This command lists all tasks—including threads—in the system. To verify
that you have indeed performed an appropriate DFS iteration, you will have to
examine the relationships among the various tasks output by the ps command.
Bibliographical Notes
Process creation, management, and IPC in UNIX and Windows systems,
respectively, are discussed in [Robbins and Robbins (2003)] and [Russinovich
and Solomon (2009)]. [Love (2010)] covers support for processes in the Linux
kernel, and [Hart (2005)] covers Windows systems programming in detail.
Coverage of the multiprocess model used in Google’s Chrome can be found at
http://blog.chromium.org/2008/09/multi-process-architecture.html.
4
C H A P T E R
Threads
The process model introduced in Chapter 3 assumed that a process was
an executing program with a single thread of control. Virtually all modern
operating systems, however, provide features enabling a process to contain
multiple threads of control. In this chapter, we introduce many concepts
associated with multithreaded computer systems, including a discussion of
the APIs for the Pthreads, Windows, and Java thread libraries. We look at a
number of issues related to multithreaded programming and its effect on the
design of operating systems. Finally, we explore how the Windows and Linux
operating systems support threads at the kernel level.
CHAPTER OBJECTIVES
• To introduce the notion of a thread—a fundamental unit of CPU utilization
that forms the basis of multithreaded computer systems.
• To discuss the APIs for the Pthreads, Windows, and Java thread libraries.
• To explore several strategies that provide implicit threading.
• To examine issues related to multithreaded programming.
• To cover operating system support for threads in Windows and Linux.
4.1 Overview
A thread is a basic unit of CPU utilization; it comprises a thread ID, a program
counter, a register set, and a stack. It shares with other threads belonging
to the same process its code section, data section, and other operating-system
resources, such as open files and signals. A traditional (or heavyweight) process
has a single thread of control. If a process has multiple threads of control, it
can perform more than one task at a time. Figure 4.1 illustrates the difference
between a traditional single-threaded process and a multithreaded process.
4.1.1 Motivation
Most software applications that run on modern computers are multithreaded.
An application typically is implemented as a separate process with several
163
164 Chapter 4 Threads
registers
code data files
stack registers registers registers
code data files
stack
stack
stack
thread thread
single-threaded process multithreaded process
Figure 4.1 Single-threaded and multithreaded processes.
threads of control. A web browser might have one thread display images or
text while another thread retrieves data from the network, for example. A
word processor may have a thread for displaying graphics, another thread for
responding to keystrokes from the user, and a third thread for performing
spelling and grammar checking in the background. Applications can also
be designed to leverage processing capabilities on multicore systems. Such
applications can perform several CPU-intensive tasks in parallel across the
multiple computing cores.
In certain situations, a single application may be required to perform
several similar tasks. For example, a web server accepts client requests for
web pages, images, sound, and so forth. A busy web server may have several
(perhaps thousands of) clients concurrently accessing it. If the web server ran
as a traditional single-threaded process, it would be able to service only one
client at a time, and a client might have to wait a very long time for its request
to be serviced.
One solution is to have the server run as a single process that accepts
requests. When the server receives a request, it creates a separate process
to service that request. In fact, this process-creation method was in common
use before threads became popular. Process creation is time consuming and
resource intensive, however. If the new process will perform the same tasks as
the existing process, why incur all that overhead? It is generally more efficient
to use one process that contains multiple threads. If the web-server process is
multithreaded, the server will create a separate thread that listens for client
requests. When a request is made, rather than creating another process, the
server creates a new thread to service the request and resume listening for
additional requests. This is illustrated in Figure 4.2.
Threads also play a vital role in remote procedure call (RPC) systems. Recall
from Chapter 3 that RPCs allow interprocess communication by providing a
communication mechanism similar to ordinary function or procedure calls.
Typically, RPC servers are multithreaded. When a server receives a message, it
4.1 Overview 165
client
(1) request
(2) create new
thread to service
the request
(3) resume listening
for additional
client requests
server thread
Figure 4.2 Multithreaded server architecture.
services the message using a separate thread. This allows the server to service
several concurrent requests.
Finally, most operating-system kernels are now multithreaded. Several
threads operate in the kernel, and each thread performs a specific task, such
as managing devices, managing memory, or interrupt handling. For example,
Solaris has a set of threads in the kernel specifically for interrupt handling;
Linux uses a kernel thread for managing the amount of free memory in the
system.
4.1.2 Benefits
The benefits of multithreaded programming can be broken down into four
major categories:
1. Responsiveness. Multithreading an interactive application may allow
a program to continue running even if part of it is blocked or is
performing a lengthy operation, thereby increasing responsiveness to
the user. This quality is especially useful in designing user interfaces. For
instance, consider what happens when a user clicks a button that results
in the performance of a time-consuming operation. A single-threaded
application would be unresponsive to the user until the operation had
completed. In contrast, if the time-consuming operation is performed in
a separate thread, the application remains responsive to the user.
2. Resource sharing. Processes can only share resources through techniques
such as shared memory and message passing. Such techniques must
be explicitly arranged by the programmer. However, threads share the
memory and the resources of the process to which they belong by default.
The benefit of sharing code and data is that it allows an application to
have several different threads of activity within the same address space.
3. Economy. Allocating memory and resources for process creation is costly.
Because threads share the resources of the process to which they belong,
it is more economical to create and context-switch threads. Empirically
gauging the difference in overhead can be difficult, but in general it is
significantly more time consuming to create and manage processes than
threads. In Solaris, for example, creating a process is about thirty times
166 Chapter 4 Threads
T1 T2 T3 T4 T1 T2 T3 T4 T1
single core
time
…
Figure 4.3 Concurrent execution on a single-core system.
slower than is creating a thread, and context switching is about five times
slower.
4. Scalability. The benefits of multithreading can be even greater in a
multiprocessor architecture, where threads may be running in parallel
on different processing cores. A single-threaded process can run on only
one processor, regardless how many are available. We explore this issue
further in the following section.
4.2 Multicore Programming
Earlier in the history of computer design, in response to the need for more
computing performance, single-CPU systems evolved into multi-CPU systems.
A more recent, similar trend in system design is to place multiple computing
cores on a single chip. Each core appears as a separate processor to the
operating system (Section 1.3.2). Whether the cores appear across CPU chips or
within CPU chips, we call these systems multicore or multiprocessor systems.
Multithreaded programming provides a mechanism for more efficient use
of these multiple computing cores and improved concurrency. Consider an
application with four threads. On a system with a single computing core,
concurrency merely means that the execution of the threads will be interleaved
over time (Figure 4.3), because the processing core is capable of executing only
one thread at a time. On a system with multiple cores, however, concurrency
means that the threads can run in parallel, because the system can assign a
separate thread to each core (Figure 4.4).
Notice the distinction between parallelism and concurrency in this discus-
sion. A system is parallel if it can perform more than one task simultaneously.
In contrast, a concurrent system supports more than one task by allowing all
the tasks to make progress. Thus, it is possible to have concurrency without
parallelism. Before the advent of SMP and multicore architectures, most com-
puter systems had only a single processor. CPU schedulers were designed to
provide the illusion of parallelism by rapidly switching between processes in
T1 T3 T1 T3 T1
core 1
T2 T4 T2 T4 T2
core 2
time
…
…
Figure 4.4 Parallel execution on a multicore system.
4.2 Multicore Programming 167
AMDAHL’S LAW
Amdahl’s Law is a formula that identifies potential performance gains from
adding additional computing cores to an application that has both serial
(nonparallel) and parallel components. If S is the portion of the application
that must be performed serially on a system with N processing cores, the
formula appears as follows:
speedup ≤
1
S + (1−S)
N
As an example, assume we have an application that is 75 percent parallel and
25 percent serial. If we run this application on a system with two processing
cores, we can get a speedup of 1.6 times. If we add two additional cores (for
a total of four), the speedup is 2.28 times.
One interesting fact about Amdahl’s Law is that as N approaches infinity,
the speedup converges to 1/S. For example, if 40 percent of an application
is performed serially, the maximum speedup is 2.5 times, regardless of
the number of processing cores we add. This is the fundamental principle
behind Amdahl’s Law: the serial portion of an application can have a
disproportionate effect on the performance we gain by adding additional
computing cores.
Some argue that Amdahl’s Law does not take into account the hardware
performance enhancements used in the design of contemporary multicore
systems. Such arguments suggest Amdahl’s Law may cease to be applicable
as the number of processing cores continues to increase on modern computer
systems.
the system, thereby allowing each process to make progress. Such processes
were running concurrently, but not in parallel.
As systems have grown from tens of threads to thousands of threads, CPU
designers have improved system performance by adding hardware to improve
thread performance. Modern Intel CPUs frequently support two threads per
core, while the Oracle T4 CPU supports eight threads per core. This support
means that multiple threads can be loaded into the core for fast switching.
Multicore computers will no doubt continue to increase in core counts and
hardware thread support.
4.2.1 Programming Challenges
The trend towards multicore systems continues to place pressure on system
designers and application programmers to make better use of the multiple
computing cores. Designers of operating systems must write scheduling
algorithms that use multiple processing cores to allow the parallel execution
shown in Figure 4.4. For application programmers, the challenge is to modify
existing programs as well as design new programs that are multithreaded.
In general, five areas present challenges in programming for multicore
systems:
168 Chapter 4 Threads
1. Identifying tasks. This involves examining applications to find areas
that can be divided into separate, concurrent tasks. Ideally, tasks are
independent of one another and thus can run in parallel on individual
cores.
2. Balance. While identifying tasks that can run in parallel, programmers
must also ensure that the tasks perform equal work of equal value. In
some instances, a certain task may not contribute as much value to the
overall process as other tasks. Using a separate execution core to run that
task may not be worth the cost.
3. Data splitting. Just as applications are divided into separate tasks, the
data accessed and manipulated by the tasks must be divided to run on
separate cores.
4. Data dependency. The data accessed by the tasks must be examined for
dependencies between two or more tasks. When one task depends on
data from another, programmers must ensure that the execution of the
tasks is synchronized to accommodate the data dependency. We examine
such strategies in Chapter 5.
5. Testing and debugging. When a program is running in parallel on
multiple cores, many different execution paths are possible. Testing and
debugging such concurrent programs is inherently more difficult than
testing and debugging single-threaded applications.
Because of these challenges, many software developers argue that the advent of
multicore systems will require an entirely new approach to designing software
systems in the future. (Similarly, many computer science educators believe that
software development must be taught with increased emphasis on parallel
programming.)
4.2.2 Types of Parallelism
In general, there are two types of parallelism: data parallelism and task
parallelism. Data parallelism focuses on distributing subsets of the same data
across multiple computing cores and performing the same operation on each
core. Consider, for example, summing the contents of an array of size N. On a
single-core system, one thread would simply sum the elements [0] . . . [N − 1].
On a dual-core system, however, thread A, running on core 0, could sum the
elements [0] . . . [N/2 − 1] while thread B, running on core 1, could sum the
elements [N/2] . . . [N − 1]. The two threads would be running in parallel on
separate computing cores.
Task parallelism involves distributing not data but tasks (threads) across
multiple computing cores. Each thread is performing a unique operation.
Different threads may be operating on the same data, or they may be operating
on different data. Consider again our example above. In contrast to that
situation, an example of task parallelism might involve two threads, each
performing a unique statistical operation on the array of elements. The threads
again are operating in parallel on separate computing cores, but each is
performing a unique operation.
4.3 Multithreading Models 169
Fundamentally, then, data parallelism involves the distribution of data
across multiple cores and task parallelism on the distribution of tasks across
multiple cores. In practice, however, few applications strictly follow either data
or task parallelism. In most instances, applications use a hybrid of these two
strategies.
4.3 Multithreading Models
Our discussion so far has treated threads in a generic sense. However, support
for threads may be provided either at the user level, for user threads, or by the
kernel, for kernel threads. User threads are supported above the kernel and
are managed without kernel support, whereas kernel threads are supported
and managed directly by the operating system. Virtually all contemporary
operating systems—including Windows, Linux, Mac OS X, and Solaris—
support kernel threads.
Ultimately, a relationship must exist between user threads and kernel
threads. In this section, we look at three common ways of establishing such a
relationship: the many-to-one model, the one-to-one model, and the many-to-
many model.
4.3.1 Many-to-One Model
The many-to-one model (Figure 4.5) maps many user-level threads to one
kernel thread. Thread management is done by the thread library in user space,
so it is efficient (we discuss thread libraries in Section 4.4). However, the entire
process will block if a thread makes a blocking system call. Also, because only
one thread can access the kernel at a time, multiple threads are unable to run in
parallel on multicore systems. Green threads—a thread library available for
Solaris systems and adopted in early versions of Java—used the many-to-one
model. However, very few systems continue to use the model because of its
inability to take advantage of multiple processing cores.
user thread
kernel thread
k
Figure 4.5 Many-to-one model.
170 Chapter 4 Threads
user thread
kernel thread
k
k
k
k
Figure 4.6 One-to-one model.
4.3.2 One-to-One Model
The one-to-one model (Figure 4.6) maps each user thread to a kernel thread. It
provides more concurrency than the many-to-one model by allowing another
thread to run when a thread makes a blocking system call. It also allows
multiple threads to run in parallel on multiprocessors. The only drawback to
this model is that creating a user thread requires creating the corresponding
kernel thread. Because the overhead of creating kernel threads can burden the
performance of an application, most implementations of this model restrict the
number of threads supported by the system. Linux, along with the family of
Windows operating systems, implement the one-to-one model.
4.3.3 Many-to-Many Model
The many-to-many model (Figure 4.7) multiplexes many user-level threads to
a smaller or equal number of kernel threads. The number of kernel threads
may be specific to either a particular application or a particular machine (an
application may be allocated more kernel threads on a multiprocessor than on
a single processor).
Let’s consider the effect of this design on concurrency. Whereas the many-
to-one model allows the developer to create as many user threads as she wishes,
it does not result in true concurrency, because the kernel can schedule only
one thread at a time. The one-to-one model allows greater concurrency, but the
developer has to be careful not to create too many threads within an application
(and in some instances may be limited in the number of threads she can
user thread
kernel thread
k
k
k
Figure 4.7 Many-to-many model.
4.4 Thread Libraries 171
user thread
kernel thread
k
k
k k
Figure 4.8 Two-level model.
create). The many-to-many model suffers from neither of these shortcomings:
developers can create as many user threads as necessary, and the corresponding
kernel threads can run in parallel on a multiprocessor. Also, when a thread
performs a blocking system call, the kernel can schedule another thread for
execution.
One variation on the many-to-many model still multiplexes many user-
level threads to a smaller or equal number of kernel threads but also allows a
user-level thread to be bound to a kernel thread. This variation is sometimes
referred to as the two-level model (Figure 4.8). The Solaris operating system
supported the two-level model in versions older than Solaris 9. However,
beginning with Solaris 9, this system uses the one-to-one model.
4.4 Thread Libraries
A thread library provides the programmer with an API for creating and
managing threads. There are two primary ways of implementing a thread
library. The first approach is to provide a library entirely in user space with no
kernel support. All code and data structures for the library exist in user space.
This means that invoking a function in the library results in a local function
call in user space and not a system call.
The second approach is to implement a kernel-level library supported
directly by the operating system. In this case, code and data structures for
the library exist in kernel space. Invoking a function in the API for the library
typically results in a system call to the kernel.
Three main thread libraries are in use today: POSIX Pthreads, Windows, and
Java. Pthreads, the threads extension of the POSIX standard, may be provided
as either a user-level or a kernel-level library. The Windows thread library
is a kernel-level library available on Windows systems. The Java thread API
allows threads to be created and managed directly in Java programs. However,
because in most instances the JVM is running on top of a host operating system,
the Java thread API is generally implemented using a thread library available
on the host system. This means that on Windows systems, Java threads are
typically implemented using the Windows API; UNIX and Linux systems often
use Pthreads.
5
C H A P T E R
Process
Synchronization
A cooperating process is one that can affect or be affected by other processes
executing in the system. Cooperating processes can either directly share a
logical address space (that is, both code and data) or be allowed to share data
only through files or messages. The former case is achieved through the use of
threads, discussed in Chapter 4. Concurrent access to shared data may result in
data inconsistency, however. In this chapter, we discuss various mechanisms
to ensure the orderly execution of cooperating processes that share a logical
address space, so that data consistency is maintained.
CHAPTER OBJECTIVES
• To introduce the critical-section problem, whose solutions can be used to
ensure the consistency of shared data.
• To present both software and hardware solutions of the critical-section
problem.
• To examine several classical process-synchronization problems.
• To explore several tools that are used to solve process synchronization
problems.
5.1 Background
We’ve already seen that processes can execute concurrently or in parallel.
Section 3.2.2 introduced the role of process scheduling and described how
the CPU scheduler switches rapidly between processes to provide concurrent
execution. This means that one process may only partially complete execution
before another process is scheduled. In fact, a process may be interrupted at
any point in its instruction stream, and the processing core may be assigned
to execute instructions of another process. Additionally, Section 4.2 introduced
parallel execution, in which two instruction streams (representing different
processes) execute simultaneously on separate processing cores. In this chapter,
203
204 Chapter 5 Process Synchronization
we explain how concurrent or parallel execution can contribute to issues
involving the integrity of data shared by several processes.
Let’s consider an example of how this can happen. In Chapter 3, we devel-
oped a model of a system consisting of cooperating sequential processes or
threads, all running asynchronously and possibly sharing data. We illustrated
this model with the producer–consumer problem, which is representative of
operating systems. Specifically, in Section 3.4.1, we described how a bounded
buffer could be used to enable processes to share memory.
We now return to our consideration of the bounded buffer. As we pointed
out, our original solution allowed at most BUFFER SIZE − 1 items in the buffer
at the same time. Suppose we want to modify the algorithm to remedy this
deficiency. One possibility is to add an integer variable counter, initialized to
0. counter is incremented every time we add a new item to the buffer and is
decremented every time we remove one item from the buffer. The code for the
producer process can be modified as follows:
while (true) {
/* produce an item in next produced */
while (counter == BUFFER SIZE)
; /* do nothing */
buffer[in] = next produced;
in = (in + 1) % BUFFER SIZE;
counter++;
}
The code for the consumer process can be modified as follows:
while (true) {
while (counter == 0)
; /* do nothing */
next consumed = buffer[out];
out = (out + 1) % BUFFER SIZE;
counter--;
/* consume the item in next consumed */
}
Although the producer and consumer routines shown above are correct
separately, they may not function correctly when executed concurrently. As
an illustration, suppose that the value of the variable counter is currently
5 and that the producer and consumer processes concurrently execute the
statements “counter++” and “counter--”. Following the execution of these
two statements, the value of the variable counter may be 4, 5, or 6! The only
correct result, though, is counter == 5, which is generated correctly if the
producer and consumer execute separately.
5.1 Background 205
We can show that the value of counter may be incorrect as follows. Note
that the statement “counter++” may be implemented in machine language (on
a typical machine) as follows:
register1 = counter
register1 = register1 + 1
counter = register1
where register1 is one of the local CPU registers. Similarly, the statement
“counter--” is implemented as follows:
register2 = counter
register2 = register2 − 1
counter = register2
where again register2 is one of the local CPU registers. Even though register1 and
register2 may be the same physical register (an accumulator, say), remember
that the contents of this register will be saved and restored by the interrupt
handler (Section 1.2.3).
The concurrent execution of “counter++” and “counter--” is equivalent
to a sequential execution in which the lower-level statements presented
previously are interleaved in some arbitrary order (but the order within each
high-level statement is preserved). One such interleaving is the following:
T0: producer execute register1 = counter {register1 = 5}
T1: producer execute register1 = register1 + 1 {register1 = 6}
T2: consumer execute register2 = counter {register2 = 5}
T3: consumer execute register2 = register2 − 1 {register2 = 4}
T4: producer execute counter = register1 {counter = 6}
T5: consumer execute counter = register2 {counter = 4}
Notice that we have arrived at the incorrect state “counter == 4”, indicating
that four buffers are full, when, in fact, five buffers are full. If we reversed the
order of the statements at T4 and T5, we would arrive at the incorrect state
“counter == 6”.
We would arrive at this incorrect state because we allowed both processes
to manipulate the variable counter concurrently. A situation like this, where
several processes access and manipulate the same data concurrently and the
outcome of the execution depends on the particular order in which the access
takes place, is called a race condition. To guard against the race condition
above, we need to ensure that only one process at a time can be manipulating
the variable counter. To make such a guarantee, we require that the processes
be synchronized in some way.
Situations such as the one just described occur frequently in operating
systems as different parts of the system manipulate resources. Furthermore, as
we have emphasized in earlier chapters, the growing importance of multicore
systems has brought an increased emphasis on developing multithreaded
applications. In such applications, several threads—which are quite possibly
sharing data—are running in parallel on different processing cores. Clearly,
206 Chapter 5 Process Synchronization
do {
entry section
critical section
exit section
remainder section
} while (true);
Figure 5.1 General structure of a typical process Pi .
we want any changes that result from such activities not to interfere with one
another. Because of the importance of this issue, we devote a major portion of
this chapter to process synchronization and coordination among cooperating
processes.
5.2 The Critical-Section Problem
We begin our consideration of process synchronization by discussing the so-
called critical-section problem. Consider a system consisting of n processes
{P0, P1, ..., Pn−1}. Each process has a segment of code, called a critical section,
in which the process may be changing common variables, updating a table,
writing a file, and so on. The important feature of the system is that, when
one process is executing in its critical section, no other process is allowed to
execute in its critical section. That is, no two processes are executing in their
critical sections at the same time. The critical-section problem is to design a
protocol that the processes can use to cooperate. Each process must request
permission to enter its critical section. The section of code implementing this
request is the entry section. The critical section may be followed by an exit
section. The remaining code is the remainder section. The general structure of
a typical process Pi is shown in Figure 5.1. The entry section and exit section
are enclosed in boxes to highlight these important segments of code.
A solution to the critical-section problem must satisfy the following three
requirements:
1. Mutual exclusion. If process Pi is executing in its critical section, then no
other processes can be executing in their critical sections.
2. Progress. If no process is executing in its critical section and some
processes wish to enter their critical sections, then only those processes
that are not executing in their remainder sections can participate in
deciding which will enter its critical section next, and this selection cannot
be postponed indefinitely.
3. Bounded waiting. There exists a bound, or limit, on the number of times
that other processes are allowed to enter their critical sections after a
5.3 Peterson’s Solution 207
process has made a request to enter its critical section and before that
request is granted.
We assume that each process is executing at a nonzero speed. However, we can
make no assumption concerning the relative speed of the n processes.
At a given point in time, many kernel-mode processes may be active in
the operating system. As a result, the code implementing an operating system
(kernel code) is subject to several possible race conditions. Consider as an
example a kernel data structure that maintains a list of all open files in the
system. This list must be modified when a new file is opened or closed (adding
the file to the list or removing it from the list). If two processes were to open files
simultaneously, the separate updates to this list could result in a race condition.
Other kernel data structures that are prone to possible race conditions include
structures for maintaining memory allocation, for maintaining process lists,
and for interrupt handling. It is up to kernel developers to ensure that the
operating system is free from such race conditions.
Two general approaches are used to handle critical sections in operating
systems: preemptive kernels and nonpreemptive kernels. A preemptive
kernel allows a process to be preempted while it is running in kernel mode. A
nonpreemptive kernel does not allow a process running in kernel mode to be
preempted; a kernel-mode process will run until it exits kernel mode, blocks,
or voluntarily yields control of the CPU.
Obviously, a nonpreemptive kernel is essentially free from race conditions
on kernel data structures, as only one process is active in the kernel at a time.
We cannot say the same about preemptive kernels, so they must be carefully
designed to ensure that shared kernel data are free from race conditions.
Preemptive kernels are especially difficult to design for SMP architectures,
since in these environments it is possible for two kernel-mode processes to run
simultaneously on different processors.
Why, then, would anyone favor a preemptive kernel over a nonpreemptive
one? A preemptive kernel may be more responsive, since there is less risk that a
kernel-mode process will run for an arbitrarily long period before relinquishing
the processor to waiting processes. (Of course, this risk can also be minimized
by designing kernel code that does not behave in this way.) Furthermore, a
preemptive kernel is more suitable for real-time programming, as it will allow
a real-time process to preempt a process currently running in the kernel. Later
in this chapter, we explore how various operating systems manage preemption
within the kernel.
5.3 Peterson’s Solution
Next, we illustrate a classic software-based solution to the critical-section
problem known as Peterson’s solution. Because of the way modern computer
architectures perform basic machine-language instructions, such as load and
store, there are no guarantees that Peterson’s solution will work correctly on
such architectures. However, we present the solution because it provides a good
algorithmic description of solving the critical-section problem and illustrates
some of the complexities involved in designing software that addresses the
requirements of mutual exclusion, progress, and bounded waiting.
208 Chapter 5 Process Synchronization
do {
flag[i] = true;
turn = j;
while (flag[j] && turn == j);
critical section
flag[i] = false;
remainder section
} while (true);
Figure 5.2 The structure of process Pi in Peterson’s solution.
Peterson’s solution is restricted to two processes that alternate execution
between their critical sections and remainder sections. The processes are
numbered P0 and P1. For convenience, when presenting Pi , we use Pj to
denote the other process; that is, j equals 1 − i.
Peterson’s solution requires the two processes to share two data items:
int turn;
boolean flag[2];
The variable turn indicates whose turn it is to enter its critical section. That is,
if turn == i, then process Pi is allowed to execute in its critical section. The
flag array is used to indicate if a process is ready to enter its critical section.
For example, if flag[i] is true, this value indicates that Pi is ready to enter
its critical section. With an explanation of these data structures complete, we
are now ready to describe the algorithm shown in Figure 5.2.
To enter the critical section, process Pi first sets flag[i] to be true and
then sets turn to the value j, thereby asserting that if the other process wishes
to enter the critical section, it can do so. If both processes try to enter at the same
time, turn will be set to both i and j at roughly the same time. Only one of these
assignments will last; the other will occur but will be overwritten immediately.
The eventual value of turn determines which of the two processes is allowed
to enter its critical section first.
We now prove that this solution is correct. We need to show that:
1. Mutual exclusion is preserved.
2. The progress requirement is satisfied.
3. The bounded-waiting requirement is met.
To prove property 1, we note that each Pi enters its critical section only
if either flag[j] == false or turn == i. Also note that, if both processes
can be executing in their critical sections at the same time, then flag[0] ==
flag[1] == true. These two observations imply that P0 and P1 could not have
successfully executed their while statements at about the same time, since the
5.4 Synchronization Hardware 209
value of turn can be either 0 or 1 but cannot be both. Hence, one of the processes
—say, Pj —must have successfully executed the while statement, whereas Pi
had to execute at least one additional statement (“turn == j”). However, at
that time, flag[j] == true and turn == j, and this condition will persist as
long as Pj is in its critical section; as a result, mutual exclusion is preserved.
To prove properties 2 and 3, we note that a process Pi can be prevented from
entering the critical section only if it is stuck in the while loop with the condition
flag[j] == true and turn == j; this loop is the only one possible. If Pj is not
ready to enter the critical section, then flag[j] == false, and Pi can enter its
critical section. If Pj has set flag[j] to true and is also executing in its while
statement, then either turn == i or turn == j. If turn == i, then Pi will enter
the critical section. If turn == j, then Pj will enter the critical section. However,
once Pj exits its critical section, it will reset flag[j] to false, allowing Pi to
enter its critical section. If Pj resets flag[j] to true, it must also set turn to i.
Thus, since Pi does not change the value of the variable turn while executing
the while statement, Pi will enter the critical section (progress) after at most
one entry by Pj (bounded waiting).
5.4 Synchronization Hardware
We have just described one software-based solution to the critical-section
problem. However, as mentioned, software-based solutions such as Peterson’s
are not guaranteed to work on modern computer architectures. In the following
discussions, we explore several more solutions to the critical-section problem
using techniques ranging from hardware to software-based APIs available to
both kernel developers and application programmers. All these solutions are
based on the premise of locking —that is, protecting critical regions through
the use of locks. As we shall see, the designs of such locks can be quite
sophisticated.
We start by presenting some simple hardware instructions that are available
on many systems and showing how they can be used effectively in solving the
critical-section problem. Hardware features can make any programming task
easier and improve system efficiency.
The critical-section problem could be solved simply in a single-processor
environment if we could prevent interrupts from occurring while a shared
variable was being modified. In this way, we could be sure that the current
sequence of instructions would be allowed to execute in order without pre-
emption. No other instructions would be run, so no unexpected modifications
could be made to the shared variable. This is often the approach taken by
nonpreemptive kernels.
boolean test and set(boolean *target) {
boolean rv = *target;
*target = true;
return rv;
}
Figure 5.3 The definition of the test and set() instruction.
210 Chapter 5 Process Synchronization
do {
while (test and set(&lock))
; /* do nothing */
/* critical section */
lock = false;
/* remainder section */
} while (true);
Figure 5.4 Mutual-exclusion implementation with test and set().
Unfortunately, this solution is not as feasible in a multiprocessor environ-
ment. Disabling interrupts on a multiprocessor can be time consuming, since
the message is passed to all the processors. This message passing delays entry
into each critical section, and system efficiency decreases. Also consider the
effect on a system’s clock if the clock is kept updated by interrupts.
Many modern computer systems therefore provide special hardware
instructions that allow us either to test and modify the content of a word or
to swap the contents of two words atomically—that is, as one uninterruptible
unit. We can use these special instructions to solve the critical-section problem
in a relatively simple manner. Rather than discussing one specific instruction
for one specific machine, we abstract the main concepts behind these types
of instructions by describing the test and set() and compare and swap()
instructions.
The test and set() instruction can be defined as shown in Figure 5.3.
The important characteristic of this instruction is that it is executed atomically.
Thus, if two test and set() instructions are executed simultaneously (each
on a different CPU), they will be executed sequentially in some arbitrary order. If
the machine supports the test and set() instruction, then we can implement
mutual exclusion by declaring a boolean variable lock, initialized to false.
The structure of process Pi is shown in Figure 5.4.
The compare and swap() instruction, in contrast to the test and set()
instruction, operates on three operands; it is defined in Figure 5.5. The operand
value is set to new value only if the expression (*value == exected) is
true. Regardless, compare and swap() always returns the original value of the
variable value. Like the test and set() instruction, compare and swap() is
int compare and swap(int *value, int expected, int new value) {
int temp = *value;
if (*value == expected)
*value = new value;
return temp;
}
Figure 5.5 The definition of the compare and swap() instruction.
5.4 Synchronization Hardware 211
do {
while (compare and swap(&lock, 0, 1) != 0)
; /* do nothing */
/* critical section */
lock = 0;
/* remainder section */
} while (true);
Figure 5.6 Mutual-exclusion implementation with the compare and swap()
instruction.
executed atomically. Mutual exclusion can be provided as follows: a global
variable (lock) is declared and is initialized to 0. The first process that invokes
compare and swap() will set lock to 1. It will then enter its critical section,
because the original value of lock was equal to the expected value of 0.
Subsequent calls to compare and swap() will not succeed, because lock now
is not equal to the expected value of 0. When a process exits its critical section,
it sets lock back to 0, which allows another process to enter its critical section.
The structure of process Pi is shown in Figure 5.6.
Although these algorithms satisfy the mutual-exclusion requirement, they
do not satisfy the bounded-waiting requirement. In Figure 5.7, we present
another algorithm using the test and set() instruction that satisfies all the
critical-section requirements. The common data structures are
do {
waiting[i] = true;
key = true;
while (waiting[i] && key)
key = test and set(&lock);
waiting[i] = false;
/* critical section */
j = (i + 1) % n;
while ((j != i) && !waiting[j])
j = (j + 1) % n;
if (j == i)
lock = false;
else
waiting[j] = false;
/* remainder section */
} while (true);
Figure 5.7 Bounded-waiting mutual exclusion with test and set().
212 Chapter 5 Process Synchronization
boolean waiting[n];
boolean lock;
These data structures are initialized to false. To prove that the mutual-
exclusion requirement is met, we note that process Pi can enter its critical
section only if either waiting[i] == false or key == false. The value
of key can become false only if the test and set() is executed. The first
process to execute the test and set() will find key == false; all others must
wait. The variable waiting[i] can become false only if another process
leaves its critical section; only one waiting[i] is set to false, maintaining the
mutual-exclusion requirement.
To prove that the progress requirement is met, we note that the arguments
presented for mutual exclusion also apply here, since a process exiting the
critical section either sets lock to false or sets waiting[j] to false. Both
allow a process that is waiting to enter its critical section to proceed.
To prove that the bounded-waiting requirement is met, we note that, when
a process leaves its critical section, it scans the array waiting in the cyclic
ordering (i + 1, i + 2, ..., n − 1, 0, ..., i − 1). It designates the first process in this
ordering that is in the entry section (waiting[j] == true) as the next one to
enter the critical section. Any process waiting to enter its critical section will
thus do so within n − 1 turns.
Details describing the implementation of the atomic test and set()
and compare and swap() instructions are discussed more fully in books on
computer architecture.
5.5 Mutex Locks
The hardware-based solutions to the critical-section problem presented in
Section 5.4 are complicated as well as generally inaccessible to application
programmers. Instead, operating-systems designers build software tools to
solve the critical-section problem. The simplest of these tools is the mutex
lock. (In fact, the term mutex is short for mutual exclusion.) We use the mutex
lock to protect critical regions and thus prevent race conditions. That is, a
process must acquire the lock before entering a critical section; it releases the
lock when it exits the critical section. The acquire()function acquires the lock,
and the release() function releases the lock, as illustrated in Figure 5.8.
A mutex lock has a boolean variable available whose value indicates if
the lock is available or not. If the lock is available, a call to acquire() succeeds,
and the lock is then considered unavailable. A process that attempts to acquire
an unavailable lock is blocked until the lock is released.
The definition of acquire() is as follows:
acquire() {
while (!available)
; /* busy wait */
available = false;;
}
5.6 Semaphores 213
do {
acquire lock
critical section
release lock
remainder section
} while (true);
Figure 5.8 Solution to the critical-section problem using mutex locks.
The definition of release() is as follows:
release() {
available = true;
}
Calls to either acquire() or release() must be performed atomically.
Thus, mutex locks are often implemented using one of the hardware mecha-
nisms described in Section 5.4, and we leave the description of this technique
as an exercise.
The main disadvantage of the implementation given here is that it requires
busy waiting. While a process is in its critical section, any other process that
tries to enter its critical section must loop continuously in the call to acquire().
In fact, this type of mutex lock is also called a spinlock because the process
“spins” while waiting for the lock to become available. (We see the same issue
with the code examples illustrating the test and set() instruction and the
compare and swap() instruction.) This continual looping is clearly a problem
in a real multiprogramming system, where a single CPU is shared among many
processes. Busy waiting wastes CPU cycles that some other process might be
able to use productively.
Spinlocks do have an advantage, however, in that no context switch is
required when a process must wait on a lock, and a context switch may
take considerable time. Thus, when locks are expected to be held for short
times, spinlocks are useful. They are often employed on multiprocessor systems
where one thread can “spin” on one processor while another thread performs
its critical section on another processor.
Later in this chapter (Section 5.7), we examine how mutex locks can be
used to solve classical synchronization problems. We also discuss how these
locks are used in several operating systems, as well as in Pthreads.
5.6 Semaphores
Mutex locks, as we mentioned earlier, are generally considered the simplest of
synchronization tools. In this section, we examine a more robust tool that can
214 Chapter 5 Process Synchronization
behave similarly to a mutex lock but can also provide more sophisticated ways
for processes to synchronize their activities.
A semaphore S is an integer variable that, apart from initialization, is
accessed only through two standard atomic operations: wait() and signal().
The wait() operation was originally termed P (from the Dutch proberen, “to
test”); signal() was originally called V (from verhogen, “to increment”). The
definition of wait() is as follows:
wait(S) {
while (S <= 0)
; // busy wait
S--;
}
The definition of signal() is as follows:
signal(S) {
S++;
}
All modifications to the integer value of the semaphore in the wait() and
signal() operations must be executed indivisibly. That is, when one process
modifies the semaphore value, no other process can simultaneously modify
that same semaphore value. In addition, in the case of wait(S), the testing of
the integer value of S (S ≤ 0), as well as its possible modification (S--), must
be executed without interruption. We shall see how these operations can be
implemented in Section 5.6.2. First, let’s see how semaphores can be used.
5.6.1 Semaphore Usage
Operating systems often distinguish between counting and binary semaphores.
The value of a counting semaphore can range over an unrestricted domain.
The value of a binary semaphore can range only between 0 and 1. Thus, binary
semaphores behave similarly to mutex locks. In fact, on systems that do not
provide mutex locks, binary semaphores can be used instead for providing
mutual exclusion.
Counting semaphores can be used to control access to a given resource
consisting of a finite number of instances. The semaphore is initialized to the
number of resources available. Each process that wishes to use a resource
performs a wait() operation on the semaphore (thereby decrementing the
count). When a process releases a resource, it performs a signal() operation
(incrementing the count). When the count for the semaphore goes to 0, all
resources are being used. After that, processes that wish to use a resource will
block until the count becomes greater than 0.
We can also use semaphores to solve various synchronization problems.
For example, consider two concurrently running processes: P1 with a statement
S1 and P2 with a statement S2. Suppose we require that S2 be executed only
after S1 has completed. We can implement this scheme readily by letting P1
and P2 share a common semaphore synch, initialized to 0. In process P1, we
insert the statements
5.6 Semaphores 215
S1;
signal(synch);
In process P2, we insert the statements
wait(synch);
S2;
Because synch is initialized to 0, P2 will execute S2 only after P1 has invoked
signal(synch), which is after statement S1 has been executed.
5.6.2 Semaphore Implementation
Recall that the implementation of mutex locks discussed in Section 5.5 suffers
from busy waiting. The definitions of the wait() and signal() semaphore
operations just described present the same problem. To overcome the need
for busy waiting, we can modify the definition of the wait() and signal()
operations as follows: When a process executes the wait() operation and finds
that the semaphore value is not positive, it must wait. However, rather than
engaging in busy waiting, the process can block itself. The block operation
places a process into a waiting queue associated with the semaphore, and the
state of the process is switched to the waiting state. Then control is transferred
to the CPU scheduler, which selects another process to execute.
A process that is blocked, waiting on a semaphore S, should be restarted
when some other process executes a signal() operation. The process is
restarted by a wakeup() operation, which changes the process from the waiting
state to the ready state. The process is then placed in the ready queue. (The
CPU may or may not be switched from the running process to the newly ready
process, depending on the CPU-scheduling algorithm.)
To implement semaphores under this definition, we define a semaphore as
follows:
typedef struct {
int value;
struct process *list;
} semaphore;
Each semaphore has an integer value and a list of processes list. When
a process must wait on a semaphore, it is added to the list of processes. A
signal() operation removes one process from the list of waiting processes
and awakens that process.
Now, the wait() semaphore operation can be defined as
wait(semaphore *S) {
S->value--;
if (S->value < 0) {
add this process to S->list;
block();
}
}
216 Chapter 5 Process Synchronization
and the signal() semaphore operation can be defined as
signal(semaphore *S) {
S->value++;
if (S->value <= 0) {
remove a process P from S->list;
wakeup(P);
}
}
The block() operation suspends the process that invokes it. The wakeup(P)
operation resumes the execution of a blocked process P. These two operations
are provided by the operating system as basic system calls.
Note that in this implementation, semaphore values may be negative,
whereas semaphore values are never negative under the classical definition of
semaphores with busy waiting. If a semaphore value is negative, its magnitude
is the number of processes waiting on that semaphore. This fact results from
switching the order of the decrement and the test in the implementation of the
wait() operation.
The list of waiting processes can be easily implemented by a link field in
each process control block (PCB). Each semaphore contains an integer value and
a pointer to a list of PCBs. One way to add and remove processes from the list
so as to ensure bounded waiting is to use a FIFO queue, where the semaphore
contains both head and tail pointers to the queue. In general, however, the list
can use any queueing strategy. Correct usage of semaphores does not depend
on a particular queueing strategy for the semaphore lists.
It is critical that semaphore operations be executed atomically. We must
guarantee that no two processes can execute wait() and signal() operations
on the same semaphore at the same time. This is a critical-section problem;
and in a single-processor environment, we can solve it by simply inhibiting
interrupts during the time the wait() and signal() operations are executing.
This scheme works in a single-processor environment because, once interrupts
are inhibited, instructions from different processes cannot be interleaved. Only
the currently running process executes until interrupts are reenabled and the
scheduler can regain control.
In a multiprocessor environment, interrupts must be disabled on every pro-
cessor. Otherwise, instructions from different processes (running on different
processors) may be interleaved in some arbitrary way. Disabling interrupts on
every processor can be a difficult task and furthermore can seriously diminish
performance. Therefore, SMP systems must provide alternative locking tech-
niques—such as compare and swap() or spinlocks—to ensure that wait()
and signal() are performed atomically.
It is important to admit that we have not completely eliminated busy
waiting with this definition of the wait() and signal() operations. Rather,
we have moved busy waiting from the entry section to the critical sections
of application programs. Furthermore, we have limited busy waiting to the
critical sections of the wait() and signal() operations, and these sections are
short (if properly coded, they should be no more than about ten instructions).
Thus, the critical section is almost never occupied, and busy waiting occurs
5.6 Semaphores 217
rarely, and then for only a short time. An entirely different situation exists
with application programs whose critical sections may be long (minutes or
even hours) or may almost always be occupied. In such cases, busy waiting is
extremely inefficient.
5.6.3 Deadlocks and Starvation
The implementation of a semaphore with a waiting queue may result in a
situation where two or more processes are waiting indefinitely for an event
that can be caused only by one of the waiting processes. The event in question
is the execution of a signal() operation. When such a state is reached, these
processes are said to be deadlocked.
To illustrate this, consider a system consisting of two processes, P0 and P1,
each accessing two semaphores, S and Q, set to the value 1:
P0 P1
wait(S); wait(Q);
wait(Q); wait(S);
. .
. .
. .
signal(S); signal(Q);
signal(Q); signal(S);
Suppose that P0 executes wait(S) and then P1 executes wait(Q). When P0
executes wait(Q), it must wait until P1 executes signal(Q). Similarly, when
P1 executes wait(S), it must wait until P0 executes signal(S). Since these
signal() operations cannot be executed, P0 and P1 are deadlocked.
We say that a set of processes is in a deadlocked state when every process
in the set is waiting for an event that can be caused only by another process
in the set. The events with which we are mainly concerned here are resource
acquisition and release. Other types of events may result in deadlocks, as we
show in Chapter 7. In that chapter, we describe various mechanisms for dealing
with the deadlock problem.
Another problem related to deadlocks is indefinite blocking or starvation,
a situation in which processes wait indefinitely within the semaphore. Indefi-
nite blocking may occur if we remove processes from the list associated with a
semaphore in LIFO (last-in, first-out) order.
5.6.4 Priority Inversion
A scheduling challenge arises when a higher-priority process needs to read
or modify kernel data that are currently being accessed by a lower-priority
process—or a chain of lower-priority processes. Since kernel data are typically
protected with a lock, the higher-priority process will have to wait for a
lower-priority one to finish with the resource. The situation becomes more
complicated if the lower-priority process is preempted in favor of another
process with a higher priority.
As an example, assume we have three processes— L, M, and H —whose
priorities follow the order L < M < H. Assume that process H requires
218 Chapter 5 Process Synchronization
PRIORITY INVERSION AND THE MARS PATHFINDER
Priority inversion can be more than a scheduling inconvenience. On systems
with tight time constraints—such as real-time systems—priority inversion
can cause a process to take longer than it should to accomplish a task. When
that happens, other failures can cascade, resulting in system failure.
Consider the Mars Pathfinder, a NASA space probe that landed a robot, the
Sojourner rover, on Mars in 1997 to conduct experiments. Shortly after the
Sojourner began operating, it started to experience frequent computer resets.
Each reset reinitialized all hardware and software, including communica-
tions. If the problem had not been solved, the Sojourner would have failed in
its mission.
The problem was caused by the fact that one high-priority task, “bc dist,”
was taking longer than expected to complete its work. This task was being
forced to wait for a shared resource that was held by the lower-priority
“ASI/MET” task, which in turn was preempted by multiple medium-priority
tasks. The “bc dist” task would stall waiting for the shared resource, and
ultimately the “bc sched” task would discover the problem and perform the
reset. The Sojourner was suffering from a typical case of priority inversion.
The operating system on the Sojourner was the VxWorks real-time operat-
ing system, which had a global variable to enable priority inheritance on all
semaphores. After testing, the variable was set on the Sojourner (on Mars!),
and the problem was solved.
A full description of the problem, its detection, and its solu-
tion was written by the software team lead and is available at
http://research.microsoft.com/en-us/um/people/mbj/mars pathfinder/
authoritative account.html.
resource R, which is currently being accessed by process L. Ordinarily, process
H would wait for L to finish using resource R. However, now suppose that
process M becomes runnable, thereby preempting process L. Indirectly, a
process with a lower priority—process M—has affected how long process
H must wait for L to relinquish resource R.
This problem is known as priority inversion. It occurs only in systems with
more than two priorities, so one solution is to have only two priorities. That is
insufficient for most general-purpose operating systems, however. Typically
these systems solve the problem by implementing a priority-inheritance
protocol. According to this protocol, all processes that are accessing resources
needed by a higher-priority process inherit the higher priority until they are
finished with the resources in question. When they are finished, their priorities
revert to their original values. In the example above, a priority-inheritance
protocol would allow process L to temporarily inherit the priority of process
H, thereby preventing process M from preempting its execution. When process
L had finished using resource R, it would relinquish its inherited priority from
H and assume its original priority. Because resource R would now be available,
process H —not M—would run next.
5.7 Classic Problems of Synchronization 219
do {
. . .
/* produce an item in next produced */
. . .
wait(empty);
wait(mutex);
. . .
/* add next produced to the buffer */
. . .
signal(mutex);
signal(full);
} while (true);
Figure 5.9 The structure of the producer process.
5.7 Classic Problems of Synchronization
In this section, we present a number of synchronization problems as examples
of a large class of concurrency-control problems. These problems are used for
testing nearly every newly proposed synchronization scheme. In our solutions
to the problems, we use semaphores for synchronization, since that is the
traditional way to present such solutions. However, actual implementations of
these solutions could use mutex locks in place of binary semaphores.
5.7.1 The Bounded-Buffer Problem
The bounded-buffer problem was introduced in Section 5.1; it is commonly
used to illustrate the power of synchronization primitives. Here, we present a
general structure of this scheme without committing ourselves to any particular
implementation. We provide a related programming project in the exercises at
the end of the chapter.
In our problem, the producer and consumer processes share the following
data structures:
int n;
semaphore mutex = 1;
semaphore empty = n;
semaphore full = 0
We assume that the pool consists of n buffers, each capable of holding one item.
The mutex semaphore provides mutual exclusion for accesses to the buffer pool
and is initialized to the value 1. The empty and full semaphores count the
number of empty and full buffers. The semaphore empty is initialized to the
value n; the semaphore full is initialized to the value 0.
The code for the producer process is shown in Figure 5.9, and the code
for the consumer process is shown in Figure 5.10. Note the symmetry between
the producer and the consumer. We can interpret this code as the producer
producing full buffers for the consumer or as the consumer producing empty
buffers for the producer.
220 Chapter 5 Process Synchronization
do {
wait(full);
wait(mutex);
. . .
/* remove an item from buffer to next consumed */
. . .
signal(mutex);
signal(empty);
. . .
/* consume the item in next consumed */
. . .
} while (true);
Figure 5.10 The structure of the consumer process.
5.7.2 The Readers–Writers Problem
Suppose that a database is to be shared among several concurrent processes.
Some of these processes may want only to read the database, whereas others
may want to update (that is, to read and write) the database. We distinguish
between these two types of processes by referring to the former as readers
and to the latter as writers. Obviously, if two readers access the shared data
simultaneously, no adverse effects will result. However, if a writer and some
other process (either a reader or a writer) access the database simultaneously,
chaos may ensue.
To ensure that these difficulties do not arise, we require that the writers
have exclusive access to the shared database while writing to the database. This
synchronization problem is referred to as the readers–writers problem. Since it
was originally stated, it has been used to test nearly every new synchronization
primitive. The readers–writers problem has several variations, all involving
priorities. The simplest one, referred to as the first readers–writers problem,
requires that no reader be kept waiting unless a writer has already obtained
permission to use the shared object. In other words, no reader should wait for
other readers to finish simply because a writer is waiting. The second readers
–writers problem requires that, once a writer is ready, that writer perform its
write as soon as possible. In other words, if a writer is waiting to access the
object, no new readers may start reading.
A solution to either problem may result in starvation. In the first case,
writers may starve; in the second case, readers may starve. For this reason,
other variants of the problem have been proposed. Next, we present a solution
to the first readers–writers problem. See the bibliographical notes at the end
of the chapter for references describing starvation-free solutions to the second
readers–writers problem.
In the solution to the first readers–writers problem, the reader processes
share the following data structures:
semaphore rw mutex = 1;
semaphore mutex = 1;
int read count = 0;
The semaphores mutex and rw mutex are initialized to 1; read count is
initialized to 0. The semaphore rw mutex is common to both reader and writer
5.7 Classic Problems of Synchronization 221
do {
wait(rw mutex);
. . .
/* writing is performed */
. . .
signal(rw mutex);
} while (true);
Figure 5.11 The structure of a writer process.
processes. The mutex semaphore is used to ensure mutual exclusion when the
variable read count is updated. The read count variable keeps track of how
many processes are currently reading the object. The semaphore rw mutex
functions as a mutual exclusion semaphore for the writers. It is also used by
the first or last reader that enters or exits the critical section. It is not used by
readers who enter or exit while other readers are in their critical sections.
The code for a writer process is shown in Figure 5.11; the code for a
reader process is shown in Figure 5.12. Note that, if a writer is in the critical
section and n readers are waiting, then one reader is queued on rw mutex, and
n − 1 readers are queued on mutex. Also observe that, when a writer executes
signal(rw mutex), we may resume the execution of either the waiting readers
or a single waiting writer. The selection is made by the scheduler.
The readers–writers problem and its solutions have been generalized to
provide reader–writer locks on some systems. Acquiring a reader–writer lock
requires specifying the mode of the lock: either read or write access. When a
process wishes only to read shared data, it requests the reader–writer lock
in read mode. A process wishing to modify the shared data must request the
lock in write mode. Multiple processes are permitted to concurrently acquire
a reader–writer lock in read mode, but only one process may acquire the lock
for writing, as exclusive access is required for writers.
Reader–writer locks are most useful in the following situations:
do {
wait(mutex);
read count++;
if (read count == 1)
wait(rw mutex);
signal(mutex);
. . .
/* reading is performed */
. . .
wait(mutex);
read count--;
if (read count == 0)
signal(rw mutex);
signal(mutex);
} while (true);
Figure 5.12 The structure of a reader process.
222 Chapter 5 Process Synchronization
RICE
Figure 5.13 The situation of the dining philosophers.
• In applications where it is easy to identify which processes only read shared
data and which processes only write shared data.
• In applications that have more readers than writers. This is because reader–
writer locks generally require more overhead to establish than semaphores
or mutual-exclusion locks. The increased concurrency of allowing multiple
readers compensates for the overhead involved in setting up the reader–
writer lock.
5.7.3 The Dining-Philosophers Problem
Consider five philosophers who spend their lives thinking and eating. The
philosophers share a circular table surrounded by five chairs, each belonging
to one philosopher. In the center of the table is a bowl of rice, and the table is laid
with five single chopsticks (Figure 5.13). When a philosopher thinks, she does
not interact with her colleagues. From time to time, a philosopher gets hungry
and tries to pick up the two chopsticks that are closest to her (the chopsticks
that are between her and her left and right neighbors). A philosopher may pick
up only one chopstick at a time. Obviously, she cannot pick up a chopstick that
is already in the hand of a neighbor. When a hungry philosopher has both her
chopsticks at the same time, she eats without releasing the chopsticks. When
she is finished eating, she puts down both chopsticks and starts thinking again.
The dining-philosophers problem is considered a classic synchronization
problem neither because of its practical importance nor because computer
scientists dislike philosophers but because it is an example of a large class
of concurrency-control problems. It is a simple representation of the need
to allocate several resources among several processes in a deadlock-free and
starvation-free manner.
One simple solution is to represent each chopstick with a semaphore. A
philosopher tries to grab a chopstick by executing a wait() operation on that
semaphore. She releases her chopsticks by executing the signal() operation
on the appropriate semaphores. Thus, the shared data are
semaphore chopstick[5];
5.8 Monitors 223
do {
wait(chopstick[i]);
wait(chopstick[(i+1) % 5]);
. . .
/* eat for awhile */
. . .
signal(chopstick[i]);
signal(chopstick[(i+1) % 5]);
. . .
/* think for awhile */
. . .
} while (true);
Figure 5.14 The structure of philosopher i.
where all the elements of chopstick are initialized to 1. The structure of
philosopher i is shown in Figure 5.14.
Although this solution guarantees that no two neighbors are eating
simultaneously, it nevertheless must be rejected because it could create a
deadlock. Suppose that all five philosophers become hungry at the same time
and each grabs her left chopstick. All the elements of chopstick will now be
equal to 0. When each philosopher tries to grab her right chopstick, she will be
delayed forever.
Several possible remedies to the deadlock problem are replaced by:
• Allow at most four philosophers to be sitting simultaneously at the table.
• Allow a philosopher to pick up her chopsticks only if both chopsticks are
available (to do this, she must pick them up in a critical section).
• Use an asymmetric solution—that is, an odd-numbered philosopher picks
up first her left chopstick and then her right chopstick, whereas an even-
numbered philosopher picks up her right chopstick and then her left
chopstick.
In Section 5.8, we present a solution to the dining-philosophers problem
that ensures freedom from deadlocks. Note, however, that any satisfactory
solution to the dining-philosophers problem must guard against the possibility
that one of the philosophers will starve to death. A deadlock-free solution does
not necessarily eliminate the possibility of starvation.
5.8 Monitors
Although semaphores provide a convenient and effective mechanism for
process synchronization, using them incorrectly can result in timing errors
that are difficult to detect, since these errors happen only if particular execution
sequences take place and these sequences do not always occur.
We have seen an example of such errors in the use of counters in our
solution to the producer–consumer problem (Section 5.1). In that example,
the timing problem happened only rarely, and even then the counter value
6
C H A P T E R
CPU
Scheduling
CPU scheduling is the basis of multiprogrammed operating systems. By
switching the CPU among processes, the operating system can make the
computer more productive. In this chapter, we introduce basic CPU-scheduling
concepts and present several CPU-scheduling algorithms. We also consider the
problem of selecting an algorithm for a particular system.
In Chapter 4, we introduced threads to the process model. On operating
systems that support them, it is kernel-level threads—not processes—that
are in fact being scheduled by the operating system. However, the terms
"process scheduling" and "thread scheduling" are often used interchangeably.
In this chapter, we use process scheduling when discussing general scheduling
concepts and thread scheduling to refer to thread-specific ideas.
CHAPTER OBJECTIVES
• To introduce CPU scheduling, which is the basis for multiprogrammed
operating systems.
• To describe various CPU-scheduling algorithms.
• To discuss evaluation criteria for selecting a CPU-scheduling algorithm for
a particular system.
• To examine the scheduling algorithms of several operating systems.
6.1 Basic Concepts
In a single-processor system, only one process can run at a time. Others
must wait until the CPU is free and can be rescheduled. The objective of
multiprogramming is to have some process running at all times, to maximize
CPU utilization. The idea is relatively simple. A process is executed until
it must wait, typically for the completion of some I/O request. In a simple
computer system, the CPU then just sits idle. All this waiting time is wasted;
no useful work is accomplished. With multiprogramming, we try to use this
time productively. Several processes are kept in memory at one time. When
261
262 Chapter 6 CPU Scheduling
CPU burst
load store
add store
read from file
store increment
index
write to file
load store
add store
read from file
wait for I/O
wait for I/O
wait for I/O
I/O burst
I/O burst
I/O burst
CPU burst
CPU burst
•
•
•
•
•
•
Figure 6.1 Alternating sequence of CPU and I/O bursts.
one process has to wait, the operating system takes the CPU away from that
process and gives the CPU to another process. This pattern continues. Every
time one process has to wait, another process can take over use of the CPU.
Scheduling of this kind is a fundamental operating-system function.
Almost all computer resources are scheduled before use. The CPU is, of course,
one of the primary computer resources. Thus, its scheduling is central to
operating-system design.
6.1.1 CPU–I/O Burst Cycle
The success of CPU scheduling depends on an observed property of processes:
process execution consists of a cycle of CPU execution and I/O wait. Processes
alternate between these two states. Process execution begins with a CPU burst.
That is followed by an I/O burst, which is followed by another CPU burst, then
another I/O burst, and so on. Eventually, the final CPU burst ends with a system
request to terminate execution (Figure 6.1).
The durations of CPU bursts have been measured extensively. Although
they vary greatly from process to process and from computer to computer,
they tend to have a frequency curve similar to that shown in Figure 6.2. The
curve is generally characterized as exponential or hyperexponential, with a
large number of short CPU bursts and a small number of long CPU bursts.
6.1 Basic Concepts 263
frequency
160
140
120
100
80
60
40
20
0 8 16 24 32 40
burst duration (milliseconds)
Figure 6.2 Histogram of CPU-burst durations.
An I/O-bound program typically has many short CPU bursts. A CPU-bound
program might have a few long CPU bursts. This distribution can be important
in the selection of an appropriate CPU-scheduling algorithm.
6.1.2 CPU Scheduler
Whenever the CPU becomes idle, the operating system must select one of the
processes in the ready queue to be executed. The selection process is carried out
by the short-term scheduler, or CPU scheduler. The scheduler selects a process
from the processes in memory that are ready to execute and allocates the CPU
to that process.
Note that the ready queue is not necessarily a first-in, first-out (FIFO) queue.
As we shall see when we consider the various scheduling algorithms, a ready
queue can be implemented as a FIFO queue, a priority queue, a tree, or simply
an unordered linked list. Conceptually, however, all the processes in the ready
queue are lined up waiting for a chance to run on the CPU. The records in the
queues are generally process control blocks (PCBs) of the processes.
6.1.3 Preemptive Scheduling
CPU-scheduling decisions may take place under the following four circum-
stances:
1. When a process switches from the running state to the waiting state (for
example, as the result of an I/O request or an invocation of wait() for
the termination of a child process)
264 Chapter 6 CPU Scheduling
2. When a process switches from the running state to the ready state (for
example, when an interrupt occurs)
3. When a process switches from the waiting state to the ready state (for
example, at completion of I/O)
4. When a process terminates
For situations 1 and 4, there is no choice in terms of scheduling. A new process
(if one exists in the ready queue) must be selected for execution. There is a
choice, however, for situations 2 and 3.
When scheduling takes place only under circumstances 1 and 4, we say
that the scheduling scheme is nonpreemptive or cooperative. Otherwise,
it is preemptive. Under nonpreemptive scheduling, once the CPU has been
allocated to a process, the process keeps the CPU until it releases the CPU either
by terminating or by switching to the waiting state. This scheduling method
was used by Microsoft Windows 3.x. Windows 95 introduced preemptive
scheduling, and all subsequent versions of Windows operating systems have
used preemptive scheduling. The Mac OS X operating system for the Macintosh
also uses preemptive scheduling; previous versions of the Macintosh operating
system relied on cooperative scheduling. Cooperative scheduling is the only
method that can be used on certain hardware platforms, because it does not
require the special hardware (for example, a timer) needed for preemptive
scheduling.
Unfortunately, preemptive scheduling can result in race conditions when
data are shared among several processes. Consider the case of two processes
that share data. While one process is updating the data, it is preempted so that
the second process can run. The second process then tries to read the data,
which are in an inconsistent state. This issue was explored in detail in Chapter
5.
Preemption also affects the design of the operating-system kernel. During
the processing of a system call, the kernel may be busy with an activity on behalf
of a process. Such activities may involve changing important kernel data (for
instance, I/O queues). What happens if the process is preempted in the middle
of these changes and the kernel (or the device driver) needs to read or modify
the same structure? Chaos ensues. Certain operating systems, including most
versions of UNIX, deal with this problem by waiting either for a system call
to complete or for an I/O block to take place before doing a context switch.
This scheme ensures that the kernel structure is simple, since the kernel will
not preempt a process while the kernel data structures are in an inconsistent
state. Unfortunately, this kernel-execution model is a poor one for supporting
real-time computing where tasks must complete execution within a given time
frame. In Section 6.6, we explore scheduling demands of real-time systems.
Because interrupts can, by definition, occur at any time, and because
they cannot always be ignored by the kernel, the sections of code affected
by interrupts must be guarded from simultaneous use. The operating system
needs to accept interrupts at almost all times. Otherwise, input might be lost or
output overwritten. So that these sections of code are not accessed concurrently
by several processes, they disable interrupts at entry and reenable interrupts
at exit. It is important to note that sections of code that disable interrupts do
not occur very often and typically contain few instructions.
6.2 Scheduling Criteria 265
6.1.4 Dispatcher
Another component involved in the CPU-scheduling function is the dispatcher.
The dispatcher is the module that gives control of the CPUto the process selected
by the short-term scheduler. This function involves the following:
• Switching context
• Switching to user mode
• Jumping to the proper location in the user program to restart that program
The dispatcher should be as fast as possible, since it is invoked during every
process switch. The time it takes for the dispatcher to stop one process and
start another running is known as the dispatch latency.
6.2 Scheduling Criteria
Different CPU-scheduling algorithms have different properties, and the choice
of a particular algorithm may favor one class of processes over another. In
choosing which algorithm to use in a particular situation, we must consider
the properties of the various algorithms.
Many criteria have been suggested for comparing CPU-scheduling algo-
rithms. Which characteristics are used for comparison can make a substantial
difference in which algorithm is judged to be best. The criteria include the
following:
• CPU utilization. We want to keep the CPU as busy as possible. Concep-
tually, CPU utilization can range from 0 to 100 percent. In a real system, it
should range from 40 percent (for a lightly loaded system) to 90 percent
(for a heavily loaded system).
• Throughput. If the CPU is busy executing processes, then work is being
done. One measure of work is the number of processes that are completed
per time unit, called throughput. For long processes, this rate may be one
process per hour; for short transactions, it may be ten processes per second.
• Turnaround time. From the point of view of a particular process, the
important criterion is how long it takes to execute that process. The interval
from the time of submission of a process to the time of completion is the
turnaround time. Turnaround time is the sum of the periods spent waiting
to get into memory, waiting in the ready queue, executing on the CPU, and
doing I/O.
• Waiting time. The CPU-scheduling algorithm does not affect the amount
of time during which a process executes or does I/O. It affects only the
amount of time that a process spends waiting in the ready queue. Waiting
time is the sum of the periods spent waiting in the ready queue.
• Response time. In an interactive system, turnaround time may not be
the best criterion. Often, a process can produce some output fairly early
and can continue computing new results while previous results are being
266 Chapter 6 CPU Scheduling
output to the user. Thus, another measure is the time from the submission
of a request until the first response is produced. This measure, called
response time, is the time it takes to start responding, not the time it takes
to output the response. The turnaround time is generally limited by the
speed of the output device.
It is desirable to maximize CPU utilization and throughput and to minimize
turnaround time, waiting time, and response time. In most cases, we optimize
the average measure. However, under some circumstances, we prefer to
optimize the minimum or maximum values rather than the average. For
example, to guarantee that all users get good service, we may want to minimize
the maximum response time.
Investigators have suggested that, for interactive systems (such as desktop
systems), it is more important to minimize the variance in the response time
than to minimize the average response time. A system with reasonable and
predictable response time may be considered more desirable than a system
that is faster on the average but is highly variable. However, little work has
been done on CPU-scheduling algorithms that minimize variance.
As we discuss various CPU-scheduling algorithms in the following section,
we illustrate their operation. An accurate illustration should involve many
processes, each a sequence of several hundred CPU bursts and I/O bursts.
For simplicity, though, we consider only one CPU burst (in milliseconds) per
process in our examples. Our measure of comparison is the average waiting
time. More elaborate evaluation mechanisms are discussed in Section 6.8.
6.3 Scheduling Algorithms
CPU schedulingdealswiththe problemofdecidingwhichofthe processesinthe
readyqueue istobe allocated the CPU. There are manydifferent CPU-scheduling
algorithms. In this section, we describe several of them.
6.3.1 First-Come, First-Served Scheduling
By far the simplest CPU-scheduling algorithm is the first-come, first-served
(FCFS) scheduling algorithm. With this scheme, the process that requests the
CPU first is allocated the CPU first. The implementation of the FCFS policy is
easily managed with a FIFO queue. When a process enters the ready queue, its
PCB is linked onto the tail of the queue. When the CPU is free, it is allocated to
the process at the head of the queue. The running process is then removed from
the queue. The code for FCFS scheduling is simple to write and understand.
On the negative side, the average waiting time under the FCFS policy is
often quite long. Consider the following set of processes that arrive at time 0,
with the length of the CPU burst given in milliseconds:
Process Burst Time
P1 24
P2 3
P3 3
6.3 Scheduling Algorithms 267
If the processes arrive in the order P1, P2, P3, and are served in FCFS order,
we get the result shown in the following Gantt chart, which is a bar chart that
illustrates a particular schedule, including the start and finish times of each of
the participating processes:
P1 P2 P3
30
27
24
0
The waiting time is 0 milliseconds for process P1, 24 milliseconds for process
P2, and 27 milliseconds for process P3. Thus, the average waiting time is (0
+ 24 + 27)/3 = 17 milliseconds. If the processes arrive in the order P2, P3, P1,
however, the results will be as shown in the following Gantt chart:
P1
P2 P3
30
0 3 6
The average waiting time is now (6 + 0 + 3)/3 = 3 milliseconds. This reduction
is substantial. Thus, the average waiting time under an FCFS policy is generally
not minimal and may vary substantially if the processes’ CPU burst times vary
greatly.
In addition, consider the performance of FCFS scheduling in a dynamic
situation. Assume we have one CPU-bound process and many I/O-bound
processes. As the processes flow around the system, the following scenario
may result. The CPU-bound process will get and hold the CPU. During this
time, all the other processes will finish their I/O and will move into the ready
queue, waiting for the CPU. While the processes wait in the ready queue, the
I/O devices are idle. Eventually, the CPU-bound process finishes its CPU burst
and moves to an I/O device. All the I/O-bound processes, which have short
CPU bursts, execute quickly and move back to the I/O queues. At this point,
the CPU sits idle. The CPU-bound process will then move back to the ready
queue and be allocated the CPU. Again, all the I/O processes end up waiting in
the ready queue until the CPU-bound process is done. There is a convoy effect
as all the other processes wait for the one big process to get off the CPU. This
effect results in lower CPU and device utilization than might be possible if the
shorter processes were allowed to go first.
Note also that the FCFS scheduling algorithm is nonpreemptive. Once the
CPU has been allocated to a process, that process keeps the CPU until it releases
the CPU, either by terminating or by requesting I/O. The FCFS algorithm is thus
particularly troublesome for time-sharing systems, where it is important that
each user get a share of the CPU at regular intervals. It would be disastrous to
allow one process to keep the CPU for an extended period.
6.3.2 Shortest-Job-First Scheduling
A differentapproachtoCPU schedulingisthe shortest-job-first (SJF)scheduling
algorithm. This algorithm associates with each process the length of the
process’s next CPU burst. When the CPU is available, it is assigned to the
268 Chapter 6 CPU Scheduling
process that has the smallest next CPU burst. If the next CPU bursts of two
processes are the same, FCFS scheduling is used to break the tie. Note that a
more appropriate term for this scheduling method would be the shortest-next-
CPU-burst algorithm, because scheduling depends on the length of the next
CPU burst of a process, rather than its total length. We use the term SJF because
most people and textbooks use this term to refer to this type of scheduling.
As an example of SJF scheduling, consider the following set of processes,
with the length of the CPU burst given in milliseconds:
Process Burst Time
P1 6
P2 8
P3 7
P4 3
Using SJF scheduling, we would schedule these processes according to the
following Gantt chart:
P3 P2
P4 P1
24
16
9
0 3
The waiting time is 3 milliseconds for process P1, 16 milliseconds for process
P2, 9 milliseconds for process P3, and 0 milliseconds for process P4. Thus, the
average waiting time is (3 + 16 + 9 + 0)/4 = 7 milliseconds. By comparison, if
we were using the FCFS scheduling scheme, the average waiting time would
be 10.25 milliseconds.
The SJF scheduling algorithm is provably optimal, in that it gives the
minimum average waiting time for a given set of processes. Moving a short
process before a long one decreases the waiting time of the short process more
than it increases the waiting time of the long process. Consequently, the average
waiting time decreases.
The real difficulty with the SJF algorithm is knowing the length of the next
CPU request. For long-term (job) scheduling in a batch system, we can use
the process time limit that a user specifies when he submits the job. In this
situation, users are motivated to estimate the process time limit accurately,
since a lower value may mean faster response but too low a value will cause
a time-limit-exceeded error and require resubmission. SJF scheduling is used
frequently in long-term scheduling.
Although the SJF algorithm is optimal, it cannot be implemented at the
level of short-term CPU scheduling. With short-term scheduling, there is no
way to know the length of the next CPU burst. One approach to this problem
is to try to approximate SJF scheduling. We may not know the length of the
next CPU burst, but we may be able to predict its value. We expect that the
next CPU burst will be similar in length to the previous ones. By computing
an approximation of the length of the next CPU burst, we can pick the process
with the shortest predicted CPU burst.
The next CPU burst is generally predicted as an exponential average of
the measured lengths of previous CPU bursts. We can define the exponential
6.3 Scheduling Algorithms 269
6 4 6 4 13 13 13 …
8
10 6 6 5 9 11 12 …
CPU burst (ti)
"guess" (τi)
ti
τi
2
time
4
6
8
10
12
Figure 6.3 Prediction of the length of the next CPU burst.
average with the following formula. Let tn be the length of the nth CPU burst,
and let n+1 be our predicted value for the next CPU burst. Then, for , 0 ≤  ≤
1, define
n+1 =  tn + (1 − )n.
The value of tn contains our most recent information, while n stores the past
history. The parameter  controls the relative weight of recent and past history
in our prediction. If  = 0, then n+1 = n, and recent history has no effect (current
conditions are assumed to be transient). If  = 1, then n+1 = tn, and only the most
recent CPU burst matters (history is assumed to be old and irrelevant). More
commonly,  = 1/2, so recent history and past history are equally weighted.
The initial 0 can be defined as a constant or as an overall system average.
Figure 6.3 shows an exponential average with  = 1/2 and 0 = 10.
To understand the behavior of the exponential average, we can expand the
formula for n+1 by substituting for n to find
n+1 = tn + (1 − )tn−1 + · · · + (1 − )j
tn− j + · · · + (1 − )n+1
0.
Typically,  is less than 1. As a result, (1 − ) is also less than 1, and each
successive term has less weight than its predecessor.
The SJF algorithm can be either preemptive or nonpreemptive. The choice
arises when a new process arrives at the ready queue while a previous process is
still executing. The next CPU burst of the newly arrived process may be shorter
than what is left of the currently executing process. A preemptive SJF algorithm
will preempt the currently executing process, whereas a nonpreemptive SJF
algorithm will allow the currently running process to finish its CPU burst.
Preemptive SJF scheduling is sometimes called shortest-remaining-time-first
scheduling.
270 Chapter 6 CPU Scheduling
As an example, consider the following four processes, with the length of
the CPU burst given in milliseconds:
Process Arrival Time Burst Time
P1 0 8
P2 1 4
P3 2 9
P4 3 5
If the processes arrive at the ready queue at the times shown and need the
indicated burst times, then the resulting preemptive SJF schedule is as depicted
in the following Gantt chart:
P1 P3
P1 P2 P4
26
17
10
0 1 5
Process P1 is started at time 0, since it is the only process in the queue. Process
P2 arrives at time 1. The remaining time for process P1 (7 milliseconds) is
larger than the time required by process P2 (4 milliseconds), so process P1 is
preempted, and process P2 is scheduled. The average waiting time for this
example is [(10 − 1) + (1 − 1) + (17 − 2) + (5 − 3)]/4 = 26/4 = 6.5 milliseconds.
Nonpreemptive SJF scheduling would result in an average waiting time of 7.75
milliseconds.
6.3.3 Priority Scheduling
The SJF algorithm is a special case of the general priority-scheduling algorithm.
A priority is associated with each process, and the CPUis allocated to the process
with the highest priority. Equal-priority processes are scheduled in FCFS order.
An SJF algorithm is simply a priority algorithm where the priority (p) is the
inverse of the (predicted) next CPU burst. The larger the CPU burst, the lower
the priority, and vice versa.
Note that we discuss scheduling in terms of high priority and low priority.
Priorities are generally indicated by some fixed range of numbers, such as 0
to 7 or 0 to 4,095. However, there is no general agreement on whether 0 is the
highest or lowest priority. Some systems use low numbers to represent low
priority; others use low numbers for high priority. This difference can lead to
confusion. In this text, we assume that low numbers represent high priority.
As an example, consider the following set of processes, assumed to have
arrived at time 0 in the order P1, P2, · · ·, P5, with the length of the CPU burst
given in milliseconds:
Process Burst Time Priority
P1 10 3
P2 1 1
P3 2 4
P4 1 5
P5 5 2
6.3 Scheduling Algorithms 271
Using priority scheduling, we would schedule these processes according to the
following Gantt chart:
P1
P4
P3
P2
P5
19
18
16
6
0 1
The average waiting time is 8.2 milliseconds.
Priorities can be defined either internally or externally. Internally defined
priorities use some measurable quantity or quantities to compute the priority
of a process. For example, time limits, memory requirements, the number of
open files, and the ratio of average I/O burst to average CPU burst have been
used in computing priorities. External priorities are set by criteria outside the
operating system, such as the importance of the process, the type and amount
of funds being paid for computer use, the department sponsoring the work,
and other, often political, factors.
Priority scheduling can be either preemptive or nonpreemptive. When a
process arrives at the ready queue, its priority is compared with the priority
of the currently running process. A preemptive priority scheduling algorithm
will preempt the CPU if the priority of the newly arrived process is higher
than the priority of the currently running process. A nonpreemptive priority
scheduling algorithm will simply put the new process at the head of the ready
queue.
A major problem with priority scheduling algorithms is indefinite block-
ing, or starvation. A process that is ready to run but waiting for the CPU can
be considered blocked. A priority scheduling algorithm can leave some low-
priority processes waiting indefinitely. In a heavily loaded computer system, a
steady stream of higher-priority processes can prevent a low-priority process
from ever getting the CPU. Generally, one of two things will happen. Either the
process will eventually be run (at 2 A.M. Sunday, when the system is finally
lightly loaded), or the computer system will eventually crash and lose all
unfinished low-priority processes. (Rumor has it that when they shut down
the IBM 7094 at MIT in 1973, they found a low-priority process that had been
submitted in 1967 and had not yet been run.)
A solution to the problem of indefinite blockage of low-priority processes is
aging. Aging involves gradually increasing the priority of processes that wait
in the system for a long time. For example, if priorities range from 127 (low)
to 0 (high), we could increase the priority of a waiting process by 1 every 15
minutes. Eventually, even a process with an initial priority of 127 would have
the highest priority in the system and would be executed. In fact, it would take
no more than 32 hours for a priority-127 process to age to a priority-0 process.
6.3.4 Round-Robin Scheduling
The round-robin (RR) scheduling algorithm is designed especially for time-
sharing systems. It is similar to FCFS scheduling, but preemption is added to
enable the system to switch between processes. A small unit of time, called a
time quantum or time slice, is defined. A time quantum is generally from 10
to 100 milliseconds in length. The ready queue is treated as a circular queue.
272 Chapter 6 CPU Scheduling
The CPU scheduler goes around the ready queue, allocating the CPU to each
process for a time interval of up to 1 time quantum.
To implement RR scheduling, we again treat the ready queue as a FIFO
queue of processes. New processes are added to the tail of the ready queue.
The CPU scheduler picks the first process from the ready queue, sets a timer to
interrupt after 1 time quantum, and dispatches the process.
One of two things will then happen. The process may have a CPU burst of
less than 1 time quantum. In this case, the process itself will release the CPU
voluntarily. The scheduler will then proceed to the next process in the ready
queue. If the CPU burst of the currently running process is longer than 1 time
quantum, the timer will go off and will cause an interrupt to the operating
system. A context switch will be executed, and the process will be put at the
tail of the ready queue. The CPU scheduler will then select the next process in
the ready queue.
The average waiting time under the RR policy is often long. Consider the
following set of processes that arrive at time 0, with the length of the CPU burst
given in milliseconds:
Process Burst Time
P1 24
P2 3
P3 3
If we use a time quantum of 4 milliseconds, then process P1 gets the first 4
milliseconds. Since it requires another 20 milliseconds, it is preempted after
the first time quantum, and the CPU is given to the next process in the queue,
process P2. Process P2 does not need 4 milliseconds, so it quits before its time
quantum expires. The CPU is then given to the next process, process P3. Once
each process has received 1 time quantum, the CPU is returned to process P1
for an additional time quantum. The resulting RR schedule is as follows:
P1
P1 P1
P1
P1
P1 P2
30
18
14 26
22
10
7
0 4
P3
Let’s calculate the average waiting time for this schedule. P1 waits for 6
milliseconds (10 - 4), P2 waits for 4 milliseconds, and P3 waits for 7 milliseconds.
Thus, the average waiting time is 17/3 = 5.66 milliseconds.
In the RR scheduling algorithm, no process is allocated the CPU for more
than 1 time quantum in a row (unless it is the only runnable process). If a
process’s CPU burst exceeds 1 time quantum, that process is preempted and is
put back in the ready queue. The RR scheduling algorithm is thus preemptive.
If there are n processes in the ready queue and the time quantum is q,
then each process gets 1/n of the CPU time in chunks of at most q time units.
Each process must wait no longer than (n − 1) × q time units until its
next time quantum. For example, with five processes and a time quantum of 20
milliseconds, each process will get up to 20 milliseconds every 100 milliseconds.
The performance ofthe RR algorithmdependsheavilyonthe size ofthe time
quantum. At one extreme, if the time quantum is extremely large, the RR policy
6.3 Scheduling Algorithms 273
process time  10 quantum context
switches
12 0
6 1
1 9
0 10
0 10
0 1 2 3 4 5 6 7 8 9 10
6
Figure 6.4 How a smaller time quantum increases context switches.
is the same as the FCFS policy. In contrast, if the time quantum is extremely
small (say, 1 millisecond), the RR approach can result in a large number of
context switches. Assume, for example, that we have only one process of 10
time units. If the quantum is 12 time units, the process finishes in less than 1
time quantum, with no overhead. If the quantum is 6 time units, however, the
process requires 2 quanta, resulting in a context switch. If the time quantum is
1 time unit, then nine context switches will occur, slowing the execution of the
process accordingly (Figure 6.4).
Thus, we want the time quantum to be large with respect to the context-
switch time. If the context-switch time is approximately 10 percent of the
time quantum, then about 10 percent of the CPU time will be spent in context
switching. In practice, most modern systems have time quanta ranging from
10 to 100 milliseconds. The time required for a context switch is typically less
than 10 microseconds; thus, the context-switch time is a small fraction of the
time quantum.
Turnaround time also depends on the size of the time quantum. As we
can see from Figure 6.5, the average turnaround time of a set of processes
does not necessarily improve as the time-quantum size increases. In general,
the average turnaround time can be improved if most processes finish their
next CPU burst in a single time quantum. For example, given three processes
of 10 time units each and a quantum of 1 time unit, the average turnaround
time is 29. If the time quantum is 10, however, the average turnaround time
drops to 20. If context-switch time is added in, the average turnaround time
increases even more for a smaller time quantum, since more context switches
are required.
Although the time quantum should be large compared with the context-
switch time, it should not be too large. As we pointed out earlier, if the time
quantum is too large, RR scheduling degenerates to an FCFS policy. A rule of
thumb is that 80 percent of the CPU bursts should be shorter than the time
quantum.
6.3.5 Multilevel Queue Scheduling
Another class of scheduling algorithms has been created for situations in
which processes are easily classified into different groups. For example, a
274 Chapter 6 CPU Scheduling
average
turnaround
time
1
12.5
12.0
11.5
11.0
10.5
10.0
9.5
9.0
2 3 4
time quantum
5 6 7
P1
P2
P3
P4
6
3
1
7
process time
Figure 6.5 How turnaround time varies with the time quantum.
common division is made between foreground (interactive) processes and
background (batch) processes. These two types of processes have different
response-time requirements and so may have different scheduling needs. In
addition, foreground processes may have priority (externally defined) over
background processes.
A multilevel queue scheduling algorithm partitions the ready queue into
several separate queues (Figure 6.6). The processes are permanently assigned to
one queue, generally based on some property of the process, such as memory
size, process priority, or process type. Each queue has its own scheduling
algorithm. For example, separate queues might be used for foreground and
background processes. The foreground queue might be scheduled by an RR
algorithm, while the background queue is scheduled by an FCFS algorithm.
In addition, there must be scheduling among the queues, which is com-
monly implemented as fixed-priority preemptive scheduling. For example, the
foreground queue may have absolute priority over the background queue.
Let’s look at an example of a multilevel queue scheduling algorithm with
five queues, listed below in order of priority:
1. System processes
2. Interactive processes
3. Interactive editing processes
4. Batch processes
5. Student processes
6.3 Scheduling Algorithms 275
system processes
highest priority
lowest priority
interactive processes
interactive editing processes
batch processes
student processes
Figure 6.6 Multilevel queue scheduling.
Each queue has absolute priority over lower-priority queues. No process in the
batch queue, for example, could run unless the queues for system processes,
interactive processes, and interactive editing processes were all empty. If an
interactive editing process entered the ready queue while a batch process was
running, the batch process would be preempted.
Another possibility is to time-slice among the queues. Here, each queue gets
a certain portion of the CPU time, which it can then schedule among its various
processes. For instance, in the foreground–background queue example, the
foreground queue can be given 80 percent of the CPU time for RR scheduling
among its processes, while the background queue receives 20 percent of the
CPU to give to its processes on an FCFS basis.
6.3.6 Multilevel Feedback Queue Scheduling
Normally, when the multilevel queue scheduling algorithm is used, processes
are permanently assigned to a queue when they enter the system. If there
are separate queues for foreground and background processes, for example,
processes do not move from one queue to the other, since processes do not
change their foreground or background nature. This setup has the advantage
of low scheduling overhead, but it is inflexible.
The multilevel feedback queue scheduling algorithm, in contrast, allows
a process to move between queues. The idea is to separate processes according
to the characteristics of their CPU bursts. If a process uses too much CPU time,
it will be moved to a lower-priority queue. This scheme leaves I/O-bound and
interactive processes in the higher-priority queues. In addition, a process that
waits too long in a lower-priority queue may be moved to a higher-priority
queue. This form of aging prevents starvation.
For example, consider a multilevel feedback queue scheduler with three
queues, numbered from 0 to 2 (Figure 6.7). The scheduler first executes all
276 Chapter 6 CPU Scheduling
quantum  8
quantum  16
FCFS
Figure 6.7 Multilevel feedback queues.
processes in queue 0. Only when queue 0 is empty will it execute processes
in queue 1. Similarly, processes in queue 2 will be executed only if queues 0
and 1 are empty. A process that arrives for queue 1 will preempt a process in
queue 2. A process in queue 1 will in turn be preempted by a process arriving
for queue 0.
A process entering the ready queue is put in queue 0. A process in queue 0
is given a time quantum of 8 milliseconds. If it does not finish within this time,
it is moved to the tail of queue 1. If queue 0 is empty, the process at the head
of queue 1 is given a quantum of 16 milliseconds. If it does not complete, it is
preempted and is put into queue 2. Processes in queue 2 are run on an FCFS
basis but are run only when queues 0 and 1 are empty.
This scheduling algorithm gives highest priority to any process with a CPU
burst of 8 milliseconds or less. Such a process will quickly get the CPU, finish
its CPU burst, and go off to its next I/O burst. Processes that need more than
8 but less than 24 milliseconds are also served quickly, although with lower
priority than shorter processes. Long processes automatically sink to queue
2 and are served in FCFS order with any CPU cycles left over from queues 0
and 1.
In general, a multilevel feedback queue scheduler is defined by the
following parameters:
• The number of queues
• The scheduling algorithm for each queue
• The method used to determine when to upgrade a process to a higher-
priority queue
• The method used to determine when to demote a process to a lower-
priority queue
• The method used to determine which queue a process will enter when that
process needs service
The definition of a multilevel feedback queue scheduler makes it the most
general CPU-scheduling algorithm. It can be configured to match a specific
system under design. Unfortunately, it is also the most complex algorithm,
6.4 Thread Scheduling 277
since defining the best scheduler requires some means by which to select
values for all the parameters.
6.4 Thread Scheduling
In Chapter 4, we introduced threads to the process model, distinguishing
between user-level and kernel-level threads. On operating systems that support
them, it is kernel-level threads—not processes—that are being scheduled by
the operating system. User-level threads are managed by a thread library,
and the kernel is unaware of them. To run on a CPU, user-level threads
must ultimately be mapped to an associated kernel-level thread, although
this mapping may be indirect and may use a lightweight process (LWP). In this
section, we explore scheduling issues involving user-level and kernel-level
threads and offer specific examples of scheduling for Pthreads.
6.4.1 Contention Scope
One distinction between user-level and kernel-level threads lies in how they
are scheduled. On systems implementing the many-to-one (Section 4.3.1) and
many-to-many (Section 4.3.3) models, the thread library schedules user-level
threads to run on an available LWP. This scheme is known as process-
contention scope (PCS), since competition for the CPU takes place among
threads belonging to the same process. (When we say the thread library
schedules user threads onto available LWPs, we do not mean that the threads
are actually running on a CPU. That would require the operating system to
schedule the kernel thread onto a physical CPU.) To decide which kernel-level
thread to schedule onto a CPU, the kernel uses system-contention scope (SCS).
Competition for the CPU with SCS scheduling takes place among all threads
in the system. Systems using the one-to-one model (Section 4.3.2), such as
Windows, Linux, and Solaris, schedule threads using only SCS.
Typically, PCS is done according to priority—the scheduler selects the
runnable thread with the highest priority to run. User-level thread priorities
are set by the programmer and are not adjusted by the thread library, although
some thread libraries may allow the programmer to change the priority of
a thread. It is important to note that PCS will typically preempt the thread
currently running in favor of a higher-priority thread; however, there is no
guarantee of time slicing (Section 6.3.4) among threads of equal priority.
6.4.2 Pthread Scheduling
We provided a sample POSIX Pthread program in Section 4.4.1, along with an
introduction to thread creation with Pthreads. Now, we highlight the POSIX
Pthread API that allows specifying PCS or SCS during thread creation. Pthreads
identifies the following contention scope values:
• PTHREAD SCOPE PROCESS schedules threads using PCS scheduling.
• PTHREAD SCOPE SYSTEM schedules threads using SCS scheduling.
7
C H A P T E R
Deadlocks
In a multiprogramming environment, several processes may compete for a
finite number of resources. A process requests resources; if the resources are
not available at that time, the process enters a waiting state. Sometimes, a
waiting process is never again able to change state, because the resources it
has requested are held by other waiting processes. This situation is called
a deadlock. We discussed this issue briefly in Chapter 5 in connection with
semaphores.
Perhaps the best illustration of a deadlock can be drawn from a law passed
by the Kansas legislature early in the 20th century. It said, in part: “When two
trains approach each other at a crossing, both shall come to a full stop and
neither shall start up again until the other has gone.”
In this chapter, we describe methods that an operating system can use
to prevent or deal with deadlocks. Although some applications can identify
programs that may deadlock, operating systems typically do not provide
deadlock-prevention facilities, and it remains the responsibility of program-
mers to ensure that they design deadlock-free programs. Deadlock problems
can only become more common, given current trends, including larger num-
bers of processes, multithreaded programs, many more resources within a
system, and an emphasis on long-lived file and database servers rather than
batch systems.
CHAPTER OBJECTIVES
• To develop a description of deadlocks, which prevent sets of concurrent
processes from completing their tasks.
• To present a number of different methods for preventing or avoiding
deadlocks in a computer system.
7.1 System Model
A system consists of a finite number of resources to be distributed among a
number of competing processes. The resources may be partitioned into several
315
316 Chapter 7 Deadlocks
types (or classes), each consisting of some number of identical instances. CPU
cycles, files, and I/O devices (such as printers and DVD drives) are examples of
resource types. If a system has two CPUs, then the resource type CPU has two
instances. Similarly, the resource type printer may have five instances.
If a process requests an instance of a resource type, the allocation of any
instance of the type should satisfy the request. If it does not, then the instances
are not identical, and the resource type classes have not been defined properly.
For example, a system may have two printers. These two printers may be
defined to be in the same resource class if no one cares which printer prints
which output. However, if one printer is on the ninth floor and the other is
in the basement, then people on the ninth floor may not see both printers
as equivalent, and separate resource classes may need to be defined for each
printer.
Chapter 5 discussed various synchronization tools, such as mutex locks
and semaphores. These tools are also considered system resources, and they
are a common source of deadlock. However, a lock is typically associated with
protecting a specific data structure—that is, one lock may be used to protect
access to a queue, another to protect access to a linked list, and so forth. For that
reason, each lock is typically assigned its own resource class, and definition is
not a problem.
A process must request a resource before using it and must release the
resource after using it. A process may request as many resources as it requires
to carry out its designated task. Obviously, the number of resources requested
may not exceed the total number of resources available in the system. In other
words, a process cannot request three printers if the system has only two.
Under the normal mode of operation, a process may utilize a resource in
only the following sequence:
1. Request. The process requests the resource. If the request cannot be
granted immediately (for example, if the resource is being used by another
process), then the requesting process must wait until it can acquire the
resource.
2. Use. The process can operate on the resource (for example, if the resource
is a printer, the process can print on the printer).
3. Release. The process releases the resource.
The request and release of resources may be system calls, as explained in
Chapter 2. Examples are the request() and release() device, open() and
close() file, and allocate() and free() memory system calls. Similarly,
as we saw in Chapter 5, the request and release of semaphores can be
accomplished through the wait() and signal() operations on semaphores
or through acquire() and release() of a mutex lock. For each use of a
kernel-managed resource by a process or thread, the operating system checks
to make sure that the process has requested and has been allocated the resource.
A system table records whether each resource is free or allocated. For each
resource that is allocated, the table also records the process to which it is
allocated. If a process requests a resource that is currently allocated to another
process, it can be added to a queue of processes waiting for this resource.
A set of processes is in a deadlocked state when every process in the set is
waiting for an event that can be caused only by another process in the set. The
7.2 Deadlock Characterization 317
events with which we are mainly concerned here are resource acquisition and
release. The resources may be either physical resources (for example, printers,
tape drives, memory space, and CPU cycles) or logical resources (for example,
semaphores, mutex locks, and files). However, other types of events may result
in deadlocks (for example, the IPC facilities discussed in Chapter 3).
To illustrate a deadlocked state, consider a system with three CD RW drives.
Suppose each of three processes holds one of these CDRW drives. If each process
now requests another drive, the three processes will be in a deadlocked state.
Each is waiting for the event “CD RW is released,” which can be caused only
by one of the other waiting processes. This example illustrates a deadlock
involving the same resource type.
Deadlocks may also involve different resource types. For example, consider
a system with one printer and one DVD drive. Suppose that process Pi is holding
the DVD and process Pj is holding the printer. If Pi requests the printer and Pj
requests the DVD drive, a deadlock occurs.
Developers of multithreaded applications must remain aware of the
possibility of deadlocks. The locking tools presented in Chapter 5 are designed
to avoid race conditions. However, in using these tools, developers must pay
careful attention to how locks are acquired and released. Otherwise, deadlock
can occur, as illustrated in the dining-philosophers problem in Section 5.7.3.
7.2 Deadlock Characterization
In a deadlock, processes never finish executing, and system resources are tied
up, preventing other jobs from starting. Before we discuss the various methods
for dealing with the deadlock problem, we look more closely at features that
characterize deadlocks.
DEADLOCK WITH MUTEX LOCKS
Let’s see how deadlock can occur in a multithreaded Pthread program
using mutex locks. The pthread mutex init() function initializes
an unlocked mutex. Mutex locks are acquired and released using
pthread mutex lock() and pthread mutex unlock(), respec-
tively. If a thread attempts to acquire a locked mutex, the call to
pthread mutex lock() blocks the thread until the owner of the mutex
lock invokes pthread mutex unlock().
Two mutex locks are created in the following code example:
/* Create and initialize the mutex locks */
pthread mutex t first mutex;
pthread mutex t second mutex;
pthread mutex init(first mutex,NULL);
pthread mutex init(second mutex,NULL);
Next, two threads—thread one and thread two—are created, and both
these threads have access to both mutex locks. thread one and thread two
318 Chapter 7 Deadlocks
DEADLOCK WITH MUTEX LOCKS (Continued)
run in the functions do work one() and do work two(), respectively, as
shown below:
/* thread one runs in this function */
void *do work one(void *param)
{
pthread mutex lock(first mutex);
pthread mutex lock(second mutex);
/**
* Do some work
*/
pthread mutex unlock(second mutex);
pthread mutex unlock(first mutex);
pthread exit(0);
}
/* thread two runs in this function */
void *do work two(void *param)
{
pthread mutex lock(second mutex);
pthread mutex lock(first mutex);
/**
* Do some work
*/
pthread mutex unlock(first mutex);
pthread mutex unlock(second mutex);
pthread exit(0);
}
In this example, thread one attempts to acquire the mutex locks in the order
(1) first mutex, (2) second mutex, while thread two attempts to acquire
the mutex locks in the order (1) second mutex, (2) first mutex. Deadlock
is possible if thread one acquires first mutex while thread two acquires
second mutex.
Note that, even though deadlock is possible, it will not occur if thread one
can acquire and release the mutex locks for first mutex and second mutex
before thread two attempts to acquire the locks. And, of course, the order
in which the threads run depends on how they are scheduled by the CPU
scheduler. This example illustrates a problem with handling deadlocks: it is
difficult to identify and test for deadlocks that may occur only under certain
scheduling circumstances.
7.2.1 Necessary Conditions
A deadlock situation can arise if the following four conditions hold simultane-
ously in a system:
7.2 Deadlock Characterization 319
1. Mutual exclusion. At least one resource must be held in a nonsharable
mode; that is, only one process at a time can use the resource. If another
process requests that resource, the requesting process must be delayed
until the resource has been released.
2. Hold and wait. A process must be holding at least one resource and
waiting to acquire additional resources that are currently being held by
other processes.
3. No preemption. Resources cannot be preempted; that is, a resource can
be released only voluntarily by the process holding it, after that process
has completed its task.
4. Circular wait. A set {P0, P1, ..., Pn} of waiting processes must exist such
that P0 is waiting for a resource held by P1, P1 is waiting for a resource
held by P2, ..., Pn−1 is waiting for a resource held by Pn, and Pn is waiting
for a resource held by P0.
We emphasize that all four conditions must hold for a deadlock to
occur. The circular-wait condition implies the hold-and-wait condition, so the
four conditions are not completely independent. We shall see in Section 7.4,
however, that it is useful to consider each condition separately.
7.2.2 Resource-Allocation Graph
Deadlocks can be described more precisely in terms of a directed graph called
a system resource-allocation graph. This graph consists of a set of vertices V
and a set of edges E. The set of vertices V is partitioned into two different types
of nodes: P = {P1, P2, ..., Pn}, the set consisting of all the active processes in the
system, and R = {R1, R2, ..., Rm}, the set consisting of all resource types in the
system.
A directed edge from process Pi to resource type Rj is denoted by Pi → Rj ;
it signifies that process Pi has requested an instance of resource type Rj and
is currently waiting for that resource. A directed edge from resource type Rj
to process Pi is denoted by Rj → Pi ; it signifies that an instance of resource
type Rj has been allocated to process Pi . A directed edge Pi → Rj is called a
request edge; a directed edge Rj → Pi is called an assignment edge.
Pictorially, we represent each process Pi as a circle and each resource type
Rj as a rectangle. Since resource type Rj may have more than one instance, we
represent each such instance as a dot within the rectangle. Note that a request
edge points to only the rectangle Rj , whereas an assignment edge must also
designate one of the dots in the rectangle.
When process Pi requests an instance of resource type Rj , a request edge
is inserted in the resource-allocation graph. When this request can be fulfilled,
the request edge is instantaneously transformed to an assignment edge. When
the process no longer needs access to the resource, it releases the resource. As
a result, the assignment edge is deleted.
The resource-allocation graph shown in Figure 7.1 depicts the following
situation.
• The sets P, R, and E:
◦ P = {P1, P2, P3}
320 Chapter 7 Deadlocks
R1 R3
R2
R4
P3
P2
P1
Figure 7.1 Resource-allocation graph.
◦ R = {R1, R2, R3, R4}
◦ E = {P1 → R1, P2 → R3, R1 → P2, R2 → P2, R2 → P1, R3 → P3}
• Resource instances:
◦ One instance of resource type R1
◦ Two instances of resource type R2
◦ One instance of resource type R3
◦ Three instances of resource type R4
• Process states:
◦ Process P1 is holding an instance of resource type R2 and is waiting for
an instance of resource type R1.
◦ Process P2 is holding an instance of R1 and an instance of R2 and is
waiting for an instance of R3.
◦ Process P3 is holding an instance of R3.
Given the definition of a resource-allocation graph, it can be shown that, if
the graph contains no cycles, then no process in the system is deadlocked. If
the graph does contain a cycle, then a deadlock may exist.
If each resource type has exactly one instance, then a cycle implies that a
deadlock has occurred. If the cycle involves only a set of resource types, each
of which has only a single instance, then a deadlock has occurred. Each process
involved in the cycle is deadlocked. In this case, a cycle in the graph is both a
necessary and a sufficient condition for the existence of deadlock.
If each resource type has several instances, then a cycle does not necessarily
imply that a deadlock has occurred. In this case, a cycle in the graph is a
necessary but not a sufficient condition for the existence of deadlock.
To illustrate this concept, we return to the resource-allocation graph
depicted in Figure 7.1. Suppose that process P3 requests an instance of resource
7.2 Deadlock Characterization 321
R1 R3
R2
R4
P3
P2
P1
Figure 7.2 Resource-allocation graph with a deadlock.
type R2. Since no resource instance is currently available, we add a request edge
P3 → R2 to the graph (Figure 7.2). At this point, two minimal cycles exist in the
system:
P1 → R1 → P2 → R3 → P3 → R2 → P1
P2 → R3 → P3 → R2 → P2
Processes P1, P2, and P3 are deadlocked. Process P2 is waiting for the resource
R3, which is held by process P3. Process P3 is waiting for either process P1 or
process P2 to release resource R2. In addition, process P1 is waiting for process
P2 to release resource R1.
Now consider the resource-allocation graph in Figure 7.3. In this example,
we also have a cycle:
P1 → R1 → P3 → R2 → P1
R2
R1
P3
P4
P2
P1
Figure 7.3 Resource-allocation graph with a cycle but no deadlock.
322 Chapter 7 Deadlocks
However, there is no deadlock. Observe that process P4 may release its instance
of resource type R2. That resource can then be allocated to P3, breaking the cycle.
In summary, if a resource-allocation graph does not have a cycle, then the
system is not in a deadlocked state. If there is a cycle, then the system may or
may not be in a deadlocked state. This observation is important when we deal
with the deadlock problem.
7.3 Methods for Handling Deadlocks
Generally speaking, we can deal with the deadlock problem in one of three
ways:
• We can use a protocol to prevent or avoid deadlocks, ensuring that the
system will never enter a deadlocked state.
• We can allow the system to enter a deadlocked state, detect it, and recover.
• We can ignore the problem altogether and pretend that deadlocks never
occur in the system.
The third solution is the one used by most operating systems, including Linux
and Windows. It is then up to the application developer to write programs that
handle deadlocks.
Next, we elaborate briefly on each of the three methods for handling
deadlocks. Then, in Sections 7.4 through 7.7, we present detailed algorithms.
Before proceeding, we should mention that some researchers have argued that
none of the basic approaches alone is appropriate for the entire spectrum of
resource-allocation problems in operating systems. The basic approaches can
be combined, however, allowing us to select an optimal approach for each class
of resources in a system.
To ensure that deadlocks never occur, the system can use either a deadlock-
prevention or a deadlock-avoidance scheme. Deadlock prevention provides a
set of methods to ensure that at least one of the necessary conditions (Section
7.2.1) cannot hold. These methods prevent deadlocks by constraining how
requests for resources can be made. We discuss these methods in Section 7.4.
Deadlock avoidance requires that the operating system be given additional
information in advance concerning which resources a process will request
and use during its lifetime. With this additional knowledge, the operating
system can decide for each request whether or not the process should wait.
To decide whether the current request can be satisfied or must be delayed, the
system must consider the resources currently available, the resources currently
allocated to each process, and the future requests and releases of each process.
We discuss these schemes in Section 7.5.
If a system does not employ either a deadlock-prevention or a deadlock-
avoidance algorithm, then a deadlock situation may arise. In this environment,
the system can provide an algorithm that examines the state of the system to
determine whether a deadlock has occurred and an algorithm to recover from
the deadlock (if a deadlock has indeed occurred). We discuss these issues in
Section 7.6 and Section 7.7.
7.4 Deadlock Prevention 323
In the absence of algorithms to detect and recover from deadlocks, we may
arrive at a situation in which the system is in a deadlocked state yet has no
way of recognizing what has happened. In this case, the undetected deadlock
will cause the system’s performance to deteriorate, because resources are being
held by processes that cannot run and because more and more processes, as
they make requests for resources, will enter a deadlocked state. Eventually, the
system will stop functioning and will need to be restarted manually.
Although this method may not seem to be a viable approach to the deadlock
problem, it is nevertheless used in most operating systems, as mentioned
earlier. Expense is one important consideration. Ignoring the possibility of
deadlocks is cheaper than the other approaches. Since in many systems,
deadlocks occur infrequently (say, once per year), the extra expense of the
other methods may not seem worthwhile. In addition, methods used to recover
from other conditions may be put to use to recover from deadlock. In some
circumstances, a system is in a frozen state but not in a deadlocked state.
We see this situation, for example, with a real-time process running at the
highest priority (or any process running on a nonpreemptive scheduler) and
never returning control to the operating system. The system must have manual
recovery methods for such conditions and may simply use those techniques
for deadlock recovery.
7.4 Deadlock Prevention
As we noted in Section 7.2.1, for a deadlock to occur, each of the four necessary
conditions must hold. By ensuring that at least one of these conditions cannot
hold, we can prevent the occurrence of a deadlock. We elaborate on this
approach by examining each of the four necessary conditions separately.
7.4.1 Mutual Exclusion
The mutual exclusion condition must hold. That is, at least one resource must be
nonsharable. Sharable resources, in contrast, do not require mutually exclusive
access and thus cannot be involved in a deadlock. Read-only files are a good
example of a sharable resource. If several processes attempt to open a read-only
file at the same time, they can be granted simultaneous access to the file. A
process never needs to wait for a sharable resource. In general, however, we
cannot prevent deadlocks by denying the mutual-exclusion condition, because
some resources are intrinsically nonsharable. For example, a mutex lock cannot
be simultaneously shared by several processes.
7.4.2 Hold and Wait
To ensure that the hold-and-wait condition never occurs in the system, we must
guarantee that, whenever a process requests a resource, it does not hold any
other resources. One protocol that we can use requires each process to request
and be allocated all its resources before it begins execution. We can implement
this provision by requiring that system calls requesting resources for a process
precede all other system calls.
324 Chapter 7 Deadlocks
An alternative protocol allows a process to request resources only when
it has none. A process may request some resources and use them. Before it
can request any additional resources, it must release all the resources that it is
currently allocated.
To illustrate the difference between these two protocols, we consider a
process that copies data from a DVD drive to a file on disk, sorts the file, and
then prints the results to a printer. If all resources must be requested at the
beginning of the process, then the process must initially request the DVD drive,
disk file, and printer. It will hold the printer for its entire execution, even though
it needs the printer only at the end.
The second method allows the process to request initially only the DVD
drive and disk file. It copies from the DVD drive to the disk and then releases
both the DVD drive and the disk file. The process must then request the disk
file and the printer. After copying the disk file to the printer, it releases these
two resources and terminates.
Both these protocols have two main disadvantages. First, resource utiliza-
tion may be low, since resources may be allocated but unused for a long period.
In the example given, for instance, we can release the DVD drive and disk file,
and then request the disk file and printer, only if we can be sure that our data
will remain on the disk file. Otherwise, we must request all resources at the
beginning for both protocols.
Second, starvation is possible. A process that needs several popular
resources may have to wait indefinitely, because at least one of the resources
that it needs is always allocated to some other process.
7.4.3 No Preemption
The third necessary condition for deadlocks is that there be no preemption
of resources that have already been allocated. To ensure that this condition
does not hold, we can use the following protocol. If a process is holding
some resources and requests another resource that cannot be immediately
allocated to it (that is, the process must wait), then all resources the process is
currently holding are preempted. In other words, these resources are implicitly
released. The preempted resources are added to the list of resources for which
the process is waiting. The process will be restarted only when it can regain its
old resources, as well as the new ones that it is requesting.
Alternatively, if a process requests some resources, we first check whether
they are available. If they are, we allocate them. If they are not, we check
whether they are allocated to some other process that is waiting for additional
resources. If so, we preempt the desired resources from the waiting process and
allocate them to the requesting process. If the resources are neither available
nor held by a waiting process, the requesting process must wait. While it is
waiting, some of its resources may be preempted, but only if another process
requests them. A process can be restarted only when it is allocated the new
resources it is requesting and recovers any resources that were preempted
while it was waiting.
This protocol is often applied to resources whose state can be easily saved
and restored later, such as CPU registers and memory space. It cannot generally
be applied to such resources as mutex locks and semaphores.
7.4 Deadlock Prevention 325
7.4.4 Circular Wait
The fourth and final condition for deadlocks is the circular-wait condition. One
way to ensure that this condition never holds is to impose a total ordering of
all resource types and to require that each process requests resources in an
increasing order of enumeration.
To illustrate, we let R = {R1, R2, ..., Rm} be the set of resource types. We
assign to each resource type a unique integer number, which allows us to
compare two resources and to determine whether one precedes another in our
ordering. Formally, we define a one-to-one function F: R → N, where N is the
set of natural numbers. For example, if the set of resource types R includes
tape drives, disk drives, and printers, then the function F might be defined as
follows:
F(tape drive) = 1
F(disk drive) = 5
F(printer) = 12
We can now consider the following protocol to prevent deadlocks: Each
process can request resources only in an increasing order of enumeration. That
is, a process can initially request any number of instances of a resource type
—say, Ri . After that, the process can request instances of resource type Rj if
and only if F(Rj )  F(Ri ). For example, using the function defined previously,
a process that wants to use the tape drive and printer at the same time must
first request the tape drive and then request the printer. Alternatively, we can
require that a process requesting an instance of resource type Rj must have
released any resources Ri such that F(Ri ) ≥ F(Rj ). Note also that if several
instances of the same resource type are needed, a single request for all of them
must be issued.
If these two protocols are used, then the circular-wait condition cannot
hold. We can demonstrate this fact by assuming that a circular wait exists
(proof by contradiction). Let the set of processes involved in the circular wait be
{P0, P1, ..., Pn}, where Pi is waiting for a resource Ri , which is held by process
Pi+1. (Modulo arithmetic is used on the indexes, so that Pn is waiting for
a resource Rn held by P0.) Then, since process Pi+1 is holding resource Ri
while requesting resource Ri+1, we must have F(Ri )  F(Ri+1) for all i. But
this condition means that F(R0)  F(R1)  ...  F(Rn)  F(R0). By transitivity,
F(R0)  F(R0), which is impossible. Therefore, there can be no circular wait.
We can accomplish this scheme in an application program by developing
an ordering among all synchronization objects in the system. All requests for
synchronization objects must be made in increasing order. For example, if the
lock ordering in the Pthread program shown in Figure 7.4 was
F(first mutex) = 1
F(second mutex) = 5
then thread two could not request the locks out of order.
Keep in mind that developing an ordering, or hierarchy, does not in itself
prevent deadlock. It is up to application developers to write programs that
follow the ordering. Also note that the function F should be defined according
to the normal order of usage of the resources in a system. For example, because
326 Chapter 7 Deadlocks
/* thread one runs in this function */
void *do work one(void *param)
{
pthread mutex lock(first mutex);
pthread mutex lock(second mutex);
/**
* Do some work
*/
pthread mutex unlock(second mutex);
pthread mutex unlock(first mutex);
pthread exit(0);
}
/* thread two runs in this function */
void *do work two(void *param)
{
pthread mutex lock(second mutex);
pthread mutex lock(first mutex);
/**
* Do some work
*/
pthread mutex unlock(first mutex);
pthread mutex unlock(second mutex);
pthread exit(0);
}
Figure 7.4 Deadlock example.
the tape drive is usually needed before the printer, it would be reasonable to
define F(tape drive)  F(printer).
Although ensuring that resources are acquired in the proper order is the
responsibility of application developers, certain software can be used to verify
that locks are acquired in the proper order and to give appropriate warnings
when locks are acquired out of order and deadlock is possible. One lock-order
verifier, which works on BSD versions of UNIX such as FreeBSD, is known as
witness. Witness uses mutual-exclusion locks to protect critical sections, as
described in Chapter 5. It works by dynamically maintaining the relationship
of lock orders in a system. Let’s use the program shown in Figure 7.4 as an
example. Assume that thread one is the first to acquire the locks and does so in
the order (1) first mutex, (2) second mutex. Witness records the relationship
that first mutex must be acquired before second mutex. If thread two later
acquires the locks out of order, witness generates a warning message on the
system console.
It is also important to note that imposing a lock ordering does not guarantee
deadlock prevention if locks can be acquired dynamically. For example, assume
we have a function that transfers funds between two accounts. To prevent a
race condition, each account has an associated mutex lock that is obtained from
a get lock() function such as shown in Figure 7.5:
7.5 Deadlock Avoidance 327
void transaction(Account from, Account to, double amount)
{
mutex lock1, lock2;
lock1 = get lock(from);
lock2 = get lock(to);
acquire(lock1);
acquire(lock2);
withdraw(from, amount);
deposit(to, amount);
release(lock2);
release(lock1);
}
Figure 7.5 Deadlock example with lock ordering.
Deadlock is possible if two threads simultaneously invoke the transaction()
function, transposing different accounts. That is, one thread might invoke
transaction(checking account, savings account, 25);
and another might invoke
transaction(savings account, checking account, 50);
We leave it as an exercise for students to fix this situation.
7.5 Deadlock Avoidance
Deadlock-preventionalgorithms, asdiscussed inSection7.4, preventdeadlocks
by limiting how requests can be made. The limits ensure that at least one of
the necessary conditions for deadlock cannot occur. Possible side effects of
preventing deadlocks by this method, however, are low device utilization and
reduced system throughput.
An alternative method for avoiding deadlocks is to require additional
information about how resources are to be requested. For example, in a system
with one tape drive and one printer, the system might need to know that
process P will request first the tape drive and then the printer before releasing
both resources, whereas process Q will request first the printer and then the
tape drive. With this knowledge of the complete sequence of requests and
releases for each process, the system can decide for each request whether or
not the process should wait in order to avoid a possible future deadlock. Each
request requires that in making this decision the system consider the resources
currently available, the resources currently allocated to each process, and the
future requests and releases of each process.
The various algorithms that use this approach differ in the amount and
type of information required. The simplest and most useful model requires
that each process declare the maximum number of resources of each type that
it may need. Given this a priori information, it is possible to construct an
328 Chapter 7 Deadlocks
algorithm that ensures that the system will never enter a deadlocked state. A
deadlock-avoidance algorithm dynamically examines the resource-allocation
state to ensure that a circular-wait condition can never exist. The resource-
allocation state is defined by the number of available and allocated resources
and the maximum demands of the processes. In the following sections, we
explore two deadlock-avoidance algorithms.
7.5.1 Safe State
A state is safe if the system can allocate resources to each process (up to its
maximum) in some order and still avoid a deadlock. More formally, a system
is in a safe state only if there exists a safe sequence. A sequence of processes
P1, P2, ..., Pn is a safe sequence for the current allocation state if, for each
Pi , the resource requests that Pi can still make can be satisfied by the currently
available resources plus the resources held by all Pj , with j  i. In this situation,
if the resources that Pi needs are not immediately available, then Pi can wait
until all Pj have finished. When they have finished, Pi can obtain all of its
needed resources, complete its designated task, return its allocated resources,
and terminate. When Pi terminates, Pi+1 can obtain its needed resources, and
so on. If no such sequence exists, then the system state is said to be unsafe.
A safe state is not a deadlocked state. Conversely, a deadlocked state is
an unsafe state. Not all unsafe states are deadlocks, however (Figure 7.6).
An unsafe state may lead to a deadlock. As long as the state is safe, the
operating system can avoid unsafe (and deadlocked) states. In an unsafe state,
the operating system cannot prevent processes from requesting resources in
such a way that a deadlock occurs. The behavior of the processes controls
unsafe states.
To illustrate, we consider a system with twelve magnetic tape drives and
three processes: P0, P1, and P2. Process P0 requires ten tape drives, process P1
may need as many as four tape drives, and process P2 may need up to nine tape
drives. Suppose that, at time t0, process P0 is holding five tape drives, process
P1 is holding two tape drives, and process P2 is holding two tape drives. (Thus,
there are three free tape drives.)
deadlock
unsafe
safe
Figure 7.6 Safe, unsafe, and deadlocked state spaces.
7.5 Deadlock Avoidance 329
Maximum Needs Current Needs
P0 10 5
P1 4 2
P2 9 2
At time t0, the system is in a safe state. The sequence P1, P0, P2 satisfies
the safety condition. Process P1 can immediately be allocated all its tape drives
and then return them (the system will then have five available tape drives);
then process P0 can get all its tape drives and return them (the system will then
have ten available tape drives); and finally process P2 can get all its tape drives
and return them (the system will then have all twelve tape drives available).
A system can go from a safe state to an unsafe state. Suppose that, at time
t1, process P2 requests and is allocated one more tape drive. The system is no
longer in a safe state. At this point, only process P1 can be allocated all its tape
drives. When it returns them, the system will have only four available tape
drives. Since process P0 is allocated five tape drives but has a maximum of ten,
it may request five more tape drives. If it does so, it will have to wait, because
they are unavailable. Similarly, process P2 may request six additional tape
drives and have to wait, resulting in a deadlock. Our mistake was in granting
the request from process P2 for one more tape drive. If we had made P2 wait
until either of the other processes had finished and released its resources, then
we could have avoided the deadlock.
Given the concept of a safe state, we can define avoidance algorithms that
ensure that the system will never deadlock. The idea is simply to ensure that the
system will always remain in a safe state. Initially, the system is in a safe state.
Whenever a process requests a resource that is currently available, the system
must decide whether the resource can be allocated immediately or whether
the process must wait. The request is granted only if the allocation leaves the
system in a safe state.
In this scheme, if a process requests a resource that is currently available,
it may still have to wait. Thus, resource utilization may be lower than it would
otherwise be.
7.5.2 Resource-Allocation-Graph Algorithm
If we have a resource-allocation system with only one instance of each resource
type, we can use a variant of the resource-allocation graph defined in Section
7.2.2 for deadlock avoidance. In addition to the request and assignment edges
already described, we introduce a new type of edge, called a claim edge.
A claim edge Pi → Rj indicates that process Pi may request resource Rj at
some time in the future. This edge resembles a request edge in direction but is
represented in the graph by a dashed line. When process Pi requests resource
Rj , the claim edge Pi → Rj is converted to a request edge. Similarly, when a
resource Rj is released by Pi , the assignment edge Rj → Pi is reconverted to a
claim edge Pi → Rj .
Note that the resources must be claimed a priori in the system. That is,
before process Pi starts executing, all its claim edges must already appear in
the resource-allocation graph. We can relax this condition by allowing a claim
edge Pi → Rj to be added to the graph only if all the edges associated with
process Pi are claim edges.
330 Chapter 7 Deadlocks
R1
R2
P2
P1
Figure 7.7 Resource-allocation graph for deadlock avoidance.
Now suppose that process Pi requests resource Rj . The request can be
granted only if converting the request edge Pi → Rj to an assignment edge
Rj → Pi does not result in the formation of a cycle in the resource-allocation
graph. We check for safety by using a cycle-detection algorithm. An algorithm
for detecting a cycle in this graph requires an order of n2
operations, where n
is the number of processes in the system.
If no cycle exists, then the allocation of the resource will leave the system
in a safe state. If a cycle is found, then the allocation will put the system in
an unsafe state. In that case, process Pi will have to wait for its requests to be
satisfied.
To illustrate this algorithm, we consider the resource-allocation graph of
Figure 7.7. Suppose that P2 requests R2. Although R2 is currently free, we
cannot allocate it to P2, since this action will create a cycle in the graph (Figure
7.8). A cycle, as mentioned, indicates that the system is in an unsafe state. If P1
requests R2, and P2 requests R1, then a deadlock will occur.
7.5.3 Banker’s Algorithm
The resource-allocation-graph algorithm is not applicable to a resource-
allocation system with multiple instances of each resource type. The deadlock-
avoidance algorithm that we describe next is applicable to such a system but
is less efficient than the resource-allocation graph scheme. This algorithm is
commonly known as the banker’s algorithm. The name was chosen because
the algorithm could be used in a banking system to ensure that the bank never
R1
R2
P2
P1
Figure 7.8 An unsafe state in a resource-allocation graph.
7.5 Deadlock Avoidance 331
allocated its available cash in such a way that it could no longer satisfy the
needs of all its customers.
When a new process enters the system, it must declare the maximum
number of instances of each resource type that it may need. This number may
not exceed the total number of resources in the system. When a user requests
a set of resources, the system must determine whether the allocation of these
resources will leave the system in a safe state. If it will, the resources are
allocated; otherwise, the process must wait until some other process releases
enough resources.
Several data structures must be maintained to implement the banker’s
algorithm. These data structures encode the state of the resource-allocation
system. We need the following data structures, where n is the number of
processes in the system and m is the number of resource types:
• Available. A vector of length m indicates the number of available resources
of each type. If Available[j] equals k, then k instances of resource type Rj
are available.
• Max. An n × m matrix defines the maximum demand of each process.
If Max[i][j] equals k, then process Pi may request at most k instances of
resource type Rj .
• Allocation. An n × m matrix defines the number of resources of each type
currently allocated to each process. If Allocation[i][j] equals k, then process
Pi is currently allocated k instances of resource type Rj .
• Need. An n × m matrix indicates the remaining resource need of each
process. If Need[i][j] equals k, then process Pi may need k more instances
of resource type Rj to complete its task. Note that Need[i][j] equals Max[i][j]
− Allocation[i][j].
These data structures vary over time in both size and value.
To simplify the presentation of the banker’s algorithm, we next establish
some notation. Let X and Y be vectors of length n. We say that X ≤ Y if and
only if X[i] ≤ Y[i] for all i = 1, 2, ..., n. For example, if X = (1,7,3,2) and Y =
(0,3,2,1), then Y ≤ X. In addition, Y  X if Y ≤ X and Y = X.
We can treat each row in the matrices Allocation and Need as vectors
and refer to them as Allocationi and Needi . The vector Allocationi specifies
the resources currently allocated to process Pi ; the vector Needi specifies the
additional resources that process Pi may still request to complete its task.
7.5.3.1 Safety Algorithm
We can now present the algorithm for finding out whether or not a system is
in a safe state. This algorithm can be described as follows:
1. Let Work and Finish be vectors of length m and n, respectively. Initialize
Work = Available and Finish[i] = false for i = 0, 1, ..., n − 1.
2. Find an index i such that both
a. Finish[i] == false
b. Needi ≤ Work
332 Chapter 7 Deadlocks
If no such i exists, go to step 4.
3. Work = Work + Allocationi
Finish[i] = true
Go to step 2.
4. If Finish[i] == true for all i, then the system is in a safe state.
This algorithm may require an order of m × n2
operations to determine whether
a state is safe.
7.5.3.2 Resource-Request Algorithm
Next, we describe the algorithm for determining whether requests can be safely
granted.
Let Requesti be the request vector for process Pi . If Requesti [ j] == k, then
process Pi wants k instances of resource type Rj . When a request for resources
is made by process Pi , the following actions are taken:
1. If Requesti ≤ Needi , go to step 2. Otherwise, raise an error condition, since
the process has exceeded its maximum claim.
2. If Requesti ≤ Available, go to step 3. Otherwise, Pi must wait, since the
resources are not available.
3. Have the system pretend to have allocated the requested resources to
process Pi by modifying the state as follows:
Available = Available–Requesti ;
Allocationi = Allocationi + Requesti ;
Needi = Needi –Requesti ;
If the resulting resource-allocation state is safe, the transaction is com-
pleted, and process Pi is allocated its resources. However, if the new state
is unsafe, then Pi must wait for Requesti , and the old resource-allocation
state is restored.
7.5.3.3 An Illustrative Example
To illustrate the use of the banker’s algorithm, consider a system with five
processes P0 through P4 and three resource types A, B, and C. Resource type A
has ten instances, resource type B has five instances, and resource type C has
seven instances. Suppose that, at time T0, the following snapshot of the system
has been taken:
Allocation Max Available
A B C A B C A B C
P0 0 1 0 7 5 3 3 3 2
P1 2 0 0 3 2 2
P2 3 0 2 9 0 2
P3 2 1 1 2 2 2
P4 0 0 2 4 3 3
7.6 Deadlock Detection 333
The content of the matrix Need is defined to be Max − Allocation and is as
follows:
Need
A B C
P0 7 4 3
P1 1 2 2
P2 6 0 0
P3 0 1 1
P4 4 3 1
We claim that the system is currently in a safe state. Indeed, the sequence
P1, P3, P4, P2, P0 satisfies the safety criteria. Suppose now that process
P1 requests one additional instance of resource type A and two instances of
resource type C, so Request1 = (1,0,2). To decide whether this request can be
immediately granted, we first check that Request1 ≤ Available—that is, that
(1,0,2) ≤ (3,3,2), which is true. We then pretend that this request has been
fulfilled, and we arrive at the following new state:
Allocation Need Available
A B C A B C A B C
P0 0 1 0 7 4 3 2 3 0
P1 3 0 2 0 2 0
P2 3 0 2 6 0 0
P3 2 1 1 0 1 1
P4 0 0 2 4 3 1
We must determine whether this new system state is safe. To do so, we
execute our safety algorithm and find that the sequence P1, P3, P4, P0, P2
satisfies the safety requirement. Hence, we can immediately grant the request
of process P1.
You should be able to see, however, that when the system is in this state, a
request for (3,3,0) by P4 cannot be granted, since the resources are not available.
Furthermore, a request for (0,2,0) by P0 cannot be granted, even though the
resources are available, since the resulting state is unsafe.
We leave it as a programming exercise for students to implement the
banker’s algorithm.
7.6 Deadlock Detection
If a system does not employ either a deadlock-prevention or a deadlock-
avoidance algorithm, then a deadlock situation may occur. In this environment,
the system may provide:
• An algorithm that examines the state of the system to determine whether
a deadlock has occurred
• An algorithm to recover from the deadlock
334 Chapter 7 Deadlocks
P3
P5
P4
P2
P1
R2
R1 R3 R4
R5
P3
P5
P4
P2
P1
(b)
(a)
Figure 7.9 (a) Resource-allocation graph. (b) Corresponding wait-for graph.
In the following discussion, we elaborate on these two requirements as they
pertain to systems with only a single instance of each resource type, as well as to
systems with several instances of each resource type. At this point, however, we
note that a detection-and-recovery scheme requires overhead that includes not
only the run-time costs of maintaining the necessary information and executing
the detection algorithm but also the potential losses inherent in recovering from
a deadlock.
7.6.1 Single Instance of Each Resource Type
If all resources have only a single instance, then we can define a deadlock-
detection algorithm that uses a variant of the resource-allocation graph, called
a wait-for graph. We obtain this graph from the resource-allocation graph by
removing the resource nodes and collapsing the appropriate edges.
More precisely, an edge from Pi to Pj in a wait-for graph implies that
process Pi is waiting for process Pj to release a resource that Pi needs. An edge
Pi → Pj exists in a wait-for graph if and only if the corresponding resource-
allocation graph contains two edges Pi → Rq and Rq → Pj for some resource
Rq . In Figure 7.9, we present a resource-allocation graph and the corresponding
wait-for graph.
As before, a deadlock exists in the system if and only if the wait-for graph
contains a cycle. To detect deadlocks, the system needs to maintain the wait-
for graph and periodically invoke an algorithm that searches for a cycle in
the graph. An algorithm to detect a cycle in a graph requires an order of n2
operations, where n is the number of vertices in the graph.
7.6.2 Several Instances of a Resource Type
The wait-for graph scheme is not applicable to a resource-allocation system
with multiple instances of each resource type. We turn now to a deadlock-
7.6 Deadlock Detection 335
detection algorithm that is applicable to such a system. The algorithm employs
several time-varying data structures that are similar to those used in the
banker’s algorithm (Section 7.5.3):
• Available. A vector of length m indicates the number of available resources
of each type.
• Allocation. An n × m matrix defines the number of resources of each type
currently allocated to each process.
• Request. An n × m matrix indicates the current request of each process.
If Request[i][j] equals k, then process Pi is requesting k more instances of
resource type Rj .
The ≤ relation between two vectors is defined as in Section 7.5.3. To simplify
notation, we again treat the rows in the matrices Allocation and Request as
vectors; we refer to them as Allocationi and Requesti . The detection algorithm
described here simply investigates every possible allocation sequence for the
processes that remain to be completed. Compare this algorithm with the
banker’s algorithm of Section 7.5.3.
1. Let Work and Finish be vectors of length m and n, respectively. Initialize
Work = Available. For i = 0, 1, ..., n–1, if Allocationi = 0, then Finish[i] =
false. Otherwise, Finish[i] = true.
2. Find an index i such that both
a. Finish[i] == false
b. Requesti ≤ Work
If no such i exists, go to step 4.
3. Work = Work + Allocationi
Finish[i] = true
Go to step 2.
4. If Finish[i] == false for some i, 0 ≤ i  n, then the system is in a deadlocked
state. Moreover, if Finish[i] == false, then process Pi is deadlocked.
This algorithm requires an order of m × n2
operations to detect whether the
system is in a deadlocked state.
You may wonder why we reclaim the resources of process Pi (in step 3) as
soon as we determine that Requesti ≤ Work (in step 2b). We know that Pi is
currently not involved in a deadlock (since Requesti ≤ Work). Thus, we take
an optimistic attitude and assume that Pi will require no more resources to
complete its task; it will thus soon return all currently allocated resources to
the system. If our assumption is incorrect, a deadlock may occur later. That
deadlock will be detected the next time the deadlock-detection algorithm is
invoked.
To illustrate this algorithm, we consider a system with five processes P0
through P4 and three resource types A, B, and C. Resource type A has seven
instances, resource type B has two instances, and resource type C has six
336 Chapter 7 Deadlocks
instances. Suppose that, at time T0, we have the following resource-allocation
state:
Allocation Request Available
A B C A B C A B C
P0 0 1 0 0 0 0 0 0 0
P1 2 0 0 2 0 2
P2 3 0 3 0 0 0
P3 2 1 1 1 0 0
P4 0 0 2 0 0 2
We claim that the system is not in a deadlocked state. Indeed, if we execute
our algorithm, we will find that the sequence P0, P2, P3, P1, P4 results in
Finish[i] == true for all i.
Suppose now that process P2 makes one additional request for an instance
of type C. The Request matrix is modified as follows:
Request
A B C
P0 0 0 0
P1 2 0 2
P2 0 0 1
P3 1 0 0
P4 0 0 2
We claim that the system is now deadlocked. Although we can reclaim the
resources held by process P0, the number of available resources is not sufficient
to fulfill the requests of the other processes. Thus, a deadlock exists, consisting
of processes P1, P2, P3, and P4.
7.6.3 Detection-Algorithm Usage
When should we invoke the detection algorithm? The answer depends on two
factors:
1. How often is a deadlock likely to occur?
2. How many processes will be affected by deadlock when it happens?
If deadlocks occur frequently, then the detection algorithm should be invoked
frequently. Resources allocated to deadlocked processes will be idle until the
deadlock can be broken. In addition, the number of processes involved in the
deadlock cycle may grow.
Deadlocks occur only when some process makes a request that cannot be
granted immediately. This request may be the final request that completes a
chain of waiting processes. In the extreme, then, we can invoke the deadlock-
detection algorithm every time a request for allocation cannot be granted
immediately. In this case, we can identify not only the deadlocked set of
7.7 Recovery from Deadlock 337
processes but also the specific process that “caused” the deadlock. (In reality,
each of the deadlocked processes is a link in the cycle in the resource graph, so
all of them, jointly, caused the deadlock.) If there are many different resource
types, one request may create many cycles in the resource graph, each cycle
completed by the most recent request and “caused” by the one identifiable
process.
Of course, invoking the deadlock-detection algorithm for every resource
request will incur considerable overhead in computation time. A less expensive
alternative is simply to invoke the algorithm at defined intervals—for example,
once per hour or whenever CPU utilization drops below 40 percent. (A deadlock
eventually cripples system throughput and causes CPU utilization to drop.) If
the detection algorithm is invoked at arbitrary points in time, the resource
graph may contain many cycles. In this case, we generally cannot tell which of
the many deadlocked processes “caused” the deadlock.
7.7 Recovery from Deadlock
When a detection algorithm determines that a deadlock exists, several alter-
natives are available. One possibility is to inform the operator that a deadlock
has occurred and to let the operator deal with the deadlock manually. Another
possibility is to let the system recover from the deadlock automatically. There
are two options for breaking a deadlock. One is simply to abort one or more
processes to break the circular wait. The other is to preempt some resources
from one or more of the deadlocked processes.
7.7.1 Process Termination
To eliminate deadlocks by aborting a process, we use one of two methods. In
both methods, the system reclaims all resources allocated to the terminated
processes.
• Abort all deadlocked processes. This method clearly will break the
deadlock cycle, but at great expense. The deadlocked processes may have
computed for a long time, and the results of these partial computations
must be discarded and probably will have to be recomputed later.
• Abort one process at a time until the deadlock cycle is eliminated. This
method incurs considerable overhead, since after each process is aborted, a
deadlock-detection algorithm must be invoked to determine whether any
processes are still deadlocked.
Aborting a process may not be easy. If the process was in the midst of
updating a file, terminating it will leave that file in an incorrect state. Similarly,
if the process was in the midst of printing data on a printer, the system must
reset the printer to a correct state before printing the next job.
If the partial termination method is used, then we must determine which
deadlocked process (or processes) should be terminated. This determination is
a policy decision, similar to CPU-scheduling decisions. The question is basically
an economic one; we should abort those processes whose termination will incur
338 Chapter 7 Deadlocks
the minimum cost. Unfortunately, the term minimum cost is not a precise one.
Many factors may affect which process is chosen, including:
1. What the priority of the process is
2. How long the process has computed and how much longer the process
will compute before completing its designated task
3. How many and what types of resources the process has used (for example,
whether the resources are simple to preempt)
4. How many more resources the process needs in order to complete
5. How many processes will need to be terminated
6. Whether the process is interactive or batch
7.7.2 Resource Preemption
To eliminate deadlocks using resource preemption, we successively preempt
some resources from processes and give these resources to other processes until
the deadlock cycle is broken.
If preemption is required to deal with deadlocks, then three issues need to
be addressed:
1. Selecting a victim. Which resources and which processes are to be
preempted? As in process termination, we must determine the order of
preemption to minimize cost. Cost factors may include such parameters
as the number of resources a deadlocked process is holding and the
amount of time the process has thus far consumed.
2. Rollback. If we preempt a resource from a process, what should be done
with that process? Clearly, it cannot continue with its normal execution; it
is missing some needed resource. We must roll back the process to some
safe state and restart it from that state.
Since, in general, it is difficult to determine what a safe state is, the
simplest solution is a total rollback: abort the process and then restart
it. Although it is more effective to roll back the process only as far as
necessary to break the deadlock, this method requires the system to keep
more information about the state of all running processes.
3. Starvation. How do we ensure that starvation will not occur? That is,
how can we guarantee that resources will not always be preempted from
the same process?
In a system where victim selection is based primarily on cost factors,
it may happen that the same process is always picked as a victim. As
a result, this process never completes its designated task, a starvation
situation any practical system must address. Clearly, we must ensure
that a process can be picked as a victim only a (small) finite number of
times. The most common solution is to include the number of rollbacks
in the cost factor.
Practice Exercises 339
7.8 Summary
A deadlocked state occurs when two or more processes are waiting indefinitely
for an event that can be caused only by one of the waiting processes. There are
three principal methods for dealing with deadlocks:
• Use some protocol to prevent or avoid deadlocks, ensuring that the system
will never enter a deadlocked state.
• Allow the system to enter a deadlocked state, detect it, and then recover.
• Ignore the problem altogether and pretend that deadlocks never occur in
the system.
The third solution is the one used by most operating systems, including Linux
and Windows.
A deadlock can occur only if four necessary conditions hold simultaneously
in the system: mutual exclusion, hold and wait, no preemption, and circular
wait. To prevent deadlocks, we can ensure that at least one of the necessary
conditions never holds.
A method for avoiding deadlocks, rather than preventing them, requires
that the operating system have a priori information about how each process
will utilize system resources. The banker’s algorithm, for example, requires
a priori information about the maximum number of each resource class that
each process may request. Using this information, we can define a deadlock-
avoidance algorithm.
If a system does not employ a protocol to ensure that deadlocks will never
occur, then a detection-and-recovery scheme may be employed. A deadlock-
detection algorithm must be invoked to determine whether a deadlock
has occurred. If a deadlock is detected, the system must recover either by
terminating some of the deadlocked processes or by preempting resources
from some of the deadlocked processes.
Where preemption is used to deal with deadlocks, three issues must be
addressed: selecting a victim, rollback, and starvation. In a system that selects
victims for rollback primarily on the basis of cost factors, starvation may occur,
and the selected process can never complete its designated task.
Researchers have argued that none of the basic approaches alone is appro-
priate for the entire spectrum of resource-allocation problems in operating
systems. The basic approaches can be combined, however, allowing us to select
an optimal approach for each class of resources in a system.
Practice Exercises
7.1 List three examples of deadlocks that are not related to a computer-
system environment.
7.2 Suppose that a system is in an unsafe state. Show that it is possible for
the processes to complete their execution without entering a deadlocked
state.
340 Chapter 7 Deadlocks
7.3 Consider the following snapshot of a system:
Allocation Max Available
A B C D A B C D A B C D
P0 0 0 1 2 0 0 1 2 1 5 2 0
P1 1 0 0 0 1 7 5 0
P2 1 3 5 4 2 3 5 6
P3 0 6 3 2 0 6 5 2
P4 0 0 1 4 0 6 5 6
Answer the following questions using the banker’s algorithm:
a. What is the content of the matrix Need?
b. Is the system in a safe state?
c. If a request from process P1 arrives for (0,4,2,0), can the request be
granted immediately?
7.4 A possible method for preventing deadlocks is to have a single, higher-
order resource that must be requested before any other resource. For
example, if multiple threads attempt to access the synchronization
objects A· · · E, deadlock is possible. (Such synchronization objects may
include mutexes, semaphores, condition variables, and the like.) We can
prevent the deadlock by adding a sixth object F. Whenever a thread
wants to acquire the synchronization lock for any object A· · · E, it must
first acquire the lock for object F. This solution is known as containment:
the locks for objects A · · · E are contained within the lock for object F.
Compare this scheme with the circular-wait scheme of Section 7.4.4.
7.5 Prove that the safety algorithm presented in Section 7.5.3 requires an
order of m × n2
operations.
7.6 Consider a computer system that runs 5,000 jobs per month and has no
deadlock-prevention or deadlock-avoidance scheme. Deadlocks occur
about twice per month, and the operator must terminate and rerun
about ten jobs per deadlock. Each job is worth about two dollars (in CPU
time), and the jobs terminated tend to be about half done when they are
aborted.
A systems programmer has estimated that a deadlock-avoidance
algorithm (like the banker’s algorithm) could be installed in the system
with an increase of about 10 percent in the average execution time per
job. Since the machine currently has 30 percent idle time, all 5,000 jobs
per month could still be run, although turnaround time would increase
by about 20 percent on average.
a. What are the arguments for installing the deadlock-avoidance
algorithm?
b. What are the arguments against installing the deadlock-avoidance
algorithm?
Exercises 341
7.7 Can a system detect that some of its processes are starving? If you answer
“yes,” explain how it can. If you answer “no,” explain how the system
can deal with the starvation problem.
7.8 Consider the following resource-allocation policy. Requests for and
releases of resources are allowed at any time. If a request for resources
cannot be satisfied because the resources are not available, then we check
any processes that are blocked waiting for resources. If a blocked process
has the desired resources, then these resources are taken away from it
and are given to the requesting process. The vector of resources for which
the blocked process is waiting is increased to include the resources that
were taken away.
For example, a system has three resource types, and the vector
Available is initialized to (4,2,2). If process P0 asks for (2,2,1), it gets
them. If P1 asks for (1,0,1), it gets them. Then, if P0 asks for (0,0,1), it
is blocked (resource not available). If P2 now asks for (2,0,0), it gets the
available one (1,0,0), as well as one that was allocated to P0 (since P0 is
blocked). P0’s Allocation vector goes down to (1,2,1), and its Need vector
goes up to (1,0,1).
a. Can deadlock occur? If you answer “yes,” give an example. If you
answer “no,” specify which necessary condition cannot occur.
b. Can indefinite blocking occur? Explain your answer.
7.9 Suppose that you have coded the deadlock-avoidance safety algorithm
and now have been asked to implement the deadlock-detection algo-
rithm. Can you do so by simply using the safety algorithm code and
redefining Maxi = Waitingi + Allocationi , where Waitingi is a vector
specifying the resources for which process i is waiting and Allocationi
is as defined in Section 7.5? Explain your answer.
7.10 Is it possible to have a deadlock involving only one single-threaded
process? Explain your answer.
Exercises
7.11 Consider the traffic deadlock depicted in Figure 7.10.
a. Show that the four necessary conditions for deadlock hold in this
example.
b. State a simple rule for avoiding deadlocks in this system.
7.12 Assume a multithreaded application uses only reader–writer locks for
synchronization. Applying the four necessary conditions for deadlock,
is deadlock still possible if multiple reader–writer locks are used?
7.13 The program example shown in Figure 7.4 doesn’t always lead to
deadlock. Describe what role the CPU scheduler plays and how it can
contribute to deadlock in this program.
342 Chapter 7 Deadlocks
•
•
•
•
•
•
•
•
•
•
•
•
Figure 7.10 Traffic deadlock for Exercise 7.11.
7.14 In Section 7.4.4, we describe a situation in which we prevent deadlock
by ensuring that all locks are acquired in a certain order. However,
we also point out that deadlock is possible in this situation if two
threads simultaneously invoke the transaction() function. Fix the
transaction() function to prevent deadlocks.
7.15 Compare the circular-wait scheme with the various deadlock-avoidance
schemes (like the banker’s algorithm) with respect to the following
issues:
a. Runtime overheads
b. System throughput
7.16 In a real computer system, neither the resources available nor the
demands of processes for resources are consistent over long periods
(months). Resources break or are replaced, new processes come and go,
and new resources are bought and added to the system. If deadlock is
controlled by the banker’s algorithm, which of the following changes
can be made safely (without introducing the possibility of deadlock),
and under what circumstances?
a. Increase Available (new resources added).
b. Decrease Available (resource permanently removed from system).
c. Increase Max for one process (the process needs or wants more
resources than allowed).
d. Decrease Max for one process (the process decides it does not need
that many resources).
Exercises 343
e. Increase the number of processes.
f. Decrease the number of processes.
7.17 Consider a system consisting of four resources of the same type that are
shared by three processes, each of which needs at most two resources.
Show that the system is deadlock free.
7.18 Consider a system consisting of m resources of the same type being
shared by n processes. A process can request or release only one resource
at a time. Show that the system is deadlock free if the following two
conditions hold:
a. The maximum need of each process is between one resource and
m resources.
b. The sum of all maximum needs is less than m + n.
7.19 Consider the version of the dining-philosophers problem in which the
chopsticks are placed at the center of the table and any two of them
can be used by a philosopher. Assume that requests for chopsticks are
made one at a time. Describe a simple rule for determining whether a
particular request can be satisfied without causing deadlock given the
current allocation of chopsticks to philosophers.
7.20 Consider again the setting in the preceding question. Assume now that
each philosopher requires three chopsticks to eat. Resource requests are
still issued one at a time. Describe some simple rules for determining
whether a particular request can be satisfied without causing deadlock
given the current allocation of chopsticks to philosophers.
7.21 We can obtain the banker’s algorithm for a single resource type from
the general banker’s algorithm simply by reducing the dimensionality
of the various arrays by 1. Show through an example that we cannot
implement the multiple-resource-type banker’s scheme by applying the
single-resource-type scheme to each resource type individually.
7.22 Consider the following snapshot of a system:
Allocation Max
A B C D A B C D
P0 3 0 1 4 5 1 1 7
P1 2 2 1 0 3 2 1 1
P2 3 1 2 1 3 3 2 1
P3 0 5 1 0 4 6 1 2
P4 4 2 1 2 6 3 2 5
Using the banker’s algorithm, determine whether or not each of the
following states is unsafe. If the state is safe, illustrate the order in which
the processes may complete. Otherwise, illustrate why the state is unsafe.
a. Available = (0, 3, 0, 1)
b. Available = (1, 0, 0, 2)
344 Chapter 7 Deadlocks
7.23 Consider the following snapshot of a system:
Allocation Max Available
A B C D A B C D A B C D
P0 2 0 0 1 4 2 1 2 3 3 2 1
P1 3 1 2 1 5 2 5 2
P2 2 1 0 3 2 3 1 6
P3 1 3 1 2 1 4 2 4
P4 1 4 3 2 3 6 6 5
Answer the following questions using the banker’s algorithm:
a. Illustrate that the system is in a safe state by demonstrating an
order in which the processes may complete.
b. If a request from process P1 arrives for (1, 1, 0, 0), can the request
be granted immediately?
c. If a request from process P4 arrives for (0, 0, 2, 0), can the request
be granted immediately?
7.24 What is the optimistic assumption made in the deadlock-detection
algorithm? How can this assumption be violated?
7.25 A single-lane bridge connects the two Vermont villages of North
Tunbridge and South Tunbridge. Farmers in the two villages use this
bridge to deliver their produce to the neighboring town. The bridge
can become deadlocked if a northbound and a southbound farmer get
on the bridge at the same time. (Vermont farmers are stubborn and are
unable to back up.) Using semaphores and/or mutex locks, design an
algorithm in pseudocode that prevents deadlock. Initially, do not be
concerned about starvation (the situation in which northbound farmers
prevent southbound farmers from using the bridge, or vice versa).
7.26 Modify your solution to Exercise 7.25 so that it is starvation-free.
Programming Problems
7.27 Implement your solution to Exercise 7.25 using POSIX synchronization.
In particular, represent northbound and southbound farmers as separate
threads. Once a farmer is on the bridge, the associated thread will sleep
for a random period of time, representing traveling across the bridge.
Design your program so that you can create several threads representing
the northbound and southbound farmers.
Programming Projects 345
Programming Projects
Banker’s Algorithm
For this project, you will write a multithreaded program that implements the
banker’s algorithm discussed in Section 7.5.3. Several customers request and
release resources from the bank. The banker will grant a request only if it leaves
the system in a safe state. A request that leaves the system in an unsafe state
will be denied. This programming assignment combines three separate topics:
(1) multithreading, (2) preventing race conditions, and (3) deadlock avoidance.
The Banker
The banker will consider requests from n customers for m resources types. as
outlined in Section 7.5.3. The banker will keep track of the resources using the
following data structures:
/* these may be any values = 0 */
#define NUMBER OF CUSTOMERS 5
#define NUMBER OF RESOURCES 3
/* the available amount of each resource */
int available[NUMBER OF RESOURCES];
/*the maximum demand of each customer */
int maximum[NUMBER OF CUSTOMERS][NUMBER OF RESOURCES];
/* the amount currently allocated to each customer */
int allocation[NUMBER OF CUSTOMERS][NUMBER OF RESOURCES];
/* the remaining need of each customer */
int need[NUMBER OF CUSTOMERS][NUMBER OF RESOURCES];
The Customers
Create n customer threads that request and release resources from the bank.
The customers will continually loop, requesting and then releasing random
numbers of resources. The customers’ requests for resources will be bounded
by their respective values in the need array. The banker will grant a request if
it satisfies the safety algorithm outlined in Section 7.5.3.1. If a request does not
leave the system in a safe state, the banker will deny it. Function prototypes
for requesting and releasing resources are as follows:
int request resources(int customer num, int request[]);
int release resources(int customer num, int release[]);
These two functions should return 0 if successful (the request has been
granted) and –1 if unsuccessful. Multiple threads (customers) will concurrently
346 Chapter 7 Deadlocks
access shared data through these two functions. Therefore, access must be
controlled through mutex locks to prevent race conditions. Both the Pthreads
and Windows APIs provide mutex locks. The use of Pthreads mutex locks is
covered in Section 5.9.4; mutex locks for Windows systems are described in the
project entitled “Producer–Consumer Problem” at the end of Chapter 5.
Implementation
You should invoke your program by passing the number of resources of each
type on the command line. For example, if there were three resource types,
with ten instances of the first type, five of the second type, and seven of the
third type, you would invoke your program follows:
./a.out 10 5 7
The available array would be initialized to these values. You may initialize
the maximum array (which holds the maximum demand of each customer) using
any method you find convenient.
Bibliographical Notes
Most research involving deadlock was conducted many years ago. [Dijkstra
(1965)] was one of the first and most influential contributors in the deadlock
area. [Holt (1972)] was the first person to formalize the notion of deadlocks in
terms of an allocation-graph model similar to the one presented in this chapter.
Starvation was also covered by [Holt (1972)]. [Hyman (1985)] provided the
deadlock example from the Kansas legislature. A study of deadlock handling
is provided in [Levine (2003)].
The various prevention algorithms were suggested by [Havender (1968)],
who devised the resource-ordering scheme for the IBM OS/360 system. The
banker’s algorithm for avoiding deadlocks was developed for a single resource
type by [Dijkstra (1965)] and was extended to multiple resource types by
[Habermann (1969)].
The deadlock-detection algorithm for multiple instances of a resource type,
which is described in Section 7.6.2, was presented by [Coffman et al. (1971)].
[Bach (1987)] describes how many of the algorithms in the traditional
UNIX kernel handle deadlock. Solutions to deadlock problems in networks are
discussed in works such as [Culler et al. (1998)] and [Rodeheffer and Schroeder
(1991)].
The witness lock-order verifier is presented in [Baldwin (2002)].
Bibliography
[Bach (1987)] M. J. Bach, The Design of the UNIX Operating System, Prentice Hall
(1987).
[Baldwin (2002)] J. Baldwin, “Locking in the Multithreaded FreeBSD Kernel”,
USENIX BSD (2002).
Bibliography 347
[Coffman et al. (1971)] E. G. Coffman, M. J. Elphick, and A. Shoshani, “System
Deadlocks”, Computing Surveys, Volume 3, Number 2 (1971), pages 67–78.
[Culler et al. (1998)] D. E. Culler, J. P. Singh, and A. Gupta, Parallel Computer
Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers Inc.
(1998).
[Dijkstra (1965)] E. W. Dijkstra, “Cooperating Sequential Processes”, Technical
report, Technological University, Eindhoven, the Netherlands (1965).
[Habermann (1969)] A. N. Habermann, “Prevention of System Deadlocks”,
Communications of the ACM, Volume 12, Number 7 (1969), pages 373–377, 385.
[Havender (1968)] J. W. Havender, “Avoiding Deadlock in Multitasking Sys-
tems”, IBM Systems Journal, Volume 7, Number 2 (1968), pages 74–84.
[Holt (1972)] R. C. Holt, “Some Deadlock Properties of Computer Systems”,
Computing Surveys, Volume 4, Number 3 (1972), pages 179–196.
[Hyman (1985)] D. Hyman, The Columbus Chicken Statute and More Bonehead
Legislation, S. Greene Press (1985).
[Levine (2003)] G. Levine, “Defining Deadlock”, Operating Systems Review, Vol-
ume 37, Number 1 (2003).
[Rodeheffer and Schroeder (1991)] T. L. Rodeheffer and M. D. Schroeder,
“Automatic Reconfiguration in Autonet”, Proceedings of the ACM Symposium
on Operating Systems Principles (1991), pages 183–97.
Imports topics from Galvin Operating System .pdf
Part Three
Memory
Management
The main purpose of a computer system is to execute programs. These
programs, together with the data they access, must be at least partially
in main memory during execution.
To improve both the utilization of the CPU and the speed of its
response to users, a general-purpose computer must keep several pro-
cesses in memory. Many memory-management schemes exist, reflect-
ing various approaches, and the effectiveness of each algorithm depends
on the situation. Selection of a memory-management scheme for a sys-
tem depends on many factors, especially on the hardware design of the
system. Most algorithms require hardware support.
Imports topics from Galvin Operating System .pdf
8
C H A P T E R
Main Memory
In Chapter 6, we showed how the CPU can be shared by a set of processes. As
a result of CPU scheduling, we can improve both the utilization of the CPU and
the speed of the computer’s response to its users. To realize this increase in
performance, however, we must keep several processes in memory—that is,
we must share memory.
In this chapter, we discuss various ways to manage memory. The memory-
management algorithms vary from a primitive bare-machine approach to
paging and segmentation strategies. Each approach has its own advantages
and disadvantages. Selection of a memory-management method for a specific
system depends on many factors, especially on the hardware design of the
system. As we shall see, many algorithms require hardware support, leading
many systems to have closely integrated hardware and operating-system
memory management.
CHAPTER OBJECTIVES
• To provide a detailed description of various ways of organizing memory
hardware.
• To explore various techniques of allocating memory to processes.
• To discuss in detail how paging works in contemporary computer systems.
8.1 Background
As we saw in Chapter 1, memory is central to the operation of a modern
computer system. Memory consists of a large array of bytes, each with its own
address. The CPU fetches instructions from memory according to the value of
the program counter. These instructions may cause additional loading from
and storing to specific memory addresses.
A typical instruction-execution cycle, for example, first fetches an instruc-
tion from memory. The instruction is then decoded and may cause operands
to be fetched from memory. After the instruction has been executed on the
operands, results may be stored back in memory. The memory unit sees only
351
352 Chapter 8 Main Memory
a stream of memory addresses; it does not know how they are generated (by
the instruction counter, indexing, indirection, literal addresses, and so on) or
what they are for (instructions or data). Accordingly, we can ignore how a
program generates a memory address. We are interested only in the sequence
of memory addresses generated by the running program.
We begin our discussion by covering several issues that are pertinent
to managing memory: basic hardware, the binding of symbolic memory
addresses to actual physical addresses, and the distinction between logical
and physical addresses. We conclude the section with a discussion of dynamic
linking and shared libraries.
8.1.1 Basic Hardware
Main memory and the registers built into the processor itself are the only
general-purpose storage that the CPU can access directly. There are machine
instructions that take memory addresses as arguments, but none that take disk
addresses. Therefore, any instructions in execution, and any data being used
by the instructions, must be in one of these direct-access storage devices. If the
data are not in memory, they must be moved there before the CPU can operate
on them.
Registers that are built into the CPU are generally accessible within one
cycle of the CPU clock. Most CPUs can decode instructions and perform simple
operations on register contents at the rate of one or more operations per
clock tick. The same cannot be said of main memory, which is accessed via
a transaction on the memory bus. Completing a memory access may take
many cycles of the CPU clock. In such cases, the processor normally needs to
stall, since it does not have the data required to complete the instruction that it
is executing. This situation is intolerable because of the frequency of memory
accesses. The remedy is to add fast memory between the CPUand main memory,
typically on the CPU chip for fast access. Such a cache was described in Section
1.8.3. To manage a cache built into the CPU, the hardware automatically speeds
up memory access without any operating-system control.
Not only are we concerned with the relative speed of accessing physical
memory, but we also must ensure correct operation. For proper system
operation we must protect the operating system from access by user processes.
On multiuser systems, we must additionally protect user processes from
one another. This protection must be provided by the hardware because the
operating system doesn’t usually intervene between the CPU and its memory
accesses (because of the resulting performance penalty). Hardware implements
this production in several different ways, as we show throughout the chapter.
Here, we outline one possible implementation.
We first need to make sure that each process has a separate memory space.
Separate per-process memory space protects the processes from each other and
is fundamental to having multiple processes loaded in memory for concurrent
execution. To separate memory spaces, we need the ability to determine the
range of legal addresses that the process may access and to ensure that the
process can access only these legal addresses. We can provide this protection
by using two registers, usually a base and a limit, as illustrated in Figure 8.1.
The base register holds the smallest legal physical memory address; the limit
register specifies the size of the range. For example, if the base register holds
8.1 Background 353
operating
system
0
256000
300040 300040
base
120900
limit
420940
880000
1024000
process
process
process
Figure 8.1 A base and a limit register define a logical address space.
300040 and the limit register is 120900, then the program can legally access all
addresses from 300040 through 420939 (inclusive).
Protection of memory space is accomplished by having the CPU hardware
compare every address generated in user mode with the registers. Any attempt
by a program executing in user mode to access operating-system memory or
other users’ memory results in a trap to the operating system, which treats the
attempt as a fatal error (Figure 8.2). This scheme prevents a user program from
(accidentally or deliberately) modifying the code or data structures of either
the operating system or other users.
The base and limit registers can be loaded only by the operating system,
which uses a special privileged instruction. Since privileged instructions can
be executed only in kernel mode, and since only the operating system executes
in kernel mode, only the operating system can load the base and limit registers.
base
memory
trap to operating system
monitor—addressing error
address yes
yes
no
no
CPU
base  limit
≥ 
Figure 8.2 Hardware address protection with base and limit registers.
354 Chapter 8 Main Memory
This scheme allows the operating system to change the value of the registers
but prevents user programs from changing the registers’ contents.
The operating system, executing in kernel mode, is given unrestricted
access to both operating-system memory and users’ memory. This provision
allows the operating system to load users’ programs into users’ memory, to
dump out those programs in case of errors, to access and modify parameters
of system calls, to perform I/O to and from user memory, and to provide
many other services. Consider, for example, that an operating system for a
multiprocessing system must execute context switches, storing the state of one
process from the registers into main memory before loading the next process’s
context from main memory into the registers.
8.1.2 Address Binding
Usually, a program resides on a disk as a binary executable file. To be executed,
the program must be brought into memory and placed within a process.
Depending on the memory management in use, the process may be moved
between disk and memory during its execution. The processes on the disk that
are waiting to be brought into memory for execution form the input queue.
The normal single-tasking procedure is to select one of the processes
in the input queue and to load that process into memory. As the process
is executed, it accesses instructions and data from memory. Eventually, the
process terminates, and its memory space is declared available.
Most systems allow a user process to reside in any part of the physical
memory. Thus, although the address space of the computer may start at 00000,
the first address of the user process need not be 00000. You will see later how
a user program actually places a process in physical memory.
In most cases, a user program goes through several steps—some of which
may be optional—before being executed (Figure 8.3). Addresses may be
represented in different ways during these steps. Addresses in the source
program are generally symbolic (such as the variable count). A compiler
typically binds these symbolic addresses to relocatable addresses (such as
“14 bytes from the beginning of this module”). The linkage editor or loader
in turn binds the relocatable addresses to absolute addresses (such as 74014).
Each binding is a mapping from one address space to another.
Classically, the binding of instructions and data to memory addresses can
be done at any step along the way:
• Compile time. If you know at compile time where the process will reside
in memory, then absolute code can be generated. For example, if you know
that a user process will reside starting at location R, then the generated
compiler code will start at that location and extend up from there. If, at
some later time, the starting location changes, then it will be necessary
to recompile this code. The MS-DOS .COM-format programs are bound at
compile time.
• Load time. If it is not known at compile time where the process will reside
in memory, then the compiler must generate relocatable code. In this case,
final binding is delayed until load time. If the starting address changes, we
need only reload the user code to incorporate this changed value.
8.1 Background 355
dynamic
linking
source
program
object
module
linkage
editor
load
module
loader
in-memory
binary
memory
image
other
object
modules
compile
time
load
time
execution
time (run
time)
compiler or
assembler
system
library
dynamically
loaded
system
library
Figure 8.3 Multistep processing of a user program.
• Execution time. If the process can be moved during its execution from
one memory segment to another, then binding must be delayed until run
time. Special hardware must be available for this scheme to work, as will
be discussed in Section 8.1.3. Most general-purpose operating systems use
this method.
A major portion of this chapter is devoted to showing how these various bind-
ings can be implemented effectively in a computer system and to discussing
appropriate hardware support.
8.1.3 Logical Versus Physical Address Space
An address generated by the CPU is commonly referred to as a logical address,
whereas an address seen by the memory unit—that is, the one loaded into
the memory-address register of the memory—is commonly referred to as a
physical address.
The compile-time and load-time address-binding methods generate iden-
tical logical and physical addresses. However, the execution-time address-
356 Chapter 8 Main Memory

MMU
CPU memory
14346
14000
relocation
register
346
logical
address
physical
address
Figure 8.4 Dynamic relocation using a relocation register.
binding scheme results in differing logical and physical addresses. In this
case, we usually refer to the logical address as a virtual address. We use
logical address and virtual address interchangeably in this text. The set of all
logical addresses generated by a program is a logical address space. The set
of all physical addresses corresponding to these logical addresses is a physical
address space. Thus, in the execution-time address-binding scheme, the logical
and physical address spaces differ.
The run-time mapping from virtual to physical addresses is done by a
hardware device called the memory-management unit (MMU). We can choose
from many different methods to accomplish such mapping, as we discuss in
Section 8.3 through Section 8.5. For the time being, we illustrate this mapping
with a simple MMU scheme that is a generalization of the base-register scheme
described in Section 8.1.1. The base register is now called a relocation register.
The value in the relocation register is added to every address generated by a
user process at the time the address is sent to memory (see Figure 8.4). For
example, if the base is at 14000, then an attempt by the user to address location
0 is dynamically relocated to location 14000; an access to location 346 is mapped
to location 14346.
The user program never sees the real physical addresses. The program can
create a pointer to location 346, store it in memory, manipulate it, and compare it
with other addresses—all as the number 346. Only when it is used as a memory
address (in an indirect load or store, perhaps) is it relocated relative to the base
register. The user program deals with logical addresses. The memory-mapping
hardware converts logical addresses into physical addresses. This form of
execution-time binding was discussed in Section 8.1.2. The final location of
a referenced memory address is not determined until the reference is made.
We now have two different types of addresses: logical addresses (in the
range 0 to max) and physical addresses (in the range R + 0 to R + max for a base
value R). The user program generates only logical addresses and thinks that
the process runs in locations 0 to max. However, these logical addresses must
be mapped to physical addresses before they are used. The concept of a logical
8.1 Background 357
address space that is bound to a separate physical address space is central to
proper memory management.
8.1.4 Dynamic Loading
In our discussion so far, it has been necessary for the entire program and all
data of a process to be in physical memory for the process to execute. The size
of a process has thus been limited to the size of physical memory. To obtain
better memory-space utilization, we can use dynamic loading. With dynamic
loading, a routine is not loaded until it is called. All routines are kept on disk
in a relocatable load format. The main program is loaded into memory and
is executed. When a routine needs to call another routine, the calling routine
first checks to see whether the other routine has been loaded. If it has not, the
relocatable linking loader is called to load the desired routine into memory and
to update the program’s address tables to reflect this change. Then control is
passed to the newly loaded routine.
The advantage of dynamic loading is that a routine is loaded only when it
is needed. This method is particularly useful when large amounts of code are
needed to handle infrequently occurring cases, such as error routines. In this
case, although the total program size may be large, the portion that is used
(and hence loaded) may be much smaller.
Dynamic loading does not require special support from the operating
system. It is the responsibility of the users to design their programs to take
advantage of such a method. Operating systems may help the programmer,
however, by providing library routines to implement dynamic loading.
8.1.5 Dynamic Linking and Shared Libraries
Dynamically linked libraries are system libraries that are linked to user
programs when the programs are run (refer back to Figure 8.3). Some operating
systems support only static linking, in which system libraries are treated
like any other object module and are combined by the loader into the binary
program image. Dynamic linking, in contrast, is similar to dynamic loading.
Here, though, linking, rather than loading, is postponed until execution time.
This feature is usually used with system libraries, such as language subroutine
libraries. Without this facility, each program on a system must include a copy
of its language library (or at least the routines referenced by the program) in the
executable image. This requirement wastes both disk space and main memory.
With dynamic linking, a stub is included in the image for each library-
routine reference. The stub is a small piece of code that indicates how to locate
the appropriate memory-resident library routine or how to load the library if
the routine is not already present. When the stub is executed, it checks to see
whether the needed routine is already in memory. If it is not, the program loads
the routine into memory. Either way, the stub replaces itself with the address
of the routine and executes the routine. Thus, the next time that particular
code segment is reached, the library routine is executed directly, incurring no
cost for dynamic linking. Under this scheme, all processes that use a language
library execute only one copy of the library code.
This feature can be extended to library updates (such as bug fixes). A library
may be replaced by a new version, and all programs that reference the library
will automatically use the new version. Without dynamic linking, all such
358 Chapter 8 Main Memory
programs would need to be relinked to gain access to the new library. So that
programs will not accidentally execute new, incompatible versions of libraries,
version information is included in both the program and the library. More than
one version of a library may be loaded into memory, and each program uses its
version information to decide which copy of the library to use. Versions with
minor changes retain the same version number, whereas versions with major
changes increment the number. Thus, only programs that are compiled with
the new library version are affected by any incompatible changes incorporated
in it. Other programs linked before the new library was installed will continue
using the older library. This system is also known as shared libraries.
Unlike dynamic loading, dynamic linking and shared libraries generally
require help from the operating system. If the processes in memory are
protected from one another, then the operating system is the only entity that can
check to see whether the needed routine is in another process’s memory space
or that can allow multiple processes to access the same memory addresses. We
elaborate on this concept when we discuss paging in Section 8.5.4.
8.2 Swapping
A process must be in memory to be executed. A process, however, can be
swapped temporarily out of memory to a backing store and then brought back
into memory for continued execution (Figure 8.5). Swapping makes it possible
for the total physical address space of all processes to exceed the real physical
memory of the system, thus increasing the degree of multiprogramming in a
system.
8.2.1 Standard Swapping
Standard swapping involves moving processes between main memory and
a backing store. The backing store is commonly a fast disk. It must be large
operating
system
swap out
swap in
user
space
main memory
backing store
process P2
process P1
1
2
Figure 8.5 Swapping of two processes using a disk as a backing store.
8.2 Swapping 359
enough to accommodate copies of all memory images for all users, and it must
provide direct access to these memory images. The system maintains a ready
queue consisting of all processes whose memory images are on the backing
store or in memory and are ready to run. Whenever the CPU scheduler decides
to execute a process, it calls the dispatcher. The dispatcher checks to see whether
the next process in the queue is in memory. If it is not, and if there is no free
memory region, the dispatcher swaps out a process currently in memory and
swaps in the desired process. It then reloads registers and transfers control to
the selected process.
The context-switch time in such a swapping system is fairly high. To get an
idea of the context-switch time, let’s assume that the user process is 100 MB in
size and the backing store is a standard hard disk with a transfer rate of 50 MB
per second. The actual transfer of the 100-MB process to or from main memory
takes
100 MB/50 MB per second = 2 seconds
The swap time is 200 milliseconds. Since we must swap both out and in, the
total swap time is about 4,000 milliseconds. (Here, we are ignoring other disk
performance aspects, which we cover in Chapter 10.)
Notice that the major part of the swap time is transfer time. The total
transfer time is directly proportional to the amount of memory swapped.
If we have a computer system with 4 GB of main memory and a resident
operating system taking 1 GB, the maximum size of the user process is 3
GB. However, many user processes may be much smaller than this—say, 100
MB. A 100-MB process could be swapped out in 2 seconds, compared with
the 60 seconds required for swapping 3 GB. Clearly, it would be useful to
know exactly how much memory a user process is using, not simply how
much it might be using. Then we would need to swap only what is actually
used, reducing swap time. For this method to be effective, the user must
keep the system informed of any changes in memory requirements. Thus,
a process with dynamic memory requirements will need to issue system calls
(request memory() and release memory()) to inform the operating system
of its changing memory needs.
Swapping is constrained by other factors as well. If we want to swap
a process, we must be sure that it is completely idle. Of particular concern
is any pending I/O. A process may be waiting for an I/O operation when
we want to swap that process to free up memory. However, if the I/O is
asynchronously accessing the user memory for I/O buffers, then the process
cannot be swapped. Assume that the I/O operation is queued because the
device is busy. If we were to swap out process P1 and swap in process P2, the
I/O operation might then attempt to use memory that now belongs to process
P2. There are two main solutions to this problem: never swap a process with
pending I/O, or execute I/O operations only into operating-system buffers.
Transfers between operating-system buffers and process memory then occur
only when the process is swapped in. Note that this double buffering itself
adds overhead. We now need to copy the data again, from kernel memory to
user memory, before the user process can access it.
Standard swapping is not used in modern operating systems. It requires too
much swapping time and provides too little execution time to be a reasonable
360 Chapter 8 Main Memory
memory-management solution. Modified versions of swapping, however, are
found on many systems, including UNIX, Linux, and Windows. In one common
variation, swapping is normally disabled but will start if the amount of free
memory (unused memory available for the operating system or processes to
use) falls below a threshold amount. Swapping is halted when the amount
of free memory increases. Another variation involves swapping portions of
processes—rather than entire processes—to decrease swap time. Typically,
these modified forms of swapping work in conjunction with virtual memory,
which we cover in Chapter 9.
8.2.2 Swapping on Mobile Systems
Although most operating systems for PCs and servers support some modified
version of swapping, mobile systems typically do not support swapping in any
form. Mobile devices generally use flash memory rather than more spacious
hard disks as their persistent storage. The resulting space constraint is one
reason why mobile operating-system designers avoid swapping. Other reasons
include the limited number of writes that flash memory can tolerate before it
becomes unreliable and the poor throughput between main memory and flash
memory in these devices.
Instead of using swapping, when free memory falls below a certain
threshold, Apple’s iOS asks applications to voluntarily relinquish allocated
memory. Read-only data (such as code) are removed from the system and later
reloaded from flash memory if necessary. Data that have been modified (such
as the stack) are never removed. However, any applications that fail to free up
sufficient memory may be terminated by the operating system.
Android does not support swapping and adopts a strategy similar to that
used by iOS. It may terminate a process if insufficient free memory is available.
However, before terminating a process, Android writes its application state to
flash memory so that it can be quickly restarted.
Because of these restrictions, developers for mobile systems must carefully
allocate and release memory to ensure that their applications do not use too
much memory or suffer from memory leaks. Note that both iOS and Android
support paging, so they do have memory-management abilities. We discuss
paging later in this chapter.
8.3 Contiguous Memory Allocation
The main memory must accommodate both the operating system and the
various user processes. We therefore need to allocate main memory in the most
efficient way possible. This section explains one early method, contiguous
memory allocation.
The memory is usually divided into two partitions: one for the resident
operating system and one for the user processes. We can place the operating
system in either low memory or high memory. The major factor affecting this
decision is the location of the interrupt vector. Since the interrupt vector is
often in low memory, programmers usually place the operating system in low
memory as well. Thus, in this text, we discuss only the situation in which
8.3 Contiguous Memory Allocation 361
the operating system resides in low memory. The development of the other
situation is similar.
We usually want several user processes to reside in memory at the same
time. We therefore need to consider how to allocate available memory to the
processes that are in the input queue waiting to be brought into memory. In
contiguous memory allocation, each process is contained in a single section of
memory that is contiguous to the section containing the next process.
8.3.1 Memory Protection
Before discussing memory allocation further, we must discuss the issue of
memory protection. We can prevent a process from accessing memory it does
not own by combining two ideas previously discussed. If we have a system
with a relocation register (Section 8.1.3), together with a limit register (Section
8.1.1), we accomplish our goal. The relocation register contains the value of
the smallest physical address; the limit register contains the range of logical
addresses (for example, relocation = 100040 and limit = 74600). Each logical
address must fall within the range specified by the limit register. The MMU
maps the logical address dynamically by adding the value in the relocation
register. This mapped address is sent to memory (Figure 8.6).
When the CPU scheduler selects a process for execution, the dispatcher
loads the relocation and limit registers with the correct values as part of the
context switch. Because every address generated by a CPU is checked against
these registers, we can protect both the operating system and the other users’
programs and data from being modified by this running process.
The relocation-register scheme provides an effective way to allow the
operating system’s size to change dynamically. This flexibility is desirable in
many situations. For example, the operating system contains code and buffer
space for device drivers. If a device driver (or other operating-system service)
is not commonly used, we do not want to keep the code and data in memory, as
we might be able to use that space for other purposes. Such code is sometimes
called transient operating-system code; it comes and goes as needed. Thus,
using this code changes the size of the operating system during program
execution.
CPU memory
logical
address
trap: addressing error
no
yes
physical
address
relocation
register

limit
register
Figure 8.6 Hardware support for relocation and limit registers.
362 Chapter 8 Main Memory
8.3.2 Memory Allocation
Now we are ready to turn to memory allocation. One of the simplest
methods for allocating memory is to divide memory into several fixed-sized
partitions. Each partition may contain exactly one process. Thus, the degree
of multiprogramming is bound by the number of partitions. In this multiple-
partition method, when a partition is free, a process is selected from the input
queue and is loaded into the free partition. When the process terminates, the
partition becomes available for another process. This method was originally
used by the IBM OS/360 operating system (called MFT)but is no longer in use.
The method described next is a generalization of the fixed-partition scheme
(called MVT); it is used primarily in batch environments. Many of the ideas
presented here are also applicable to a time-sharing environment in which
pure segmentation is used for memory management (Section 8.4).
In the variable-partition scheme, the operating system keeps a table
indicating which parts of memory are available and which are occupied.
Initially, all memory is available for user processes and is considered one
large block of available memory, a hole. Eventually, as you will see, memory
contains a set of holes of various sizes.
As processes enter the system, they are put into an input queue. The
operating system takes into account the memory requirements of each process
and the amount of available memory space in determining which processes are
allocated memory. When a process is allocated space, it is loaded into memory,
and it can then compete for CPU time. When a process terminates, it releases its
memory, which the operating system may then fill with another process from
the input queue.
At any given time, then, we have a list of available block sizes and an
input queue. The operating system can order the input queue according to
a scheduling algorithm. Memory is allocated to processes until, finally, the
memory requirements of the next process cannot be satisfied—that is, no
available block of memory (or hole) is large enough to hold that process. The
operating system can then wait until a large enough block is available, or it can
skip down the input queue to see whether the smaller memory requirements
of some other process can be met.
In general, as mentioned, the memory blocks available comprise a set of
holes of various sizes scattered throughout memory. When a process arrives
and needs memory, the system searches the set for a hole that is large enough
for this process. If the hole is too large, it is split into two parts. One part is
allocated to the arriving process; the other is returned to the set of holes. When
a process terminates, it releases its block of memory, which is then placed back
in the set of holes. If the new hole is adjacent to other holes, these adjacent holes
are merged to form one larger hole. At this point, the system may need to check
whether there are processes waiting for memory and whether this newly freed
and recombined memory could satisfy the demands of any of these waiting
processes.
This procedure is a particular instance of the general dynamic storage-
allocation problem, which concerns how to satisfy a request of size n from a
list of free holes. There are many solutions to this problem. The first-fit, best-fit,
and worst-fit strategies are the ones most commonly used to select a free hole
from the set of available holes.
8.3 Contiguous Memory Allocation 363
• First fit. Allocate the first hole that is big enough. Searching can start either
at the beginning of the set of holes or at the location where the previous
first-fit search ended. We can stop searching as soon as we find a free hole
that is large enough.
• Best fit. Allocate the smallest hole that is big enough. We must search the
entire list, unless the list is ordered by size. This strategy produces the
smallest leftover hole.
• Worst fit. Allocate the largest hole. Again, we must search the entire list,
unless it is sorted by size. This strategy produces the largest leftover hole,
which may be more useful than the smaller leftover hole from a best-fit
approach.
Simulations have shown that both first fit and best fit are better than worst
fit in terms of decreasing time and storage utilization. Neither first fit nor best
fit is clearly better than the other in terms of storage utilization, but first fit is
generally faster.
8.3.3 Fragmentation
Both the first-fit and best-fit strategies for memory allocation suffer from
external fragmentation. As processes are loaded and removed from memory,
the free memory space is broken into little pieces. External fragmentation exists
when there is enough total memory space to satisfy a request but the available
spaces are not contiguous: storage is fragmented into a large number of small
holes. This fragmentation problem can be severe. In the worst case, we could
have a block of free (or wasted) memory between every two processes. If all
these small pieces of memory were in one big free block instead, we might be
able to run several more processes.
Whether we are using the first-fit or best-fit strategy can affect the amount
of fragmentation. (First fit is better for some systems, whereas best fit is better
for others.) Another factor is which end of a free block is allocated. (Which is
the leftover piece—the one on the top or the one on the bottom?) No matter
which algorithm is used, however, external fragmentation will be a problem.
Depending on the total amount of memory storage and the average process
size, external fragmentation may be a minor or a major problem. Statistical
analysis of first fit, for instance, reveals that, even with some optimization,
given N allocated blocks, another 0.5 N blocks will be lost to fragmentation.
That is, one-third of memory may be unusable! This property is known as the
50-percent rule.
Memory fragmentation can be internal as well as external. Consider a
multiple-partition allocation scheme with a hole of 18,464 bytes. Suppose that
the next process requests 18,462 bytes. If we allocate exactly the requested block,
we are left with a hole of 2 bytes. The overhead to keep track of this hole will be
substantially larger than the hole itself. The general approach to avoiding this
problem is to break the physical memory into fixed-sized blocks and allocate
memory in units based on block size. With this approach, the memory allocated
to a process may be slightly larger than the requested memory. The difference
between these two numbers is internal fragmentation—unused memory that
is internal to a partition.
364 Chapter 8 Main Memory
One solution to the problem of external fragmentation is compaction. The
goal is to shuffle the memory contents so as to place all free memory together
in one large block. Compaction is not always possible, however. If relocation
is static and is done at assembly or load time, compaction cannot be done. It is
possible only if relocation is dynamic and is done at execution time. If addresses
are relocated dynamically, relocation requires only moving the program and
data and then changing the base register to reflect the new base address. When
compaction is possible, we must determine its cost. The simplest compaction
algorithm is to move all processes toward one end of memory; all holes move in
the other direction, producing one large hole of available memory. This scheme
can be expensive.
Another possible solution to the external-fragmentation problem is to
permit the logical address space of the processes to be noncontiguous, thus
allowing a process to be allocated physical memory wherever such memory is
available. Two complementary techniques achieve this solution: segmentation
(Section 8.4) and paging (Section 8.5). These techniques can also be combined.
Fragmentation is a general problem in computing that can occur wherever
we must manage blocks of data. We discuss the topic further in the storage
management chapters (Chapters 10 through and 12).
8.4 Segmentation
As we’ve already seen, the user’s view of memory is not the same as the actual
physical memory. This is equally true of the programmer’s view of memory.
Indeed, dealing with memory in terms of its physical properties is inconvenient
to both the operating system and the programmer. What if the hardware could
provide a memory mechanism that mapped the programmer’s view to the
actual physical memory? The system would have more freedom to manage
memory, while the programmer would have a more natural programming
environment. Segmentation provides such a mechanism.
8.4.1 Basic Method
Do programmers think of memory as a linear array of bytes, some containing
instructions and others containing data? Most programmers would say “no.”
Rather, they prefer to view memory as a collection of variable-sized segments,
with no necessary ordering among the segments (Figure 8.7).
When writing a program, a programmer thinks of it as a main program
with a set of methods, procedures, or functions. It may also include various data
structures: objects, arrays, stacks, variables, and so on. Each of these modules or
data elements is referred to by name. The programmer talks about “the stack,”
“the math library,” and “the main program” without caring what addresses
in memory these elements occupy. She is not concerned with whether the
stack is stored before or after the Sqrt() function. Segments vary in length,
and the length of each is intrinsically defined by its purpose in the program.
Elements within a segment are identified by their offset from the beginning of
the segment: the first statement of the program, the seventh stack frame entry
in the stack, the fifth instruction of the Sqrt(), and so on.
Segmentation is a memory-management scheme that supports this pro-
grammer view of memory. A logical address space is a collection of segments.
8.4 Segmentation 365
logical address
subroutine stack
symbol
table
main
program
Sqrt
Figure 8.7 Programmer’s view of a program.
Each segment has a name and a length. The addresses specify both the segment
name and the offset within the segment. The programmer therefore specifies
each address by two quantities: a segment name and an offset.
For simplicity of implementation, segments are numbered and are referred
to by a segment number, rather than by a segment name. Thus, a logical address
consists of a two tuple:
segment-number, offset.
Normally, when a program is compiled, the compiler automatically constructs
segments reflecting the input program.
A C compiler might create separate segments for the following:
1. The code
2. Global variables
3. The heap, from which memory is allocated
4. The stacks used by each thread
5. The standard C library
Libraries that are linked in during compile time might be assigned separate
segments. The loader would take all these segments and assign them segment
numbers.
8.4.2 Segmentation Hardware
Although the programmer can now refer to objects in the program by a
two-dimensional address, the actual physical memory is still, of course, a one-
dimensional sequence of bytes. Thus, we must define an implementation to
map two-dimensional user-defined addresses into one-dimensional physical
366 Chapter 8 Main Memory
CPU
physical memory
s d
 +
trap: addressing error
no
yes
segment
table
limit base
s
Figure 8.8 Segmentation hardware.
addresses. This mapping is effected by a segment table. Each entry in the
segment table has a segment base and a segment limit. The segment base
contains the starting physical address where the segment resides in memory,
and the segment limit specifies the length of the segment.
The use of a segment table is illustrated in Figure 8.8. A logical address
consists of two parts: a segment number, s, and an offset into that segment, d.
The segment number is used as an index to the segment table. The offset d of
the logical address must be between 0 and the segment limit. If it is not, we trap
to the operating system (logical addressing attempt beyond end of segment).
When an offset is legal, it is added to the segment base to produce the address
in physical memory of the desired byte. The segment table is thus essentially
an array of base–limit register pairs.
As an example, consider the situation shown in Figure 8.9. We have five
segments numbered from 0 through 4. The segments are stored in physical
memory as shown. The segment table has a separate entry for each segment,
giving the beginning address of the segment in physical memory (or base) and
the length of that segment (or limit). For example, segment 2 is 400 bytes long
and begins at location 4300. Thus, a reference to byte 53 of segment 2 is mapped
onto location 4300 + 53 = 4353. A reference to segment 3, byte 852, is mapped to
3200 (the base of segment 3) + 852 = 4052. A reference to byte 1222 of segment
0 would result in a trap to the operating system, as this segment is only 1,000
bytes long.
8.5 Paging
Segmentation permits the physical address space of a process to be non-
contiguous. Paging is another memory-management scheme that offers this
advantage. However, paging avoids external fragmentation and the need for
8.5 Paging 367
logical address space
subroutine stack
symbol
table
main
program
Sqrt
1400
physical memory
2400
3200
segment 2
4300
4700
5700
6300
6700
segment table
limit
0
1
2
3
4
1000
400
400
1100
1000
base
1400
6300
4300
3200
4700
segment 0
segment 3
segment 4
segment 2
segment 1
segment 0
segment 3
segment 4
segment 1
Figure 8.9 Example of segmentation.
compaction, whereas segmentation does not. It also solves the considerable
problem of fitting memory chunks of varying sizes onto the backing store.
Most memory-management schemes used before the introduction of paging
suffered from this problem. The problem arises because, when code fragments
or data residing in main memory need to be swapped out, space must be found
on the backing store. The backing store has the same fragmentation problems
discussed in connection with main memory, but access is much slower, so
compaction is impossible. Because of its advantages over earlier methods,
paging in its various forms is used in most operating systems, from those for
mainframes through those for smartphones. Paging is implemented through
cooperation between the operating system and the computer hardware.
8.5.1 Basic Method
The basic method for implementing paging involves breaking physical mem-
ory into fixed-sized blocks called frames and breaking logical memory into
blocks of the same size called pages. When a process is to be executed, its
pages are loaded into any available memory frames from their source (a file
system or the backing store). The backing store is divided into fixed-sized
blocks that are the same size as the memory frames or clusters of multiple
frames. This rather simple idea has great functionality and wide ramifications.
For example, the logical address space is now totally separate from the physical
address space, so a process can have a logical 64-bit address space even though
the system has less than 264
bytes of physical memory.
The hardware support for paging is illustrated in Figure 8.10. Every address
generated by the CPU is divided into two parts: a page number (p) and a page
368 Chapter 8 Main Memory
physical
memory
f
logical
address
page table
physical
address
CPU p
p
f
d d
f
f0000 … 0000
f1111 … 1111
Figure 8.10 Paging hardware.
offset (d). The page number is used as an index into a page table. The page table
contains the base address of each page in physical memory. This base address
is combined with the page offset to define the physical memory address that
is sent to the memory unit. The paging model of memory is shown in Figure
8.11.
page 0
page 1
page 2
page 3
logical
memory
page 1
page 3
page 0
page 2
physical
memory
page table
frame
number
1
4
3
7
0
1
2
3
0
1
2
3
4
5
6
7
Figure 8.11 Paging model of logical and physical memory.
8.5 Paging 369
The page size (like the frame size) is defined by the hardware. The size of a
page is a power of 2, varying between 512 bytes and 1 GB per page, depending
on the computer architecture. The selection of a power of 2 as a page size
makes the translation of a logical address into a page number and page offset
particularly easy. If the size of the logical address space is 2m
, and a page size is
2n
bytes, then the high-order m − n bits of a logical address designate the page
number, and the n low-order bits designate the page offset. Thus, the logical
address is as follows:
p d
page number page offset
m – n n
where p is an index into the page table and d is the displacement within the
page.
As a concrete (although minuscule) example, consider the memory in
Figure 8.12. Here, in the logical address, n= 2 and m = 4. Using a page size
of 4 bytes and a physical memory of 32 bytes (8 pages), we show how the
programmer’s view of memory can be mapped into physical memory. Logical
address 0 is page 0, offset 0. Indexing into the page table, we find that page 0
logical memory
physical memory
page table
i
j
k
l
m
n
o
p
a
b
c
d
e
f
g
h
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
0
4
8
12
16
20
24
28
1
2
3
5
6
1
2
Figure 8.12 Paging example for a 32-byte memory with 4-byte pages.
370 Chapter 8 Main Memory
OBTAINING THE PAGE SIZE ON LINUX SYSTEMS
On a Linux system, the page size varies according to architecture, and
there are several ways of obtaining the page size. One approach is to use
the getpagesize() system call. Another strategy is to enter the following
command on the command line:
getconf PAGESIZE
Each of these techniques returns the page size as a number of bytes.
is in frame 5. Thus, logical address 0 maps to physical address 20 [= (5 × 4) +
0]. Logical address 3 (page 0, offset 3) maps to physical address 23 [= (5 × 4) +
3]. Logical address 4 is page 1, offset 0; according to the page table, page 1 is
mapped to frame 6. Thus, logical address 4 maps to physical address 24 [= (6
× 4) + 0]. Logical address 13 maps to physical address 9.
You may have noticed that paging itself is a form of dynamic relocation.
Every logical address is bound by the paging hardware to some physical
address. Using paging is similar to using a table of base (or relocation) registers,
one for each frame of memory.
When we use a paging scheme, we have no external fragmentation: any free
frame can be allocated to a process that needs it. However, we may have some
internal fragmentation. Notice that frames are allocated as units. If the memory
requirements of a process do not happen to coincide with page boundaries,
the last frame allocated may not be completely full. For example, if page size
is 2,048 bytes, a process of 72,766 bytes will need 35 pages plus 1,086 bytes. It
will be allocated 36 frames, resulting in internal fragmentation of 2,048 − 1,086
= 962 bytes. In the worst case, a process would need n pages plus 1 byte. It
would be allocated n + 1 frames, resulting in internal fragmentation of almost
an entire frame.
If process size is independent of page size, we expect internal fragmentation
to average one-half page per process. This consideration suggests that small
page sizes are desirable. However, overhead is involved in each page-table
entry, and this overhead is reduced as the size of the pages increases. Also,
disk I/O is more efficient when the amount data being transferred is larger
(Chapter 10). Generally, page sizes have grown over time as processes, data
sets, and main memory have become larger. Today, pages typically are between
4 KB and 8 KB in size, and some systems support even larger page sizes. Some
CPUs and kernels even support multiple page sizes. For instance, Solaris uses
page sizes of 8 KB and 4 MB, depending on the data stored by the pages.
Researchers are now developing support for variable on-the-fly page size.
Frequently, on a 32-bit CPU, each page-table entry is 4 bytes long, but that
size can vary as well. A 32-bit entry can point to one of 232
physical page frames.
If frame size is 4 KB (212
), then a system with 4-byte entries can address 244
bytes
(or 16 TB) of physical memory. We should note here that the size of physical
memory in a paged memory system is different from the maximum logical size
of a process. As we further explore paging, we introduce other information that
must be kept in the page-table entries. That information reduces the number
8.5 Paging 371
(a)
free-frame list
14
13
18
20
15
13
14
15
16
17
18
19
20
21
page 0
page 1
page 2
page 3
new process
(b)
free-frame list
15
13 page 1
page 0
page 2
page 3
14
15
16
17
18
19
20
21
page 0
page 1
page 2
page 3
new process
new-process page table
14
0
1
2
3
13
18
20
Figure 8.13 Free frames (a) before allocation and (b) after allocation.
of bits available to address page frames. Thus, a system with 32-bit page-table
entries may address less physical memory than the possible maximum. A 32-bit
CPU uses 32-bit addresses, meaning that a given process space can only be 232
bytes (4 TB). Therefore, paging lets us use physical memory that is larger than
what can be addressed by the CPU’s address pointer length.
When a process arrives in the system to be executed, its size, expressed
in pages, is examined. Each page of the process needs one frame. Thus, if the
process requires n pages, at least n frames must be available in memory. If n
frames are available, they are allocated to this arriving process. The first page
of the process is loaded into one of the allocated frames, and the frame number
is put in the page table for this process. The next page is loaded into another
frame, its frame number is put into the page table, and so on (Figure 8.13).
An important aspect of paging is the clear separation between the program-
mer’s view of memory and the actual physical memory. The programmer views
memory as one single space, containing only this one program. In fact, the user
program is scattered throughout physical memory, which also holds other
programs. The difference between the programmer’s view of memory and
the actual physical memory is reconciled by the address-translation hardware.
The logical addresses are translated into physical addresses. This mapping is
hidden from the programmer and is controlled by the operating system. Notice
that the user process by definition is unable to access memory it does not own.
It has no way of addressing memory outside of its page table, and the table
includes only those pages that the process owns.
Since the operating system is managing physical memory, it must be aware
of the allocation details of physical memory—which frames are allocated,
which frames are available, how many total frames there are, and so on. This
information is generally kept in a data structure called a frame table. The frame
table has one entry for each physical page frame, indicating whether the latter
372 Chapter 8 Main Memory
is free or allocated and, if it is allocated, to which page of which process or
processes.
In addition, the operating system must be aware that user processes operate
in user space, and all logical addresses must be mapped to produce physical
addresses. If a user makes a system call (to do I/O, for example) and provides
an address as a parameter (a buffer, for instance), that address must be mapped
to produce the correct physical address. The operating system maintains a copy
of the page table for each process, just as it maintains a copy of the instruction
counter and register contents. This copy is used to translate logical addresses to
physical addresses whenever the operating system must map a logical address
to a physical address manually. It is also used by the CPU dispatcher to define
the hardware page table when a process is to be allocated the CPU. Paging
therefore increases the context-switch time.
8.5.2 Hardware Support
Each operating system has its own methods for storing page tables. Some
allocate a page table for each process. A pointer to the page table is stored with
the other register values (like the instruction counter) in the process control
block. When the dispatcher is told to start a process, it must reload the user
registers and define the correct hardware page-table values from the stored user
page table. Other operating systems provide one or at most a few page tables,
which decreases the overhead involved when processes are context-switched.
The hardware implementation of the page table can be done in several
ways. In the simplest case, the page table is implemented as a set of dedicated
registers. These registers should be built with very high-speed logic to make the
paging-address translation efficient. Every access to memory must go through
the paging map, so efficiency is a major consideration. The CPU dispatcher
reloads these registers, just as it reloads the other registers. Instructions to load
or modify the page-table registers are, of course, privileged, so that only the
operating system can change the memory map. The DEC PDP-11 is an example
of such an architecture. The address consists of 16 bits, and the page size is 8
KB. The page table thus consists of eight entries that are kept in fast registers.
The use of registers for the page table is satisfactory if the page table is
reasonably small (for example, 256 entries). Most contemporary computers,
however, allow the page table to be very large (for example, 1 million entries).
For these machines, the use of fast registers to implement the page table is
not feasible. Rather, the page table is kept in main memory, and a page-table
base register (PTBR) points to the page table. Changing page tables requires
changing only this one register, substantially reducing context-switch time.
The problem with this approach is the time required to access a user
memory location. If we want to access location i, we must first index into
the page table, using the value in the PTBR offset by the page number for i. This
task requires a memory access. It provides us with the frame number, which
is combined with the page offset to produce the actual address. We can then
access the desired place in memory. With this scheme, two memory accesses
are needed to access a byte (one for the page-table entry, one for the byte). Thus,
memory access is slowed by a factor of 2. This delay would be intolerable under
most circumstances. We might as well resort to swapping!
8.5 Paging 373
The standard solution to this problem is to use a special, small, fast-
lookup hardware cache called a translation look-aside buffer (TLB). The TLB
is associative, high-speed memory. Each entry in the TLB consists of two parts:
a key (or tag) and a value. When the associative memory is presented with an
item, the item is compared with all keys simultaneously. If the item is found,
the corresponding value field is returned. The search is fast; a TLB lookup in
modern hardware is part of the instruction pipeline, essentially adding no
performance penalty. To be able to execute the search within a pipeline step,
however, the TLB must be kept small. It is typically between 32 and 1,024 entries
in size. Some CPUs implement separate instruction and data address TLBs. That
can double the number of TLB entries available, because those lookups occur
in different pipeline steps. We can see in this development an example of the
evolution of CPU technology: systems have evolved from having no TLBs to
having multiple levels of TLBs, just as they have multiple levels of caches.
The TLB is used with page tables in the following way. The TLB contains
only a few of the page-table entries. When a logical address is generated by the
CPU, its page number is presented to the TLB. If the page number is found, its
frame number is immediately available and is used to access memory. As just
mentioned, these steps are executed as part of the instruction pipeline within
the CPU, adding no performance penalty compared with a system that does
not implement paging.
If the page number is not in the TLB (known as a TLB miss), a memory
reference to the page table must be made. Depending on the CPU, this may be
done automatically in hardware or via an interrupt to the operating system.
When the frame number is obtained, we can use it to access memory (Figure
8.14). In addition, we add the page number and frame number to the TLB, so
page table
f
CPU
logical
address
p d
f d
physical
address
physical
memory
p
TLB miss
page
number
frame
number
TLB hit
TLB
Figure 8.14 Paging hardware with TLB.
374 Chapter 8 Main Memory
that they will be found quickly on the next reference. If the TLB is already full
of entries, an existing entry must be selected for replacement. Replacement
policies range from least recently used (LRU) through round-robin to random.
Some CPUs allow the operating system to participate in LRU entry replacement,
while others handle the matter themselves. Furthermore, some TLBs allow
certain entries to be wired down, meaning that they cannot be removed from
the TLB. Typically, TLB entries for key kernel code are wired down.
Some TLBs store address-space identifiers (ASIDs) in each TLB entry. An
ASID uniquely identifies each process and is used to provide address-space
protection for that process. When the TLB attempts to resolve virtual page
numbers, it ensures that the ASID for the currently running process matches the
ASID associated with the virtual page. If the ASIDs do not match, the attempt is
treated asaTLB miss. Inadditiontoprovidingaddress-space protection, anASID
allows the TLB to contain entries for several different processes simultaneously.
If the TLB does not support separate ASIDs, then every time a new page table
is selected (for instance, with each context switch), the TLB must be flushed
(or erased) to ensure that the next executing process does not use the wrong
translation information. Otherwise, the TLB could include old entries that
contain valid virtual addresses but have incorrect or invalid physical addresses
left over from the previous process.
The percentage of times that the page number of interest is found in the
TLB is called the hit ratio. An 80-percent hit ratio, for example, means that
we find the desired page number in the TLB 80 percent of the time. If it takes
100 nanoseconds to access memory, then a mapped-memory access takes 100
nanoseconds when the page number is in the TLB. If we fail to find the page
number in the TLB then we must first access memory for the page table and
frame number (100 nanoseconds) and then access the desired byte in memory
(100 nanoseconds), for a total of 200 nanoseconds. (We are assuming that a
page-table lookup takes only one memory access, but it can take more, as we
shall see.) To find the effective memory-access time, we weight the case by its
probability:
effective access time = 0.80 × 100 + 0.20 × 200
= 120 nanoseconds
In this example, we suffer a 20-percent slowdown in average memory-access
time (from 100 to 120 nanoseconds).
For a 99-percent hit ratio, which is much more realistic, we have
effective access time = 0.99 × 100 + 0.01 × 200
= 101 nanoseconds
This increased hit rate produces only a 1 percent slowdown in access time.
As we noted earlier, CPUs today may provide multiple levels of TLBs.
Calculating memory access times in modern CPUs is therefore much more
complicated than shown in the example above. For instance, the Intel Core
i7 CPU has a 128-entry L1 instruction TLB and a 64-entry L1 data TLB. In the
case of a miss at L1, it takes the CPU six cycles to check for the entry in the L2
512-entry TLB. A miss in L2 means that the CPU must either walk through the
8.5 Paging 375
page-table entries in memory to find the associated frame address, which can
take hundreds of cycles, or interrupt to the operating system to have it do the
work.
A complete performance analysis of paging overhead in such a system
would require miss-rate information about each TLB tier. We can see from the
general information above, however, that hardware features can have a signif-
icant effect on memory performance and that operating-system improvements
(such as paging) can result in and, in turn, be affected by hardware changes
(such as TLBs). We will further explore the impact of the hit ratio on the TLB in
Chapter 9.
TLBs are a hardware feature and therefore would seem to be of little concern
to operating systems and their designers. But the designer needs to understand
the function and features of TLBs, which vary by hardware platform. For
optimal operation, an operating-system design for a given platform must
implement paging according to the platform’s TLB design. Likewise, a change in
the TLB design (for example, between generations of Intel CPUs) may necessitate
a change in the paging implementation of the operating systems that use it.
8.5.3 Protection
Memory protection in a paged environment is accomplished by protection bits
associated with each frame. Normally, these bits are kept in the page table.
One bit can define a page to be read–write or read-only. Every reference
to memory goes through the page table to find the correct frame number. At
the same time that the physical address is being computed, the protection bits
can be checked to verify that no writes are being made to a read-only page. An
attempt to write to a read-only page causes a hardware trap to the operating
system (or memory-protection violation).
We can easily expand this approach to provide a finer level of protection.
We can create hardware to provide read-only, read–write, or execute-only
protection; or, by providing separate protection bits for each kind of access, we
can allow any combination of these accesses. Illegal attempts will be trapped
to the operating system.
One additional bit is generally attached to each entry in the page table: a
valid–invalid bit. When this bit is set to valid, the associated page is in the
process’s logical address space and is thus a legal (or valid) page. When the
bit is set toinvalid, the page is not in the process’s logical address space. Illegal
addresses are trapped by use of the valid–invalid bit. The operating system
sets this bit for each page to allow or disallow access to the page.
Suppose, for example, that in a system with a 14-bit address space (0 to
16383), we have a program that should use only addresses 0 to 10468. Given
a page size of 2 KB, we have the situation shown in Figure 8.15. Addresses in
pages 0, 1, 2, 3, 4, and 5 are mapped normally through the page table. Any
attempt to generate an address in pages 6 or 7, however, will find that the
valid–invalid bit is set to invalid, and the computer will trap to the operating
system (invalid page reference).
Notice that this scheme has created a problem. Because the program
extends only to address 10468, any reference beyond that address is illegal.
However, references to page 5 are classified as valid, so accesses to addresses
up to 12287 are valid. Only the addresses from 12288 to 16383 are invalid. This
376 Chapter 8 Main Memory
page 0
page 0
page 1
page 2
page 3
page 4
page 5
page n
•
•
•
00000
0
1
2
3
4
5
6
7
8
9
frame number
0
1
2
3
4
5
6
7
2
3
4
7
8
9
0
0
v
v
v
v
v
v
i
i
page table
valid–invalid bit
10,468
12,287
page 1
page 2
page 3
page 4
page 5
Figure 8.15 Valid (v) or invalid (i) bit in a page table.
problem is a result of the 2-KB page size and reflects the internal fragmentation
of paging.
Rarely does a process use all its address range. In fact, many processes
use only a small fraction of the address space available to them. It would be
wasteful in these cases to create a page table with entries for every page in the
address range. Most of this table would be unused but would take up valuable
memory space. Some systems provide hardware, in the form of a page-table
length register (PTLR), to indicate the size of the page table. This value is
checked against every logical address to verify that the address is in the valid
range for the process. Failure of this test causes an error trap to the operating
system.
8.5.4 Shared Pages
An advantage of paging is the possibility of sharing common code. This con-
sideration is particularly important in a time-sharing environment. Consider a
system that supports 40 users, each of whom executes a text editor. If the text
editor consists of 150 KB of code and 50 KB of data space, we need 8,000 KB to
support the 40 users. If the code is reentrant code (or pure code), however, it
can be shared, as shown in Figure 8.16. Here, we see three processes sharing
a three-page editor—each page 50 KB in size (the large page size is used to
simplify the figure). Each process has its own data page.
Reentrant code is non-self-modifying code: it never changes during execu-
tion. Thus, two or more processes can execute the same code at the same time.
8.5 Paging 377
7
6
5
ed 2
4
ed 1
3
2
data 1
1
0
3
4
6
1
page table
for P1
process P1
data 1
ed 2
ed 3
ed 1
3
4
6
2
page table
for P3
process P3
data 3
ed 2
ed 3
ed 1
3
4
6
7
page table
for P2
process P2
data 2
ed 2
ed 3
ed 1
8
9
10
11
data 3
2
data
ed 3
Figure 8.16 Sharing of code in a paging environment.
Each process has its own copy of registers and data storage to hold the data for
the process’s execution. The data for two different processes will, of course, be
different.
Only one copy of the editor need be kept in physical memory. Each user’s
page table maps onto the same physical copy of the editor, but data pages are
mapped onto different frames. Thus, to support 40 users, we need only one
copy of the editor (150 KB), plus 40 copies of the 50 KB of data space per user.
The total space required is now 2,150 KB instead of 8,000 KB—a significant
savings.
Other heavily used programs can also be shared—compilers, window
systems, run-time libraries, database systems, and so on. To be sharable, the
code must be reentrant. The read-only nature of shared code should not be
left to the correctness of the code; the operating system should enforce this
property.
The sharing of memory among processes on a system is similar to the
sharing of the address space of a task by threads, described in Chapter 4.
Furthermore, recall that in Chapter 3 we described shared memory as a method
of interprocess communication. Some operating systems implement shared
memory using shared pages.
Organizing memory according to pages provides numerous benefits in
addition to allowing several processes to share the same physical pages. We
cover several other benefits in Chapter 9.
378 Chapter 8 Main Memory
8.6 Structure of the Page Table
In this section, we explore some of the most common techniques for structuring
the page table, including hierarchical paging, hashed page tables, and inverted
page tables.
8.6.1 Hierarchical Paging
Most modern computer systems support a large logical address space
(232
to 264
). In such an environment, the page table itself becomes excessively
large. For example, consider a system with a 32-bit logical address space. If
the page size in such a system is 4 KB (212
), then a page table may consist of
up to 1 million entries (232
/212
). Assuming that each entry consists of 4 bytes,
each process may need up to 4 MB of physical address space for the page table
alone. Clearly, we would not want to allocate the page table contiguously in
main memory. One simple solution to this problem is to divide the page table
into smaller pieces. We can accomplish this division in several ways.
One way is to use a two-level paging algorithm, in which the page table
itself is also paged (Figure 8.17). For example, consider again the system with
a 32-bit logical address space and a page size of 4 KB. A logical address is
divided into a page number consisting of 20 bits and a page offset consisting
of 12 bits. Because we page the page table, the page number is further divided
•
•
•
•
•
•
outer page
table
page of
page table
page table
memory
929
900
929
900
708
500
100
1
0
•
•
•
100
708
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1
500
Figure 8.17 A two-level page-table scheme.
8.6 Structure of the Page Table 379
logical address
outer page
table
p1 p2
p1
page of
page table
p2
d
d
Figure 8.18 Address translation for a two-level 32-bit paging architecture.
into a 10-bit page number and a 10-bit page offset. Thus, a logical address is as
follows:
p1 p2 d
page number page offset
10 10 12
where p1 is an index into the outer page table and p2 is the displacement
within the page of the inner page table. The address-translation method for this
architecture is shown in Figure 8.18. Because address translation works from
the outer page table inward, this scheme is also known as a forward-mapped
page table.
Consider the memory management of one of the classic systems, the VAX
minicomputer from Digital Equipment Corporation (DEC). The VAX was the
most popular minicomputer of its time and was sold from 1977 through 2000.
The VAX architecture supported a variation of two-level paging. The VAX is a 32-
bit machine with a page size of 512 bytes. The logical address space of a process
is divided into four equal sections, each of which consists of 230
bytes. Each
section represents a different part of the logical address space of a process. The
first 2 high-order bits of the logical address designate the appropriate section.
The next 21 bits represent the logical page number of that section, and the final
9 bits represent an offset in the desired page. By partitioning the page table in
this manner, the operating system can leave partitions unused until a process
needs them. Entire sections of virtual address space are frequently unused, and
multilevel page tables have no entries for these spaces, greatly decreasing the
amount of memory needed to store virtual memory data structures.
An address on the VAX architecture is as follows:
s p d
section page offset
2 21 9
where s designates the section number, p is an index into the page table, and d
is the displacement within the page. Even when this scheme is used, the size
of a one-level page table for a VAX process using one section is 221
bits ∗ 4
380 Chapter 8 Main Memory
bytes per entry = 8 MB. To further reduce main-memory use, the VAX pages the
user-process page tables.
For a system with a 64-bit logical address space, a two-level paging scheme
is no longer appropriate. To illustrate this point, let’s suppose that the page
size in such a system is 4 KB (212
). In this case, the page table consists of up
to 252
entries. If we use a two-level paging scheme, then the inner page tables
can conveniently be one page long, or contain 210
4-byte entries. The addresses
look like this:
p1 p2 d
outer page inner page offset
42 10 12
The outer page table consists of 242
entries, or 244
bytes. The obvious way to
avoid such a large table is to divide the outer page table into smaller pieces.
(This approach is also used on some 32-bit processors for added flexibility and
efficiency.)
We can divide the outer page table in various ways. For example, we can
page the outer page table, giving us a three-level paging scheme. Suppose that
the outer page table is made up of standard-size pages (210
entries, or 212
bytes).
In this case, a 64-bit address space is still daunting:
p1 p2 p3
2nd outer page outer page inner page
32 10 10
d
offset
12
The outer page table is still 234
bytes (16 GB) in size.
The next step would be a four-level paging scheme, where the second-level
outer page table itself is also paged, and so forth. The 64-bit UltraSPARC would
require seven levels of paging—a prohibitive number of memory accesses—
to translate each logical address. You can see from this example why, for 64-bit
architectures, hierarchical page tables are generally considered inappropriate.
8.6.2 Hashed Page Tables
A common approach for handling address spaces larger than 32 bits is to use
a hashed page table, with the hash value being the virtual page number. Each
entry in the hash table contains a linked list of elements that hash to the same
location (to handle collisions). Each element consists of three fields: (1) the
virtual page number, (2) the value of the mapped page frame, and (3) a pointer
to the next element in the linked list.
The algorithm works as follows: The virtual page number in the virtual
address is hashed into the hash table. The virtual page number is compared
with field 1 in the first element in the linked list. If there is a match, the
corresponding page frame (field 2) is used to form the desired physical address.
If there is no match, subsequent entries in the linked list are searched for a
matching virtual page number. This scheme is shown in Figure 8.19.
A variation of this scheme that is useful for 64-bit address spaces has
been proposed. This variation uses clustered page tables, which are similar to
8.6 Structure of the Page Table 381
hash table
q s
logical address
physical
address
physical
memory
p d r d
p r
hash
function
• • •
Figure 8.19 Hashed page table.
hashed page tables except that each entry in the hash table refers to several
pages (such as 16) rather than a single page. Therefore, a single page-table
entry can store the mappings for multiple physical-page frames. Clustered
page tables are particularly useful for sparse address spaces, where memory
references are noncontiguous and scattered throughout the address space.
8.6.3 Inverted Page Tables
Usually, each process has an associated page table. The page table has one
entry for each page that the process is using (or one slot for each virtual
address, regardless of the latter’s validity). This table representation is a natural
one, since processes reference pages through the pages’ virtual addresses. The
operating system must then translate this reference into a physical memory
address. Since the table is sorted by virtual address, the operating system is
able to calculate where in the table the associated physical address entry is
located and to use that value directly. One of the drawbacks of this method
is that each page table may consist of millions of entries. These tables may
consume large amounts of physical memory just to keep track of how other
physical memory is being used.
To solve this problem, we can use an inverted page table. An inverted
page table has one entry for each real page (or frame) of memory. Each entry
consists of the virtual address of the page stored in that real memory location,
with information about the process that owns the page. Thus, only one page
table is in the system, and it has only one entry for each page of physical
memory. Figure 8.20 shows the operation of an inverted page table. Compare
it with Figure 8.10, which depicts a standard page table in operation. Inverted
page tables often require that an address-space identifier (Section 8.5.2) be
stored in each entry of the page table, since the table usually contains several
different address spaces mapping physical memory. Storing the address-space
identifier ensures that a logical page for a particular process is mapped to the
corresponding physical page frame. Examples of systems using inverted page
tables include the 64-bit UltraSPARC and PowerPC.
382 Chapter 8 Main Memory
page table
CPU
logical
address physical
address
physical
memory
i
pid p
pid
search
p
d i d
Figure 8.20 Inverted page table.
To illustrate this method, we describe a simplified version of the inverted
page table used in the IBM RT. IBM was the first major company to use inverted
page tables, starting with the IBM System 38 and continuing through the
RS/6000 and the current IBM Power CPUs. For the IBM RT, each virtual address
in the system consists of a triple:
process-id, page-number, offset.
Each inverted page-table entry is a pair process-id, page-number where the
process-id assumes the role of the address-space identifier. When a memory
reference occurs, part of the virtual address, consisting of process-id, page-
number, is presented to the memory subsystem. The inverted page table
is then searched for a match. If a match is found—say, at entry i—then the
physical address i, offset is generated. If no match is found, then an illegal
address access has been attempted.
Although this scheme decreases the amount of memory needed to store
each page table, it increases the amount of time needed to search the table when
a page reference occurs. Because the inverted page table is sorted by physical
address, but lookups occur on virtual addresses, the whole table might need
to be searched before a match is found. This search would take far too long.
To alleviate this problem, we use a hash table, as described in Section 8.6.2,
to limit the search to one—or at most a few—page-table entries. Of course,
each access to the hash table adds a memory reference to the procedure, so one
virtual memory reference requires at least two real memory reads—one for the
hash-table entry and one for the page table. (Recall that the TLB is searched first,
before the hash table is consulted, offering some performance improvement.)
Systems that use inverted page tables have difficulty implementing shared
memory. Shared memory is usually implemented as multiple virtual addresses
(one for each process sharing the memory) that are mapped to one physical
address. This standard method cannot be used with inverted page tables;
because there is only one virtual page entry for every physical page, one
8.7 Example: Intel 32 and 64-bit Architectures 383
physical page cannot have two (or more) shared virtual addresses. A simple
technique for addressing this issue is to allow the page table to contain only
one mapping of a virtual address to the shared physical address. This means
that references to virtual addresses that are not mapped result in page faults.
8.6.4 Oracle SPARC Solaris
Consider as a final example a modern 64-bit CPU and operating system that are
tightly integrated to provide low-overhead virtual memory. Solaris running
on the SPARC CPU is a fully 64-bit operating system and as such has to solve
the problem of virtual memory without using up all of its physical memory
by keeping multiple levels of page tables. Its approach is a bit complex but
solves the problem efficiently using hashed page tables. There are two hash
tables—one for the kernel and one for all user processes. Each maps memory
addresses from virtual to physical memory. Each hash-table entry represents a
contiguous area of mapped virtual memory, which is more efficient than having
a separate hash-table entry for each page. Each entry has a base address and a
span indicating the number of pages the entry represents.
Virtual-to-physical translation would take too long if each address required
searching through a hash table, so the CPU implements a TLB that holds
translation table entries (TTEs) for fast hardware lookups. A cache of these TTEs
reside in a translation storage buffer (TSB), which includes an entry per recently
accessed page. When a virtual address reference occurs, the hardware searches
the TLB for a translation. If none is found, the hardware walks through the
in-memory TSB looking for the TTE that corresponds to the virtual address that
caused the lookup. This TLB walk functionality is found on many modern CPUs.
If a match is found in the TSB, the CPU copies the TSB entry into the TLB, and
the memory translation completes. If no match is found in the TSB, the kernel
is interrupted to search the hash table. The kernel then creates a TTE from the
appropriate hash table and stores it in the TSB for automatic loading into the TLB
by the CPU memory-management unit. Finally, the interrupt handler returns
control to the MMU, which completes the address translation and retrieves the
requested byte or word from main memory.
8.7 Example: Intel 32 and 64-bit Architectures
The architecture of Intel chips has dominated the personal computer landscape
for several years. The 16-bit Intel 8086 appeared in the late 1970s and was soon
followed by another 16-bit chip—the Intel 8088—which was notable for being
the chip used in the original IBM PC. Both the 8086 chip and the 8088 chip were
based on a segmented architecture. Intel later produced a series of 32-bit chips
—the IA-32—which included the family of 32-bit Pentium processors. The
IA-32 architecture supported both paging and segmentation. More recently,
Intel has produced a series of 64-bit chips based on the x86-64 architecture.
Currently, all the most popular PC operating systems run on Intel chips,
including Windows, Mac OS X, and Linux (although Linux, of course, runs
on several other architectures as well). Notably, however, Intel’s dominance
has not spread to mobile systems, where the ARM architecture currently enjoys
considerable success (see Section 8.8).
9
C H A P T E R
Virtual
Memory
In Chapter 8, we discussed various memory-management strategies used in
computer systems. All these strategies have the same goal: to keep many
processes in memory simultaneously to allow multiprogramming. However,
they tend to require that an entire process be in memory before it can execute.
Virtual memory is a technique that allows the execution of processes
that are not completely in memory. One major advantage of this scheme is
that programs can be larger than physical memory. Further, virtual memory
abstracts main memory into an extremely large, uniform array of storage,
separating logical memory as viewed by the user from physical memory.
This technique frees programmers from the concerns of memory-storage
limitations. Virtual memory also allows processes to share files easily and
to implement shared memory. In addition, it provides an efficient mechanism
for process creation. Virtual memory is not easy to implement, however, and
may substantially decrease performance if it is used carelessly. In this chapter,
we discuss virtual memory in the form of demand paging and examine its
complexity and cost.
CHAPTER OBJECTIVES
• To describe the benefits of a virtual memory system.
• To explain the concepts of demand paging, page-replacement algorithms,
and allocation of page frames.
• To discuss the principles of the working-set model.
• To examine the relationship between shared memory and memory-mapped
files.
• To explore how kernel memory is managed.
9.1 Background
The memory-management algorithms outlined in Chapter 8 are necessary
because of one basic requirement: The instructions being executed must be
397
398 Chapter 9 Virtual Memory
in physical memory. The first approach to meeting this requirement is to place
the entire logical address space in physical memory. Dynamic loading can help
to ease this restriction, but it generally requires special precautions and extra
work by the programmer.
The requirement that instructions must be in physical memory to be
executed seems both necessary and reasonable; but it is also unfortunate, since
it limits the size of a program to the size of physical memory. In fact, an
examination of real programs shows us that, in many cases, the entire program
is not needed. For instance, consider the following:
• Programs often have code to handle unusual error conditions. Since these
errors seldom, if ever, occur in practice, this code is almost never executed.
• Arrays, lists, and tables are often allocated more memory than they actually
need. An array may be declared 100 by 100 elements, even though it is
seldom larger than 10 by 10 elements. An assembler symbol table may
have room for 3,000 symbols, although the average program has less than
200 symbols.
• Certain options and features of a program may be used rarely. For instance,
the routines on U.S. government computers that balance the budget have
not been used in many years.
Even in those cases where the entire program is needed, it may not all be
needed at the same time.
The ability to execute a program that is only partially in memory would
confer many benefits:
• A program would no longer be constrained by the amount of physical
memory that is available. Users would be able to write programs for an
extremely large virtual address space, simplifying the programming task.
• Because each user program could take less physical memory, more
programs could be run at the same time, with a corresponding increase in
CPU utilization and throughput but with no increase in response time or
turnaround time.
• Less I/O would be needed to load or swap user programs into memory, so
each user program would run faster.
Thus, running a program that is not entirely in memory would benefit both
the system and the user.
Virtual memory involves the separation of logical memory as perceived
by users from physical memory. This separation allows an extremely large
virtual memory to be provided for programmers when only a smaller physical
memory is available (Figure 9.1). Virtual memory makes the task of program-
ming much easier, because the programmer no longer needs to worry about
the amount of physical memory available; she can concentrate instead on the
problem to be programmed.
The virtual address space of a process refers to the logical (or virtual) view
of how a process is stored in memory. Typically, this view is that a process
begins at a certain logical address—say, address 0—and exists in contiguous
memory, as shown in Figure 9.2. Recall from Chapter 8, though, that in fact
9.1 Background 399
virtual
memory
memory
map
physical
memory
•
•
•
page 0
page 1
page 2
page v
Figure 9.1 Diagram showing virtual memory that is larger than physical memory.
physical memory may be organized in page frames and that the physical page
frames assigned to a process may not be contiguous. It is up to the memory-
management unit (MMU) to map logical pages to physical page frames in
memory.
Note in Figure 9.2 that we allow the heap to grow upward in memory as
it is used for dynamic memory allocation. Similarly, we allow for the stack to
code
0
Max
data
heap
stack
Figure 9.2 Virtual address space.
400 Chapter 9 Virtual Memory
shared library
stack
shared
pages
code
data
heap
code
data
heap
shared library
stack
Figure 9.3 Shared library using virtual memory.
grow downward in memory through successive function calls. The large blank
space (or hole) between the heap and the stack is part of the virtual address
space but will require actual physical pages only if the heap or stack grows.
Virtual address spaces that include holes are known as sparse address spaces.
Using a sparse address space is beneficial because the holes can be filled as the
stack or heap segments grow or if we wish to dynamically link libraries (or
possibly other shared objects) during program execution.
In addition to separating logical memory from physical memory, virtual
memory allows files and memory to be shared by two or more processes
through page sharing (Section 8.5.4). This leads to the following benefits:
• System libraries can be shared by several processes through mapping of the
shared object into a virtual address space. Although each process considers
the libraries to be part of its virtual address space, the actual pages where
the libraries reside in physical memory are shared by all the processes
(Figure 9.3). Typically, a library is mapped read-only into the space of each
process that is linked with it.
• Similarly, processes can share memory. Recall from Chapter 3 that two
or more processes can communicate through the use of shared memory.
Virtual memory allows one process to create a region of memory that it can
share with another process. Processes sharing this region consider it part
of their virtual address space, yet the actual physical pages of memory are
shared, much as is illustrated in Figure 9.3.
• Pages can be shared during process creation with the fork() system call,
thus speeding up process creation.
We further explore these—and other—benefits of virtual memory later in
this chapter. First, though, we discuss implementing virtual memory through
demand paging.
9.2 Demand Paging 401
9.2 Demand Paging
Consider how an executable program might be loaded from disk into memory.
One option is to load the entire program in physical memory at program
execution time. However, a problem with this approach is that we may not
initially need the entire program in memory. Suppose a program starts with
a list of available options from which the user is to select. Loading the entire
program into memory results in loading the executable code for all options,
regardless of whether or not an option is ultimately selected by the user. An
alternative strategy is to load pages only as they are needed. This technique is
known as demand paging and is commonly used in virtual memory systems.
With demand-paged virtual memory, pages are loaded only when they are
demanded during program execution. Pages that are never accessed are thus
never loaded into physical memory.
A demand-paging system is similar to a paging system with swapping
(Figure 9.4) where processes reside in secondary memory (usually a disk).
When we want to execute a process, we swap it into memory. Rather than
swapping the entire process into memory, though, we use a lazy swapper.
A lazy swapper never swaps a page into memory unless that page will be
needed. In the context of a demand-paging system, use of the term “swapper”
is technically incorrect. A swapper manipulates entire processes, whereas a
pager is concerned with the individual pages of a process. We thus use “pager,”
rather than “swapper,” in connection with demand paging.
program
A
swap out 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
swap in
program
B
main
memory
Figure 9.4 Transfer of a paged memory to contiguous disk space.
402 Chapter 9 Virtual Memory
9.2.1 Basic Concepts
When a process is to be swapped in, the pager guesses which pages will be
used before the process is swapped out again. Instead of swapping in a whole
process, the pager brings only those pages into memory. Thus, it avoids reading
into memory pages that will not be used anyway, decreasing the swap time
and the amount of physical memory needed.
With this scheme, we need some form of hardware support to distinguish
between the pages that are in memory and the pages that are on the disk.
The valid–invalid bit scheme described in Section 8.5.3 can be used for this
purpose. This time, however, when this bit is set to “valid,” the associated page
is both legal and in memory. If the bit is set to “invalid,” the page either is not
valid (that is, not in the logical address space of the process) or is valid but
is currently on the disk. The page-table entry for a page that is brought into
memory is set as usual, but the page-table entry for a page that is not currently
in memory is either simply marked invalid or contains the address of the page
on disk. This situation is depicted in Figure 9.5.
Notice that marking a page invalid will have no effect if the process never
attempts to access that page. Hence, if we guess right and page in all pages
that are actually needed and only those pages, the process will run exactly as
though we had brought in all pages. While the process executes and accesses
pages that are memory resident, execution proceeds normally.
B
D
D E
F
H
logical
memory
valid–invalid
bit
frame
page table
1
0 4
6
2
3
4
5 9
6
7
1
0
2
3
4
5
6
7
i
v
v
i
i
v
i
i
physical memory
A
A B
C
C
F G H
F
1
0
2
3
4
5
6
7
9
8
10
11
12
13
14
15
A
C
E
G
Figure 9.5 Page table when some pages are not in main memory.
9.2 Demand Paging 403
load M
reference
trap
i
page is on
backing store
operating
system
restart
instruction
reset page
table
page table
physical
memory
bring in
missing page
free frame
1
2
3
6
5 4
Figure 9.6 Steps in handling a page fault.
But what happens if the process tries to access a page that was not brought
into memory? Access to a page marked invalid causes a page fault. The paging
hardware, in translating the address through the page table, will notice that
the invalid bit is set, causing a trap to the operating system. This trap is the
result of the operating system’s failure to bring the desired page into memory.
The procedure for handling this page fault is straightforward (Figure 9.6):
1. We check an internal table (usually kept with the process control block)
for this process to determine whether the reference was a valid or an
invalid memory access.
2. If the reference was invalid, we terminate the process. If it was valid but
we have not yet brought in that page, we now page it in.
3. We find a free frame (by taking one from the free-frame list, for example).
4. We schedule a disk operation to read the desired page into the newly
allocated frame.
5. When the disk read is complete, we modify the internal table kept with
the process and the page table to indicate that the page is now in memory.
6. We restart the instruction that was interrupted by the trap. The process
can now access the page as though it had always been in memory.
In the extreme case, we can start executing a process with no pages in
memory. When the operating system sets the instruction pointer to the first
404 Chapter 9 Virtual Memory
instruction of the process, which is on a non-memory-resident page, the process
immediately faults for the page. After this page is brought into memory, the
process continues to execute, faulting as necessary until every page that it
needs is in memory. At that point, it can execute with no more faults. This
scheme is pure demand paging: never bring a page into memory until it is
required.
Theoretically, some programs could access several new pages of memory
with each instruction execution (one page for the instruction and many for
data), possibly causing multiple page faults per instruction. This situation
would result in unacceptable system performance. Fortunately, analysis of
running processes shows that this behavior is exceedingly unlikely. Programs
tend to have locality of reference, described in Section 9.6.1, which results in
reasonable performance from demand paging.
The hardware to support demand paging is the same as the hardware for
paging and swapping:
• Page table. This table has the ability to mark an entry invalid through a
valid–invalid bit or a special value of protection bits.
• Secondary memory. This memory holds those pages that are not present
in main memory. The secondary memory is usually a high-speed disk. It is
known as the swap device, and the section of disk used for this purpose is
known as swap space. Swap-space allocation is discussed in Chapter 10.
A crucial requirement for demand paging is the ability to restart any
instruction after a page fault. Because we save the state (registers, condition
code, instruction counter) of the interrupted process when the page fault
occurs, we must be able to restart the process in exactly the same place and
state, except that the desired page is now in memory and is accessible. In most
cases, this requirement is easy to meet. A page fault may occur at any memory
reference. If the page fault occurs on the instruction fetch, we can restart by
fetching the instruction again. If a page fault occurs while we are fetching an
operand, we must fetch and decode the instruction again and then fetch the
operand.
As a worst-case example, consider a three-address instruction such as ADD
the content of A to B, placing the result in C. These are the steps to execute this
instruction:
1. Fetch and decode the instruction (ADD).
2. Fetch A.
3. Fetch B.
4. Add A and B.
5. Store the sum in C.
If we fault when we try to store in C (because C is in a page not currently
in memory), we will have to get the desired page, bring it in, correct the
page table, and restart the instruction. The restart will require fetching the
instruction again, decoding it again, fetching the two operands again, and
then adding again. However, there is not much repeated work (less than one
9.2 Demand Paging 405
complete instruction), and the repetition is necessary only when a page fault
occurs.
The major difficulty arises when one instruction may modify several
different locations. For example, consider the IBM System 360/370 MVC (move
character) instruction, which can move up to 256 bytes from one location to
another (possibly overlapping) location. If either block (source or destination)
straddles a page boundary, a page fault might occur after the move is partially
done. In addition, if the source and destination blocks overlap, the source
block may have been modified, in which case we cannot simply restart the
instruction.
This problem can be solved in two different ways. In one solution, the
microcode computes and attempts to access both ends of both blocks. If a page
fault is going to occur, it will happen at this step, before anything is modified.
The move can then take place; we know that no page fault can occur, since all
the relevant pages are in memory. The other solution uses temporary registers
to hold the values of overwritten locations. If there is a page fault, all the old
values are written back into memory before the trap occurs. This action restores
memory to its state before the instruction was started, so that the instruction
can be repeated.
This is by no means the only architectural problem resulting from adding
paging to an existing architecture to allow demand paging, but it illustrates
some of the difficulties involved. Paging is added between the CPU and the
memory in a computer system. It should be entirely transparent to the user
process. Thus, people often assume that paging can be added to any system.
Although this assumption is true for a non-demand-paging environment,
where a page fault represents a fatal error, it is not true where a page fault
means only that an additional page must be brought into memory and the
process restarted.
9.2.2 Performance of Demand Paging
Demand paging can significantly affect the performance of a computer system.
To see why, let’s compute the effective access time for a demand-paged
memory. For most computer systems, the memory-access time, denoted ma,
ranges from 10 to 200 nanoseconds. As long as we have no page faults, the
effective access time is equal to the memory access time. If, however, a page
fault occurs, we must first read the relevant page from disk and then access the
desired word.
Let p be the probability of a page fault (0 ≤ p ≤ 1). We would expect p to
be close to zero—that is, we would expect to have only a few page faults. The
effective access time is then
effective access time = (1 − p) × ma + p × page fault time.
To compute the effective access time, we must know how much time is
needed to service a page fault. A page fault causes the following sequence to
occur:
1. Trap to the operating system.
2. Save the user registers and process state.
406 Chapter 9 Virtual Memory
3. Determine that the interrupt was a page fault.
4. Check that the page reference was legal and determine the location of the
page on the disk.
5. Issue a read from the disk to a free frame:
a. Wait in a queue for this device until the read request is serviced.
b. Wait for the device seek and/or latency time.
c. Begin the transfer of the page to a free frame.
6. While waiting, allocate the CPU to some other user (CPU scheduling,
optional).
7. Receive an interrupt from the disk I/O subsystem (I/O completed).
8. Save the registers and process state for the other user (if step 6 is executed).
9. Determine that the interrupt was from the disk.
10. Correct the page table and other tables to show that the desired page is
now in memory.
11. Wait for the CPU to be allocated to this process again.
12. Restore the user registers, process state, and new page table, and then
resume the interrupted instruction.
Not all of these steps are necessary in every case. For example, we are assuming
that, in step 6, the CPU is allocated to another process while the I/O occurs.
This arrangement allows multiprogramming to maintain CPU utilization but
requires additional time to resume the page-fault service routine when the I/O
transfer is complete.
In any case, we are faced with three major components of the page-fault
service time:
1. Service the page-fault interrupt.
2. Read in the page.
3. Restart the process.
The first and third tasks can be reduced, with careful coding, to several
hundred instructions. These tasks may take from 1 to 100 microseconds each.
The page-switch time, however, will probably be close to 8 milliseconds.
(A typical hard disk has an average latency of 3 milliseconds, a seek of
5 milliseconds, and a transfer time of 0.05 milliseconds. Thus, the total
paging time is about 8 milliseconds, including hardware and software time.)
Remember also that we are looking at only the device-service time. If a queue
of processes is waiting for the device, we have to add device-queueing time as
we wait for the paging device to be free to service our request, increasing even
more the time to swap.
9.2 Demand Paging 407
With an average page-fault service time of 8 milliseconds and a memory-
access time of 200 nanoseconds, the effective access time in nanoseconds
is
effective access time = (1 − p) × (200) + p (8 milliseconds)
= (1 − p) × 200 + p × 8,000,000
= 200 + 7,999,800 × p.
We see, then, that the effective access time is directly proportional to the
page-fault rate. If one access out of 1,000 causes a page fault, the effective access
time is 8.2 microseconds. The computer will be slowed down by a factor of 40
because of demand paging! If we want performance degradation to be less
than 10 percent, we need to keep the probability of page faults at the following
level:
220  200 + 7,999,800 × p,
20  7,999,800 × p,
p  0.0000025.
That is, to keep the slowdown due to paging at a reasonable level, we can
allow fewer than one memory access out of 399,990 to page-fault. In sum,
it is important to keep the page-fault rate low in a demand-paging system.
Otherwise, the effective access time increases, slowing process execution
dramatically.
An additional aspect of demand paging is the handling and overall use
of swap space. Disk I/O to swap space is generally faster than that to the file
system. It is a faster file system because swap space is allocated in much larger
blocks, and file lookups and indirect allocation methods are not used (Chapter
10). The system can therefore gain better paging throughput by copying an
entire file image into the swap space at process startup and then performing
demand paging from the swap space. Another option is to demand pages
from the file system initially but to write the pages to swap space as they are
replaced. This approach will ensure that only needed pages are read from the
file system but that all subsequent paging is done from swap space.
Some systems attempt to limit the amount of swap space used through
demand paging of binary files. Demand pages for such files are brought directly
from the file system. However, when page replacement is called for, these
frames can simply be overwritten (because they are never modified), and the
pages can be read in from the file system again if needed. Using this approach,
the file system itself serves as the backing store. However, swap space must still
be used for pages not associated with a file (known as anonymous memory);
these pages include the stack and heap for a process. This method appears to
be a good compromise and is used in several systems, including Solaris and
BSD UNIX.
Mobile operating systems typically do not support swapping. Instead,
these systems demand-page from the file system and reclaim read-only pages
(such as code) from applications if memory becomes constrained. Such data
can be demand-paged from the file system if it is later needed. Under iOS,
anonymous memory pages are never reclaimed from an application unless the
application is terminated or explicitly releases the memory.
408 Chapter 9 Virtual Memory
9.3 Copy-on-Write
In Section 9.2, we illustrated how a process can start quickly by demand-paging
in the page containing the first instruction. However, process creation using the
fork() system call may initially bypass the need for demand paging by using
a technique similar to page sharing (covered in Section 8.5.4). This technique
provides rapid process creation and minimizes the number of new pages that
must be allocated to the newly created process.
Recall that the fork() system call creates a child process that is a duplicate
of its parent. Traditionally, fork() worked by creating a copy of the parent’s
address space for the child, duplicating the pages belonging to the parent.
However, considering that many child processes invoke the exec() system
call immediately after creation, the copying of the parent’s address space may
be unnecessary. Instead, we can use a technique known as copy-on-write,
which works by allowing the parent and child processes initially to share the
same pages. These shared pages are marked as copy-on-write pages, meaning
that if either process writes to a shared page, a copy of the shared page is
created. Copy-on-write is illustrated in Figures 9.7 and 9.8, which show the
contents of the physical memory before and after process 1 modifies page C.
For example, assume that the child process attempts to modify a page
containing portions of the stack, with the pages set to be copy-on-write. The
operating system will create a copy of this page, mapping it to the address space
of the child process. The child process will then modify its copied page and not
the page belonging to the parent process. Obviously, when the copy-on-write
technique is used, only the pages that are modified by either process are copied;
all unmodified pages can be shared by the parent and child processes. Note, too,
that only pages that can be modified need be marked as copy-on-write. Pages
that cannot be modified (pages containing executable code) can be shared by
the parent and child. Copy-on-write is a common technique used by several
operating systems, including Windows XP, Linux, and Solaris.
When it is determined that a page is going to be duplicated using copy-
on-write, it is important to note the location from which the free page will
be allocated. Many operating systems provide a pool of free pages for such
requests. These free pages are typically allocated when the stack or heap for a
process must expand or when there are copy-on-write pages to be managed.
process1
physical
memory
page A
page B
page C
process2
Figure 9.7 Before process 1 modifies page C.
9.4 Page Replacement 409
process1
physical
memory
page A
page B
page C
Copy of page C
process2
Figure 9.8 After process 1 modifies page C.
Operating systems typically allocate these pages using a technique known as
zero-fill-on-demand. Zero-fill-on-demand pages have been zeroed-out before
being allocated, thus erasing the previous contents.
Several versions of UNIX (including Solaris and Linux) provide a variation
of the fork() system call—vfork() (for virtual memory fork)—that operates
differently from fork() with copy-on-write. With vfork(), the parent process
is suspended, and the child process uses the address space of the parent.
Because vfork() does not use copy-on-write, if the child process changes
any pages of the parent’s address space, the altered pages will be visible to the
parent once it resumes. Therefore, vfork() must be used with caution to ensure
that the child process does not modify the address space of the parent. vfork()
is intended to be used when the child process calls exec() immediately after
creation. Because no copying of pages takes place, vfork() is an extremely
efficient method of process creation and is sometimes used to implement UNIX
command-line shell interfaces.
9.4 Page Replacement
In our earlier discussion of the page-fault rate, we assumed that each page
faults at most once, when it is first referenced. This representation is not strictly
accurate, however. If a process of ten pages actually uses only half of them, then
demand paging saves the I/O necessary to load the five pages that are never
used. We could also increase our degree of multiprogramming by running
twice as many processes. Thus, if we had forty frames, we could run eight
processes, rather than the four that could run if each required ten frames (five
of which were never used).
If we increase our degree of multiprogramming, we are over-allocating
memory. If we run six processes, each of which is ten pages in size but actually
uses only five pages, we have higher CPU utilization and throughput, with
ten frames to spare. It is possible, however, that each of these processes, for a
particular data set, may suddenly try to use all ten of its pages, resulting in a
need for sixty frames when only forty are available.
Further, consider that system memory is not used only for holding program
pages. Buffers for I/O also consume a considerable amount of memory. This use
410 Chapter 9 Virtual Memory
monitor
load M
physical
memory
1
0
2
3
4
5
6
7
H
load M
J
M
logical memory
for user 1
0
PC
1
2
3 B
M
valid–invalid
bit
frame
page table
for user 1
i
A
B
D
E
logical memory
for user 2
0
1
2
3
valid–invalid
bit
frame
page table
for user 2
i
4
3
5
v
v
v
7
2 v
v
6 v
D
H
J
A
E
Figure 9.9 Need for page replacement.
can increase the strain on memory-placement algorithms. Deciding how much
memory to allocate to I/O and how much to program pages is a significant
challenge. Some systems allocate a fixed percentage of memory for I/O buffers,
whereas others allow both user processes and the I/O subsystem to compete
for all system memory.
Over-allocation of memory manifests itself as follows. While a user process
is executing, a page fault occurs. The operating system determines where the
desired page is residing on the disk but then finds that there are no free frames
on the free-frame list; all memory is in use (Figure 9.9).
The operating system has several options at this point. It could terminate
the user process. However, demand paging is the operating system’s attempt to
improve the computer system’s utilization and throughput. Users should not
be aware that their processes are running on a paged system—paging should
be logically transparent to the user. So this option is not the best choice.
The operating system could instead swap out a process, freeing all its
frames and reducing the level of multiprogramming. This option is a good one
in certain circumstances, and we consider it further in Section 9.6. Here, we
discuss the most common solution: page replacement.
9.4.1 Basic Page Replacement
Page replacement takes the following approach. If no frame is free, we find
one that is not currently being used and free it. We can free a frame by writing
its contents to swap space and changing the page table (and all other tables) to
indicate that the page is no longer in memory (Figure 9.10). We can now use
the freed frame to hold the page for which the process faulted. We modify the
page-fault service routine to include page replacement:
9.4 Page Replacement 411
valid–invalid bit
frame
f
page table
victim
change
to invalid
page out
victim
page
page in
desired
page
reset page
table for
new page
physical
memory
2
4
1
3
f
0 i
v
Figure 9.10 Page replacement.
1. Find the location of the desired page on the disk.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to select
a victim frame.
c. Write the victim frame to the disk; change the page and frame tables
accordingly.
3. Read the desired page into the newly freed frame; change the page and
frame tables.
4. Continue the user process from where the page fault occurred.
Notice that, if no frames are free, two page transfers (one out and one in)
are required. This situation effectively doubles the page-fault service time and
increases the effective access time accordingly.
We can reduce this overhead by using a modify bit (or dirty bit). When
this scheme is used, each page or frame has a modify bit associated with it in
the hardware. The modify bit for a page is set by the hardware whenever any
byte in the page is written into, indicating that the page has been modified.
When we select a page for replacement, we examine its modify bit. If the bit
is set, we know that the page has been modified since it was read in from the
disk. In this case, we must write the page to the disk. If the modify bit is not set,
however, the page has not been modified since it was read into memory. In this
case, we need not write the memory page to the disk: it is already there. This
technique also applies to read-only pages (for example, pages of binary code).
412 Chapter 9 Virtual Memory
Such pages cannot be modified; thus, they may be discarded when desired.
This scheme can significantly reduce the time required to service a page fault,
since it reduces I/O time by one-half if the page has not been modified.
Page replacement is basic to demand paging. It completes the separation
between logical memory and physical memory. With this mechanism, an
enormous virtual memory can be provided for programmers on a smaller
physical memory. With no demand paging, user addresses are mapped into
physical addresses, and the two sets of addresses can be different. All the
pages of a process still must be in physical memory, however. With demand
paging, the size of the logical address space is no longer constrained by physical
memory. If we have a user process of twenty pages, we can execute it in ten
frames simply by using demand paging and using a replacement algorithm to
find a free frame whenever necessary. If a page that has been modified is to be
replaced, its contents are copied to the disk. A later reference to that page will
cause a page fault. At that time, the page will be brought back into memory,
perhaps replacing some other page in the process.
We must solve two major problems to implement demand paging: we must
develop a frame-allocation algorithm and a page-replacement algorithm.
That is, if we have multiple processes in memory, we must decide how many
frames to allocate to each process; and when page replacement is required,
we must select the frames that are to be replaced. Designing appropriate
algorithms to solve these problems is an important task, because disk I/O
is so expensive. Even slight improvements in demand-paging methods yield
large gains in system performance.
There are many different page-replacement algorithms. Every operating
system probably has its own replacement scheme. How do we select a
particular replacement algorithm? In general, we want the one with the lowest
page-fault rate.
We evaluate an algorithm by running it on a particular string of memory
references and computing the number of page faults. The string of memory
references is called a reference string. We can generate reference strings
artificially (by using a random-number generator, for example), or we can trace
a given system and record the address of each memory reference. The latter
choice produces a large number of data (on the order of 1 million addresses
per second). To reduce the number of data, we use two facts.
First, for a given page size (and the page size is generally fixed by the
hardware or system), we need to consider only the page number, rather than
the entire address. Second, if we have a reference to a page p, then any references
to page p that immediately follow will never cause a page fault. Page p will
be in memory after the first reference, so the immediately following references
will not fault.
For example, if we trace a particular process, we might record the following
address sequence:
0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103,
0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105
At 100 bytes per page, this sequence is reduced to the following reference
string:
1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1
9.4 Page Replacement 413
number
of
page
faults
16
14
12
10
8
6
4
2
1 2 3
number of frames
4 5 6
Figure 9.11 Graph of page faults versus number of frames.
To determine the number of page faults for a particular reference string and
page-replacement algorithm, we also need to know the number of page frames
available. Obviously, as the number of frames available increases, the number
of page faults decreases. For the reference string considered previously, for
example, if we had three or more frames, we would have only three faults—
one fault for the first reference to each page. In contrast, with only one frame
available, we would have a replacement with every reference, resulting in
eleven faults. In general, we expect a curve such as that in Figure 9.11. As the
number of frames increases, the number of page faults drops to some minimal
level. Of course, adding physical memory increases the number of frames.
We next illustrate several page-replacement algorithms. In doing so, we
use the reference string
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
for a memory with three frames.
9.4.2 FIFO Page Replacement
The simplest page-replacement algorithm is a first-in, first-out (FIFO) algorithm.
A FIFO replacement algorithm associates with each page the time when that
page was brought into memory. When a page must be replaced, the oldest
page is chosen. Notice that it is not strictly necessary to record the time when
a page is brought in. We can create a FIFO queue to hold all pages in memory.
We replace the page at the head of the queue. When a page is brought into
memory, we insert it at the tail of the queue.
For our example reference string, our three frames are initially empty. The
first three references (7, 0, 1) cause page faults and are brought into these empty
frames. The next reference (2) replaces page 7, because page 7 was brought in
first. Since 0 is the next reference and 0 is already in memory, we have no fault
for this reference. The first reference to 3 results in replacement of page 0, since
it is now first in line. Because of this replacement, the next reference, to 0, will
414 Chapter 9 Virtual Memory
7 7
0
7
0
1
page frames
reference string
2
0
1
2
3
1
2
3
0
4
3
0
4
2
0
4
2
3
0
2
3
7
1
2
7
0
2
7
0
1
0
1
3
0
7 0 1 2 0 3 0 4 2 3 0 7 1
1 0
2 1 2
0 3
1
2
Figure 9.12 FIFO page-replacement algorithm.
fault. Page 1 is then replaced by page 0. This process continues as shown in
Figure 9.12. Every time a fault occurs, we show which pages are in our three
frames. There are fifteen faults altogether.
The FIFO page-replacement algorithm is easy to understand and program.
However, its performance is not always good. On the one hand, the page
replaced may be an initialization module that was used a long time ago and is
no longer needed. On the other hand, it could contain a heavily used variable
that was initialized early and is in constant use.
Notice that, even if we select for replacement a page that is in active use,
everything still works correctly. After we replace an active page with a new
one, a fault occurs almost immediately to retrieve the active page. Some other
page must be replaced to bring the active page back into memory. Thus, a bad
replacement choice increases the page-fault rate and slows process execution.
It does not, however, cause incorrect execution.
To illustrate the problems that are possible with a FIFO page-replacement
algorithm, consider the following reference string:
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Figure 9.13 shows the curve of page faults for this reference string versus the
number of available frames. Notice that the number of faults for four frames
(ten) is greater than the number of faults for three frames (nine)! This most
unexpected result is known as Belady’s anomaly: for some page-replacement
algorithms, the page-fault rate may increase as the number of allocated frames
increases. We would expect that giving more memory to a process would
improve its performance. In some early research, investigators noticed that
this assumption was not always true. Belady’s anomaly was discovered as a
result.
9.4.3 Optimal Page Replacement
One result of the discovery of Belady’s anomaly was the search for an optimal
page-replacement algorithm—the algorithm that has the lowest page-fault
rate of all algorithms and will never suffer from Belady’s anomaly. Such an
algorithm does exist and has been called OPT or MIN. It is simply this:
Replace the page that will not be used for the longest period of time.
Use of this page-replacement algorithm guarantees the lowest possible page-
fault rate for a fixed number of frames.
9.4 Page Replacement 415
number
of
page
faults
16
14
12
10
8
6
4
2
1 2 3
number of frames
4 5 6 7
Figure 9.13 Page-fault curve for FIFO replacement on a reference string.
For example, on our sample reference string, the optimal page-replacement
algorithm would yield nine page faults, as shown in Figure 9.14. The first three
references cause faults that fill the three empty frames. The reference to page
2 replaces page 7, because page 7 will not be used until reference 18, whereas
page 0 will be used at 5, and page 1 at 14. The reference to page 3 replaces
page 1, as page 1 will be the last of the three pages in memory to be referenced
again. With only nine page faults, optimal replacement is much better than
a FIFO algorithm, which results in fifteen faults. (If we ignore the first three,
which all algorithms must suffer, then optimal replacement is twice as good as
FIFO replacement.) In fact, no replacement algorithm can process this reference
string in three frames with fewer than nine faults.
Unfortunately, the optimal page-replacement algorithm is difficult to
implement, because it requires future knowledge of the reference string. (We
encountered a similar situation with the SJF CPU-scheduling algorithm in
Section 6.3.2.) As a result, the optimal algorithm is used mainly for comparison
studies. For instance, it may be useful to know that, although a new algorithm
is not optimal, it is within 12.3 percent of optimal at worst and within 4.7
percent on average.
page frames
reference string
7 7
0
7
0
1
2
0
1
2
0
3
2
4
3
2
0
3
7
0
1
2
0
1
7 0 1 2 0 3 0 4 2 3 0 7 1
1 0
2 1 2
0 3
Figure 9.14 Optimal page-replacement algorithm.
416 Chapter 9 Virtual Memory
9.4.4 LRU Page Replacement
If the optimal algorithm is not feasible, perhaps an approximation of the
optimal algorithm is possible. The key distinction between the FIFO and OPT
algorithms (other than looking backward versus forward in time) is that the
FIFO algorithm uses the time when a page was brought into memory, whereas
the OPT algorithm uses the time when a page is to be used. If we use the recent
past as an approximation of the near future, then we can replace the page that
has not been used for the longest period of time. This approach is the least
recently used (LRU) algorithm.
LRU replacement associates with each page the time of that page’s last use.
When a page must be replaced, LRU chooses the page that has not been used
for the longest period of time. We can think of this strategy as the optimal
page-replacement algorithm looking backward in time, rather than forward.
(Strangely, if we let SR
be the reverse of a reference string S, then the page-fault
rate for the OPT algorithm on S is the same as the page-fault rate for the OPT
algorithm on SR
. Similarly, the page-fault rate for the LRU algorithm on S is the
same as the page-fault rate for the LRU algorithm on SR
.)
The result of applying LRU replacement to our example reference string is
shown in Figure 9.15. The LRU algorithm produces twelve faults. Notice that
the first five faults are the same as those for optimal replacement. When the
reference to page 4 occurs, however, LRU replacement sees that, of the three
frames in memory, page 2 was used least recently. Thus, the LRU algorithm
replaces page 2, not knowing that page 2 is about to be used. When it then faults
for page 2, the LRU algorithm replaces page 3, since it is now the least recently
used of the three pages in memory. Despite these problems, LRU replacement
with twelve faults is much better than FIFO replacement with fifteen.
The LRU policy is often used as a page-replacement algorithm and
is considered to be good. The major problem is how to implement LRU
replacement. An LRU page-replacement algorithm may require substantial
hardware assistance. The problem is to determine an order for the frames
defined by the time of last use. Two implementations are feasible:
• Counters. In the simplest case, we associate with each page-table entry a
time-of-use field and add to the CPU a logical clock or counter. The clock is
incremented for every memory reference. Whenever a reference to a page
is made, the contents of the clock register are copied to the time-of-use
field in the page-table entry for that page. In this way, we always have
page frames
reference string
7 7
0
7
0
1
2
0
1
2
0
3
4
0
3
4
0
2
4
3
2
0
3
2
1
3
2
1
0
2
1
0
7
7 0 1 2 0 3 0 4 2 3 0 7 1
1 0
2 1 2
0 3
Figure 9.15 LRU page-replacement algorithm.
9.4 Page Replacement 417
the “time” of the last reference to each page. We replace the page with the
smallest time value. This scheme requires a search of the page table to find
the LRU page and a write to memory (to the time-of-use field in the page
table) for each memory access. The times must also be maintained when
page tables are changed (due to CPU scheduling). Overflow of the clock
must be considered.
• Stack. Another approach to implementing LRU replacement is to keep
a stack of page numbers. Whenever a page is referenced, it is removed
from the stack and put on the top. In this way, the most recently used
page is always at the top of the stack and the least recently used page is
always at the bottom (Figure 9.16). Because entries must be removed from
the middle of the stack, it is best to implement this approach by using a
doubly linked list with a head pointer and a tail pointer. Removing a page
and putting it on the top of the stack then requires changing six pointers
at worst. Each update is a little more expensive, but there is no search for
a replacement; the tail pointer points to the bottom of the stack, which is
the LRU page. This approach is particularly appropriate for software or
microcode implementations of LRU replacement.
Like optimal replacement, LRU replacement does not suffer from Belady’s
anomaly. Both belong to a class of page-replacement algorithms, called stack
algorithms, that can never exhibit Belady’s anomaly. A stack algorithm is an
algorithm for which it can be shown that the set of pages in memory for n
frames is always a subset of the set of pages that would be in memory with n
+ 1 frames. For LRU replacement, the set of pages in memory would be the n
most recently referenced pages. If the number of frames is increased, these n
pages will still be the most recently referenced and so will still be in memory.
Note that neither implementation of LRU would be conceivable without
hardware assistance beyond the standard TLB registers. The updating of the
clock fields or stack must be done for every memory reference. If we were
to use an interrupt for every reference to allow software to update such data
structures, it would slow every memory reference by a factor of at least ten,
2
1
0
4
7
stack
before
a
7
2
1
4
0
stack
after
b
reference string
4 7 0 7 1 0 1 2 1 2 2
7
a b
1
Figure 9.16 Use of a stack to record the most recent page references.
418 Chapter 9 Virtual Memory
hence slowing every user process by a factor of ten. Few systems could tolerate
that level of overhead for memory management.
9.4.5 LRU-Approximation Page Replacement
Few computer systems provide sufficient hardware support for true LRU page
replacement. In fact, some systems provide no hardware support, and other
page-replacement algorithms (such as a FIFO algorithm) must be used. Many
systems provide some help, however, in the form of a reference bit. The
reference bit for a page is set by the hardware whenever that page is referenced
(either a read or a write to any byte in the page). Reference bits are associated
with each entry in the page table.
Initially, all bits are cleared (to 0) by the operating system. As a user process
executes, the bit associated with each page referenced is set (to 1) by the
hardware. After some time, we can determine which pages have been used and
which have not been used by examining the reference bits, although we do not
know the order of use. This information is the basis for many page-replacement
algorithms that approximate LRU replacement.
9.4.5.1 Additional-Reference-Bits Algorithm
We can gain additional ordering information by recording the reference bits at
regular intervals. We can keep an 8-bit byte for each page in a table in memory.
At regular intervals (say, every 100 milliseconds), a timer interrupt transfers
control to the operating system. The operating system shifts the reference bit
for each page into the high-order bit of its 8-bit byte, shifting the other bits right
by 1 bit and discarding the low-order bit. These 8-bit shift registers contain the
history of page use for the last eight time periods. If the shift register contains
00000000, for example, then the page has not been used for eight time periods.
A page that is used at least once in each period has a shift register value of
11111111. A page with a history register value of 11000100 has been used more
recently than one with a value of 01110111. If we interpret these 8-bit bytes
as unsigned integers, the page with the lowest number is the LRU page, and
it can be replaced. Notice that the numbers are not guaranteed to be unique,
however. We can either replace (swap out) all pages with the smallest value or
use the FIFO method to choose among them.
The number of bits of history included in the shift register can be varied,
of course, and is selected (depending on the hardware available) to make
the updating as fast as possible. In the extreme case, the number can be
reduced to zero, leaving only the reference bit itself. This algorithm is called
the second-chance page-replacement algorithm.
9.4.5.2 Second-Chance Algorithm
The basic algorithm of second-chance replacement is a FIFO replacement
algorithm. When a page has been selected, however, we inspect its reference
bit. If the value is 0, we proceed to replace this page; but if the reference bit
is set to 1, we give the page a second chance and move on to select the next
FIFO page. When a page gets a second chance, its reference bit is cleared, and
its arrival time is reset to the current time. Thus, a page that is given a second
chance will not be replaced until all other pages have been replaced (or given
9.4 Page Replacement 419
circular queue of pages
(a)
next
victim
0
reference
bits
pages
0
1
1
0
1
1
…
…
circular queue of pages
(b)
0
reference
bits
pages
0
0
0
0
1
1
…
…
Figure 9.17 Second-chance (clock) page-replacement algorithm.
second chances). In addition, if a page is used often enough to keep its reference
bit set, it will never be replaced.
One way to implement the second-chance algorithm (sometimes referred
to as the clock algorithm) is as a circular queue. A pointer (that is, a hand on
the clock) indicates which page is to be replaced next. When a frame is needed,
the pointer advances until it finds a page with a 0 reference bit. As it advances,
it clears the reference bits (Figure 9.17). Once a victim page is found, the page
is replaced, and the new page is inserted in the circular queue in that position.
Notice that, in the worst case, when all bits are set, the pointer cycles through
the whole queue, giving each page a second chance. It clears all the reference
bits before selecting the next page for replacement. Second-chance replacement
degenerates to FIFO replacement if all bits are set.
9.4.5.3 Enhanced Second-Chance Algorithm
We can enhance the second-chance algorithm by considering the reference bit
and the modify bit (described in Section 9.4.1) as an ordered pair. With these
two bits, we have the following four possible classes:
1. (0, 0) neither recently used nor modified—best page to replace
2. (0, 1) not recently used but modified—not quite as good, because the
page will need to be written out before replacement
420 Chapter 9 Virtual Memory
3. (1, 0) recently used but clean—probably will be used again soon
4. (1, 1) recently used and modified—probably will be used again soon, and
the page will be need to be written out to disk before it can be replaced
Each page is in one of these four classes. When page replacement is called for,
we use the same scheme as in the clock algorithm; but instead of examining
whether the page to which we are pointing has the reference bit set to 1,
we examine the class to which that page belongs. We replace the first page
encountered in the lowest nonempty class. Notice that we may have to scan
the circular queue several times before we find a page to be replaced.
The major difference between this algorithm and the simpler clock algo-
rithm is that here we give preference to those pages that have been modified
in order to reduce the number of I/Os required.
9.4.6 Counting-Based Page Replacement
There are many other algorithms that can be used for page replacement. For
example, we can keep a counter of the number of references that have been
made to each page and develop the following two schemes.
• The least frequently used (LFU) page-replacement algorithm requires that
the page with the smallest count be replaced. The reason for this selection is
that an actively used page should have a large reference count. A problem
arises, however, when a page is used heavily during the initial phase of
a process but then is never used again. Since it was used heavily, it has a
large count and remains in memory even though it is no longer needed.
One solution is to shift the counts right by 1 bit at regular intervals, forming
an exponentially decaying average usage count.
• The most frequently used (MFU) page-replacement algorithm is based
on the argument that the page with the smallest count was probably just
brought in and has yet to be used.
As you might expect, neither MFU nor LFU replacement is common. The
implementation of these algorithms is expensive, and they do not approximate
OPT replacement well.
9.4.7 Page-Buffering Algorithms
Other procedures are often used in addition to a specific page-replacement
algorithm. For example, systems commonly keep a pool of free frames. When
a page fault occurs, a victim frame is chosen as before. However, the desired
page is read into a free frame from the pool before the victim is written out. This
procedure allows the process to restart as soon as possible, without waiting
for the victim page to be written out. When the victim is later written out, its
frame is added to the free-frame pool.
An expansion of this idea is to maintain a list of modified pages. Whenever
the paging device is idle, a modified page is selected and is written to the disk.
Its modify bit is then reset. This scheme increases the probability that a page
will be clean when it is selected for replacement and will not need to be written
out.
9.5 Allocation of Frames 421
Another modification is to keep a pool of free frames but to remember
which page was in each frame. Since the frame contents are not modified when
a frame is written to the disk, the old page can be reused directly from the
free-frame pool if it is needed before that frame is reused. No I/O is needed in
this case. When a page fault occurs, we first check whether the desired page is
in the free-frame pool. If it is not, we must select a free frame and read into it.
This technique is used in the VAX/VMS system along with a FIFO replace-
ment algorithm. When the FIFO replacement algorithm mistakenly replaces a
page that is still in active use, that page is quickly retrieved from the free-frame
pool, and no I/O is necessary. The free-frame buffer provides protection against
the relatively poor, but simple, FIFO replacement algorithm. This method is
necessary because the early versions of VAX did not implement the reference
bit correctly.
Some versions of the UNIX system use this method in conjunction with
the second-chance algorithm. It can be a useful augmentation to any page-
replacement algorithm, to reduce the penalty incurred if the wrong victim
page is selected.
9.4.8 Applications and Page Replacement
In certain cases, applications accessing data through the operating system’s
virtual memory perform worse than if the operating system provided no
buffering at all. A typical example is a database, which provides its own
memory management and I/O buffering. Applications like this understand
their memory use and disk use better than does an operating system that is
implementing algorithms for general-purpose use. If the operating system is
buffering I/O and the application is doing so as well, however, then twice the
memory is being used for a set of I/O.
In another example, data warehouses frequently perform massive sequen-
tial disk reads, followed by computations and writes. The LRU algorithm would
be removing old pages and preserving new ones, while the application would
more likely be reading older pages than newer ones (as it starts its sequential
reads again). Here, MFU would actually be more efficient than LRU.
Because of such problems, some operating systems give special programs
the ability to use a disk partition as a large sequential array of logical blocks,
without any file-system data structures. This array is sometimes called the raw
disk, and I/O to this array is termed raw I/O. Raw I/O bypasses all the file-
system services, such as file I/O demand paging, file locking, prefetching, space
allocation, file names, and directories. Note that although certain applications
are more efficient when implementing their own special-purpose storage
services on a raw partition, most applications perform better when they use
the regular file-system services.
9.5 Allocation of Frames
We turn next to the issue of allocation. How do we allocate the fixed amount
of free memory among the various processes? If we have 93 free frames and
two processes, how many frames does each process get?
The simplest case is the single-user system. Consider a single-user system
with 128 KB of memory composed of pages 1 KB in size. This system has 128
422 Chapter 9 Virtual Memory
frames. The operating system may take 35 KB, leaving 93 frames for the user
process. Under pure demand paging, all 93 frames would initially be put on
the free-frame list. When a user process started execution, it would generate a
sequence of page faults. The first 93 page faults would all get free frames from
the free-frame list. When the free-frame list was exhausted, a page-replacement
algorithm would be used to select one of the 93 in-memory pages to be replaced
with the 94th, and so on. When the process terminated, the 93 frames would
once again be placed on the free-frame list.
There are many variations on this simple strategy. We can require that the
operating system allocate all its buffer and table space from the free-frame list.
When this space is not in use by the operating system, it can be used to support
user paging. We can try to keep three free frames reserved on the free-frame list
at all times. Thus, when a page fault occurs, there is a free frame available to
page into. While the page swap is taking place, a replacement can be selected,
which is then written to the disk as the user process continues to execute.
Other variants are also possible, but the basic strategy is clear: the user process
is allocated any free frame.
9.5.1 Minimum Number of Frames
Our strategies for the allocation of frames are constrained in various ways. We
cannot, for example, allocate more than the total number of available frames
(unless there is page sharing). We must also allocate at least a minimum number
of frames. Here, we look more closely at the latter requirement.
One reason for allocating at least a minimum number of frames involves
performance. Obviously, as the number of frames allocated to each process
decreases, the page-fault rate increases, slowing process execution. In addition,
remember that, when a page fault occurs before an executing instruction
is complete, the instruction must be restarted. Consequently, we must have
enough frames to hold all the different pages that any single instruction can
reference.
For example, consider a machine in which all memory-reference instruc-
tions may reference only one memory address. In this case, we need at least one
frame for the instruction and one frame for the memory reference. In addition,
if one-level indirect addressing is allowed (for example, a load instruction on
page 16 can refer to an address on page 0, which is an indirect reference to page
23), then paging requires at least three frames per process. Think about what
might happen if a process had only two frames.
The minimum number of frames is defined by the computer architecture.
For example, the move instruction for the PDP-11 includes more than one word
for some addressing modes, and thus the instruction itself may straddle two
pages. In addition, each of its two operands may be indirect references, for a
total of six frames. Another example is the IBM 370 MVC instruction. Since the
instruction is from storage location to storage location, it takes 6 bytes and can
straddle two pages. The block of characters to move and the area to which it
is to be moved can each also straddle two pages. This situation would require
six frames. The worst case occurs when the MVC instruction is the operand of
an EXECUTE instruction that straddles a page boundary; in this case, we need
eight frames.
9.5 Allocation of Frames 423
The worst-case scenario occurs in computer architectures that allow
multiple levels of indirection (for example, each 16-bit word could contain
a 15-bit address plus a 1-bit indirect indicator). Theoretically, a simple load
instruction could reference an indirect address that could reference an indirect
address (on another page) that could also reference an indirect address (on yet
another page), and so on, until every page in virtual memory had been touched.
Thus, in the worst case, the entire virtual memory must be in physical memory.
To overcome this difficulty, we must place a limit on the levels of indirection (for
example, limit an instruction to at most 16 levels of indirection). When the first
indirection occurs, a counter is set to 16; the counter is then decremented for
each successive indirection for this instruction. If the counter is decremented to
0, a trap occurs (excessive indirection). This limitation reduces the maximum
number of memory references per instruction to 17, requiring the same number
of frames.
Whereas the minimum number of frames per process is defined by the
architecture, the maximum number is defined by the amount of available
physical memory. In between, we are still left with significant choice in frame
allocation.
9.5.2 Allocation Algorithms
The easiest way to split m frames among n processes is to give everyone an
equal share, m/n frames (ignoring frames needed by the operating system
for the moment). For instance, if there are 93 frames and five processes, each
process will get 18 frames. The three leftover frames can be used as a free-frame
buffer pool. This scheme is called equal allocation.
An alternative is to recognize that various processes will need differing
amounts of memory. Consider a system with a 1-KB frame size. If a small
student process of 10 KB and an interactive database of 127 KB are the only
two processes running in a system with 62 free frames, it does not make much
sense to give each process 31 frames. The student process does not need more
than 10 frames, so the other 21 are, strictly speaking, wasted.
To solve this problem, we can use proportional allocation, in which we
allocate available memory to each process according to its size. Let the size of
the virtual memory for process pi be si , and define
S =

si .
Then, if the total number of available frames is m, we allocate ai frames to
process pi , where ai is approximately
ai = si /S × m.
Of course, we must adjust each ai to be an integer that is greater than the
minimum number of frames required by the instruction set, with a sum not
exceeding m.
With proportional allocation, we would split 62 frames between two
processes, one of 10 pages and one of 127 pages, by allocating 4 frames and 57
frames, respectively, since
10/137 × 62 ≈ 4, and
127/137 × 62 ≈ 57.
424 Chapter 9 Virtual Memory
In this way, both processes share the available frames according to their
“needs,” rather than equally.
In both equal and proportional allocation, of course, the allocation may
vary according to the multiprogramming level. If the multiprogramming level
is increased, each process will lose some frames to provide the memory needed
for the new process. Conversely, if the multiprogramming level decreases, the
frames that were allocated to the departed process can be spread over the
remaining processes.
Notice that, with either equal or proportional allocation, a high-priority
process is treated the same as a low-priority process. By its definition, however,
we may want to give the high-priority process more memory to speed its
execution, to the detriment of low-priority processes. One solution is to use
a proportional allocation scheme wherein the ratio of frames depends not on
the relative sizes of processes but rather on the priorities of processes or on a
combination of size and priority.
9.5.3 Global versus Local Allocation
Another important factor in the way frames are allocated to the various
processes is page replacement. With multiple processes competing for frames,
we can classify page-replacement algorithms into two broad categories: global
replacement and local replacement. Global replacement allows a process to
select a replacement frame from the set of all frames, even if that frame is
currently allocated to some other process; that is, one process can take a frame
from another. Local replacement requires that each process select from only its
own set of allocated frames.
For example, consider an allocation scheme wherein we allow high-priority
processes to select frames from low-priority processes for replacement. A
process can select a replacement from among its own frames or the frames
of any lower-priority process. This approach allows a high-priority process to
increase its frame allocation at the expense of a low-priority process. With a
local replacement strategy, the number of frames allocated to a process does not
change. With global replacement, a process may happen to select only frames
allocated to other processes, thus increasing the number of frames allocated to
it (assuming that other processes do not choose its frames for replacement).
One problem with a global replacement algorithm is that a process cannot
control its own page-fault rate. The set of pages in memory for a process
depends not only on the paging behavior of that process but also on the paging
behavior of other processes. Therefore, the same process may perform quite
differently (for example, taking 0.5 seconds for one execution and 10.3 seconds
for the next execution) because of totally external circumstances. Such is not
the case with a local replacement algorithm. Under local replacement, the
set of pages in memory for a process is affected by the paging behavior of
only that process. Local replacement might hinder a process, however, by
not making available to it other, less used pages of memory. Thus, global
replacement generally results in greater system throughput and is therefore
the more commonly used method.
9.5.4 Non-Uniform Memory Access
Thus far in our coverage of virtual memory, we have assumed that all main
memory is created equal—or at least that it is accessed equally. On many
9.6 Thrashing 425
computer systems, that is not the case. Often, in systems with multiple CPUs
(Section 1.3.2), a given CPU can access some sections of main memory faster
than it can access others. These performance differences are caused by how
CPUs and memory are interconnected in the system. Frequently, such a system
is made up of several system boards, each containing multiple CPUs and some
memory. The system boards are interconnected in various ways, ranging from
system buses to high-speed network connections like InfiniBand. As you might
expect, the CPUs on a particular board can access the memory on that board with
less delay than they can access memory on other boards in the system. Systems
in which memory access times vary significantly are known collectively as
non-uniform memory access (NUMA) systems, and without exception, they
are slower than systems in which memory and CPUs are located on the same
motherboard.
Managing which page frames are stored at which locations can significantly
affect performance in NUMA systems. If we treat memory as uniform in such
a system, CPUs may wait significantly longer for memory access than if we
modify memory allocation algorithms to take NUMA into account. Similar
changes must be made to the scheduling system. The goal of these changes is
to have memory frames allocated “as close as possible” to the CPU on which
the process is running. The definition of “close” is “with minimum latency,”
which typically means on the same system board as the CPU.
The algorithmic changes consist of having the scheduler track the last CPU
on which each process ran. If the scheduler tries to schedule each process onto
its previous CPU, and the memory-management system tries to allocate frames
for the process close to the CPU on which it is being scheduled, then improved
cache hits and decreased memory access times will result.
The picture is more complicated once threads are added. For example, a
process with many running threads may end up with those threads scheduled
on many different system boards. How is the memory to be allocated in this
case? Solaris solves the problem by creating lgroups (for “latency groups”) in
the kernel. Each lgroup gathers together close CPUs and memory. In fact, there
is a hierarchy of lgroups based on the amount of latency between the groups.
Solaris tries to schedule all threads of a process and allocate all memory of a
process within an lgroup. If that is not possible, it picks nearby lgroups for the
rest of the resources needed. This practice minimizes overall memory latency
and maximizes CPU cache hit rates.
9.6 Thrashing
If the number of frames allocated to a low-priority process falls below the
minimum number required by the computer architecture, we must suspend
that process’s execution. We should then page out its remaining pages, freeing
all its allocated frames. This provision introduces a swap-in, swap-out level of
intermediate CPU scheduling.
In fact, look at any process that does not have “enough” frames. If the
process does not have the number of frames it needs to support pages in
active use, it will quickly page-fault. At this point, it must replace some page.
However, since all its pages are in active use, it must replace a page that will
be needed again right away. Consequently, it quickly faults again, and again,
and again, replacing pages that it must bring back in immediately.
426 Chapter 9 Virtual Memory
This high paging activity is called thrashing. A process is thrashing if it is
spending more time paging than executing.
9.6.1 Cause of Thrashing
Thrashing results in severe performance problems. Consider the following
scenario, which is based on the actual behavior of early paging systems.
The operating system monitors CPU utilization. If CPU utilization is too low,
we increase the degree of multiprogramming by introducing a new process
to the system. A global page-replacement algorithm is used; it replaces pages
without regard to the process to which they belong. Now suppose that a process
enters a new phase in its execution and needs more frames. It starts faulting and
taking frames away from other processes. These processes need those pages,
however, and so they also fault, taking frames from other processes. These
faulting processes must use the paging device to swap pages in and out. As
they queue up for the paging device, the ready queue empties. As processes
wait for the paging device, CPU utilization decreases.
The CPU scheduler sees the decreasing CPU utilization and increases the
degree of multiprogramming as a result. The new process tries to get started by
taking frames from running processes, causing more page faults and a longer
queue for the paging device. As a result, CPU utilization drops even further,
and the CPU scheduler tries to increase the degree of multiprogramming even
more. Thrashing has occurred, and system throughput plunges. The page-
fault rate increases tremendously. As a result, the effective memory-access
time increases. No work is getting done, because the processes are spending
all their time paging.
This phenomenon is illustrated in Figure 9.18, in which CPU utilization
is plotted against the degree of multiprogramming. As the degree of multi-
programming increases, CPU utilization also increases, although more slowly,
until a maximum is reached. If the degree of multiprogramming is increased
even further, thrashing sets in, and CPU utilization drops sharply. At this point,
to increase CPU utilization and stop thrashing, we must decrease the degree of
multiprogramming.
thrashing
degree of multiprogramming
CPU
utilization
Figure 9.18 Thrashing.
9.6 Thrashing 427
We can limit the effects of thrashing by using a local replacement algorithm
(or priority replacement algorithm). With local replacement, if one process
starts thrashing, it cannot steal frames from another process and cause the latter
to thrash as well. However, the problem is not entirely solved. If processes are
thrashing, they will be in the queue for the paging device most of the time. The
average service time for a page fault will increase because of the longer average
queue for the paging device. Thus, the effective access time will increase even
for a process that is not thrashing.
To prevent thrashing, we must provide a process with as many frames as
it needs. But how do we know how many frames it “needs”? There are several
techniques. The working-set strategy (Section 9.6.2) starts by looking at how
many frames a process is actually using. This approach defines the locality
model of process execution.
The locality model states that, as a process executes, it moves from locality
to locality. A locality is a set of pages that are actively used together (Figure
9.19). A program is generally composed of several different localities, which
may overlap.
For example, when a function is called, it defines a new locality. In this
locality, memory references are made to the instructions of the function call, its
local variables, and a subset of the global variables. When we exit the function,
the process leaves this locality, since the local variables and instructions of the
function are no longer in active use. We may return to this locality later.
Thus, we see that localities are defined by the program structure and its
data structures. The locality model states that all programs will exhibit this
basic memory reference structure. Note that the locality model is the unstated
principle behind the caching discussions so far in this book. If accesses to any
types of data were random rather than patterned, caching would be useless.
Suppose we allocate enough frames to a process to accommodate its current
locality. It will fault for the pages in its locality until all these pages are in
memory; then, it will not fault again until it changes localities. If we do not
allocate enough frames to accommodate the size of the current locality, the
process will thrash, since it cannot keep in memory all the pages that it is
actively using.
9.6.2 Working-Set Model
As mentioned, the working-set model is based on the assumption of locality.
This model uses a parameter, , to define the working-set window. The idea
is to examine the most recent  page references. The set of pages in the most
recent  page references is the working set (Figure 9.20). If a page is in active
use, it will be in the working set. If it is no longer being used, it will drop from
the working set  time units after its last reference. Thus, the working set is an
approximation of the program’s locality.
For example, given the sequence of memory references shown in Figure
9.20, if  = 10 memory references, then the working set at time t1 is {1, 2, 5,
6, 7}. By time t2, the working set has changed to {3, 4}.
The accuracy of the working set depends on the selection of . If  is too
small, it will not encompass the entire locality; if  is too large, it may overlap
several localities. In the extreme, if  is infinite, the working set is the set of
pages touched during the process execution.
428 Chapter 9 Virtual Memory
18
20
22
24
26
28
30
32
34
page
numbers
memory
address
execution time
Figure 9.19 Locality in a memory-reference pattern.
The most important property of the working set, then, is its size. If we
compute the working-set size, WSSi , for each process in the system, we can
then consider that
D =

WSSi ,
where D is the total demand for frames. Each process is actively using the pages
in its working set. Thus, process i needs WSSi frames. If the total demand is
greater than the total number of available frames (D  m), thrashing will occur,
because some processes will not have enough frames.
Once  has been selected, use of the working-set model is simple. The
operating system monitors the working set of each process and allocates to
9.6 Thrashing 429
page reference table
. . . 2 6 1 5 7 7 7 7 5 1 6 2 3 4 1 2 3 4 4 4 3 4 3 4 4 4 1 3 2 3 4 4 4 3 4 4 4 . . .
Δ
t1
WS(t1
) = {1,2,5,6,7}
Δ
t2
WS(t2
) = {3,4}
Figure 9.20 Working-set model.
that working set enough frames to provide it with its working-set size. If there
are enough extra frames, another process can be initiated. If the sum of the
working-set sizes increases, exceeding the total number of available frames,
the operating system selects a process to suspend. The process’s pages are
written out (swapped), and its frames are reallocated to other processes. The
suspended process can be restarted later.
This working-set strategy prevents thrashing while keeping the degree of
multiprogramming as high as possible. Thus, it optimizes CPU utilization. The
difficulty with the working-set model is keeping track of the working set. The
working-set window is a moving window. At each memory reference, a new
reference appears at one end, and the oldest reference drops off the other end.
A page is in the working set if it is referenced anywhere in the working-set
window.
We can approximate the working-set model with a fixed-interval timer
interrupt and a reference bit. For example, assume that  equals 10,000
references and that we can cause a timer interrupt every 5,000 references.
When we get a timer interrupt, we copy and clear the reference-bit values for
each page. Thus, if a page fault occurs, we can examine the current reference
bit and two in-memory bits to determine whether a page was used within the
last 10,000 to 15,000 references. If it was used, at least one of these bits will be
on. If it has not been used, these bits will be off. Pages with at least one bit on
will be considered to be in the working set.
Note that this arrangement is not entirely accurate, because we cannot
tell where, within an interval of 5,000, a reference occurred. We can reduce the
uncertainty by increasing the number of history bits and the frequency of inter-
rupts (for example, 10 bits and interrupts every 1,000 references). However, the
cost to service these more frequent interrupts will be correspondingly higher.
9.6.3 Page-Fault Frequency
The working-set model is successful, and knowledge of the working set can
be useful for prepaging (Section 9.9.1), but it seems a clumsy way to control
thrashing. A strategy that uses the page-fault frequency (PFF) takes a more
direct approach.
The specific problem is how to prevent thrashing. Thrashing has a high
page-fault rate. Thus, we want to control the page-fault rate. When it is too
high, we know that the process needs more frames. Conversely, if the page-fault
rate is too low, then the process may have too many frames. We can establish
upper and lower bounds on the desired page-fault rate (Figure 9.21). If the
actual page-fault rate exceeds the upper limit, we allocate the process another
430 Chapter 9 Virtual Memory
number of frames
increase number
of frames
upper bound
lower bound
decrease number
of frames
page-fault
rate
Figure 9.21 Page-fault frequency.
frame. If the page-fault rate falls below the lower limit, we remove a frame
from the process. Thus, we can directly measure and control the page-fault
rate to prevent thrashing.
As with the working-set strategy, we may have to swap out a process. If the
page-fault rate increases and no free frames are available, we must select some
process and swap it out to backing store. The freed frames are then distributed
to processes with high page-fault rates.
9.6.4 Concluding Remarks
Practically speaking, thrashing and the resulting swapping have a disagreeably
large impact on performance. The current best practice in implementing a
computer facility is to include enough physical memory, whenever possible,
to avoid thrashing and swapping. From smartphones through mainframes,
providing enough memory to keep all working sets in memory concurrently,
except under extreme conditions, gives the best user experience.
9.7 Memory-Mapped Files
Consider a sequential read of a file on disk using the standard system calls
open(), read(), and write(). Each file access requires a system call and disk
access. Alternatively, we can use the virtual memory techniques discussed
so far to treat file I/O as routine memory accesses. This approach, known as
memory mapping a file, allows a part of the virtual address space to be logically
associated with the file. As we shall see, this can lead to significant performance
increases.
9.7.1 Basic Mechanism
Memory mapping a file is accomplished by mapping a disk block to a page (or
pages) in memory. Initial access to the file proceeds through ordinary demand
paging, resulting in a page fault. However, a page-sized portion of the file is
read from the file system into a physical page (some systems may opt to read
Part Four
Storage
Management
Since main memory is usually too small to accommodate all the data and
programs permanently, the computer system must provide secondary
storage to back up main memory. Modern computer systems use disks
as the primary on-line storage medium for information (both programs
and data). The file system provides the mechanism for on-line storage
of and access to both data and programs residing on the disks. A file
is a collection of related information defined by its creator. The files are
mapped by the operating system onto physical devices. Files are normally
organized into directories for ease of use.
The devices that attach to a computer vary in many aspects. Some
devices transfer a character or a block of characters at a time. Some
can be accessed only sequentially, others randomly. Some transfer
data synchronously, others asynchronously. Some are dedicated, some
shared. They can be read-only or read–write. They vary greatly in speed.
In many ways, they are also the slowest major component of the
computer.
Because of all this device variation, the operating system needs to
provide a wide range of functionality to applications, to allow them to
control all aspects of the devices. One key goal of an operating system’s
I/O subsystem is to provide the simplest interface possible to the rest of
the system. Because devices are a performance bottleneck, another key
is to optimize I/O for maximum concurrency.
10
C H A P T E R
Mass-Storage
Structure
The file system can be viewed logically as consisting of three parts. In Chapter
11, we examine the user and programmer interface to the file system. In
Chapter 12, we describe the internal data structures and algorithms used by
the operating system to implement this interface. In this chapter, we begin a
discussion of file systems at the lowest level: the structure of secondary storage.
We first describe the physical structure of magnetic disks and magnetic tapes.
We then describe disk-scheduling algorithms, which schedule the order of
disk I/Os to maximize performance. Next, we discuss disk formatting and
management of boot blocks, damaged blocks, and swap space. We conclude
with an examination of the structure of RAID systems.
CHAPTER OBJECTIVES
• To describe the physical structure of secondary storage devices and its
effects on the uses of the devices.
• To explain the performance characteristics of mass-storage devices.
• To evaluate disk scheduling algorithms.
• To discuss operating-system services provided for mass storage, including
RAID.
10.1 Overview of Mass-Storage Structure
In this section, we present a general overview of the physical structure of
secondary and tertiary storage devices.
10.1.1 Magnetic Disks
Magnetic disks provide the bulk of secondary storage for modern computer
systems. Conceptually, disks are relatively simple (Figure 10.1). Each disk
platter has a flat circular shape, like a CD. Common platter diameters range
from 1.8 to 3.5 inches. The two surfaces of a platter are covered with a magnetic
material. We store information by recording it magnetically on the platters.
467
468 Chapter 10 Mass-Storage Structure
track t
sector s
spindle
cylinder c
platter
arm
read-write
head
arm assembly
rotation
Figure 10.1 Moving-head disk mechanism.
A read–write head “flies” just above each surface of every platter. The
heads are attached to a disk arm that moves all the heads as a unit. The surface
of a platter is logically divided into circular tracks, which are subdivided into
sectors. The set of tracks that are at one arm position makes up a cylinder.
There may be thousands of concentric cylinders in a disk drive, and each track
may contain hundreds of sectors. The storage capacity of common disk drives
is measured in gigabytes.
When the disk is in use, a drive motor spins it at high speed. Most drives
rotate 60 to 250 times per second, specified in terms of rotations per minute
(RPM). Common drives spin at 5,400, 7,200, 10,000, and 15,000 RPM. Disk speed
has two parts. The transfer rate is the rate at which data flow between the drive
and the computer. The positioning time, or random-access time, consists of
two parts: the time necessary to move the disk arm to the desired cylinder,
called the seek time, and the time necessary for the desired sector to rotate to
the disk head, called the rotational latency. Typical disks can transfer several
megabytes of data per second, and they have seek times and rotational latencies
of several milliseconds.
Because the disk head flies on an extremely thin cushion of air (measured
in microns), there is a danger that the head will make contact with the disk
surface. Although the disk platters are coated with a thin protective layer, the
head will sometimes damage the magnetic surface. This accident is called a
head crash. A head crash normally cannot be repaired; the entire disk must be
replaced.
A disk can be removable, allowing different disks to be mounted as needed.
Removable magnetic disks generally consist of one platter, held in a plastic
case to prevent damage while not in the disk drive. Other forms of removable
disks include CDs, DVDs, and Blu-ray discs as well as removable flash-memory
devices known as flash drives (which are a type of solid-state drive).
10.1 Overview of Mass-Storage Structure 469
A disk drive is attached to a computer by a set of wires called an I/O
bus. Several kinds of buses are available, including advanced technology
attachment (ATA), serial ATA (SATA), eSATA, universal serial bus (USB), and
fibre channel (FC). The data transfers on a bus are carried out by special
electronic processors called controllers. The host controller is the controller at
the computer end of the bus. A disk controller is built into each disk drive. To
perform a disk I/O operation, the computer places a command into the host
controller, typically using memory-mapped I/O ports, as described in Section
9.7.3. The host controller then sends the command via messages to the disk
controller, and the disk controller operates the disk-drive hardware to carry
out the command. Disk controllers usually have a built-in cache. Data transfer
at the disk drive happens between the cache and the disk surface, and data
transfer to the host, at fast electronic speeds, occurs between the cache and the
host controller.
10.1.2 Solid-State Disks
Sometimes old technologies are used in new ways as economics change or
the technologies evolve. An example is the growing importance of solid-state
disks, or SSDs. Simply described, an SSD is nonvolatile memory that is used like
a hard drive. There are many variations of this technology, from DRAM with a
battery to allow it to maintain its state in a power failure through flash-memory
technologies like single-level cell (SLC) and multilevel cell (MLC) chips.
SSDs have the same characteristics as traditional hard disks but can be more
reliable because they have no moving parts and faster because they have no
seek time or latency. In addition, they consume less power. However, they are
more expensive per megabyte than traditional hard disks, have less capacity
than the larger hard disks, and may have shorter life spans than hard disks,
so their uses are somewhat limited. One use for SSDs is in storage arrays,
where they hold file-system metadata that require high performance. SSDs are
also used in some laptop computers to make them smaller, faster, and more
energy-efficient.
Because SSDs can be much faster than magnetic disk drives, standard bus
interfaces can cause a major limit on throughput. Some SSDs are designed to
connect directly to the system bus (PCI, for example). SSDs are changing other
traditional aspects of computer design as well. Some systems use them as
a direct replacement for disk drives, while others use them as a new cache
tier, moving data between magnetic disks, SSDs, and memory to optimize
performance.
In the remainder of this chapter, some sections pertain to SSDs, while
others do not. For example, because SSDs have no disk head, disk-scheduling
algorithms largely do not apply. Throughput and formatting, however, do
apply.
10.1.3 Magnetic Tapes
Magnetic tape was used as an early secondary-storage medium. Although it
is relatively permanent and can hold large quantities of data, its access time
is slow compared with that of main memory and magnetic disk. In addition,
random access to magnetic tape is about a thousand times slower than random
access to magnetic disk, so tapes are not very useful for secondary storage.
470 Chapter 10 Mass-Storage Structure
DISK TRANSFER RATES
As with many aspects of computing, published performance numbers for
disks are not the same as real-world performance numbers. Stated transfer
rates are always lower than effective transfer rates, for example. The transfer
rate may be the rate at which bits can be read from the magnetic media by
the disk head, but that is different from the rate at which blocks are delivered
to the operating system.
Tapes are used mainly for backup, for storage of infrequently used information,
and as a medium for transferring information from one system to another.
A tape is kept in a spool and is wound or rewound past a read–write head.
Moving to the correct spot on a tape can take minutes, but once positioned, tape
drives can write data at speeds comparable to disk drives. Tape capacities vary
greatly, depending on the particular kind of tape drive, with current capacities
exceeding several terabytes. Some tapes have built-in compression that can
more than double the effective storage. Tapes and their drivers are usually
categorized by width, including 4, 8, and 19 millimeters and 1/4 and 1/2 inch.
Some are named according to technology, such as LTO-5 and SDLT.
10.2 Disk Structure
Modern magnetic disk drives are addressed as large one-dimensional arrays of
logical blocks, where the logical block is the smallest unit of transfer. The size
of a logical block is usually 512 bytes, although some disks can be low-level
formatted to have a different logical block size, such as 1,024 bytes. This option
is described in Section 10.5.1. The one-dimensional array of logical blocks is
mapped onto the sectors of the disk sequentially. Sector 0 is the first sector
of the first track on the outermost cylinder. The mapping proceeds in order
through that track, then through the rest of the tracks in that cylinder, and then
through the rest of the cylinders from outermost to innermost.
By using this mapping, we can—at least in theory—convert a logical block
number into an old-style disk address that consists of a cylinder number, a track
number within that cylinder, and a sector number within that track. In practice,
it is difficult to perform this translation, for two reasons. First, most disks have
some defective sectors, but the mapping hides this by substituting spare sectors
from elsewhere on the disk. Second, the number of sectors per track is not a
constant on some drives.
Let’s look more closely at the second reason. On media that use constant
linear velocity (CLV), the density of bits per track is uniform. The farther a
track is from the center of the disk, the greater its length, so the more sectors it
can hold. As we move from outer zones to inner zones, the number of sectors
per track decreases. Tracks in the outermost zone typically hold 40 percent
more sectors than do tracks in the innermost zone. The drive increases its
rotation speed as the head moves from the outer to the inner tracks to keep
the same rate of data moving under the head. This method is used in CD-ROM
10.3 Disk Attachment 471
and DVD-ROM drives. Alternatively, the disk rotation speed can stay constant;
in this case, the density of bits decreases from inner tracks to outer tracks to
keep the data rate constant. This method is used in hard disks and is known as
constant angular velocity (CAV).
The number of sectors per track has been increasing as disk technology
improves, and the outer zone of a disk usually has several hundred sectors per
track. Similarly, the number of cylinders per disk has been increasing; large
disks have tens of thousands of cylinders.
10.3 Disk Attachment
Computers access disk storage in two ways. One way is via I/O ports (or
host-attached storage); this is common on small systems. The other way is via
a remote host in a distributed file system; this is referred to as network-attached
storage.
10.3.1 Host-Attached Storage
Host-attached storage is storage accessed through local I/O ports. These ports
use several technologies. The typical desktop PC uses an I/O bus architecture
called IDE or ATA. This architecture supports a maximum of two drives per I/O
bus. A newer, similar protocol that has simplified cabling is SATA.
High-end workstations and servers generally use more sophisticated I/O
architectures such as fibre channel (FC), a high-speed serial architecture that
can operate over optical fiber or over a four-conductor copper cable. It has
two variants. One is a large switched fabric having a 24-bit address space. This
variant is expected to dominate in the future and is the basis of storage-area
networks (SANs), discussed in Section 10.3.3. Because of the large address space
and the switched nature of the communication, multiple hosts and storage
devices can attach to the fabric, allowing great flexibility in I/O communication.
The other FC variant is an arbitrated loop (FC-AL) that can address 126 devices
(drives and controllers).
A wide variety of storage devices are suitable for use as host-attached
storage. Among these are hard disk drives, RAID arrays, and CD, DVD, and
tape drives. The I/O commands that initiate data transfers to a host-attached
storage device are reads and writes of logical data blocks directed to specifically
identified storage units (such as bus ID or target logical unit).
10.3.2 Network-Attached Storage
A network-attached storage (NAS) device is a special-purpose storage system
that is accessed remotely over a data network (Figure 10.2). Clients access
network-attached storage via a remote-procedure-call interface such as NFS
for UNIX systems or CIFS for Windows machines. The remote procedure calls
(RPCs) are carried via TCP or UDP over an IP network—usually the same local-
area network (LAN) that carries all data traffic to the clients. Thus, it may be
easiest to think of NAS as simply another storage-access protocol. The network-
attached storage unit is usually implemented as a RAID array with software
that implements the RPC interface.
472 Chapter 10 Mass-Storage Structure
NAS
client
NAS
client
client
LAN/WAN
Figure 10.2 Network-attached storage.
Network-attached storage provides a convenient way for all the computers
on a LAN to share a pool of storage with the same ease of naming and access
enjoyed with local host-attached storage. However, it tends to be less efficient
and have lower performance than some direct-attached storage options.
iSCSI is the latest network-attached storage protocol. In essence, it uses the
IP network protocol to carry the SCSI protocol. Thus, networks—rather than
SCSI cables—can be used as the interconnects between hosts and their storage.
As a result, hosts can treat their storage as if it were directly attached, even if
the storage is distant from the host.
10.3.3 Storage-Area Network
One drawback of network-attached storage systems is that the storage I/O
operations consume bandwidth on the data network, thereby increasing the
latency of network communication. This problem can be particularly acute
in large client–server installations—the communication between servers and
clients competes for bandwidth with the communication among servers and
storage devices.
A storage-area network (SAN) is a private network (using storage protocols
rather than networking protocols) connecting servers and storage units, as
shown in Figure 10.3. The power of a SAN lies in its flexibility. Multiple hosts
and multiple storage arrays can attach to the same SAN, and storage can
be dynamically allocated to hosts. A SAN switch allows or prohibits access
between the hosts and the storage. As one example, if a host is running low
on disk space, the SAN can be configured to allocate more storage to that host.
SANs make it possible for clusters of servers to share the same storage and for
storage arrays to include multiple direct host connections. SANs typically have
more ports—as well as more expensive ports—than storage arrays.
FC is the most common SAN interconnect, although the simplicity of iSCSI is
increasing its use. Another SAN interconnect is InfiniBand — a special-purpose
bus architecture that provides hardware and software support for high-speed
interconnection networks for servers and storage units.
10.4 Disk Scheduling
One of the responsibilities of the operating system is to use the hardware
efficiently. For the disk drives, meeting this responsibility entails having fast
10.4 Disk Scheduling 473
LAN/WAN
storage
array
storage
array
data-processing
center
web content
provider
server
client
client
client
server
tape
library
SAN
Figure 10.3 Storage-area network.
access time and large disk bandwidth. For magnetic disks, the access time has
two major components, as mentioned in Section 10.1.1. The seek time is the
time for the disk arm to move the heads to the cylinder containing the desired
sector. The rotational latency is the additional time for the disk to rotate the
desired sector to the disk head. The disk bandwidth is the total number of bytes
transferred, divided by the total time between the first request for service and
the completion of the last transfer. We can improve both the access time and
the bandwidth by managing the order in which disk I/O requests are serviced.
Whenever a process needs I/O to or from the disk, it issues a system call to
the operating system. The request specifies several pieces of information:
• Whether this operation is input or output
• What the disk address for the transfer is
• What the memory address for the transfer is
• What the number of sectors to be transferred is
If the desired disk drive and controller are available, the request can be
serviced immediately. If the drive or controller is busy, any new requests
for service will be placed in the queue of pending requests for that drive.
For a multiprogramming system with many processes, the disk queue may
often have several pending requests. Thus, when one request is completed, the
operating system chooses which pending request to service next. How does
the operating system make this choice? Any one of several disk-scheduling
algorithms can be used, and we discuss them next.
10.4.1 FCFS Scheduling
The simplest form of disk scheduling is, of course, the first-come, first-served
(FCFS) algorithm. This algorithm is intrinsically fair, but it generally does not
provide the fastest service. Consider, for example, a disk queue with requests
for I/O to blocks on cylinders
98, 183, 37, 122, 14, 124, 65, 67,
474 Chapter 10 Mass-Storage Structure
0 14 37 536567 98 122124 183199
queue  98, 183, 37, 122, 14, 124, 65, 67
head starts at 53
Figure 10.4 FCFS disk scheduling.
in that order. If the disk head is initially at cylinder 53, it will first move from
53 to 98, then to 183, 37, 122, 14, 124, 65, and finally to 67, for a total head
movement of 640 cylinders. This schedule is diagrammed in Figure 10.4.
The wild swing from 122 to 14 and then back to 124 illustrates the problem
with this schedule. If the requests for cylinders 37 and 14 could be serviced
together, before or after the requests for 122 and 124, the total head movement
could be decreased substantially, and performance could be thereby improved.
10.4.2 SSTF Scheduling
It seems reasonable to service all the requests close to the current head position
before moving the head far away to service other requests. This assumption is
the basis for the shortest-seek-time-first (SSTF) algorithm. The SSTF algorithm
selects the request with the least seek time from the current head position.
In other words, SSTF chooses the pending request closest to the current head
position.
For our example request queue, the closest request to the initial head
position (53) is at cylinder 65. Once we are at cylinder 65, the next closest
request is at cylinder 67. From there, the request at cylinder 37 is closer than the
one at 98, so 37 is served next. Continuing, we service the request at cylinder 14,
then 98, 122, 124, and finally 183 (Figure 10.5). This scheduling method results
in a total head movement of only 236 cylinders—little more than one-third
of the distance needed for FCFS scheduling of this request queue. Clearly, this
algorithm gives a substantial improvement in performance.
SSTF scheduling is essentially a form of shortest-job-first (SJF) scheduling;
and like SJF scheduling, it may cause starvation of some requests. Remember
that requests may arrive at any time. Suppose that we have two requests in
the queue, for cylinders 14 and 186, and while the request from 14 is being
serviced, a new request near 14 arrives. This new request will be serviced
next, making the request at 186 wait. While this request is being serviced,
another request close to 14 could arrive. In theory, a continual stream of requests
near one another could cause the request for cylinder 186 to wait indefinitely.
10.4 Disk Scheduling 475
0 14 37 536567 98 122124 183199
queue  98, 183, 37, 122, 14, 124, 65, 67
head starts at 53
Figure 10.5 SSTF disk scheduling.
This scenario becomes increasingly likely as the pending-request queue grows
longer.
Although the SSTF algorithm is a substantial improvement over the FCFS
algorithm, it is not optimal. In the example, we can do better by moving the
head from 53 to 37, even though the latter is not closest, and then to 14, before
turning around to service 65, 67, 98, 122, 124, and 183. This strategy reduces
the total head movement to 208 cylinders.
10.4.3 SCAN Scheduling
In the SCAN algorithm, the disk arm starts at one end of the disk and moves
toward the other end, servicing requests as it reaches each cylinder, until it gets
to the other end of the disk. At the other end, the direction of head movement
is reversed, and servicing continues. The head continuously scans back and
forth across the disk. The SCAN algorithm is sometimes called the elevator
algorithm, since the disk arm behaves just like an elevator in a building, first
servicing all the requests going up and then reversing to service requests the
other way.
Let’s return to our example to illustrate. Before applying SCAN to schedule
the requests on cylinders 98, 183, 37, 122, 14, 124, 65, and 67, we need to know
the direction of head movement in addition to the head’s current position.
Assuming that the disk arm is moving toward 0 and that the initial head
position is again 53, the head will next service 37 and then 14. At cylinder 0,
the arm will reverse and will move toward the other end of the disk, servicing
the requests at 65, 67, 98, 122, 124, and 183 (Figure 10.6). If a request arrives in
the queue just in front of the head, it will be serviced almost immediately; a
request arriving just behind the head will have to wait until the arm moves to
the end of the disk, reverses direction, and comes back.
Assuming a uniform distribution of requests for cylinders, consider the
density of requests when the head reaches one end and reverses direction. At
this point, relatively few requests are immediately in front of the head, since
these cylinders have recently been serviced. The heaviest density of requests
476 Chapter 10 Mass-Storage Structure
0 14 37 536567 98 122124 183199
queue  98, 183, 37, 122, 14, 124, 65, 67
head starts at 53
Figure 10.6 SCAN disk scheduling.
is at the other end of the disk. These requests have also waited the longest, so
why not go there first? That is the idea of the next algorithm.
10.4.4 C-SCAN Scheduling
Circular SCAN (C-SCAN) scheduling is a variant of SCAN designed to provide
a more uniform wait time. Like SCAN, C-SCAN moves the head from one end
of the disk to the other, servicing requests along the way. When the head
reaches the other end, however, it immediately returns to the beginning of
the disk without servicing any requests on the return trip (Figure 10.7). The
C-SCAN scheduling algorithm essentially treats the cylinders as a circular list
that wraps around from the final cylinder to the first one.
0 14 37 53 65 67 98 122124 183199
queue = 98, 183, 37, 122, 14, 124, 65, 67
head starts at 53
Figure 10.7 C-SCAN disk scheduling.
10.4 Disk Scheduling 477
10.4.5 LOOK Scheduling
As we described them, both SCAN and C-SCAN move the disk arm across the
full width of the disk. In practice, neither algorithm is often implemented this
way. More commonly, the arm goes only as far as the final request in each
direction. Then, it reverses direction immediately, without going all the way to
the end of the disk. Versions of SCAN and C-SCAN that follow this pattern are
called LOOK and C-LOOK scheduling, because they look for a request before
continuing to move in a given direction (Figure 10.8).
10.4.6 Selection of a Disk-Scheduling Algorithm
Given so many disk-scheduling algorithms, how do we choose the best one?
SSTF is common and has a natural appeal because it increases performance over
FCFS. SCAN and C-SCAN perform better for systems that place a heavy load on
the disk, because they are less likely to cause a starvation problem. For any
particular list of requests, we can define an optimal order of retrieval, but the
computation needed to find an optimal schedule may not justify the savings
over SSTF or SCAN. With any scheduling algorithm, however, performance
depends heavily on the number and types of requests. For instance, suppose
that the queue usually has just one outstanding request. Then, all scheduling
algorithms behave the same, because they have only one choice of where to
move the disk head: they all behave like FCFS scheduling.
Requests for disk service can be greatly influenced by the file-allocation
method. A program reading a contiguously allocated file will generate several
requests that are close together on the disk, resulting in limited head movement.
A linked or indexed file, in contrast, may include blocks that are widely
scattered on the disk, resulting in greater head movement.
The location of directories and index blocks is also important. Since every
file must be opened to be used, and opening a file requires searching the
directory structure, the directories will be accessed frequently. Suppose that a
directory entry is on the first cylinder and a file’s data are on the final cylinder. In
this case, the disk head has to move the entire width of the disk. If the directory
0 14 37 536567 98 122124 183199
queue = 98, 183, 37, 122, 14, 124, 65, 67
head starts at 53
Figure 10.8 C-LOOK disk scheduling.
478 Chapter 10 Mass-Storage Structure
DISK SCHEDULING and SSDs
The disk-scheduling algorithms discussed in this section focus primarily on
minimizing the amount of disk head movement in magnetic disk drives.
SSDs—which do not contain moving disk heads—commonly use a simple
FCFS policy. For example, the Linux Noop scheduler uses an FCFS policy
but modifies it to merge adjacent requests. The observed behavior of SSDs
indicates that the time required to service reads is uniform but that, because
of the properties of flash memory, write service time is not uniform. Some
SSD schedulers have exploited this property and merge only adjacent write
requests, servicing all read requests in FCFS order.
entry were on the middle cylinder, the head would have to move only one-half
the width. Caching the directories and index blocks in main memory can also
help to reduce disk-arm movement, particularly for read requests.
Because of these complexities, the disk-scheduling algorithm should be
written as a separate module of the operating system, so that it can be replaced
with a different algorithm if necessary. Either SSTF or LOOK is a reasonable
choice for the default algorithm.
The scheduling algorithms described here consider only the seek distances.
For modern disks, the rotational latency can be nearly as large as the
average seek time. It is difficult for the operating system to schedule for
improved rotational latency, though, because modern disks do not disclose the
physical location of logical blocks. Disk manufacturers have been alleviating
this problem by implementing disk-scheduling algorithms in the controller
hardware built into the disk drive. If the operating system sends a batch of
requests to the controller, the controller can queue them and then schedule
them to improve both the seek time and the rotational latency.
If I/O performance were the only consideration, the operating system
would gladly turn over the responsibility of disk scheduling to the disk hard-
ware. In practice, however, the operating system may have other constraints on
the service order for requests. For instance, demand paging may take priority
over application I/O, and writes are more urgent than reads if the cache is
running out of free pages. Also, it may be desirable to guarantee the order
of a set of disk writes to make the file system robust in the face of system
crashes. Consider what could happen if the operating system allocated a
disk page to a file and the application wrote data into that page before the
operating system had a chance to flush the file system metadata back to disk.
To accommodate such requirements, an operating system may choose to do its
own disk scheduling and to spoon-feed the requests to the disk controller, one
by one, for some types of I/O.
10.5 Disk Management
The operating system is responsible for several other aspects of disk manage-
ment, too. Here we discuss disk initialization, booting from disk, and bad-block
recovery.
11
C H A P T E R
File-System
Interface
For most users, the file system is the most visible aspect of an operating
system. It provides the mechanism for on-line storage of and access to both
data and programs of the operating system and all the users of the computer
system. The file system consists of two distinct parts: a collection of files, each
storing related data, and a directory structure, which organizes and provides
information about all the files in the system. File systems live on devices,
which we described in the preceding chapter and will continue to discuss in
the following one. In this chapter, we consider the various aspects of files and
the major directory structures. We also discuss the semantics of sharing files
among multiple processes, users, and computers. Finally, we discuss ways to
handle file protection, necessary when we have multiple users and we want to
control who may access files and how files may be accessed.
CHAPTER OBJECTIVES
• To explain the function of file systems.
• To describe the interfaces to file systems.
• To discuss file-system design tradeoffs, including access methods, file
sharing, file locking, and directory structures.
• To explore file-system protection.
11.1 File Concept
Computers can store information on various storage media, such as magnetic
disks, magnetic tapes, and optical disks. So that the computer system will
be convenient to use, the operating system provides a uniform logical view
of stored information. The operating system abstracts from the physical
properties of its storage devices to define a logical storage unit, the file. Files are
mapped by the operating system onto physical devices. These storage devices
are usually nonvolatile, so the contents are persistent between system reboots.
503
504 Chapter 11 File-System Interface
A file is a named collection of related information that is recorded on
secondary storage. From a user’s perspective, a file is the smallest allotment
of logical secondary storage; that is, data cannot be written to secondary
storage unless they are within a file. Commonly, files represent programs (both
source and object forms) and data. Data files may be numeric, alphabetic,
alphanumeric, or binary. Files may be free form, such as text files, or may be
formatted rigidly. In general, a file is a sequence of bits, bytes, lines, or records,
the meaning of which is defined by the file’s creator and user. The concept of
a file is thus extremely general.
The information in a file is defined by its creator. Many different types of
information may be stored in a file—source or executable programs, numeric or
text data, photos, music, video, and so on. A file has a certain defined structure,
which depends on its type. A text file is a sequence of characters organized
into lines (and possibly pages). A source file is a sequence of functions, each of
which is further organized as declarations followed by executable statements.
An executable file is a series of code sections that the loader can bring into
memory and execute.
11.1.1 File Attributes
A file is named, for the convenience of its human users, and is referred to by
its name. A name is usually a string of characters, such as example.c. Some
systems differentiate between uppercase and lowercase characters in names,
whereas other systems do not. When a file is named, it becomes independent
of the process, the user, and even the system that created it. For instance, one
user might create the file example.c, and another user might edit that file by
specifying its name. The file’s owner might write the file to a USB disk, send it
as an e-mail attachment, or copy it across a network, and it could still be called
example.c on the destination system.
A file’s attributes vary from one operating system to another but typically
consist of these:
• Name. The symbolic file name is the only information kept in human-
readable form.
• Identifier. This unique tag, usually a number, identifies the file within the
file system; it is the non-human-readable name for the file.
• Type. This information is needed for systems that support different types
of files.
• Location. This information is a pointer to a device and to the location of
the file on that device.
• Size. The current size of the file (in bytes, words, or blocks) and possibly
the maximum allowed size are included in this attribute.
• Protection. Access-control information determines who can do reading,
writing, executing, and so on.
• Time, date, and user identification. This information may be kept for
creation, last modification, and last use. These data can be useful for
protection, security, and usage monitoring.
11.1 File Concept 505
Figure 11.1 A file info window on Mac OS X.
Some newer file systems also support extended file attributes, including
character encoding of the file and security features such as a file checksum.
Figure 11.1 illustrates a file info window on Mac OS X, which displays a file’s
attributes.
The information about all files is kept in the directory structure, which
also resides on secondary storage. Typically, a directory entry consists of the
file’s name and its unique identifier. The identifier in turn locates the other
506 Chapter 11 File-System Interface
file attributes. It may take more than a kilobyte to record this information for
each file. In a system with many files, the size of the directory itself may be
megabytes. Because directories, like files, must be nonvolatile, they must be
stored on the device and brought into memory piecemeal, as needed.
11.1.2 File Operations
A file is an abstract data type. To define a file properly, we need to consider the
operations that can be performed on files. The operating system can provide
system calls to create, write, read, reposition, delete, and truncate files. Let’s
examine what the operating system must do to perform each of these six basic
file operations. It should then be easy to see how other similar operations, such
as renaming a file, can be implemented.
• Creating a file. Two steps are necessary to create a file. First, space in the
file system must be found for the file. We discuss how to allocate space for
the file in Chapter 12. Second, an entry for the new file must be made in
the directory.
• Writing a file. To write a file, we make a system call specifying both the
name of the file and the information to be written to the file. Given the
name of the file, the system searches the directory to find the file’s location.
The system must keep a write pointer to the location in the file where the
next write is to take place. The write pointer must be updated whenever a
write occurs.
• Reading a file. To read from a file, we use a system call that specifies the
name of the file and where (in memory) the next block of the file should
be put. Again, the directory is searched for the associated entry, and the
system needs to keep a read pointer to the location in the file where the
next read is to take place. Once the read has taken place, the read pointer
is updated. Because a process is usually either reading from or writing to
a file, the current operation location can be kept as a per-process current-
file-position pointer. Both the read and write operations use this same
pointer, saving space and reducing system complexity.
• Repositioning within a file. The directory is searched for the appropriate
entry, and the current-file-position pointer is repositioned to a given value.
Repositioning within a file need not involve any actual I/O. This file
operation is also known as a file seek.
• Deleting a file. To delete a file, we search the directory for the named file.
Having found the associated directory entry, we release all file space, so
that it can be reused by other files, and erase the directory entry.
• Truncating a file. The user may want to erase the contents of a file but
keep its attributes. Rather than forcing the user to delete the file and then
recreate it, this function allows all attributes to remain unchanged—except
for file length—but lets the file be reset to length zero and its file space
released.
These six basic operations comprise the minimal set of required file
operations. Other common operations include appending new information
11.1 File Concept 507
to the end of an existing file and renaming an existing file. These primitive
operations can then be combined to perform other file operations. For instance,
we can create a copy of a file—or copy the file to another I/O device, such as
a printer or a display—by creating a new file and then reading from the old
and writing to the new. We also want to have operations that allow a user to
get and set the various attributes of a file. For example, we may want to have
operations that allow a user to determine the status of a file, such as the file’s
length, and to set file attributes, such as the file’s owner.
Most of the file operations mentioned involve searching the directory for
the entry associated with the named file. To avoid this constant searching,
many systems require that an open() system call be made before a file is first
used. The operating system keeps a table, called the open-file table, containing
information about all open files. When a file operation is requested, the file is
specified via an index into this table, so no searching is required. When the file
is no longer being actively used, it is closed by the process, and the operating
system removes its entry from the open-file table. create() and delete() are
system calls that work with closed rather than open files.
Some systems implicitly open a file when the first reference to it is made.
The file is automatically closed when the job or program that opened the
file terminates. Most systems, however, require that the programmer open a
file explicitly with the open() system call before that file can be used. The
open() operation takes a file name and searches the directory, copying the
directory entry into the open-file table. The open() call can also accept access-
mode information—create, read-only, read–write, append-only, and so on.
This mode is checked against the file’s permissions. If the request mode is
allowed, the file is opened for the process. The open() system call typically
returns a pointer to the entry in the open-file table. This pointer, not the actual
file name, is used in all I/O operations, avoiding any further searching and
simplifying the system-call interface.
The implementation of the open() and close() operations is more
complicated in an environment where several processes may open the file
simultaneously. This may occur in a system where several different applications
open the same file at the same time. Typically, the operating system uses two
levels of internal tables: a per-process table and a system-wide table. The per-
process table tracks all files that a process has open. Stored in this table is
information regarding the process’s use of the file. For instance, the current
file pointer for each file is found here. Access rights to the file and accounting
information can also be included.
Each entry in the per-process table in turn points to a system-wide open-file
table. The system-wide table contains process-independent information, such
as the location of the file on disk, access dates, and file size. Once a file has
been opened by one process, the system-wide table includes an entry for the
file. When another process executes an open() call, a new entry is simply
added to the process’s open-file table pointing to the appropriate entry in
the system-wide table. Typically, the open-file table also has an open count
associated with each file to indicate how many processes have the file open.
Each close() decreases this open count, and when the open count reaches
zero, the file is no longer in use, and the file’s entry is removed from the
open-file table.
508 Chapter 11 File-System Interface
In summary, several pieces of information are associated with an open file.
• File pointer. On systems that do not include a file offset as part of the
read() and write() system calls, the system must track the last read–
write location as a current-file-position pointer. This pointer is unique to
each process operating on the file and therefore must be kept separate from
the on-disk file attributes.
• File-open count. As files are closed, the operating system must reuse its
open-file table entries, or it could run out of space in the table. Multiple
processes may have opened a file, and the system must wait for the last
file to close before removing the open-file table entry. The file-open count
tracks the number of opens and closes and reaches zero on the last close.
The system can then remove the entry.
• Disk location of the file. Most file operations require the system to modify
data within the file. The information needed to locate the file on disk is
kept in memory so that the system does not have to read it from disk for
each operation.
• Access rights. Each process opens a file in an access mode. This information
is stored on the per-process table so the operating system can allow or deny
subsequent I/O requests.
Some operating systems provide facilities for locking an open file (or
sections of a file). File locks allow one process to lock a file and prevent other
processes from gaining access to it. File locks are useful for files that are shared
by several processes—for example, a system log file that can be accessed and
modified by a number of processes in the system.
File locks provide functionality similar to reader–writer locks, covered in
Section 5.7.2. A shared lock is akin to a reader lock in that several processes
can acquire the lock concurrently. An exclusive lock behaves like a writer lock;
only one process at a time can acquire such a lock. It is important to note
that not all operating systems provide both types of locks: some systems only
provide exclusive file locking.
FILE LOCKING IN JAVA
In the Java API, acquiring a lock requires first obtaining the FileChannel
for the file to be locked. The lock() method of the FileChannel is used to
acquire the lock. The API of the lock() method is
FileLock lock(long begin, long end, boolean shared)
where begin and end are the beginning and ending positions of the region
being locked. Setting shared to true is for shared locks; setting shared
to false acquires the lock exclusively. The lock is released by invoking the
release() of the FileLock returned by the lock() operation.
The program in Figure 11.2 illustrates file locking in Java. This program
acquires two locks on the file file.txt. The first half of the file is acquired
as an exclusive lock; the lock for the second half is a shared lock.
11.1 File Concept 509
FILE LOCKING IN JAVA (Continued)
import java.io.*;
import java.nio.channels.*;
public class LockingExample {
public static final boolean EXCLUSIVE = false;
public static final boolean SHARED = true;
public static void main(String args[]) throws IOException {
FileLock sharedLock = null;
FileLock exclusiveLock = null;
try {
RandomAccessFile raf = new RandomAccessFile(file.txt,rw);
// get the channel for the file
FileChannel ch = raf.getChannel();
// this locks the first half of the file - exclusive
exclusiveLock = ch.lock(0, raf.length()/2, EXCLUSIVE);
/** Now modify the data . . . */
// release the lock
exclusiveLock.release();
// this locks the second half of the file - shared
sharedLock = ch.lock(raf.length()/2+1,raf.length(),SHARED);
/** Now read the data . . . */
// release the lock
sharedLock.release();
} catch (java.io.IOException ioe) {
System.err.println(ioe);
}
finally {
if (exclusiveLock != null)
exclusiveLock.release();
if (sharedLock != null)
sharedLock.release();
}
}
}
Figure 11.2 File-locking example in Java.
Furthermore, operating systems may provide either mandatory or advi-
sory file-locking mechanisms. If a lock is mandatory, then once a process
acquires an exclusive lock, the operating system will prevent any other process
510 Chapter 11 File-System Interface
from accessing the locked file. For example, assume a process acquires an
exclusive lock on the file system.log. If we attempt to open system.log
from another process—for example, a text editor—the operating system will
prevent access until the exclusive lock is released. This occurs even if the text
editor is not written explicitly to acquire the lock. Alternatively, if the lock
is advisory, then the operating system will not prevent the text editor from
acquiring access to system.log. Rather, the text editor must be written so that
it manually acquires the lock before accessing the file. In other words, if the
locking scheme is mandatory, the operating system ensures locking integrity.
For advisory locking, it is up to software developers to ensure that locks are
appropriately acquired and released. As a general rule, Windows operating
systems adopt mandatory locking, and UNIX systems employ advisory locks.
The use of file locks requires the same precautions as ordinary process
synchronization. For example, programmers developing on systems with
mandatory locking must be careful to hold exclusive file locks only while
they are accessing the file. Otherwise, they will prevent other processes from
accessing the file as well. Furthermore, some measures must be taken to ensure
that two or more processes do not become involved in a deadlock while trying
to acquire file locks.
11.1.3 File Types
When we design a file system—indeed, an entire operating system—we
always consider whether the operating system should recognize and support
file types. If an operating system recognizes the type of a file, it can then operate
on the file in reasonable ways. For example, a common mistake occurs when a
user tries to output the binary-object form of a program. This attempt normally
produces garbage; however, the attempt can succeed if the operating system
has been told that the file is a binary-object program.
A common technique for implementing file types is to include the type
as part of the file name. The name is split into two parts—a name and an
extension, usually separated by a period (Figure 11.3). In this way, the user
and the operating system can tell from the name alone what the type of a file
is. Most operating systems allow users to specify a file name as a sequence
of characters followed by a period and terminated by an extension made
up of additional characters. Examples include resume.docx, server.c, and
ReaderThread.cpp.
The system uses the extension to indicate the type of the file and the type
of operations that can be done on that file. Only a file with a .com, .exe, or .sh
extension can be executed, for instance. The .com and .exe files are two forms
of binary executable files, whereas the .sh file is a shell script containing, in
ASCII format, commands to the operating system. Application programs also
use extensions to indicate file types in which they are interested. For example,
Java compilers expect source files to have a .java extension, and the Microsoft
Word word processor expects its files to end with a .doc or .docx extension.
These extensions are not always required, so a user may specify a file without
the extension (to save typing), and the application will look for a file with
the given name and the extension it expects. Because these extensions are
not supported by the operating system, they can be considered “hints” to the
applications that operate on them.
11.1 File Concept 511
file type usual extension function
ready-to-run machine-
language program
executable exe, com, bin
or none
compiled, machine
language, not linked
object obj, o
binary file containing
audio or A/V information
multimedia mpeg, mov, mp3,
mp4, avi
related files grouped into
one file, sometimes com-
pressed, for archiving
or storage
archive rar, zip, tar
ASCII or binary file in a
format for printing or
viewing
print or view gif, pdf, jpg
libraries of routines for
programmers
library lib, a, so, dll
various word-processor
formats
word processor
docx
commands to the command
interpreter
batch bat, sh
textual data, documents
markup xml, html, tex
source code in various
languages
source code c, cc, java, perl,
asm
xml, rtf,
Figure 11.3 Common file types.
Consider, too, the Mac OS X operating system. In this system, each file has
a type, such as .app (for application). Each file also has a creator attribute
containing the name of the program that created it. This attribute is set by
the operating system during the create() call, so its use is enforced and
supported by the system. For instance, a file produced by a word processor
has the word processor’s name as its creator. When the user opens that file, by
double-clicking the mouse on the icon representing the file, the word processor
is invoked automatically and the file is loaded, ready to be edited.
The UNIX system uses a crude magic number stored at the beginning of
some files to indicate roughly the type of the file—executable program, shell
script, PDF file, and so on. Not all files have magic numbers, so system features
cannot be based solely on this information. UNIX does not record the name of
the creating program, either. UNIX does allow file-name-extension hints, but
these extensions are neither enforced nor depended on by the operating system;
they are meant mostly to aid users in determining what type of contents the
file contains. Extensions can be used or ignored by a given application, but that
is up to the application’s programmer.
11.1.4 File Structure
File types also can be used to indicate the internal structure of the file. As
mentioned in Section 11.1.3, source and object files have structures that match
the expectations of the programs that read them. Further, certain files must
512 Chapter 11 File-System Interface
conform to a required structure that is understood by the operating system. For
example, the operating system requires that an executable file have a specific
structure so that it can determine where in memory to load the file and what
the location of the first instruction is. Some operating systems extend this idea
into a set of system-supported file structures, with sets of special operations
for manipulating files with those structures.
This point brings us to one of the disadvantages of having the operating
system support multiple file structures: the resulting size of the operating
system is cumbersome. If the operating system defines five different file
structures, it needs to contain the code to support these file structures.
In addition, it may be necessary to define every file as one of the file
types supported by the operating system. When new applications require
information structured in ways not supported by the operating system, severe
problems may result.
For example, assume that a system supports two types of files: text files
(composed of ASCII characters separated by a carriage return and line feed)
and executable binary files. Now, if we (as users) want to define an encrypted
file to protect the contents from being read by unauthorized people, we may
find neither file type to be appropriate. The encrypted file is not ASCII text lines
but rather is (apparently) random bits. Although it may appear to be a binary
file, it is not executable. As a result, we may have to circumvent or misuse the
operating system’s file-type mechanism or abandon our encryption scheme.
Some operating systems impose (and support) a minimal number of file
structures. This approach has been adopted in UNIX, Windows, and others.
UNIX considers each file to be a sequence of 8-bit bytes; no interpretation of
these bits is made by the operating system. This scheme provides maximum
flexibility but little support. Each application program must include its own
code to interpret an input file as to the appropriate structure. However, all
operating systems must support at least one structure—that of an executable
file—so that the system is able to load and run programs.
11.1.5 Internal File Structure
Internally, locating an offset within a file can be complicated for the operating
system. Disk systems typically have a well-defined block size determined by
the size of a sector. All disk I/O is performed in units of one block (physical
record), and all blocks are the same size. It is unlikely that the physical record
size will exactly match the length of the desired logical record. Logical records
may even vary in length. Packing a number of logical records into physical
blocks is a common solution to this problem.
For example, the UNIX operating system defines all files to be simply
streams of bytes. Each byte is individually addressable by its offset from the
beginning (or end) of the file. In this case, the logical record size is 1 byte. The
file system automatically packs and unpacks bytes into physical disk blocks—
say, 512 bytes per block—as necessary.
The logical record size, physical block size, and packing technique deter-
mine how many logical records are in each physical block. The packing can be
done either by the user’s application program or by the operating system. In
either case, the file may be considered a sequence of blocks. All the basic I/O
11.2 Access Methods 513
beginning end
current position
rewind
read or write
Figure 11.4 Sequential-access file.
functions operate in terms of blocks. The conversion from logical records to
physical blocks is a relatively simple software problem.
Because disk space is always allocated in blocks, some portion of the last
block of each file is generally wasted. If each block were 512 bytes, for example,
then a file of 1,949 bytes would be allocated four blocks (2,048 bytes); the last
99 bytes would be wasted. The waste incurred to keep everything in units
of blocks (instead of bytes) is internal fragmentation. All file systems suffer
from internal fragmentation; the larger the block size, the greater the internal
fragmentation.
11.2 Access Methods
Files store information. When it is used, this information must be accessed
and read into computer memory. The information in the file can be accessed
in several ways. Some systems provide only one access method for files.
while others support many access methods, and choosing the right one for
a particular application is a major design problem.
11.2.1 Sequential Access
The simplest access method is sequential access. Information in the file is
processed in order, one record after the other. This mode of access is by far the
most common; for example, editors and compilers usually access files in this
fashion.
Reads and writes make up the bulk of the operations on a file. A read
operation—read next()—reads the next portion of the file and automatically
advances a file pointer, which tracks the I/O location. Similarly, the write
operation—write next()—appends to the end of the file and advances to the
end of the newly written material (the new end of file). Such a file can be reset
to the beginning, and on some systems, a program may be able to skip forward
or backward n records for some integer n—perhaps only for n = 1. Sequential
access, which is depicted in Figure 11.4, is based on a tape model of a file and
works as well on sequential-access devices as it does on random-access ones.
11.2.2 Direct Access
Another method is direct access (or relative access). Here, a file is made up
of fixed-length logical records that allow programs to read and write records
rapidly in no particular order. The direct-access method is based on a disk
model of a file, since disks allow random access to any file block. For direct
514 Chapter 11 File-System Interface
access, the file is viewed as a numbered sequence of blocks or records. Thus,
we may read block 14, then read block 53, and then write block 7. There are no
restrictions on the order of reading or writing for a direct-access file.
Direct-access files are of great use for immediate access to large amounts
of information. Databases are often of this type. When a query concerning a
particular subject arrives, we compute which block contains the answer and
then read that block directly to provide the desired information.
As a simple example, on an airline-reservation system, we might store all
the information about a particular flight (for example, flight 713) in the block
identified by the flight number. Thus, the number of available seats for flight
713 is stored in block 713 of the reservation file. To store information about a
larger set, such as people, we might compute a hash function on the people’s
names or search a small in-memory index to determine a block to read and
search.
For the direct-access method, the file operations must be modified to
include the block number as a parameter. Thus, we have read(n), where
n is the block number, rather than read next(), and write(n) rather
than write next(). An alternative approach is to retain read next() and
write next(), as with sequential access, and to add an operation posi-
tion file(n) where n is the block number. Then, to effect a read(n), we
would position file(n) and then read next().
The block number provided by the user to the operating system is normally
a relative block number. A relative block number is an index relative to the
beginning of the file. Thus, the first relative block of the file is 0, the next is
1, and so on, even though the absolute disk address may be 14703 for the
first block and 3192 for the second. The use of relative block numbers allows
the operating system to decide where the file should be placed (called the
allocation problem, as we discuss in Chapter 12) and helps to prevent the user
from accessing portions of the file system that may not be part of her file. Some
systems start their relative block numbers at 0; others start at 1.
How, then, does the system satisfy a request for record N in a file? Assuming
we have a logical record length L, the request for record N is turned into an
I/O request for L bytes starting at location L ∗ (N) within the file (assuming the
first record is N = 0). Since logical records are of a fixed size, it is also easy to
read, write, or delete a record.
Not all operating systems support both sequential and direct access for
files. Some systems allow only sequential file access; others allow only direct
access. Some systems require that a file be defined as sequential or direct when
it is created. Such a file can be accessed only in a manner consistent with its
declaration. We can easily simulate sequential access on a direct-access file by
simply keeping a variable cp that defines our current position, as shown in
Figure 11.5. Simulating a direct-access file on a sequential-access file, however,
is extremely inefficient and clumsy.
11.2.3 Other Access Methods
Other access methods can be built on top of a direct-access method. These
methods generally involve the construction of an index for the file. The index,
like an index in the back of a book, contains pointers to the various blocks. To
11.3 Directory and Disk Structure 515
sequential access
reset
read_next
write_next
cp 0;
read cp ;
cp cp 1;
write cp;
cp cp 1;
implementation for direct access
Figure 11.5 Simulation of sequential access on a direct-access file.
find a record in the file, we first search the index and then use the pointer to
access the file directly and to find the desired record.
For example, a retail-price file might list the universal product codes (UPCs)
for items, with the associated prices. Each record consists of a 10-digit UPC and
a 6-digit price, for a 16-byte record. If our disk has 1,024 bytes per block, we
can store 64 records per block. A file of 120,000 records would occupy about
2,000 blocks (2 million bytes). By keeping the file sorted by UPC, we can define
an index consisting of the first UPC in each block. This index would have 2,000
entries of 10 digits each, or 20,000 bytes, and thus could be kept in memory. To
find the price of a particular item, we can make a binary search of the index.
From this search, we learn exactly which block contains the desired record and
access that block. This structure allows us to search a large file doing little I/O.
With large files, the index file itself may become too large to be kept in
memory. One solution is to create an index for the index file. The primary
index file contains pointers to secondary index files, which point to the actual
data items.
For example, IBM’s indexed sequential-access method (ISAM) uses a small
master index that points to disk blocks of a secondary index. The secondary
index blocks point to the actual file blocks. The file is kept sorted on a defined
key. To find a particular item, we first make a binary search of the master index,
which provides the block number of the secondary index. This block is read
in, and again a binary search is used to find the block containing the desired
record. Finally, this block is searched sequentially. In this way, any record can
be located from its key by at most two direct-access reads. Figure 11.6 shows a
similar situation as implemented by VMS index and relative files.
11.3 Directory and Disk Structure
Next, we consider how to store files. Certainly, no general-purpose computer
stores just one file. There are typically thousands, millions, even billions of
files within a computer. Files are stored on random-access storage devices,
including hard disks, optical disks, and solid-state (memory-based) disks.
A storage device can be used in its entirety for a file system. It can also be
subdivided for finer-grained control. For example, a disk can be partitioned
into quarters, and each quarter can hold a separate file system. Storage devices
can also be collected together into RAID sets that provide protection from the
failure of a single disk (as described in Section 10.7). Sometimes, disks are
subdivided and also collected into RAID sets.
516 Chapter 11 File-System Interface
index file relative file
Smith
last name
smith, john social-security age
logical record
number
Adams
Arthur
Asher
•
•
•
Figure 11.6 Example of index and relative files.
Partitioning is useful for limiting the sizes of individual file systems,
putting multiple file-system types on the same device, or leaving part of the
device available for other uses, such as swap space or unformatted (raw) disk
space. A file system can be created on each of these parts of the disk. Any entity
containing a file system is generally known as a volume. The volume may be
a subset of a device, a whole device, or multiple devices linked together into
a RAID set. Each volume can be thought of as a virtual disk. Volumes can also
store multiple operating systems, allowing a system to boot and run more than
one operating system.
Each volume that contains a file system must also contain information
about the files in the system. This information is kept in entries in a device
directory or volume table of contents. The device directory (more commonly
known simply as the directory) records information—such as name, location,
size, and type—for all files on that volume. Figure 11.7 shows a typical
file-system organization.
directory directory
directory
files
partition A
partition B
partition C
files
disk 1
disk 2
disk 3
files
Figure 11.7 A typical file-system organization.
11.3 Directory and Disk Structure 517
/ ufs
/devices devfs
/dev dev
/system/contract ctfs
/proc proc
/etc/mnttab mntfs
/etc/svc/volatile tmpfs
/system/object objfs
/lib/libc.so.1 lofs
/dev/fd fd
/var ufs
/tmp tmpfs
/var/run tmpfs
/opt ufs
/zpbge zfs
/zpbge/backup zfs
/export/home zfs
/var/mail zfs
/var/spool/mqueue zfs
/zpbg zfs
/zpbg/zones zfs
Figure 11.8 Solaris file systems.
11.3.1 Storage Structure
As we have just seen, a general-purpose computer system has multiple storage
devices, and those devices can be sliced up into volumes that hold file systems.
Computer systems may have zero or more file systems, and the file systems
may be of varying types. For example, a typical Solaris system may have dozens
of file systems of a dozen different types, as shown in the file system list in
Figure 11.8.
In this book, we consider only general-purpose file systems. It is worth
noting, though, that there are many special-purpose file systems. Consider the
types of file systems in the Solaris example mentioned above:
• tmpfs—a “temporary” file system that is created in volatile main memory
and has its contents erased if the system reboots or crashes
• objfs—a “virtual” file system (essentially an interface to the kernel that
looks like a file system) that gives debuggers access to kernel symbols
• ctfs—a virtual file system that maintains “contract” information to manage
which processes start when the system boots and must continue to run
during operation
• lofs—a “loop back” file system that allows one file system to be accessed
in place of another one
• procfs—a virtual file system that presents information on all processes as
a file system
• ufs, zfs—general-purpose file systems
518 Chapter 11 File-System Interface
The file systems of computers, then, can be extensive. Even within a file
system, it is useful to segregate files into groups and manage and act on those
groups. This organization involves the use of directories. In the remainder of
this section, we explore the topic of directory structure.
11.3.2 Directory Overview
The directory can be viewed as a symbol table that translates file names into
their directory entries. If we take such a view, we see that the directory itself can
be organized in many ways. The organization must allow us to insert entries,
to delete entries, to search for a named entry, and to list all the entries in the
directory. In this section, we examine several schemes for defining the logical
structure of the directory system.
When considering a particular directory structure, we need to keep in mind
the operations that are to be performed on a directory:
• Search for a file. We need to be able to search a directory structure to find
the entry for a particular file. Since files have symbolic names, and similar
names may indicate a relationship among files, we may want to be able to
find all files whose names match a particular pattern.
• Create a file. New files need to be created and added to the directory.
• Delete a file. When a file is no longer needed, we want to be able to remove
it from the directory.
• List a directory. We need to be able to list the files in a directory and the
contents of the directory entry for each file in the list.
• Rename a file. Because the name of a file represents its contents to its users,
we must be able to change the name when the contents or use of the file
changes. Renaming a file may also allow its position within the directory
structure to be changed.
• Traverse the file system. We may wish to access every directory and every
file within a directory structure. For reliability, it is a good idea to save the
contents and structure of the entire file system at regular intervals. Often,
we do this by copying all files to magnetic tape. This technique provides a
backup copy in case of system failure. In addition, if a file is no longer in
use, the file can be copied to tape and the disk space of that file released
for reuse by another file.
In the following sections, we describe the most common schemes for defining
the logical structure of a directory.
11.3.3 Single-Level Directory
The simplest directory structure is the single-level directory. All files are
contained in the same directory, which is easy to support and understand
(Figure 11.9).
A single-level directory has significant limitations, however, when the
number of files increases or when the system has more than one user. Since all
files are in the same directory, they must have unique names. If two users call
11.3 Directory and Disk Structure 519
cat
files
directory bo a test data mail cont hex records
Figure 11.9 Single-level directory.
their data file test.txt, then the unique-name rule is violated. For example,
in one programming class, 23 students called the program for their second
assignment prog2.c; another 11 called it assign2.c. Fortunately, most file
systems support file names of up to 255 characters, so it is relatively easy to
select unique file names.
Even a single user on a single-level directory may find it difficult to
remember the names of all the files as the number of files increases. It is not
uncommon for a user to have hundreds of files on one computer system and an
equal number of additional files on another system. Keeping track of so many
files is a daunting task.
11.3.4 Two-Level Directory
As we have seen, a single-level directory often leads to confusion of file names
among different users. The standard solution is to create a separate directory
for each user.
In the two-level directory structure, each user has his own user file
directory (UFD). The UFDs have similar structures, but each lists only the
files of a single user. When a user job starts or a user logs in, the system’s
master file directory (MFD) is searched. The MFD is indexed by user name or
account number, and each entry points to the UFD for that user (Figure 11.10).
When a user refers to a particular file, only his own UFD is searched. Thus,
different users may have files with the same name, as long as all the file names
within each UFD are unique. To create a file for a user, the operating system
searches only that user’s UFD to ascertain whether another file of that name
exists. To delete a file, the operating system confines its search to the local UFD;
thus, it cannot accidentally delete another user’s file that has the same name.
cat bo a test x data a
a
user 1 user 2 user 3 user 4
data a test
user file
directory
master file
directory
Figure 11.10 Two-level directory structure.
520 Chapter 11 File-System Interface
The user directories themselves must be created and deleted as necessary.
A special system program is run with the appropriate user name and account
information. The program creates a new UFD and adds an entry for it to the MFD.
The execution of this program might be restricted to system administrators. The
allocation of disk space for user directories can be handled with the techniques
discussed in Chapter 12 for files themselves.
Although the two-level directory structure solves the name-collision prob-
lem, it still has disadvantages. This structure effectively isolates one user from
another. Isolation is an advantage when the users are completely independent
but is a disadvantage when the users want to cooperate on some task and to
access one another’s files. Some systems simply do not allow local user files to
be accessed by other users.
If access is to be permitted, one user must have the ability to name a file
in another user’s directory. To name a particular file uniquely in a two-level
directory, we must give both the user name and the file name. A two-level
directory can be thought of as a tree, or an inverted tree, of height 2. The root
of the tree is the MFD. Its direct descendants are the UFDs. The descendants of
the UFDs are the files themselves. The files are the leaves of the tree. Specifying
a user name and a file name defines a path in the tree from the root (the MFD)
to a leaf (the specified file). Thus, a user name and a file name define a path
name. Every file in the system has a path name. To name a file uniquely, a user
must know the path name of the file desired.
For example, if user A wishes to access her own test file named test.txt,
she can simply refer to test.txt. To access the file named test.txt of
user B (with directory-entry name userb), however, she might have to refer
to /userb/test.txt. Every system has its own syntax for naming files in
directories other than the user’s own.
Additional syntax is needed to specify the volume of a file. For instance,
in Windows a volume is specified by a letter followed by a colon. Thus,
a file specification might be C:userbtest. Some systems go even fur-
ther and separate the volume, directory name, and file name parts of the
specification. In VMS, for instance, the file login.com might be specified as:
u:[sst.jdeck]login.com;1, where u is the name of the volume, sst is the
name of the directory, jdeck is the name of the subdirectory, and 1 is the
version number. Other systems—such as UNIX and Linux—simply treat the
volume name as part of the directory name. The first name given is that of the
volume, and the rest is the directory and file. For instance, /u/pbg/test might
specify volume u, directory pbg, and file test.
A special instance of this situation occurs with the system files. Programs
provided as part of the system—loaders, assemblers, compilers, utility rou-
tines, libraries, and so on—are generally defined as files. When the appropriate
commands are given to the operating system, these files are read by the loader
and executed. Many command interpreters simply treat such a command as
the name of a file to load and execute. In the directory system as we defined it
above, this file name would be searched for in the current UFD. One solution
would be to copy the system files into each UFD. However, copying all the
system files would waste an enormous amount of space. (If the system files
require 5 MB, then supporting 12 users would require 5 × 12 = 60 MB just for
copies of the system files.)
11.3 Directory and Disk Structure 521
The standard solution is to complicate the search procedure slightly. A
special user directory is defined to contain the system files (for example, user
0). Whenever a file name is given to be loaded, the operating system first
searches the local UFD. If the file is found, it is used. If it is not found, the system
automatically searches the special user directory that contains the system files.
The sequence of directories searched when a file is named is called the search
path. The search path can be extended to contain an unlimited list of directories
to search when a command name is given. This method is the one most used
in UNIX and Windows. Systems can also be designed so that each user has his
own search path.
11.3.5 Tree-Structured Directories
Once we have seen how to view a two-level directory as a two-level tree,
the natural generalization is to extend the directory structure to a tree of
arbitrary height (Figure 11.11). This generalization allows users to create their
own subdirectories and to organize their files accordingly. A tree is the most
common directory structure. The tree has a root directory, and every file in the
system has a unique path name.
A directory (or subdirectory) contains a set of files or subdirectories. A
directory is simply another file, but it is treated in a special way. All directories
have the same internal format. One bit in each directory entry defines the entry
as a file (0) or as a subdirectory (1). Special system calls are used to create and
delete directories.
In normal use, each process has a current directory. The current directory
should contain most of the files that are of current interest to the process.
When reference is made to a file, the current directory is searched. If a file
is needed that is not in the current directory, then the user usually must
list obj spell
find count hex reorder
stat mail dist
root spell bin programs
p e mail
reorder list find
prog copy prt exp
last first
hex count
all
Figure 11.11 Tree-structured directory structure.
522 Chapter 11 File-System Interface
either specify a path name or change the current directory to be the directory
holding that file. To change directories, a system call is provided that takes a
directory name as a parameter and uses it to redefine the current directory.
Thus, the user can change her current directory whenever she wants. From one
change directory() system call to the next, all open() system calls search
the current directory for the specified file. Note that the search path may or
may not contain a special entry that stands for “the current directory.”
The initial current directory of a user’s login shell is designated when
the user job starts or the user logs in. The operating system searches the
accounting file (or some other predefined location) to find an entry for this
user (for accounting purposes). In the accounting file is a pointer to (or the
name of) the user’s initial directory. This pointer is copied to a local variable
for this user that specifies the user’s initial current directory. From that shell,
other processes can be spawned. The current directory of any subprocess is
usually the current directory of the parent when it was spawned.
Path names can be of two types: absolute and relative. An absolute path
name begins at the root and follows a path down to the specified file, giving
the directory names on the path. A relative path name defines a path from the
current directory. For example, in the tree-structured file system of Figure
11.11, if the current directory is root/spell/mail, then the relative path
name prt/first refers to the same file as does the absolute path name
root/spell/mail/prt/first.
Allowing a user to define her own subdirectories permits her to impose
a structure on her files. This structure might result in separate directories for
files associated with different topics (for example, a subdirectory was created
to hold the text of this book) or different forms of information (for example,
the directory programs may contain source programs; the directory bin may
store all the binaries).
An interesting policy decision in a tree-structured directory concerns how
to handle the deletion of a directory. If a directory is empty, its entry in the
directory that contains it can simply be deleted. However, suppose the directory
to be deleted is not empty but contains several files or subdirectories. One of
two approaches can be taken. Some systems will not delete a directory unless
it is empty. Thus, to delete a directory, the user must first delete all the files
in that directory. If any subdirectories exist, this procedure must be applied
recursively to them, so that they can be deleted also. This approach can result
in a substantial amount of work. An alternative approach, such as that taken
by the UNIX rm command, is to provide an option: when a request is made
to delete a directory, all that directory’s files and subdirectories are also to be
deleted. Either approach is fairly easy to implement; the choice is one of policy.
The latter policy is more convenient, but it is also more dangerous, because an
entire directory structure can be removed with one command. If that command
is issued in error, a large number of files and directories will need to be restored
(assuming a backup exists).
With a tree-structured directory system, users can be allowed to access, in
addition to their files, the files of other users. For example, user B can access a
file of user A by specifying its path names. User B can specify either an absolute
or a relative path name. Alternatively, user B can change her current directory
to be user A’s directory and access the file by its file names.
11.3 Directory and Disk Structure 523
11.3.6 Acyclic-Graph Directories
Consider two programmers who are working on a joint project. The files asso-
ciated with that project can be stored in a subdirectory, separating them from
other projects and files of the two programmers. But since both programmers
are equally responsible for the project, both want the subdirectory to be in their
own directories. In this situation, the common subdirectory should be shared.
A shared directory or file exists in the file system in two (or more) places at
once.
A tree structure prohibits the sharing of files or directories. An acyclic graph
—that is, a graph with no cycles—allows directories to share subdirectories
and files (Figure 11.12). The same file or subdirectory may be in two different
directories. The acyclic graph is a natural generalization of the tree-structured
directory scheme.
It is important to note that a shared file (or directory) is not the same as two
copies of the file. With two copies, each programmer can view the copy rather
than the original, but if one programmer changes the file, the changes will not
appear in the other’s copy. With a shared file, only one actual file exists, so any
changes made by one person are immediately visible to the other. Sharing is
particularly important for subdirectories; a new file created by one person will
automatically appear in all the shared subdirectories.
When people are working as a team, all the files they want to share can be
put into one directory. The UFD of each team member will contain this directory
of shared files as a subdirectory. Even in the case of a single user, the user’s file
organization may require that some file be placed in different subdirectories.
For example, a program written for a particular project should be both in the
directory of all programs and in the directory for that project.
Shared files and subdirectories can be implemented in several ways. A
common way, exemplified by many of the UNIX systems, is to create a new
directory entry called a link. A link is effectively a pointer to another file
list all w count words list
list rade w7
count
root dict spell
Figure 11.12 Acyclic-graph directory structure.
524 Chapter 11 File-System Interface
or subdirectory. For example, a link may be implemented as an absolute or a
relative path name. When a reference to a file is made, we search the directory. If
the directory entry is marked as a link, then the name of the real file is included
in the link information. We resolve the link by using that path name to locate
the real file. Links are easily identified by their format in the directory entry
(or by having a special type on systems that support types) and are effectively
indirect pointers. The operating system ignores these links when traversing
directory trees to preserve the acyclic structure of the system.
Another common approach to implementing shared files is simply to
duplicate all information about them in both sharing directories. Thus, both
entries are identical and equal. Consider the difference between this approach
and the creation of a link. The link is clearly different from the original directory
entry; thus, the two are not equal. Duplicate directory entries, however, make
the original and the copy indistinguishable. A major problem with duplicate
directory entries is maintaining consistency when a file is modified.
An acyclic-graph directory structure is more flexible than a simple tree
structure, but it is also more complex. Several problems must be considered
carefully. A file may now have multiple absolute path names. Consequently,
distinct file names may refer to the same file. This situation is similar to the
aliasing problem for programming languages. If we are trying to traverse the
entire file system—to find a file, to accumulate statistics on all files, or to copy
all files to backup storage—this problem becomes significant, since we do not
want to traverse shared structures more than once.
Another problem involves deletion. When can the space allocated to a
shared file be deallocated and reused? One possibility is to remove the file
whenever anyone deletes it, but this action may leave dangling pointers to the
now-nonexistent file. Worse, if the remaining file pointers contain actual disk
addresses, and the space is subsequently reused for other files, these dangling
pointers may point into the middle of other files.
In a system where sharing is implemented by symbolic links, this situation
is somewhat easier to handle. The deletion of a link need not affect the original
file; only the link is removed. If the file entry itself is deleted, the space for
the file is deallocated, leaving the links dangling. We can search for these links
and remove them as well, but unless a list of the associated links is kept with
each file, this search can be expensive. Alternatively, we can leave the links
until an attempt is made to use them. At that time, we can determine that the
file of the name given by the link does not exist and can fail to resolve the
link name; the access is treated just as with any other illegal file name. (In this
case, the system designer should consider carefully what to do when a file is
deleted and another file of the same name is created, before a symbolic link to
the original file is used.) In the case of UNIX, symbolic links are left when a file
is deleted, and it is up to the user to realize that the original file is gone or has
been replaced. Microsoft Windows uses the same approach.
Another approach to deletion is to preserve the file until all references to
it are deleted. To implement this approach, we must have some mechanism
for determining that the last reference to the file has been deleted. We could
keep a list of all references to a file (directory entries or symbolic links). When
a link or a copy of the directory entry is established, a new entry is added to
the file-reference list. When a link or directory entry is deleted, we remove its
entry on the list. The file is deleted when its file-reference list is empty.
11.3 Directory and Disk Structure 525
The trouble with this approach is the variable and potentially large size
of the file-reference list. However, we really do not need to keep the entire
list—we need to keep only a count of the number of references. Adding a
new link or directory entry increments the reference count. Deleting a link
or entry decrements the count. When the count is 0, the file can be deleted;
there are no remaining references to it. The UNIX operating system uses this
approach for nonsymbolic links (or hard links), keeping a reference count in the
file information block (or inode; see Section A.7.2). By effectively prohibiting
multiple references to directories, we maintain an acyclic-graph structure.
To avoid problems such as the ones just discussed, some systems simply
do not allow shared directories or links.
11.3.7 General Graph Directory
A serious problem with using an acyclic-graph structure is ensuring that there
are no cycles. If we start with a two-level directory and allow users to create
subdirectories, a tree-structured directory results. It should be fairly easy to see
that simply adding new files and subdirectories to an existing tree-structured
directory preserves the tree-structured nature. However, when we add links,
the tree structure is destroyed, resulting in a simple graph structure (Figure
11.13).
The primary advantage of an acyclic graph is the relative simplicity of the
algorithms to traverse the graph and to determine when there are no more
references to a file. We want to avoid traversing shared sections of an acyclic
graph twice, mainly for performance reasons. If we have just searched a major
shared subdirectory for a particular file without finding it, we want to avoid
searching that subdirectory again; the second search would be a waste of time.
If cycles are allowed to exist in the directory, we likewise want to
avoid searching any component twice, for reasons of correctness as well as
performance. A poorly designed algorithm might result in an infinite loop
continually searching through the cycle and never terminating. One solution
text mail
avi count unhex hex
count book book mail unhex hyp
root avi tc jim
Figure 11.13 General graph directory.
526 Chapter 11 File-System Interface
is to limit arbitrarily the number of directories that will be accessed during a
search.
A similar problem exists when we are trying to determine when a file
can be deleted. With acyclic-graph directory structures, a value of 0 in the
reference count means that there are no more references to the file or directory,
and the file can be deleted. However, when cycles exist, the reference count
may not be 0 even when it is no longer possible to refer to a directory or file.
This anomaly results from the possibility of self-referencing (or a cycle) in the
directory structure. In this case, we generally need to use a garbage collection
scheme to determine when the last reference has been deleted and the disk
space can be reallocated. Garbage collection involves traversing the entire file
system, marking everything that can be accessed. Then, a second pass collects
everything that is not marked onto a list of free space. (A similar marking
procedure can be used to ensure that a traversal or search will cover everything
in the file system once and only once.) Garbage collection for a disk-based file
system, however, is extremely time consuming and is thus seldom attempted.
Garbage collection is necessary only because of possible cycles in the graph.
Thus, an acyclic-graph structure is much easier to work with. The difficulty
is to avoid cycles as new links are added to the structure. How do we know
when a new link will complete a cycle? There are algorithms to detect cycles
in graphs; however, they are computationally expensive, especially when the
graph is on disk storage. A simpler algorithm in the special case of directories
and links is to bypass links during directory traversal. Cycles are avoided, and
no extra overhead is incurred.
11.4 File-System Mounting
Just as a file must be opened before it is used, a file system must be mounted
before it can be available to processes on the system. More specifically, the
directory structure may be built out of multiple volumes, which must be
mounted to make them available within the file-system name space.
The mount procedure is straightforward. The operating system is given the
name of the device and the mount point—the location within the file structure
where the file system is to be attached. Some operating systems require that a
file system type be provided, while others inspect the structures of the device
and determine the type of file system. Typically, a mount point is an empty
directory. For instance, on a UNIX system, a file system containing a user’s home
directories might be mounted as /home; then, to access the directory structure
within that file system, we could precede the directory names with /home, as
in /home/jane. Mounting that file system under /users would result in the
path name /users/jane, which we could use to reach the same directory.
Next, the operating system verifies that the device contains a valid file
system. It does so by asking the device driver to read the device directory
and verifying that the directory has the expected format. Finally, the operating
system notes in its directory structure that a file system is mounted at the
specified mount point. This scheme enables the operating system to traverse
its directory structure, switching among file systems, and even file systems of
varying types, as appropriate.
12
C H A P T E R
File-System
Implementation
As we saw in Chapter 11, the file system provides the mechanism for on-line
storage and access to file contents, including data and programs. The file system
resides permanently on secondary storage, which is designed to hold a large
amount of data permanently. This chapter is primarily concerned with issues
surrounding file storage and access on the most common secondary-storage
medium, the disk. We explore ways to structure file use, to allocate disk space,
to recover freed space, to track the locations of data, and to interface other
parts of the operating system to secondary storage. Performance issues are
considered throughout the chapter.
CHAPTER OBJECTIVES
• To describe the details of implementing local file systems and directory
structures.
• To describe the implementation of remote file systems.
• To discuss block allocation and free-block algorithms and trade-offs.
12.1 File-System Structure
Disks provide most of the secondary storage on which file systems are
maintained. Two characteristics make them convenient for this purpose:
1. A disk can be rewritten in place; it is possible to read a block from the
disk, modify the block, and write it back into the same place.
2. A disk can access directly any block of information it contains. Thus, it is
simple to access any file either sequentially or randomly, and switching
from one file to another requires only moving the read–write heads and
waiting for the disk to rotate.
We discuss disk structure in great detail in Chapter 10.
To improve I/O efficiency, I/O transfers between memory and disk are
performed in units of blocks. Each block has one or more sectors. Depending
543
544 Chapter 12 File-System Implementation
on the disk drive, sector size varies from 32 bytes to 4,096 bytes; the usual size
is 512 bytes.
File systems provide efficient and convenient access to the disk by allowing
data to be stored, located, and retrieved easily. A file system poses two quite
different design problems. The first problem is defining how the file system
should look to the user. This task involves defining a file and its attributes,
the operations allowed on a file, and the directory structure for organizing
files. The second problem is creating algorithms and data structures to map the
logical file system onto the physical secondary-storage devices.
The file system itself is generally composed of many different levels. The
structure shown in Figure 12.1 is an example of a layered design. Each level in
the design uses the features of lower levels to create new features for use by
higher levels.
The I/O control level consists of device drivers and interrupt handlers
to transfer information between the main memory and the disk system. A
device driver can be thought of as a translator. Its input consists of high-
level commands such as “retrieve block 123.” Its output consists of low-level,
hardware-specific instructions that are used by the hardware controller, which
interfaces the I/O device to the rest of the system. The device driver usually
writes specific bit patterns to special locations in the I/O controller’s memory
to tell the controller which device location to act on and what actions to take.
The details of device drivers and the I/O infrastructure are covered in Chapter
13.
The basic file system needs only to issue generic commands to the
appropriate device driver to read and write physical blocks on the disk. Each
physical block is identified by its numeric disk address (for example, drive 1,
cylinder 73, track 2, sector 10). This layer also manages the memory buffers
and caches that hold various file-system, directory, and data blocks. A block
in the buffer is allocated before the transfer of a disk block can occur. When
the buffer is full, the buffer manager must find more buffer memory or free
application programs
file-organization module
basic file system
I/O control
devices
logical file system
Figure 12.1 Layered file system.
12.1 File-System Structure 545
up buffer space to allow a requested I/O to complete. Caches are used to hold
frequently used file-system metadata to improve performance, so managing
their contents is critical for optimum system performance.
The file-organization module knows about files and their logical blocks,
as well as physical blocks. By knowing the type of file allocation used and
the location of the file, the file-organization module can translate logical block
addresses to physical block addresses for the basic file system to transfer.
Each file’s logical blocks are numbered from 0 (or 1) through N. Since the
physical blocks containing the data usually do not match the logical numbers,
a translation is needed to locate each block. The file-organization module also
includes the free-space manager, which tracks unallocated blocks and provides
these blocks to the file-organization module when requested.
Finally, the logical file system manages metadata information. Metadata
includes all of the file-system structure except the actual data (or contents of
the files). The logical file system manages the directory structure to provide
the file-organization module with the information the latter needs, given a
symbolic file name. It maintains file structure via file-control blocks. A file-
control block (FCB) (an inode in UNIX file systems) contains information about
the file, including ownership, permissions, and location of the file contents. The
logical file system is also responsible for protection, as discussed in Chaptrers
11 and 14.
When a layered structure is used for file-system implementation, duplica-
tion of code is minimized. The I/O control and sometimes the basic file-system
code can be used by multiple file systems. Each file system can then have its
own logical file-system and file-organization modules. Unfortunately, layering
can introduce more operating-system overhead, which may result in decreased
performance. The use of layering, including the decision about how many
layers to use and what each layer should do, is a major challenge in designing
new systems.
Many file systems are in use today, and most operating systems support
more than one. For example, most CD-ROMs are written in the ISO 9660
format, a standard format agreed on by CD-ROM manufacturers. In addition
to removable-media file systems, each operating system has one or more disk-
based file systems. UNIX uses the UNIX file system (UFS), which is based on the
Berkeley Fast File System (FFS). Windows supports disk file-system formats of
FAT, FAT32, and NTFS (or Windows NT File System), as well as CD-ROM and DVD
file-system formats. Although Linux supports over forty different file systems,
the standard Linux file system is known as the extended file system, with
the most common versions being ext3 and ext4. There are also distributed file
systems in which a file system on a server is mounted by one or more client
computers across a network.
File-system research continues to be an active area of operating-system
design and implementation. Google created its own file system to meet
the company’s specific storage and retrieval needs, which include high-
performance access from many clients across a very large number of disks.
Another interesting project is the FUSE file system, which provides flexibility in
file-system development and use by implementing and executing file systems
as user-level rather than kernel-level code. Using FUSE, a user can add a new
file system to a variety of operating systems and can use that file system to
manage her files.
546 Chapter 12 File-System Implementation
12.2 File-System Implementation
As was described in Section 11.1.2, operating systems implement open()
and close() systems calls for processes to request access to file contents.
In this section, we delve into the structures and operations used to implement
file-system operations.
12.2.1 Overview
Several on-disk and in-memory structures are used to implement a file system.
These structures vary depending on the operating system and the file system,
but some general principles apply.
On disk, the file system may contain information about how to boot an
operating system stored there, the total number of blocks, the number and
location of free blocks, the directory structure, and individual files. Many of
these structures are detailed throughout the remainder of this chapter. Here,
we describe them briefly:
• A boot control block (per volume) can contain information needed by the
system to boot an operating system from that volume. If the disk does not
contain an operating system, this block can be empty. It is typically the
first block of a volume. In UFS, it is called the boot block. In NTFS, it is the
partition boot sector.
• A volume control block (per volume) contains volume (or partition)
details, such as the number of blocks in the partition, the size of the blocks,
a free-block count and free-block pointers, and a free-FCB count and FCB
pointers. In UFS, this is called a superblock. In NTFS, it is stored in the
master file table.
• A directory structure (per file system) is used to organize the files. In UFS,
this includes file names and associated inode numbers. In NTFS, it is stored
in the master file table.
• A per-file FCB contains many details about the file. It has a unique
identifier number to allow association with a directory entry. In NTFS,
this information is actually stored within the master file table, which uses
a relational database structure, with a row per file.
The in-memory information is used for both file-system management and
performance improvement via caching. The data are loaded at mount time,
updated during file-system operations, and discarded at dismount. Several
types of structures may be included.
• An in-memory mount table contains information about each mounted
volume.
• An in-memory directory-structure cache holds the directory information
of recently accessed directories. (For directories at which volumes are
mounted, it can contain a pointer to the volume table.)
• The system-wide open-file table contains a copy of the FCB of each open
file, as well as other information.
12.2 File-System Implementation 547
file permissions
file dates (create, access, write)
file owner, group, ACL
file size
file data blocks or pointers to file data blocks
Figure 12.2 A typical file-control block.
• The per-process open-file table contains a pointer to the appropriate entry
in the system-wide open-file table, as well as other information.
• Buffers hold file-system blocks when they are being read from disk or
written to disk.
To create a new file, an application program calls the logical file system.
The logical file system knows the format of the directory structures. To create a
new file, it allocates a new FCB. (Alternatively, if the file-system implementation
creates all FCBs at file-system creation time, an FCB is allocated from the set
of free FCBs.) The system then reads the appropriate directory into memory,
updates it with the new file name and FCB, and writes it back to the disk. A
typical FCB is shown in Figure 12.2.
Some operating systems, including UNIX, treat a directory exactly the same
as a file—one with a “type” field indicating that it is a directory. Other operating
systems, including Windows, implement separate system calls for files and
directories and treat directories as entities separate from files. Whatever the
larger structural issues, the logical file system can call the file-organization
module to map the directory I/O into disk-block numbers, which are passed
on to the basic file system and I/O control system.
Now that a file has been created, it can be used for I/O. First, though, it
must be opened. The open() call passes a file name to the logical file system.
The open() system call first searches the system-wide open-file table to see
if the file is already in use by another process. If it is, a per-process open-file
table entry is created pointing to the existing system-wide open-file table. This
algorithm can save substantial overhead. If the file is not already open, the
directory structure is searched for the given file name. Parts of the directory
structure are usually cached in memory to speed directory operations. Once
the file is found, the FCB is copied into a system-wide open-file table in memory.
This table not only stores the FCB but also tracks the number of processes that
have the file open.
Next, an entry is made in the per-process open-file table, with a pointer
to the entry in the system-wide open-file table and some other fields. These
other fields may include a pointer to the current location in the file (for the next
read() or write() operation) and the access mode in which the file is open.
The open() call returns a pointer to the appropriate entry in the per-process
548 Chapter 12 File-System Implementation
directory structure
directory structure
open (file name)
kernel memory
user space
index
(a)
file-control block
secondary storage
data blocks
per-process
open-file table
system-wide
open-file table
read (index)
kernel memory
user space
(b)
file-control block
secondary storage
Figure 12.3 In-memory file-system structures. (a) File open. (b) File read.
file-system table. All file operations are then performed via this pointer. The
file name may not be part of the open-file table, as the system has no use for
it once the appropriate FCB is located on disk. It could be cached, though, to
save time on subsequent opens of the same file. The name given to the entry
varies. UNIX systems refer to it as a file descriptor; Windows refers to it as a
file handle.
When a process closes the file, the per-process table entry is removed, and
the system-wide entry’s open count is decremented. When all users that have
opened the file close it, any updated metadata is copied back to the disk-based
directory structure, and the system-wide open-file table entry is removed.
Some systems complicate this scheme further by using the file system as an
interface to other system aspects, such as networking. For example, in UFS, the
system-wide open-file table holds the inodes and other information for files
and directories. It also holds similar information for network connections and
devices. In this way, one mechanism can be used for multiple purposes.
The caching aspects of file-system structures should not be overlooked.
Most systems keep all information about an open file, except for its actual data
blocks, in memory. The BSD UNIX system is typical in its use of caches wherever
disk I/O can be saved. Its average cache hit rate of 85 percent shows that these
techniques are well worth implementing. The BSD UNIX system is described
fully in Appendix A.
The operating structures of a file-system implementation are summarized
in Figure 12.3.
12.2 File-System Implementation 549
12.2.2 Partitions and Mounting
The layout of a disk can have many variations, depending on the operating
system. A disk can be sliced into multiple partitions, or a volume can span
multiple partitions on multiple disks. The former layout is discussed here,
while the latter, which is more appropriately considered a form of RAID, is
covered in Section 10.7.
Each partition can be either “raw,” containing no file system, or “cooked,”
containing a file system. Raw disk is used where no file system is appropriate.
UNIX swap space can use a raw partition, for example, since it uses its own
format on disk and does not use a file system. Likewise, some databases use raw
disk and format the data to suit their needs. Raw disk can also hold information
needed by disk RAID systems, such as bit maps indicating which blocks are
mirrored and which have changed and need to be mirrored. Similarly, raw
disk can contain a miniature database holding RAID configuration information,
such as which disks are members of each RAID set. Raw disk use is discussed
in Section 10.5.1.
Boot information can be stored in a separate partition, as described in
Section 10.5.2. Again, it has its own format, because at boot time the system
does not have the file-system code loaded and therefore cannot interpret the
file-system format. Rather, boot information is usually a sequential series of
blocks, loaded as an image into memory. Execution of the image starts at a
predefined location, such as the first byte. This boot loader in turn knows
enough about the file-system structure to be able to find and load the kernel
and start it executing. It can contain more than the instructions for how to boot
a specific operating system. For instance, many systems can be dual-booted,
allowing us to install multiple operating systems on a single system. How does
the system know which one to boot? A boot loader that understands multiple
file systems and multiple operating systems can occupy the boot space. Once
loaded, it can boot one of the operating systems available on the disk. The disk
can have multiple partitions, each containing a different type of file system and
a different operating system.
The root partition, which contains the operating-system kernel and some-
times other system files, is mounted at boot time. Other volumes can be
automatically mounted at boot or manually mounted later, depending on
the operating system. As part of a successful mount operation, the operating
system verifies that the device contains a valid file system. It does so by asking
the device driver to read the device directory and verifying that the directory
has the expected format. If the format is invalid, the partition must have
its consistency checked and possibly corrected, either with or without user
intervention. Finally, the operating system notes in its in-memory mount table
that a file system is mounted, along with the type of the file system. The details
of this function depend on the operating system.
Microsoft Windows–based systems mount each volume in a separate name
space, denoted by a letter and a colon. To record that a file system is mounted
at F:, for example, the operating system places a pointer to the file system in
a field of the device structure corresponding to F:. When a process specifies
the driver letter, the operating system finds the appropriate file-system pointer
and traverses the directory structures on that device to find the specified file
550 Chapter 12 File-System Implementation
or directory. Later versions of Windows can mount a file system at any point
within the existing directory structure.
On UNIX, file systems can be mounted at any directory. Mounting is
implemented by setting a flag in the in-memory copy of the inode for that
directory. The flag indicates that the directory is a mount point. A field then
points to an entry in the mount table, indicating which device is mounted there.
The mount table entry contains a pointer to the superblock of the file system on
that device. This scheme enables the operating system to traverse its directory
structure, switching seamlessly among file systems of varying types.
12.2.3 Virtual File Systems
The previous section makes it clear that modern operating systems must
concurrently support multiple types of file systems. But how does an operating
system allow multiple types of file systems to be integrated into a directory
structure? And how can users seamlessly move between file-system types
as they navigate the file-system space? We now discuss some of these
implementation details.
An obvious but suboptimal method of implementing multiple types of file
systems is to write directory and file routines for each type. Instead, however,
most operating systems, including UNIX, use object-oriented techniques to
simplify, organize, and modularize the implementation. The use of these
methods allows very dissimilar file-system types to be implemented within
the same structure, including network file systems, such as NFS. Users can
access files contained within multiple file systems on the local disk or even on
file systems available across the network.
Data structures and procedures are used to isolate the basic system-
call functionality from the implementation details. Thus, the file-system
implementation consists of three major layers, as depicted schematically in
Figure 12.4. The first layer is the file-system interface, based on the open(),
read(), write(), and close() calls and on file descriptors.
The second layer is called the virtual file system (VFS) layer. The VFS layer
serves two important functions:
1. It separates file-system-generic operations from their implementation
by defining a clean VFS interface. Several implementations for the VFS
interface may coexist on the same machine, allowing transparent access
to different types of file systems mounted locally.
2. It provides a mechanism for uniquely representing a file throughout a
network. The VFS is based on a file-representation structure, called a
vnode, that contains a numerical designator for a network-wide unique
file. (UNIX inodes are unique within only a single file system.) This
network-wide uniqueness is required for support of network file systems.
The kernel maintains one vnode structure for each active node (file or
directory).
Thus, the VFS distinguishes local files from remote ones, and local files are
further distinguished according to their file-system types.
The VFS activates file-system-specific operations to handle local requests
according to their file-system types and calls the NFS protocol procedures for
12.2 File-System Implementation 551
local file system
type 1
disk
local file system
type 2
disk
remote file system
type 1
network
file-system interface
VFS interface
Figure 12.4 Schematic view of a virtual file system.
remote requests. File handles are constructed from the relevant vnodes and
are passed as arguments to these procedures. The layer implementing the
file-system type or the remote-file-system protocol is the third layer of the
architecture.
Let’s briefly examine the VFS architecture in Linux. The four main object
types defined by the Linux VFS are:
• The inode object, which represents an individual file
• The file object, which represents an open file
• The superblock object, which represents an entire file system
• The dentry object, which represents an individual directory entry
For each of these four object types, the VFS defines a set of operations that
may be implemented. Every object of one of these types contains a pointer to
a function table. The function table lists the addresses of the actual functions
that implement the defined operations for that particular object. For example,
an abbreviated API for some of the operations for the file object includes:
• int open(. . .)—Open a file.
• int close(...)—Close an already-open file.
• ssize t read(. . .)—Read from a file.
• ssize t write(. . .)—Write to a file.
• int mmap(. . .)—Memory-map a file.
552 Chapter 12 File-System Implementation
An implementation of the file object for a specific file type is required to imple-
ment each function specified in the definition of the file object. (The complete
definition of the file object is specified in the struct file operations, which
is located in the file /usr/include/linux/fs.h.)
Thus, the VFS software layer can perform an operation on one of these
objects by calling the appropriate function from the object’s function table,
without having to know in advance exactly what kind of object it is dealing
with. The VFS does not know, or care, whether an inode represents a disk file,
a directory file, or a remote file. The appropriate function for that file’s read()
operation will always be at the same place in its function table, and the VFS
software layer will call that function without caring how the data are actually
read.
12.3 Directory Implementation
The selection of directory-allocation and directory-management algorithms
significantly affects the efficiency, performance, and reliability of the file
system. In this section, we discuss the trade-offs involved in choosing one
of these algorithms.
12.3.1 Linear List
The simplest method of implementing a directory is to use a linear list of file
names with pointers to the data blocks. This method is simple to program
but time-consuming to execute. To create a new file, we must first search the
directory to be sure that no existing file has the same name. Then, we add a
new entry at the end of the directory. To delete a file, we search the directory for
the named file and then release the space allocated to it. To reuse the directory
entry, we can do one of several things. We can mark the entry as unused (by
assigning it a special name, such as an all-blank name, or by including a used–
unused bit in each entry), or we can attach it to a list of free directory entries. A
third alternative is to copy the last entry in the directory into the freed location
and to decrease the length of the directory. A linked list can also be used to
decrease the time required to delete a file.
The real disadvantage of a linear list of directory entries is that finding a
file requires a linear search. Directory information is used frequently, and users
will notice if access to it is slow. In fact, many operating systems implement a
software cache to store the most recently used directory information. A cache
hit avoids the need to constantly reread the information from disk. A sorted
list allows a binary search and decreases the average search time. However, the
requirement that the list be kept sorted may complicate creating and deleting
files, since we may have to move substantial amounts of directory information
to maintain a sorted directory. A more sophisticated tree data structure, such
as a balanced tree, might help here. An advantage of the sorted list is that a
sorted directory listing can be produced without a separate sort step.
12.3.2 Hash Table
Another data structure used for a file directory is a hash table. Here, a linear
list stores the directory entries, but a hash data structure is also used. The hash
table takes a value computed from the file name and returns a pointer to the file

More Related Content

PDF
CS311-Lec1.pdfCS311-Lec1.pdfCS311-Lec1.pdf
PDF
Operating Systems Concepts with Java 6th Edition Silberschatz
PDF
Operating System / System Operasi
PDF
Operating Systems Concepts with Java 6th Edition Silberschatz
PDF
Lecture - 1.pdf
PPT
Unit I OS CS.ppt
PPT
Os concepts
PDF
R20CSE2202-OPERATING-SYSTEMS .pdf
CS311-Lec1.pdfCS311-Lec1.pdfCS311-Lec1.pdf
Operating Systems Concepts with Java 6th Edition Silberschatz
Operating System / System Operasi
Operating Systems Concepts with Java 6th Edition Silberschatz
Lecture - 1.pdf
Unit I OS CS.ppt
Os concepts
R20CSE2202-OPERATING-SYSTEMS .pdf

Similar to Imports topics from Galvin Operating System .pdf (20)

PDF
Operating System.pdf
DOC
operating system lecture notes
PDF
operating systems classification university
PPTX
Chapter02-rev.pptx
PPT
Overview of Operating System.ppt introduction
PDF
Operating system concepts 6th Ed 6th ed Edition James Lyle Peterson
PPTX
Operating Systems R20 Unit 1.pptx
PPT
Introduction to Operating Systems - Mary Margarat
PDF
William_Stallings_Operating_SystemsInter.pdf
PDF
operating system structure
PPT
Chapter one_oS.ppt
PPT
Operating Systems with Storage and Process Management
PPT
Operating Systems _ Process & Storage Management
PPT
Operating Systems Storage & Process Management
PPT
OSLec1&2.ppt
PDF
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
PDF
MODERN_OPERATING_SYSTEMS_4th_EDITION.pdf
PDF
Andrew S Tanenbaum - Modern Operating Systems (4th edition).pdf
PPT
Operating System
Operating System.pdf
operating system lecture notes
operating systems classification university
Chapter02-rev.pptx
Overview of Operating System.ppt introduction
Operating system concepts 6th Ed 6th ed Edition James Lyle Peterson
Operating Systems R20 Unit 1.pptx
Introduction to Operating Systems - Mary Margarat
William_Stallings_Operating_SystemsInter.pdf
operating system structure
Chapter one_oS.ppt
Operating Systems with Storage and Process Management
Operating Systems _ Process & Storage Management
Operating Systems Storage & Process Management
OSLec1&2.ppt
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
MODERN_OPERATING_SYSTEMS_4th_EDITION.pdf
Andrew S Tanenbaum - Modern Operating Systems (4th edition).pdf
Operating System
Ad

Recently uploaded (20)

PPTX
additive manufacturing of ss316l using mig welding
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPT
Mechanical Engineering MATERIALS Selection
PDF
PPT on Performance Review to get promotions
PPTX
Construction Project Organization Group 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PPT
Project quality management in manufacturing
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
web development for engineering and engineering
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Welding lecture in detail for understanding
PDF
composite construction of structures.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
additive manufacturing of ss316l using mig welding
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Internet of Things (IOT) - A guide to understanding
Mechanical Engineering MATERIALS Selection
PPT on Performance Review to get promotions
Construction Project Organization Group 2.pptx
Sustainable Sites - Green Building Construction
Project quality management in manufacturing
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Lecture Notes Electrical Wiring System Components
web development for engineering and engineering
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Welding lecture in detail for understanding
composite construction of structures.pdf
Foundation to blockchain - A guide to Blockchain Tech
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Ad

Imports topics from Galvin Operating System .pdf

  • 2. Contents PART ONE OVERVIEW Chapter 1 Introduction 1.1 What Operating Systems Do 4 1.2 Computer-System Organization 7 1.3 Computer-System Architecture 12 1.4 Operating-System Structure 19 1.5 Operating-System Operations 21 1.6 Process Management 24 1.7 Memory Management 25 1.8 Storage Management 26 1.9 Protection and Security 30 1.10 Kernel Data Structures 31 1.11 Computing Environments 35 1.12 Open-Source Operating Systems 43 1.13 Summary 47 Exercises 49 Bibliographical Notes 52 Chapter 2 Operating-System Structures 2.1 Operating-System Services 55 2.2 User and Operating-System Interface 58 2.3 System Calls 62 2.4 Types of System Calls 66 2.5 System Programs 74 2.6 Operating-System Design and Implementation 75 2.7 Operating-System Structure 78 2.8 Operating-System Debugging 86 2.9 Operating-System Generation 91 2.10 System Boot 92 2.11 Summary 93 Exercises 94 Bibliographical Notes 101 PART TWO PROCESS MANAGEMENT Chapter 3 Processes 3.1 Process Concept 105 3.2 Process Scheduling 110 3.3 Operations on Processes 115 3.4 Interprocess Communication 122 3.5 Examples of IPC Systems 130 3.6 Communication in Client– Server Systems 136 3.7 Summary 147 Exercises 149 Bibliographical Notes 161 xvii
  • 3. xviii Contents Chapter 4 Threads 4.1 Overview 163 4.2 Multicore Programming 166 4.3 Multithreading Models 169 4.4 Thread Libraries 171 4.5 Implicit Threading 177 4.6 Threading Issues 183 4.7 Operating-System Examples 188 4.8 Summary 191 Exercises 191 Bibliographical Notes 199 Chapter 5 Process Synchronization 5.1 Background 203 5.2 The Critical-Section Problem 206 5.3 Peterson’s Solution 207 5.4 Synchronization Hardware 209 5.5 Mutex Locks 212 5.6 Semaphores 213 5.7 Classic Problems of Synchronization 219 5.8 Monitors 223 5.9 Synchronization Examples 232 5.10 Alternative Approaches 238 5.11 Summary 242 Exercises 242 Bibliographical Notes 258 Chapter 6 CPU Scheduling 6.1 Basic Concepts 261 6.2 Scheduling Criteria 265 6.3 Scheduling Algorithms 266 6.4 Thread Scheduling 277 6.5 Multiple-Processor Scheduling 278 6.6 Real-Time CPU Scheduling 283 6.7 Operating-System Examples 290 6.8 Algorithm Evaluation 300 6.9 Summary 304 Exercises 305 Bibliographical Notes 311 Chapter 7 Deadlocks 7.1 System Model 315 7.2 Deadlock Characterization 317 7.3 Methods for Handling Deadlocks 322 7.4 Deadlock Prevention 323 7.5 Deadlock Avoidance 327 7.6 Deadlock Detection 333 7.7 Recovery from Deadlock 337 7.8 Summary 339 Exercises 339 Bibliographical Notes 346 PART THREE MEMORY MANAGEMENT Chapter 8 Main Memory 8.1 Background 351 8.2 Swapping 358 8.3 Contiguous Memory Allocation 360 8.4 Segmentation 364 8.5 Paging 366 8.6 Structure of the Page Table 378 8.7 Example: Intel 32 and 64-bit Architectures 383 8.8 Example: ARM Architecture 388 8.9 Summary 389 Exercises 390 Bibliographical Notes 394
  • 4. Contents xix Chapter 9 Virtual Memory 9.1 Background 397 9.2 Demand Paging 401 9.3 Copy-on-Write 408 9.4 Page Replacement 409 9.5 Allocation of Frames 421 9.6 Thrashing 425 9.7 Memory-Mapped Files 430 9.8 Allocating Kernel Memory 436 9.9 Other Considerations 439 9.10 Operating-System Examples 445 9.11 Summary 448 Exercises 449 Bibliographical Notes 461 PART FOUR STORAGE MANAGEMENT Chapter 10 Mass-Storage Structure 10.1 Overview of Mass-Storage Structure 467 10.2 Disk Structure 470 10.3 Disk Attachment 471 10.4 Disk Scheduling 472 10.5 Disk Management 478 10.6 Swap-Space Management 482 10.7 RAID Structure 484 10.8 Stable-Storage Implementation 494 10.9 Summary 496 Exercises 497 Bibliographical Notes 501 Chapter 11 File-System Interface 11.1 File Concept 503 11.2 Access Methods 513 11.3 Directory and Disk Structure 515 11.4 File-System Mounting 526 11.5 File Sharing 528 11.6 Protection 533 11.7 Summary 538 Exercises 539 Bibliographical Notes 541 Chapter 12 File-System Implementation 12.1 File-System Structure 543 12.2 File-System Implementation 546 12.3 Directory Implementation 552 12.4 Allocation Methods 553 12.5 Free-Space Management 561 12.6 Efficiency and Performance 564 12.7 Recovery 568 12.8 NFS 571 12.9 Example: The WAFL File System 577 12.10 Summary 580 Exercises 581 Bibliographical Notes 585 Chapter 13 I/O Systems 13.1 Overview 587 13.2 I/O Hardware 588 13.3 Application I/O Interface 597 13.4 Kernel I/O Subsystem 604 13.5 Transforming I/O Requests to Hardware Operations 611 13.6 STREAMS 613 13.7 Performance 615 13.8 Summary 618 Exercises 619 Bibliographical Notes 621
  • 5. xx Contents PART FIVE PROTECTION AND SECURITY Chapter 14 Protection 14.1 Goals of Protection 625 14.2 Principles of Protection 626 14.3 Domain of Protection 627 14.4 Access Matrix 632 14.5 Implementation of the Access Matrix 636 14.6 Access Control 639 14.7 Revocation of Access Rights 640 14.8 Capability-Based Systems 641 14.9 Language-Based Protection 644 14.10 Summary 649 Exercises 650 Bibliographical Notes 652 Chapter 15 Security 15.1 The Security Problem 657 15.2 Program Threats 661 15.3 System and Network Threats 669 15.4 Cryptography as a Security Tool 674 15.5 User Authentication 685 15.6 Implementing Security Defenses 689 15.7 Firewalling to Protect Systems and Networks 696 15.8 Computer-Security Classifications 698 15.9 An Example: Windows 7 699 15.10 Summary 701 Exercises 702 Bibliographical Notes 704 PART SIX ADVANCED TOPICS Chapter 16 Virtual Machines 16.1 Overview 711 16.2 History 713 16.3 Benefits and Features 714 16.4 Building Blocks 717 16.5 Types of Virtual Machines and Their Implementations 721 16.6 Virtualization and Operating-System Components 728 16.7 Examples 735 16.8 Summary 737 Exercises 738 Bibliographical Notes 739 Chapter 17 Distributed Systems 17.1 Advantages of Distributed Systems 741 17.2 Types of Network- based Operating Systems 743 17.3 Network Structure 747 17.4 Communication Structure 751 17.5 Communication Protocols 756 17.6 An Example: TCP/IP 760 17.7 Robustness 762 17.8 Design Issues 764 17.9 Distributed File Systems 765 17.10 Summary 773 Exercises 774 Bibliographical Notes 777
  • 6. Contents xxi PART SEVEN CASE STUDIES Chapter 18 The Linux System 18.1 Linux History 781 18.2 Design Principles 786 18.3 Kernel Modules 789 18.4 Process Management 792 18.5 Scheduling 795 18.6 Memory Management 800 18.7 File Systems 809 18.8 Input and Output 815 18.9 Interprocess Communication 818 18.10 Network Structure 819 18.11 Security 821 18.12 Summary 824 Exercises 824 Bibliographical Notes 826 Chapter 19 Windows 7 19.1 History 829 19.2 Design Principles 831 19.3 System Components 838 19.4 Terminal Services and Fast User Switching 862 19.5 File System 863 19.6 Networking 869 19.7 Programmer Interface 874 19.8 Summary 883 Exercises 883 Bibliographical Notes 885 Chapter 20 Influential Operating Systems 20.1 Feature Migration 887 20.2 Early Systems 888 20.3 Atlas 895 20.4 XDS-940 896 20.5 THE 897 20.6 RC 4000 897 20.7 CTSS 898 20.8 MULTICS 899 20.9 IBM OS/360 899 20.10 TOPS-20 901 20.11 CP/M and MS/DOS 901 20.12 Macintosh Operating System and Windows 902 20.13 Mach 902 20.14 Other Systems 904 Exercises 904 Bibliographical Notes 904 PART EIGHT APPENDICES Appendix A BSD UNIX A.1 UNIX History A1 A.2 Design Principles A6 A.3 Programmer Interface A8 A.4 User Interface A15 A.5 Process Management A18 A.6 Memory Management A22 A.7 File System A24 A.8 I/O System A32 A.9 Interprocess Communication A36 A.10 Summary A40 Exercises A41 Bibliographical Notes A42
  • 7. xxii Contents Appendix B The Mach System B.1 History of the Mach System B1 B.2 Design Principles B3 B.3 System Components B4 B.4 Process Management B7 B.5 Interprocess Communication B13 B.6 Memory Management B18 B.7 Programmer Interface B23 B.8 Summary B24 Exercises B25 Bibliographical Notes B26
  • 8. Part One Overview An operating system acts as an intermediary between the user of a computer and the computer hardware. The purpose of an operating system is to provide an environment in which a user can execute programs in a convenient and efficient manner. An operating system is software that manages the computer hard- ware. The hardware must provide appropriate mechanisms to ensure the correct operation of the computer system and to prevent user programs from interfering with the proper operation of the system. Internally, operating systems vary greatly in their makeup, since they are organized along many different lines. The design of a new operating system is a major task. It is important that the goals of the system be well defined before the design begins. These goals form the basis for choices among various algorithms and strategies. Because an operating system is large and complex, it must be created piece by piece. Each of these pieces should be a well-delineated portion of the system, with carefully defined inputs, outputs, and functions.
  • 10. 1 C H A P T E R Introduction An operating system is a program that manages a computer’s hardware. It also provides a basis for application programs and acts as an intermediary between the computer user and the computer hardware. An amazing aspect of operating systems is how they vary in accomplishing these tasks. Mainframe operating systems are designed primarily to optimize utilization of hardware. Personal computer (PC) operating systems support complex games, business applications, and everything in between. Operating systems for mobile com- puters provide an environment in which a user can easily interface with the computer to execute programs. Thus, some operating systems are designed to be convenient, others to be efficient, and others to be some combination of the two. Before we can explore the details of computer system operation, we need to know something about system structure. We thus discuss the basic functions of system startup, I/O, and storage early in this chapter. We also describe the basic computer architecture that makes it possible to write a functional operating system. Because an operating system is large and complex, it must be created piece by piece. Each of these pieces should be a well-delineated portion of the system, with carefully defined inputs, outputs, and functions. In this chapter, we provide a general overview of the major components of a contemporary computer system as well as the functions provided by the operating system. Additionally, we cover several other topics to help set the stage for the remainder of this text: data structures used in operating systems, computing environments, and open-source operating systems. CHAPTER OBJECTIVES • To describe the basic organization of computer systems. • To provide a grand tour of the major components of operating systems. • To give an overview of the many types of computing environments. • To explore several open-source operating systems. 3
  • 11. 4 Chapter 1 Introduction user 1 user 2 user 3 computer hardware operating system system and application programs compiler assembler text editor database system user n … … Figure 1.1 Abstract view of the components of a computer system. 1.1 What Operating Systems Do We begin our discussion by looking at the operating system’s role in the overall computer system. A computer system can be divided roughly into four components: the hardware, the operating system, the application programs, and the users (Figure 1.1). The hardware—the central processing unit (CPU), the memory, and the input/output (I/O) devices—provides the basic computing resources for the system. The application programs—such as word processors, spreadsheets, compilers, and Web browsers—define the ways in which these resources are used to solve users’ computing problems. The operating system controls the hardware and coordinates its use among the various application programs for the various users. We can also view a computer system as consisting of hardware, software, and data. The operating system provides the means for proper use of these resources in the operation of the computer system. An operating system is similar to a government. Like a government, it performs no useful function by itself. It simply provides an environment within which other programs can do useful work. To understand more fully the operating system’s role, we next explore operating systems from two viewpoints: that of the user and that of the system. 1.1.1 User View The user’s view of the computer varies according to the interface being used. Most computer users sit in front of a PC, consisting of a monitor, keyboard, mouse, and system unit. Such a system is designed for one user
  • 12. 1.1 What Operating Systems Do 5 to monopolize its resources. The goal is to maximize the work (or play) that the user is performing. In this case, the operating system is designed mostly for ease of use, with some attention paid to performance and none paid to resource utilization—how various hardware and software resources are shared. Performance is, of course, important to the user; but such systems are optimized for the single-user experience rather than the requirements of multiple users. In other cases, a user sits at a terminal connected to a mainframe or a minicomputer. Other users are accessing the same computer through other terminals. These users share resources and may exchange information. The operating system in such cases is designed to maximize resource utilization— to assure that all available CPU time, memory, and I/O are used efficiently and that no individual user takes more than her fair share. In still other cases, users sit at workstations connected to networks of other workstations and servers. These users have dedicated resources at their disposal, but they also share resources such as networking and servers, including file, compute, and print servers. Therefore, their operating system is designed to compromise between individual usability and resource utilization. Recently, many varieties of mobile computers, such as smartphones and tablets, have come into fashion. Most mobile computers are standalone units for individual users. Quite often, they are connected to networks through cellular or other wireless technologies. Increasingly, these mobile devices are replacing desktop and laptop computers for people who are primarily interested in using computers for e-mail and web browsing. The user interface for mobile computers generally features a touch screen, where the user interacts with the system by pressing and swiping fingers across the screen rather than using a physical keyboard and mouse. Some computers have little or no user view. For example, embedded computers in home devices and automobiles may have numeric keypads and may turn indicator lights on or off to show status, but they and their operating systems are designed primarily to run without user intervention. 1.1.2 System View From the computer’s point of view, the operating system is the program most intimately involved with the hardware. In this context, we can view an operating system as a resource allocator. A computer system has many resources that may be required to solve a problem: CPU time, memory space, file-storage space, I/O devices, and so on. The operating system acts as the manager of these resources. Facing numerous and possibly conflicting requests for resources, the operating system must decide how to allocate them to specific programs and users so that it can operate the computer system efficiently and fairly. As we have seen, resource allocation is especially important where many users access the same mainframe or minicomputer. A slightly different view of an operating system emphasizes the need to control the various I/O devices and user programs. An operating system is a control program. A control program manages the execution of user programs to prevent errors and improper use of the computer. It is especially concerned with the operation and control of I/O devices.
  • 13. 6 Chapter 1 Introduction 1.1.3 Defining Operating Systems By now, you can probably see that the term operating system covers many roles and functions. That is the case, at least in part, because of the myriad designs and uses of computers. Computers are present within toasters, cars, ships, spacecraft, homes, and businesses. They are the basis for game machines, music players, cable TV tuners, and industrial control systems. Although computers have a relatively short history, they have evolved rapidly. Computing started as an experiment to determine what could be done and quickly moved to fixed-purpose systems for military uses, such as code breaking and trajectory plotting, and governmental uses, such as census calculation. Those early computers evolved into general-purpose, multifunction mainframes, and that’s when operating systems were born. In the 1960s, Moore’s Law predicted that the number of transistors on an integrated circuit would double every eighteen months, and that prediction has held true. Computers gained in functionality and shrunk in size, leading to a vast number of uses and a vast number and variety of operating systems. (See Chapter 20 for more details on the history of operating systems.) How, then, can we define what an operating system is? In general, we have no completely adequate definition of an operating system. Operating systems exist because they offer a reasonable way to solve the problem of creating a usable computing system. The fundamental goal of computer systems is to execute user programs and to make solving user problems easier. Computer hardware is constructed toward this goal. Since bare hardware alone is not particularly easy to use, application programs are developed. These programs require certain common operations, such as those controlling the I/O devices. The common functions of controlling and allocating resources are then brought together into one piece of software: the operating system. In addition, we have no universally accepted definition of what is part of the operating system. A simple viewpoint is that it includes everything a vendor ships when you order “the operating system.” The features included, however, vary greatly across systems. Some systems take up less than a megabyte of space and lack even a full-screen editor, whereas others require gigabytes of space and are based entirely on graphical windowing systems. A more common definition, and the one that we usually follow, is that the operating system is the one program running at all times on the computer—usually called the kernel. (Along with the kernel, there are two other types of programs: system programs, which are associated with the operating system but are not necessarily part of the kernel, and application programs, which include all programs not associated with the operation of the system.) The matter of what constitutes an operating system became increasingly important as personal computers became more widespread and operating systems grew increasingly sophisticated. In 1998, the United States Department of Justice filed suit against Microsoft, in essence claiming that Microsoft included too much functionality in its operating systems and thus prevented application vendors from competing. (For example, a Web browser was an integral part of the operating systems.) As a result, Microsoft was found guilty of using its operating-system monopoly to limit competition. Today, however, if we look at operating systems for mobile devices, we see that once again the number of features constituting the operating system
  • 14. 1.2 Computer-System Organization 7 is increasing. Mobile operating systems often include not only a core kernel but also middleware—a set of software frameworks that provide additional services to application developers. For example, each of the two most promi- nent mobile operating systems—Apple’s iOS and Google’s Android—features a core kernel along with middleware that supports databases, multimedia, and graphics (to name a only few). 1.2 Computer-System Organization Before we can explore the details of how computer systems operate, we need general knowledge of the structure of a computer system. In this section, we look at several parts of this structure. The section is mostly concerned with computer-system organization, so you can skim or skip it if you already understand the concepts. 1.2.1 Computer-System Operation A modern general-purpose computer system consists of one or more CPUs and a number of device controllers connected through a common bus that provides access to shared memory (Figure 1.2). Each device controller is in charge of a specific type of device (for example, disk drives, audio devices, or video displays). The CPU and the device controllers can execute in parallel, competing for memory cycles. To ensure orderly access to the shared memory, a memory controller synchronizes access to the memory. For a computer to start running—for instance, when it is powered up or rebooted—it needs to have an initial program to run. This initial program, or bootstrap program, tends to be simple. Typically, it is stored within the computer hardware in read-only memory (ROM) or electrically erasable programmable read-only memory (EEPROM), known by the general term firmware. It initializes all aspects of the system, from CPU registers to device controllers to memory contents. The bootstrap program must know how to load the operating system and how to start executing that system. To accomplish USB controller keyboard printer mouse monitor disks graphics adapter disk controller memory CPU on-line Figure 1.2 A modern computer system.
  • 15. 8 Chapter 1 Introduction user process executing CPU I/O interrupt processing I/O request transfer done I/O request transfer done I/O device idle transferring Figure 1.3 Interrupt timeline for a single process doing output. this goal, the bootstrap program must locate the operating-system kernel and load it into memory. Once the kernel is loaded and executing, it can start providing services to the system and its users. Some services are provided outside of the kernel, by system programs that are loaded into memory at boot time to become system processes, or system daemons that run the entire time the kernel is running. On UNIX, the first system process is “init,” and it starts many other daemons. Once this phase is complete, the system is fully booted, and the system waits for some event to occur. The occurrence of an event is usually signaled by an interrupt from either the hardware or the software. Hardware may trigger an interrupt at any time by sending a signal to the CPU, usually by way of the system bus. Software may trigger an interrupt by executing a special operation called a system call (also called a monitor call). When the CPU is interrupted, it stops what it is doing and immediately transfers execution to a fixed location. The fixed location usually contains the starting address where the service routine for the interrupt is located. The interrupt service routine executes; on completion, the CPU resumes the interrupted computation. A timeline of this operation is shown in Figure 1.3. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism, but several functions are common. The interrupt must transfer control to the appropriate interrupt service routine. The straightforward method for handling this transfer would be to invoke a generic routine to examine the interrupt information. The routine, in turn, would call the interrupt-specific handler. However, interrupts must be handled quickly. Since only a predefined number of interrupts is possible, a table of pointers to interrupt routines can be used instead to provide the necessary speed. The interrupt routine is called indirectly through the table, with no intermediate routine needed. Generally, the table of pointers is stored in low memory (the first hundred or so locations). These locations hold the addresses of the interrupt service routines for the various devices. This array, or interrupt vector, of addresses is then indexed by a unique device number, given with the interrupt request, to provide the address of the interrupt service routine for
  • 16. 1.2 Computer-System Organization 9 STORAGE DEFINITIONS AND NOTATION The basic unit of computer storage is the bit. A bit can contain one of two values, 0 and 1. All other storage in a computer is based on collections of bits. Given enough bits, it is amazing how many things a computer can represent: numbers, letters, images, movies, sounds, documents, and programs, to name a few. A byte is 8 bits, and on most computers it is the smallest convenient chunk of storage. For example, most computers don’t have an instruction to move a bit but do have one to move a byte. A less common term is word, which is a given computer architecture’s native unit of data. A word is made up of one or more bytes. For example, a computer that has 64-bit registers and 64-bit memory addressing typically has 64-bit (8-byte) words. A computer executes many operations in its native word size rather than a byte at a time. Computer storage, along with most computer throughput, is generally measured and manipulated in bytes and collections of bytes. A kilobyte, or KB, is 1,024 bytes; a megabyte, or MB, is 1,0242 bytes; a gigabyte, or GB, is 1,0243 bytes; a terabyte, or TB, is 1,0244 bytes; and a petabyte, or PB, is 1,0245 bytes. Computer manufacturers often round off these numbers and say that a megabyte is 1 million bytes and a gigabyte is 1 billion bytes. Networking measurements are an exception to this general rule; they are given in bits (because networks move data a bit at a time). the interrupting device. Operating systems as different as Windows and UNIX dispatch interrupts in this manner. The interrupt architecture must also save the address of the interrupted instruction. Many old designs simply stored the interrupt address in a fixed location or in a location indexed by the device number. More recent architectures store the return address on the system stack. If the interrupt routine needs to modify the processor state—for instance, by modifying register values—it must explicitly save the current state and then restore that state before returning. After the interrupt is serviced, the saved return address is loaded into the program counter, and the interrupted computation resumes as though the interrupt had not occurred. 1.2.2 Storage Structure The CPU can load instructions only from memory, so any programs to run must be stored there. General-purpose computers run most of their programs from rewritable memory, called main memory (also called random-access memory, or RAM). Main memory commonly is implemented in a semiconductor technology called dynamic random-access memory (DRAM). Computers use other forms of memory as well. We have already mentioned read-only memory, ROM) and electrically erasable programmable read-only memory, EEPROM). Because ROM cannot be changed, only static programs, such as the bootstrap program described earlier, are stored there. The immutability of ROM is of use in game cartridges. EEPROM can be changed but cannot be changed frequently and so contains mostly static programs. For example, smartphones have EEPROM to store their factory-installed programs.
  • 17. 10 Chapter 1 Introduction All forms of memory provide an array of bytes. Each byte has its own address. Interaction is achieved through a sequence of load or store instructions to specific memory addresses. The load instruction moves a byte or word from main memory to an internal register within the CPU, whereas the store instruction moves the content of a register to main memory. Aside from explicit loads and stores, the CPU automatically loads instructions from main memory for execution. A typical instruction–execution cycle, as executed on a system with a von Neumann architecture, first fetches an instruction from memory and stores that instruction in the instruction register. The instruction is then decoded and may cause operands to be fetched from memory and stored in some internal register. After the instruction on the operands has been executed, the result may be stored back in memory. Notice that the memory unit sees only a stream of memory addresses. It does not know how they are generated (by the instruction counter, indexing, indirection, literal addresses, or some other means) or what they are for (instructions or data). Accordingly, we can ignore how a memory address is generated by a program. We are interested only in the sequence of memory addresses generated by the running program. Ideally, we want the programs and data to reside in main memory permanently. This arrangement usually is not possible for the following two reasons: 1. Main memory is usually too small to store all needed programs and data permanently. 2. Main memory is a volatile storage device that loses its contents when power is turned off or otherwise lost. Thus, most computer systems provide secondary storage as an extension of main memory. The main requirement for secondary storage is that it be able to hold large quantities of data permanently. The most common secondary-storage device is a magnetic disk, which provides storage for both programs and data. Most programs (system and application) are stored on a disk until they are loaded into memory. Many programs then use the disk as both the source and the destination of their processing. Hence, the proper management of disk storage is of central importance to a computer system, as we discuss in Chapter 10. In a larger sense, however, the storage structure that we have described— consisting of registers, main memory, and magnetic disks—is only one of many possible storage systems. Others include cache memory, CD-ROM, magnetic tapes, and so on. Each storage system provides the basic functions of storing a datum and holding that datum until it is retrieved at a later time. The main differences among the various storage systems lie in speed, cost, size, and volatility. The wide variety of storage systems can be organized in a hierarchy (Figure 1.4) according to speed and cost. The higher levels are expensive, but they are fast. As we move down the hierarchy, the cost per bit generally decreases, whereas the access time generally increases. This trade-off is reasonable; if a given storage system were both faster and less expensive than another—other properties being the same—then there would be no reason to use the slower, more expensive memory. In fact, many early storage devices, including paper
  • 18. 1.2 Computer-System Organization 11 registers cache main memory solid-state disk magnetic disk optical disk magnetic tapes Figure 1.4 Storage-device hierarchy. tape and core memories, are relegated to museums now that magnetic tape and semiconductor memory have become faster and cheaper. The top four levels of memory in Figure 1.4 may be constructed using semiconductor memory. In addition to differing in speed and cost, the various storage systems are either volatile or nonvolatile. As mentioned earlier, volatile storage loses its contents when the power to the device is removed. In the absence of expensive battery and generator backup systems, data must be written to nonvolatile storage for safekeeping. In the hierarchy shown in Figure 1.4, the storage systems above the solid-state disk are volatile, whereas those including the solid-state disk and below are nonvolatile. Solid-state disks have several variants but in general are faster than magnetic disks and are nonvolatile. One type of solid-state disk stores data in a large DRAM array during normal operation but also contains a hidden magnetic hard disk and a battery for backup power. If external power is interrupted, this solid-state disk’s controller copies the data from RAM to the magnetic disk. When external power is restored, the controller copies the data back into RAM. Another form of solid-state disk is flash memory, which is popular in cameras and personal digital assistants (PDAs), in robots, and increasingly for storage on general-purpose computers. Flash memory is slower than DRAM but needs no power to retain its contents. Another form of nonvolatile storage is NVRAM, which is DRAM with battery backup power. This memory can be as fast as DRAM and (as long as the battery lasts) is nonvolatile. The design of a complete memory system must balance all the factors just discussed: it must use only as much expensive memory as necessary while providing as much inexpensive, nonvolatile memory as possible. Caches can
  • 19. 12 Chapter 1 Introduction be installed to improve performance where a large disparity in access time or transfer rate exists between two components. 1.2.3 I/O Structure Storage is only one of many types of I/O devices within a computer. A large portion of operating system code is dedicated to managing I/O, both because of its importance to the reliability and performance of a system and because of the varying nature of the devices. Next, we provide an overview of I/O. A general-purpose computer system consists of CPUs and multiple device controllers that are connected through a common bus. Each device controller is in charge of a specific type of device. Depending on the controller, more than one device may be attached. For instance, seven or more devices can be attached to the small computer-systems interface (SCSI) controller. A device controller maintains some local buffer storage and a set of special-purpose registers. The device controller is responsible for moving the data between the peripheral devices that it controls and its local buffer storage. Typically, operating systems have a device driver for each device controller. This device driver understands the device controller and provides the rest of the operating system with a uniform interface to the device. To start an I/O operation, the device driver loads the appropriate registers within the device controller. The device controller, in turn, examines the contents of these registers to determine what action to take (such as “read a character from the keyboard”). The controller starts the transfer of data from the device to its local buffer. Once the transfer of data is complete, the device controller informs the device driver via an interrupt that it has finished its operation. The device driver then returns control to the operating system, possibly returning the data or a pointer to the data if the operation was a read. For other operations, the device driver returns status information. This form of interrupt-driven I/O is fine for moving small amounts of data but can produce high overhead when used for bulk data movement such as disk I/O. To solve this problem, direct memory access (DMA) is used. After setting up buffers, pointers, and counters for the I/O device, the device controller transfers an entire block of data directly to or from its own buffer storage to memory, with no intervention by the CPU. Only one interrupt is generated per block, to tell the device driver that the operation has completed, rather than the one interrupt per byte generated for low-speed devices. While the device controller is performing these operations, the CPU is available to accomplish other work. Some high-end systems use switch rather than bus architecture. On these systems, multiple components can talk to other components concurrently, rather than competing for cycles on a shared bus. In this case, DMA is even more effective. Figure 1.5 shows the interplay of all components of a computer system. 1.3 Computer-System Architecture In Section 1.2, we introduced the general structure of a typical computer system. A computer system can be organized in a number of different ways, which we
  • 20. 1.3 Computer-System Architecture 13 thread of execution instructions and data instruction execution cycle data movement DMA memory interrupt cache data I/O request CPU (*N) device (*M) Figure 1.5 How a modern computer system works. can categorize roughly according to the number of general-purpose processors used. 1.3.1 Single-Processor Systems Until recently, most computer systems used a single processor. On a single- processor system, there is one main CPU capable of executing a general-purpose instruction set, including instructions from user processes. Almost all single- processor systems have other special-purpose processors as well. They may come in the form of device-specific processors, such as disk, keyboard, and graphics controllers; or, on mainframes, they may come in the form of more general-purpose processors, such as I/O processors that move data rapidly among the components of the system. All of these special-purpose processors run a limited instruction set and do not run user processes. Sometimes, they are managed by the operating system, in that the operating system sends them information about their next task and monitors their status. For example, a disk-controller microprocessor receives a sequence of requests from the main CPU and implements its own disk queue and scheduling algorithm. This arrangement relieves the main CPU of the overhead of disk scheduling. PCs contain a microprocessor in the keyboard to convert the keystrokes into codes to be sent to the CPU. In other systems or circumstances, special-purpose processors are low-level components built into the hardware. The operating system cannot communicate with these processors; they do their jobs autonomously. The use of special-purpose microprocessors is common and does not turn a single-processor system into
  • 21. 14 Chapter 1 Introduction a multiprocessor. If there is only one general-purpose CPU, then the system is a single-processor system. 1.3.2 Multiprocessor Systems Within the past several years, multiprocessor systems (also known as parallel systems or multicore systems) have begun to dominate the landscape of computing. Such systems have two or more processors in close communication, sharing the computer bus and sometimes the clock, memory, and peripheral devices. Multiprocessor systems first appeared prominently appeared in servers and have since migrated to desktop and laptop systems. Recently, multiple processors have appeared on mobile devices such as smartphones and tablet computers. Multiprocessor systems have three main advantages: 1. Increased throughput. By increasing the number of processors, we expect to get more work done in less time. The speed-up ratio with N processors is not N, however; rather, it is less than N. When multiple processors cooperate on a task, a certain amount of overhead is incurred in keeping all the parts working correctly. This overhead, plus contention for shared resources, lowers the expected gain from additional processors. Similarly, N programmers working closely together do not produce N times the amount of work a single programmer would produce. 2. Economy of scale. Multiprocessor systems can cost less than equivalent multiple single-processor systems, because they can share peripherals, mass storage, and power supplies. If several programs operate on the same set of data, it is cheaper to store those data on one disk and to have all the processors share them than to have many computers with local disks and many copies of the data. 3. Increased reliability. If functions can be distributed properly among several processors, then the failure of one processor will not halt the system, only slow it down. If we have ten processors and one fails, then each of the remaining nine processors can pick up a share of the work of the failed processor. Thus, the entire system runs only 10 percent slower, rather than failing altogether. Increased reliability of a computer system is crucial in many applications. The ability to continue providing service proportional to the level of surviving hardware is called graceful degradation. Some systems go beyond graceful degradation and are called fault tolerant, because they can suffer a failure of any single component and still continue operation. Fault tolerance requires a mechanism to allow the failure to be detected, diagnosed, and, if possible, corrected. The HP NonStop (formerly Tandem) system uses both hardware and software duplication to ensure continued operation despite faults. The system consists of multiple pairs of CPUs, working in lockstep. Both processors in the pair execute each instruction and compare the results. If the results differ, then one CPU of the pair is at fault, and both are halted. The process that was being executed is then moved to another pair of CPUs, and the instruction that failed
  • 22. 1.3 Computer-System Architecture 15 is restarted. This solution is expensive, since it involves special hardware and considerable hardware duplication. The multiple-processor systems in use today are of two types. Some systems use asymmetric multiprocessing, in which each processor is assigned a specific task. A boss processor controls the system; the other processors either look to the boss for instruction or have predefined tasks. This scheme defines a boss–worker relationship. The boss processor schedules and allocates work to the worker processors. The most common systems use symmetric multiprocessing (SMP), in which each processor performs all tasks within the operating system. SMP means that all processors are peers; no boss–worker relationship exists between processors. Figure 1.6 illustrates a typical SMP architecture. Notice that each processor has its own set of registers, as well as a private—or local —cache. However, all processors share physical memory. An example of an SMP system is AIX, a commercial version of UNIX designed by IBM. An AIX system can be configured to employ dozens of processors. The benefit of this model is that many processes can run simultaneously—N processes can run if there are N CPUs—without causing performance to deteriorate significantly. However, we must carefully control I/O to ensure that the data reach the appropriate processor. Also, since the CPUs are separate, one may be sitting idle while another is overloaded, resulting in inefficiencies. These inefficiencies can be avoided if the processors share certain data structures. A multiprocessor system of this form will allow processes and resources—such as memory— to be shared dynamically among the various processors and can lower the variance among the processors. Such a system must be written carefully, as we shall see in Chapter 5. Virtually all modern operating systems—including Windows, Mac OS X, and Linux—now provide support for SMP. The difference between symmetric and asymmetric multiprocessing may result from either hardware or software. Special hardware can differentiate the multiple processors, or the software can be written to allow only one boss and multiple workers. For instance, Sun Microsystems’ operating system SunOS Version 4 provided asymmetric multiprocessing, whereas Version 5 (Solaris) is symmetric on the same hardware. Multiprocessing adds CPUs to increase computing power. If the CPU has an integrated memory controller, then adding CPUs can also increase the amount CPU0 registers cache CPU1 registers cache CPU2 registers cache memory Figure 1.6 Symmetric multiprocessing architecture.
  • 23. 16 Chapter 1 Introduction of memory addressable in the system. Either way, multiprocessing can cause a system to change its memory access model from uniform memory access (UMA) to non-uniform memory access (NUMA). UMA is defined as the situation in which access to any RAM from any CPU takes the same amount of time. With NUMA, some parts of memory may take longer to access than other parts, creating a performance penalty. Operating systems can minimize the NUMA penalty through resource management, as discussed in Section 9.5.4. A recent trend in CPU design is to include multiple computing cores on a single chip. Such multiprocessor systems are termed multicore. They can be more efficient than multiple chips with single cores because on-chip communication is faster than between-chip communication. In addition, one chip with multiple cores uses significantly less power than multiple single-core chips. It is important to note that while multicore systems are multiprocessor systems, not all multiprocessor systems are multicore, as we shall see in Section 1.3.3. In our coverage of multiprocessor systems throughout this text, unless we state otherwise, we generally use the more contemporary term multicore, which excludes some multiprocessor systems. In Figure 1.7, we show a dual-core design with two cores on the same chip. In this design, each core has its own register set as well as its own local cache. Other designs might use a shared cache or a combination of local and shared caches. Aside from architectural considerations, such as cache, memory, and bus contention, these multicore CPUs appear to the operating system as N standard processors. This characteristic puts pressure on operating system designers—and application programmers—to make use of those processing cores. Finally, blade servers are a relativelyrecent development in which multiple processor boards, I/O boards, and networking boards are placed in the same chassis. The difference between these and traditional multiprocessor systems is that each blade-processor board boots independently and runs its own operating system. Some blade-server boards are multiprocessor as well, which blurs the lines between types of computers. In essence, these servers consist of multiple independent multiprocessor systems. CPU core0 registers cache CPU core1 registers cache memory Figure 1.7 A dual-core design with two cores placed on the same chip.
  • 24. 1.3 Computer-System Architecture 17 1.3.3 Clustered Systems Another type of multiprocessor system is a clustered system, which gathers together multiple CPUs. Clustered systems differ from the multiprocessor systems described in Section 1.3.2 in that they are composed of two or more individual systems—or nodes—joined together. Such systems are considered loosely coupled. Each node may be a single processor system or a multicore system. We should note that the definition of clustered is not concrete; many commercial packages wrestle to define a clustered system and why one form is better than another. The generally accepted definition is that clustered computers share storage and are closely linked via a local-area network LAN (as described in Chapter 17) or a faster interconnect, such as InfiniBand. Clustering is usually used to provide high-availability service—that is, service will continue even if one or more systems in the cluster fail. Generally, we obtain high availability by adding a level of redundancy in the system. A layer of cluster software runs on the cluster nodes. Each node can monitor one or more of the others (over the LAN). If the monitored machine fails, the monitoring machine can take ownership of its storage and restart the applications that were running on the failed machine. The users and clients of the applications see only a brief interruption of service. Clustering can be structured asymmetrically or symmetrically. In asym- metric clustering, one machine is in hot-standby mode while the other is running the applications. The hot-standby host machine does nothing but monitor the active server. If that server fails, the hot-standby host becomes the active server. In symmetric clustering, two or more hosts are running applications and are monitoring each other. This structure is obviously more efficient, as it uses all of the available hardware. However it does require that more than one application be available to run. Since a cluster consists of several computer systems connected via a network, clusters can also be used to provide high-performance computing environments. Such systems can supply significantly greater computational power than single-processor or even SMP systems because they can run an application concurrently on all computers in the cluster. The application must have been written specifically to take advantage of the cluster, however. This involves a technique known as parallelization, which divides a program into separate components that run in parallel on individual computers in the cluster. Typically, these applications are designed so that once each computing node in the cluster has solved its portion of the problem, the results from all the nodes are combined into a final solution. Other forms of clusters include parallel clusters and clustering over a wide-area network (WAN) (as described in Chapter 17). Parallel clusters allow multiple hosts to access the same data on shared storage. Because most operating systems lack support for simultaneous data access by multiple hosts, parallel clusters usually require the use of special versions of software and special releases of applications. For example, Oracle Real Application Cluster is a version of Oracle’s database that has been designed to run on a parallel cluster. Each machine runs Oracle, and a layer of software tracks access to the shared disk. Each machine has full access to all data in the database. To provide this shared access, the system must also supply access control and locking to
  • 25. 18 Chapter 1 Introduction BEOWULF CLUSTERS Beowulf clusters are designed to solve high-performance computing tasks. A Beowulf cluster consists of commodity hardware—such as personal computers—connected via a simple local-area network. No single specific software package is required to construct a cluster. Rather, the nodes use a set of open-source software libraries to communicate with one another. Thus, there are a variety of approaches to constructing a Beowulf cluster. Typically, though, Beowulf computing nodes run the Linux operating system. Since Beowulf clusters require no special hardware and operate using open-source software that is available free, they offer a low-cost strategy for building a high-performance computing cluster. In fact, some Beowulf clusters built from discarded personal computers are using hundreds of nodes to solve computationally expensive scientific computing problems. ensure that no conflicting operations occur. This function, commonly known as a distributed lock manager (DLM), is included in some cluster technology. Cluster technology is changing rapidly. Some cluster products support dozens of systems in a cluster, as well as clustered nodes that are separated by miles. Many of these improvements are made possible by storage-area networks (SANs), as described in Section 10.3.3, which allow many systems to attach to a pool of storage. If the applications and their data are stored on the SAN, then the cluster software can assign the application to run on any host that is attached to the SAN. If the host fails, then any other host can take over. In a database cluster, dozens of hosts can share the same database, greatly increasing performance and reliability. Figure 1.8 depicts the general structure of a clustered system. computer interconnect computer interconnect computer storage area network Figure 1.8 General structure of a clustered system.
  • 26. 1.4 Operating-System Structure 19 job 1 0 Max operating system job 2 job 3 job 4 Figure 1.9 Memory layout for a multiprogramming system. 1.4 Operating-System Structure Now that we have discussed basic computer-system organization and archi- tecture, we are ready to talk about operating systems. An operating system provides the environment within which programs are executed. Internally, operating systems vary greatly in their makeup, since they are organized along many different lines. There are, however, many commonalities, which we consider in this section. One of the most important aspects of operating systems is the ability to multiprogram. A single program cannot, in general, keep either the CPU or the I/O devices busy at all times. Single users frequently have multiple programsrunning.Multiprogrammingincreases CPU utilizationbyorganizing jobs (code and data) so that the CPU always has one to execute. The idea is as follows: The operating system keeps several jobs in memory simultaneously (Figure 1.9). Since, in general, main memory is too small to accommodate all jobs, the jobs are kept initially on the disk in the job pool. This pool consists of all processes residing on disk awaiting allocation of main memory. The set of jobs in memory can be a subset of the jobs kept in the job pool. The operating system picks and begins to execute one of the jobs in memory. Eventually, the job may have to wait for some task, such as an I/O operation, to complete. In a non-multiprogrammed system, the CPU would sit idle. In a multiprogrammed system, the operating system simply switches to, and executes, another job. When that job needs to wait, the CPU switches to another job, and so on. Eventually, the first job finishes waiting and gets the CPU back. As long as at least one job needs to execute, the CPU is never idle. This idea is common in other life situations. A lawyer does not work for only one client at a time, for example. While one case is waiting to go to trial or have papers typed, the lawyer can work on another case. If he has enough clients, the lawyer will never be idle for lack of work. (Idle lawyers tend to become politicians, so there is a certain social value in keeping lawyers busy.)
  • 27. 20 Chapter 1 Introduction Multiprogrammed systems provide an environment in which the various system resources (for example, CPU, memory, and peripheral devices) are utilized effectively, but they do not provide for user interaction with the computer system. Time sharing (or multitasking) is a logical extension of multiprogramming. In time-sharing systems, the CPU executes multiple jobs by switching among them, but the switches occur so frequently that the users can interact with each program while it is running. Time sharing requires an interactive computer system, which provides direct communication between the user and the system. The user gives instructions to the operating system or to a program directly, using a input device such as a keyboard, mouse, touch pad, or touch screen, and waits for immediate results on an output device. Accordingly, the response time should be short—typically less than one second. A time-shared operating system allows many users to share the computer simultaneously. Since each action or command in a time-shared system tends to be short, only a little CPU time is needed for each user. As the system switches rapidly from one user to the next, each user is given the impression that the entire computer system is dedicated to his use, even though it is being shared among many users. A time-shared operating system uses CPU scheduling and multiprogram- ming to provide each user with a small portion of a time-shared computer. Each user has at least one separate program in memory. A program loaded into memory and executing is called a process. When a process executes, it typically executes for only a short time before it either finishes or needs to perform I/O. I/O may be interactive; that is, output goes to a display for the user, and input comes from a user keyboard, mouse, or other device. Since interactive I/O typically runs at “people speeds,” it may take a long time to complete. Input, for example, may be bounded by the user’s typing speed; seven characters per second is fast for people but incredibly slow for computers. Rather than let the CPU sit idle as this interactive input takes place, the operating system will rapidly switch the CPU to the program of some other user. Time sharing and multiprogramming require that several jobs be kept simultaneously in memory. If several jobs are ready to be brought into memory, and if there is not enough room for all of them, then the system must choose among them. Making this decision involves job scheduling, which we discuss in Chapter 6. When the operating system selects a job from the job pool, it loads that job into memory for execution. Having several programs in memory at the same time requires some form of memory management, which we cover in Chapters 8 and 9. In addition, if several jobs are ready to run at the same time, the system must choose which job will run first. Making this decision is CPU scheduling, which is also discussed in Chapter 6. Finally, running multiple jobs concurrently requires that their ability to affect one another be limited in all phases of the operating system, including process scheduling, disk storage, and memory management. We discuss these considerations throughout the text. In a time-sharing system, the operating system must ensure reasonable response time. This goal is sometimes accomplished through swapping, whereby processes are swapped in and out of main memory to the disk. A more common method for ensuring reasonable response time is virtual memory, a technique that allows the execution of a process that is not completely in
  • 28. 1.5 Operating-System Operations 21 memory (Chapter 9). The main advantage of the virtual-memory scheme is that it enables users to run programs that are larger than actual physical memory. Further, it abstracts main memory into a large, uniform array of storage, separating logical memory as viewed by the user from physical memory. This arrangement frees programmers from concern over memory-storage limitations. A time-sharing system must also provide a file system (Chapters 11 and 12). The file system resides on a collection of disks; hence, disk management must be provided (Chapter 10). In addition, a time-sharing system provides a mechanism for protecting resources from inappropriate use (Chapter 14). To ensure orderly execution, the system must provide mechanisms for job synchronization and communication (Chapter 5), and it may ensure that jobs do not get stuck in a deadlock, forever waiting for one another (Chapter 7). 1.5 Operating-System Operations As mentioned earlier, modern operating systems are interrupt driven. If there are no processes to execute, no I/O devices to service, and no users to whom to respond, an operating system will sit quietly, waiting for something to happen. Events are almost always signaled by the occurrence of an interrupt or a trap. A trap (or an exception) is a software-generated interrupt caused either by an error (for example, division by zero or invalid memory access) or by a specific request from a user program that an operating-system service be performed. The interrupt-driven nature of an operating system defines that system’s general structure. For each type of interrupt, separate segments of code in the operating system determine what action should be taken. An interrupt service routine is provided to deal with the interrupt. Since the operating system and the users share the hardware and software resources of the computer system, we need to make sure that an error in a user program could cause problems only for the one program running. With sharing, many processes could be adversely affected by a bug in one program. For example, if a process gets stuck in an infinite loop, this loop could prevent the correct operation of many other processes. More subtle errors can occur in a multiprogramming system, where one erroneous program might modify another program, the data of another program, or even the operating system itself. Without protection against these sorts of errors, either the computer must execute only one process at a time or all output must be suspect. A properly designed operating system must ensure that an incorrect (or malicious) program cannot cause other programs to execute incorrectly. 1.5.1 Dual-Mode and Multimode Operation In order to ensure the proper execution of the operating system, we must be able to distinguish between the execution of operating-system code and user- defined code. The approach taken by most computer systems is to provide hardware support that allows us to differentiate among various modes of execution.
  • 29. 22 Chapter 1 Introduction user process executing user process kernel calls system call return from system call user mode (mode bit = 1) trap mode bit = 0 return mode bit = 1 kernel mode (mode bit = 0) execute system call Figure 1.10 Transition from user to kernel mode. At the very least, we need two separate modes of operation: user mode and kernel mode (also called supervisor mode, system mode, or privileged mode). A bit, called the mode bit, is added to the hardware of the computer to indicate the current mode: kernel (0) or user (1). With the mode bit, we can distinguish between a task that is executed on behalf of the operating system and one that is executed on behalf of the user. When the computer system is executing on behalf of a user application, the system is in user mode. However, when a user application requests a service from the operating system (via a system call), the system must transition from user to kernel mode to fulfill the request. This is shown in Figure 1.10. As we shall see, this architectural enhancement is useful for many other aspects of system operation as well. At system boot time, the hardware starts in kernel mode. The operating system is then loaded and starts user applications in user mode. Whenever a trap or interrupt occurs, the hardware switches from user mode to kernel mode (that is, changes the state of the mode bit to 0). Thus, whenever the operating system gains control of the computer, it is in kernel mode. The system always switches to user mode (by setting the mode bit to 1) before passing control to a user program. The dual mode of operation provides us with the means for protecting the operating system from errant users—and errant users from one another. We accomplishthisprotectionbydesignatingsome ofthe machine instructionsthat may cause harm as privileged instructions. The hardware allows privileged instructions to be executed only in kernel mode. If an attempt is made to execute a privileged instruction in user mode, the hardware does not execute the instruction but rather treats it as illegal and traps it to the operating system. The instruction to switch to kernel mode is an example of a privileged instruction. Some other examples include I/O control, timer management, and interrupt management. As we shall see throughout the text, there are many additional privileged instructions. The concept of modes can be extended beyond two modes (in which case the CPU uses more than one bit to set and test the mode). CPUs that support virtualization (Section 16.1) frequently have a separate mode to indicate when the virtual machine manager (VMM)—and the virtualization management software—is in control of the system. In this mode, the VMM has more privileges than user processes but fewer than the kernel. It needs that level of privilege so it can create and manage virtual machines, changing the CPU state to do so. Sometimes, too, different modes are used by various kernel
  • 30. 1.5 Operating-System Operations 23 components. We should note that, as an alternative to modes, the CPU designer may use other methods to differentiate operational privileges. The Intel 64 family of CPUs supports four privilege levels, for example, and supports virtualization but does not have a separate mode for virtualization. We can now see the life cycle of instruction execution in a computer system. Initial control resides in the operating system, where instructions are executed in kernel mode. When control is given to a user application, the mode is set to user mode. Eventually, control is switched back to the operating system via an interrupt, a trap, or a system call. System calls provide the means for a user program to ask the operating system to perform tasks reserved for the operating system on the user program’s behalf. A system call is invoked in a variety of ways, depending on the functionality provided by the underlying processor. In all forms, it is the method used by a process to request action by the operating system. A system call usually takes the form of a trap to a specific location in the interrupt vector. This trap can be executed by a generic trap instruction, although some systems (such as MIPS) have a specific syscall instruction to invoke a system call. When a system call is executed, it is typically treated by the hardware as a software interrupt. Control passes through the interrupt vector to a service routine in the operating system, and the mode bit is set to kernel mode. The system-call service routine is a part of the operating system. The kernel examines the interrupting instruction to determine what system call has occurred; a parameter indicates what type of service the user program is requesting. Additional information needed for the request may be passed in registers, on the stack, or in memory (with pointers to the memory locations passed in registers). The kernel verifies that the parameters are correct and legal, executes the request, and returns control to the instruction following the system call. We describe system calls more fully in Section 2.3. The lack of a hardware-supported dual mode can cause serious shortcom- ings in an operating system. For instance, MS-DOS was written for the Intel 8088 architecture, which has no mode bit and therefore no dual mode. A user program running awry can wipe out the operating system by writing over it with data; and multiple programs are able to write to a device at the same time, with potentially disastrous results. Modern versions of the Intel CPU do provide dual-mode operation. Accordingly, most contemporary operating systems—such as Microsoft Windows 7, as well as Unix and Linux—take advantage of this dual-mode feature and provide greater protection for the operating system. Once hardware protection is in place, it detects errors that violate modes. These errors are normally handled by the operating system. If a user program fails in some way—such as by making an attempt either to execute an illegal instruction or to access memory that is not in the user’s address space—then the hardware traps to the operating system. The trap transfers control through the interrupt vector to the operating system, just as an interrupt does. When a program error occurs, the operating system must terminate the program abnormally. This situation is handled by the same code as a user-requested abnormal termination. An appropriate error message is given, and the memory of the program may be dumped. The memory dump is usually written to a file so that the user or programmer can examine it and perhaps correct it and restart the program.
  • 31. 24 Chapter 1 Introduction 1.5.2 Timer We must ensure that the operating system maintains control over the CPU. We cannot allow a user program to get stuck in an infinite loop or to fail to call system services and never return control to the operating system. To accomplish this goal, we can use a timer. A timer can be set to interrupt the computer after a specified period. The period may be fixed (for example, 1/60 second) or variable (for example, from 1 millisecond to 1 second). A variable timer is generally implemented by a fixed-rate clock and a counter. The operating system sets the counter. Every time the clock ticks, the counter is decremented. When the counter reaches 0, an interrupt occurs. For instance, a 10-bit counter with a 1-millisecond clock allows interrupts at intervals from 1 millisecond to 1,024 milliseconds, in steps of 1 millisecond. Before turning over control to the user, the operating system ensures that the timer is set to interrupt. If the timer interrupts, control transfers automatically to the operating system, which may treat the interrupt as a fatal error or may give the program more time. Clearly, instructions that modify the content of the timer are privileged. We can use the timer to prevent a user program from running too long. A simple technique is to initialize a counter with the amount of time that a program is allowed to run. A program with a 7-minute time limit, for example, would have its counter initialized to 420. Every second, the timer interrupts, and the counter is decremented by 1. As long as the counter is positive, control is returned to the user program. When the counter becomes negative, the operating system terminates the program for exceeding the assigned time limit. 1.6 Process Management A program does nothing unless its instructions are executed by a CPU. A program in execution, as mentioned, is a process. A time-shared user program such as a compiler is a process. A word-processing program being run by an individual user on a PC is a process. A system task, such as sending output to a printer, can also be a process (or at least part of one). For now, you can consider a process to be a job or a time-shared program, but later you will learn that the concept is more general. As we shall see in Chapter 3, it is possible to provide system calls that allow processes to create subprocesses to execute concurrently. A process needs certain resources—including CPU time, memory, files, and I/O devices—to accomplish its task. These resources are either given to the process when it is created or allocated to it while it is running. In addition to the various physical and logical resources that a process obtains when it is created, various initialization data (input) may be passed along. For example, consider a process whose function is to display the status of a file on the screen of a terminal. The process will be given the name of the file as an input and will execute the appropriate instructions and system calls to obtain and display the desired information on the terminal. When the process terminates, the operating system will reclaim any reusable resources. We emphasize that a program by itself is not a process. A program is a passive entity, like the contents of a file stored on disk, whereas a process
  • 32. 1.7 Memory Management 25 is an active entity. A single-threaded process has one program counter specifying the next instruction to execute. (Threads are covered in Chapter 4.) The execution of such a process must be sequential. The CPU executes one instruction of the process after another, until the process completes. Further, at any time, one instruction at most is executed on behalf of the process. Thus, although two processes may be associated with the same program, they are nevertheless considered two separate execution sequences. A multithreaded process has multiple program counters, each pointing to the next instruction to execute for a given thread. A process is the unit of work in a system. A system consists of a collection of processes, some of which are operating-system processes (those that execute system code) and the rest of which are user processes (those that execute user code). All these processes can potentially execute concurrently—by multiplexing on a single CPU, for example. The operating system is responsible for the following activities in connec- tion with process management: • Scheduling processes and threads on the CPUs • Creating and deleting both user and system processes • Suspending and resuming processes • Providing mechanisms for process synchronization • Providing mechanisms for process communication We discuss process-management techniques in Chapters 3 through 5. 1.7 Memory Management As we discussed in Section 1.2.2, the main memory is central to the operation of a modern computer system. Main memory is a large array of bytes, ranging in size from hundreds of thousands to billions. Each byte has its own address. Main memory is a repository of quickly accessible data shared by the CPU and I/O devices. The central processor reads instructions from main memory during the instruction-fetch cycle and both reads and writes data from main memory during the data-fetch cycle (on a von Neumann architecture). As noted earlier, the main memory is generally the only large storage device that the CPU is able to address and access directly. For example, for the CPU to process data from disk, those data must first be transferred to main memory by CPU-generated I/O calls. In the same way, instructions must be in memory for the CPU to execute them. For a program to be executed, it must be mapped to absolute addresses and loaded into memory. As the program executes, it accesses program instructions and data from memory by generating these absolute addresses. Eventually, the program terminates, its memory space is declared available, and the next program can be loaded and executed. To improve both the utilization of the CPU and the speed of the computer’s response to its users, general-purpose computers must keep several programs in memory, creating a need for memory management. Many different memory-
  • 33. 26 Chapter 1 Introduction management schemes are used. These schemes reflect various approaches, and the effectiveness of any given algorithm depends on the situation. In selecting a memory-management scheme for a specific system, we must take into account many factors—especially the hardware design of the system. Each algorithm requires its own hardware support. The operating system is responsible for the following activities in connec- tion with memory management: • Keeping track of which parts of memory are currently being used and who is using them • Deciding which processes (or parts of processes) and data to move into and out of memory • Allocating and deallocating memory space as needed Memory-management techniques are discussed in Chapters 8 and 9. 1.8 Storage Management To make the computer system convenient for users, the operating system provides a uniform, logical view of information storage. The operating system abstracts from the physical properties of its storage devices to define a logical storage unit, the file. The operating system maps files onto physical media and accesses these files via the storage devices. 1.8.1 File-System Management File management is one of the most visible components of an operating system. Computers can store information on several different types of physical media. Magnetic disk, optical disk, and magnetic tape are the most common. Each of these media has its own characteristics and physical organization. Each medium is controlled by a device, such as a disk drive or tape drive, that also has its own unique characteristics. These properties include access speed, capacity, data-transfer rate, and access method (sequential or random). A file is a collection of related information defined by its creator. Commonly, files represent programs (both source and object forms) and data. Data files may be numeric, alphabetic, alphanumeric, or binary. Files may be free-form (for example, text files), or they may be formatted rigidly (for example, fixed fields). Clearly, the concept of a file is an extremely general one. The operating system implements the abstract concept of a file by managing mass-storage media, such as tapes and disks, and the devices that control them. In addition, files are normally organized into directories to make them easier to use. Finally, when multiple users have access to files, it may be desirable to control which user may access a file and how that user may access it (for example, read, write, append). The operating system is responsible for the following activities in connec- tion with file management: • Creating and deleting files
  • 34. 1.8 Storage Management 27 • Creating and deleting directories to organize files • Supporting primitives for manipulating files and directories • Mapping files onto secondary storage • Backing up files on stable (nonvolatile) storage media File-management techniques are discussed in Chapters 11 and 12. 1.8.2 Mass-Storage Management As we have already seen, because main memory is too small to accommodate all data and programs, and because the data that it holds are lost when power is lost, the computer system must provide secondary storage to back up main memory. Most modern computer systems use disks as the principal on-line storage medium for both programs and data. Most programs—including compilers, assemblers, word processors, editors, and formatters—are stored on a disk until loaded into memory. They then use the disk as both the source and destination of their processing. Hence, the proper management of disk storage is of central importance to a computer system. The operating system is responsible for the following activities in connection with disk management: • Free-space management • Storage allocation • Disk scheduling Because secondary storage is used frequently, it must be used efficiently. The entire speed of operation of a computer may hinge on the speeds of the disk subsystem and the algorithms that manipulate that subsystem. There are, however, many uses for storage that is slower and lower in cost (and sometimes of higher capacity) than secondary storage. Backups of disk data, storage of seldom-used data, and long-term archival storage are some examples. Magnetic tape drives and their tapes and CD and DVD drives and platters are typical tertiary storage devices. The media (tapes and optical platters) vary between WORM (write-once, read-many-times) and RW (read– write) formats. Tertiary storage is not crucial to system performance, but it still must be managed. Some operating systems take on this task, while others leave tertiary-storage management to application programs. Some of the functions that operating systems can provide include mounting and unmounting media in devices, allocating and freeing the devices for exclusive use by processes, and migrating data from secondary to tertiary storage. Techniques for secondary and tertiary storage management are discussed in Chapter 10. 1.8.3 Caching Caching is an important principle of computer systems. Here’s how it works. Information is normally kept in some storage system (such as main memory). As it is used, it is copied into a faster storage system—the cache—on a
  • 35. 28 Chapter 1 Introduction temporary basis. When we need a particular piece of information, we first check whether it is in the cache. If it is, we use the information directly from the cache. If it is not, we use the information from the source, putting a copy in the cache under the assumption that we will need it again soon. In addition, internal programmable registers, such as index registers, provide a high-speed cache for main memory. The programmer (or compiler) implements the register-allocation and register-replacement algorithms to decide which information to keep in registers and which to keep in main memory. Other caches are implemented totally in hardware. For instance, most systems have an instruction cache to hold the instructions expected to be executed next. Without this cache, the CPU would have to wait several cycles while an instruction was fetched from main memory. For similar reasons, most systems have one or more high-speed data caches in the memory hierarchy. We are not concerned with these hardware-only caches in this text, since they are outside the control of the operating system. Because caches have limited size, cache management is an important design problem. Careful selection of the cache size and of a replacement policy can result in greatly increased performance. Figure 1.11 compares storage performance in large workstations and small servers. Various replacement algorithms for software-controlled caches are discussed in Chapter 9. Main memory can be viewed as a fast cache for secondary storage, since data in secondary storage must be copied into main memory for use and data must be in main memory before being moved to secondary storage for safekeeping. The file-system data, which resides permanently on secondary storage, may appear on several levels in the storage hierarchy. At the highest level, the operating system may maintain a cache of file-system data in main memory. In addition, solid-state disks may be used for high-speed storage that is accessed through the file-system interface. The bulk of secondary storage is on magnetic disks. The magnetic-disk storage, in turn, is often backed up onto magnetic tapes or removable disks to protect against data loss in case of a hard-disk failure. Some systems automatically archive old file data from secondary storage to tertiary storage, such as tape jukeboxes, to lower the storage cost (see Chapter 10). Level Name Typical size Implementation technology Access time (ns) Bandwidth (MB/sec) Managed by Backed by 1 registers < 1 KB custom memory with multiple ports CMOS 0.25 - 0.5 20,000 - 100,000 compiler cache 2 cache < 16MB on-chip or off-chip CMOS SRAM 0.5 - 25 5,000 - 10,000 hardware main memory 3 main memory < 64GB CMOS SRAM 80 - 250 1,000 - 5,000 operating system disk 4 solid state disk < 1 TB flash memory 25,000 - 50,000 500 operating system disk 5 magnetic disk < 10 TB magnetic disk 5,000,000 20 - 150 operating system disk or tape Figure 1.11 Performance of various levels of storage.
  • 36. 1.8 Storage Management 29 A A A magnetic disk main memory hardware register cache Figure 1.12 Migration of integer A from disk to register. The movement of information between levels of a storage hierarchy may be either explicit or implicit, depending on the hardware design and the controlling operating-system software. For instance, data transfer from cache to CPU and registers is usually a hardware function, with no operating-system intervention. In contrast, transfer of data from disk to memory is usually controlled by the operating system. In a hierarchical storage structure, the same data may appear in different levels of the storage system. For example, suppose that an integer A that is to be incremented by 1 is located in file B, and file B resides on magnetic disk. The increment operation proceeds by first issuing an I/O operation to copy the disk block on which A resides to main memory. This operation is followed by copying A to the cache and to an internal register. Thus, the copy of A appears in several places: on the magnetic disk, in main memory, in the cache, and in an internal register (see Figure 1.12). Once the increment takes place in the internal register, the value of A differs in the various storage systems. The value of A becomes the same only after the new value of A is written from the internal register back to the magnetic disk. In a computing environment where only one process executes at a time, this arrangement poses no difficulties, since an access to integer A will always be to the copy at the highest level of the hierarchy. However, in a multitasking environment, where the CPU is switched back and forth among various processes, extreme care must be taken to ensure that, if several processes wish to access A, then each of these processes will obtain the most recently updated value of A. The situation becomes more complicated in a multiprocessor environment where, in addition to maintaining internal registers, each of the CPUs also contains a local cache (Figure 1.6). In such an environment, a copy of A may exist simultaneously in several caches. Since the various CPUs can all execute in parallel, we must make sure that an update to the value of A in one cache is immediately reflected in all other caches where A resides. This situation is called cache coherency, and it is usually a hardware issue (handled below the operating-system level). In a distributed environment, the situation becomes even more complex. In this environment, several copies (or replicas) of the same file can be kept on different computers. Since the various replicas may be accessed and updated concurrently, some distributed systems ensure that, when a replica is updated in one place, all other replicas are brought up to date as soon as possible. There are various ways to achieve this guarantee, as we discuss in Chapter 17. 1.8.4 I/O Systems One of the purposes of an operating system is to hide the peculiarities of specific hardware devices from the user. For example, in UNIX, the peculiarities of I/O
  • 37. 30 Chapter 1 Introduction devices are hidden from the bulk of the operating system itself by the I/O subsystem. The I/O subsystem consists of several components: • A memory-management component that includes buffering, caching, and spooling • A general device-driver interface • Drivers for specific hardware devices Only the device driver knows the peculiarities of the specific device to which it is assigned. We discussed in Section 1.2.3 how interrupt handlers and device drivers are used in the construction of efficient I/O subsystems. In Chapter 13, we discuss how the I/O subsystem interfaces to the other system components, manages devices, transfers data, and detects I/O completion. 1.9 Protection and Security If a computer system has multiple users and allows the concurrent execution of multiple processes, then access to data must be regulated. For that purpose, mechanisms ensure that files, memory segments, CPU, and other resources can be operated on by only those processes that have gained proper authoriza- tion from the operating system. For example, memory-addressing hardware ensures that a process can execute only within its own address space. The timer ensures that no process can gain control of the CPU without eventually relinquishing control. Device-control registers are not accessible to users, so the integrity of the various peripheral devices is protected. Protection, then, is any mechanism for controlling the access of processes or users to the resources defined by a computer system. This mechanism must provide means to specify the controls to be imposed and to enforce the controls. Protection can improve reliability by detecting latent errors at the interfaces between component subsystems. Early detection of interface errors can often prevent contamination of a healthy subsystem by another subsystem that is malfunctioning. Furthermore, an unprotected resource cannot defend against use (or misuse) by an unauthorized or incompetent user. A protection-oriented system provides a means to distinguish between authorized and unauthorized usage, as we discuss in Chapter 14. A system can have adequate protection but still be prone to failure and allow inappropriate access. Consider a user whose authentication information (her means of identifying herself to the system) is stolen. Her data could be copied or deleted, even though file and memory protection are working. It is the job of security to defend a system from external and internal attacks. Such attacks spread across a huge range and include viruses and worms, denial-of- service attacks (which use all of a system’s resources and so keep legitimate users out of the system), identity theft, and theft of service (unauthorized use of a system). Prevention of some of these attacks is considered an operating-system function on some systems, while other systems leave it to policy or additional software. Due to the alarming rise in security incidents,
  • 38. 2 C H A P T E R Operating- System Structures An operating system provides the environment within which programs are executed. Internally, operating systems vary greatly in their makeup, since they are organized along many different lines. The design of a new operating system is a major task. It is important that the goals of the system be well defined before the design begins. These goals form the basis for choices among various algorithms and strategies. We can view an operating system from several vantage points. One view focuses on the services that the system provides; another, on the interface that it makes available to users and programmers; a third, on its components and their interconnections. In this chapter, we explore all three aspects of operating systems, showing the viewpoints of users, programmers, and operating system designers. We consider what services an operating system provides, how they are provided, how they are debugged, and what the various methodologies are for designing such systems. Finally, we describe how operating systems are created and how a computer starts its operating system. CHAPTER OBJECTIVES • To describe the services an operating system provides to users, processes, and other systems. • To discuss the various ways of structuring an operating system. • To explain how operating systems are installed and customized and how they boot. 2.1 Operating-System Services An operating system provides an environment for the execution of programs. It provides certain services to programs and to the users of those programs. The specific services provided, of course, differ from one operating system to another, but we can identify common classes. These operating system services are provided for the convenience of the programmer, to make the programming 55
  • 39. 56 Chapter 2 Operating-System Structures user and other system programs services operating system hardware system calls GUI batch user interfaces command line program execution I/O operations file systems communication resource allocation accounting protection and security error detection Figure 2.1 A view of operating system services. task easier. Figure 2.1 shows one view of the various operating-system services and how they interrelate. One set of operating system services provides functions that are helpful to the user. • User interface. Almost all operating systems have a user interface (UI). This interface can take several forms. One is a command-line interface (CLI), which uses text commands and a method for entering them (say, a keyboard for typing in commands in a specific format with specific options). Another is a batch interface, in which commands and directives to control those commands are entered into files, and those files are executed. Most commonly, a graphical user interface (GUI) is used. Here, the interface is a window system with a pointing device to direct I/O, choose from menus, and make selections and a keyboard to enter text. Some systems provide two or all three of these variations. • Program execution. The system must be able to load a program into memory and to run that program. The program must be able to end its execution, either normally or abnormally (indicating error). • I/O operations. A running program may require I/O, which may involve a file or an I/O device. For specific devices, special functions may be desired (such as recording to a CD or DVD drive or blanking a display screen). For efficiency and protection, users usually cannot control I/O devices directly. Therefore, the operating system must provide a means to do I/O. • File-system manipulation. The file system is of particular interest. Obvi- ously, programs need to read and write files and directories. They also need to create and delete them by name, search for a given file, and list file information. Finally, some operating systems include permissions management to allow or deny access to files or directories based on file ownership. Many operating systems provide a variety of file systems, sometimes to allow personal choice and sometimes to provide specific features or performance characteristics.
  • 40. 2.1 Operating-System Services 57 • Communications. There are many circumstances in which one process needs to exchange information with another process. Such communication may occur between processes that are executing on the same computer or between processes that are executing on different computer systems tied together by a computer network. Communications may be implemented via shared memory, in which two or more processes read and write to a shared section of memory, or message passing, in which packets of information in predefined formats are moved between processes by the operating system. • Error detection. The operating system needs to be detecting and correcting errors constantly. Errors may occur in the CPU and memory hardware (such as a memory error or a power failure), in I/O devices (such as a parity error on disk, a connection failure on a network, or lack of paper in the printer), and in the user program (such as an arithmetic overflow, an attempt to access an illegal memory location, or a too-great use of CPU time). For each type of error, the operating system should take the appropriate action to ensure correct and consistent computing. Sometimes, it has no choice but to halt the system. At other times, it might terminate an error-causing process or return an error code to a process for the process to detect and possibly correct. Another set of operating system functions exists not for helping the user but rather for ensuring the efficient operation of the system itself. Systems with multiple users can gain efficiency by sharing the computer resources among the users. • Resource allocation. When there are multiple users or multiple jobs running at the same time, resources must be allocated to each of them. The operating system manages many different types of resources. Some (such as CPU cycles, main memory, and file storage) may have special allocation code, whereas others (such as I/O devices) may have much more general request and release code. For instance, in determining how best to use the CPU, operating systems have CPU-scheduling routines that take into account the speed of the CPU, the jobs that must be executed, the number of registers available, and other factors. There may also be routines to allocate printers, USB storage drives, and other peripheral devices. • Accounting. We want to keep track of which users use how much and what kinds of computer resources. This record keeping may be used for accounting (so that users can be billed) or simply for accumulating usage statistics. Usage statistics may be a valuable tool for researchers who wish to reconfigure the system to improve computing services. • Protection and security. The owners of information stored in a multiuser or networked computer system may want to control use of that information. When several separate processes execute concurrently, it should not be possible for one process to interfere with the others or with the operating system itself. Protection involves ensuring that all access to system resources is controlled. Security of the system from outsiders is also important. Such security starts with requiring each user to authenticate
  • 41. 58 Chapter 2 Operating-System Structures himself or herself to the system, usually by means of a password, to gain access to system resources. It extends to defending external I/O devices, including network adapters, from invalid access attempts and to recording all such connections for detection of break-ins. If a system is to be protected and secure, precautions must be instituted throughout it. A chain is only as strong as its weakest link. 2.2 User and Operating-System Interface We mentioned earlier that there are several ways for users to interface with the operating system. Here, we discuss two fundamental approaches. One provides a command-line interface, or command interpreter, that allows users to directly enter commands to be performed by the operating system. The other allows users to interface with the operating system via a graphical user interface, or GUI. 2.2.1 Command Interpreters Some operating systems include the command interpreter in the kernel. Others, such as Windows and UNIX, treat the command interpreter as a special program that is running when a job is initiated or when a user first logs on (on interactive systems). On systems with multiple command interpreters to choose from, the interpreters are known as shells. For example, on UNIX and Linux systems, a user may choose among several different shells, including the Bourne shell, C shell, Bourne-Again shell, Korn shell, and others. Third-party shells and free user-written shells are also available. Most shells provide similar functionality, and a user’s choice of which shell to use is generally based on personal preference. Figure 2.2 shows the Bourne shell command interpreter being used on Solaris 10. The main function of the command interpreter is to get and execute the next user-specified command. Many of the commands given at this level manipulate files: create, delete, list, print, copy, execute, and so on. The MS-DOS and UNIX shells operate in this way. These commands can be implemented in two general ways. In one approach, the command interpreter itself contains the code to execute the command. For example, a command to delete a file may cause the command interpreter to jump to a section of its code that sets up the parameters and makes the appropriate system call. In this case, the number of commands that can be given determines the size of the command interpreter, since each command requires its own implementing code. An alternative approach—used by UNIX, among other operating systems —implements most commands through system programs. In this case, the command interpreter does not understand the command in any way; it merely uses the command to identify a file to be loaded into memory and executed. Thus, the UNIX command to delete a file rm file.txt would search for a file called rm, load the file into memory, and execute it with the parameter file.txt. The function associated with the rm command would
  • 42. 2.2 User and Operating-System Interface 59 Figure 2.2 The Bourne shell command interpreter in Solrais 10. be defined completely by the code in the file rm. In this way, programmers can add new commands to the system easily by creating new files with the proper names. The command-interpreter program, which can be small, does not have to be changed for new commands to be added. 2.2.2 Graphical User Interfaces A second strategy for interfacing with the operating system is through a user- friendly graphical user interface, or GUI. Here, rather than entering commands directly via a command-line interface, users employ a mouse-based window- and-menu system characterized by a desktop metaphor. The user moves the mouse to position its pointer on images, or icons, on the screen (the desktop) that represent programs, files, directories, and system functions. Depending on the mouse pointer’s location, clicking a button on the mouse can invoke a program, select a file or directory—known as a folder—or pull down a menu that contains commands. Graphical user interfaces first appeared due in part to research taking place in the early 1970s at Xerox PARC research facility. The first GUI appeared on the Xerox Alto computer in 1973. However, graphical interfaces became more widespread with the advent of Apple Macintosh computers in the 1980s. The user interface for the Macintosh operating system (Mac OS) has undergone various changes over the years, the most significant being the adoption of the Aqua interface that appeared with Mac OS X. Microsoft’s first version of Windows—Version 1.0—was based on the addition of a GUI interface to the MS-DOS operating system. Later versions of Windows have made cosmetic
  • 43. 60 Chapter 2 Operating-System Structures changes in the appearance of the GUI along with several enhancements in its functionality. Because a mouse is impractical for most mobile systems, smartphones and handheld tablet computers typically use a touchscreen interface. Here, users interact by making gestures on the touchscreen—for example, pressing and swiping fingers across the screen. Figure 2.3 illustrates the touchscreen of the Apple iPad. Whereas earlier smartphones included a physical keyboard, most smartphones now simulate a keyboard on the touchscreen. Traditionally, UNIX systems have been dominated by command-line inter- faces. Various GUI interfaces are available, however. These include the Common Desktop Environment (CDE) and X-Windows systems, which are common on commercial versions of UNIX, such as Solaris and IBM’s AIX system. In addition, there has been significant development in GUI designs from various open-source projects, such as K Desktop Environment (or KDE) and the GNOME desktop by the GNU project. Both the KDE and GNOME desktops run on Linux and various UNIX systems and are available under open-source licenses, which means their source code is readily available for reading and for modification under specific license terms. Figure 2.3 The iPad touchscreen.
  • 44. 2.2 User and Operating-System Interface 61 2.2.3 Choice of Interface The choice of whether to use a command-line or GUI interface is mostly one of personal preference. System administrators who manage computers and power users who have deep knowledge of a system frequently use the command-line interface. For them, it is more efficient, giving them faster access to the activities they need to perform. Indeed, on some systems, only a subset of system functions is available via the GUI, leaving the less common tasks to those who are command-line knowledgeable. Further, command- line interfaces usually make repetitive tasks easier, in part because they have their own programmability. For example, if a frequent task requires a set of command-line steps, those steps can be recorded into a file, and that file can be run just like a program. The program is not compiled into executable code but rather is interpreted by the command-line interface. These shell scripts are very common on systems that are command-line oriented, such as UNIX and Linux. In contrast, most Windows users are happy to use the Windows GUI environment and almost never use the MS-DOS shell interface. The various changes undergone by the Macintosh operating systems provide a nice study in contrast. Historically, Mac OS has not provided a command-line interface, always requiring its users to interface with the operating system using its GUI. However, with the release of Mac OS X (which is in part implemented using a UNIX kernel), the operating system now provides both a Aqua interface and a command-line interface. Figure 2.4 is a screenshot of the Mac OS X GUI. Figure 2.4 The Mac OS X GUI.
  • 45. 62 Chapter 2 Operating-System Structures The user interface can vary from system to system and even from user to user within a system. It typically is substantially removed from the actual system structure. The design of a useful and friendly user interface is therefore not a direct function of the operating system. In this book, we concentrate on the fundamental problems of providing adequate service to user programs. From the point of view of the operating system, we do not distinguish between user programs and system programs. 2.3 System Calls System calls provide an interface to the services made available by an operating system. These calls are generally available as routines written in C and C++, although certain low-level tasks (for example, tasks where hardware must be accessed directly) may have to be written using assembly-language instructions. Before we discuss how an operating system makes system calls available, let’s first use an example to illustrate how system calls are used: writing a simple program to read data from one file and copy them to another file. The first input that the program will need is the names of the two files: the input file and the output file. These names can be specified in many ways, depending on the operating-system design. One approach is for the program to ask the user for the names. In an interactive system, this approach will require a sequence of system calls, first to write a prompting message on the screen and then to read from the keyboard the characters that define the two files. On mouse-based and icon-based systems, a menu of file names is usually displayed in a window. The user can then use the mouse to select the source name, and a window can be opened for the destination name to be specified. This sequence requires many I/O system calls. Once the two file names have been obtained, the program must open the input file and create the output file. Each of these operations requires another system call. Possible error conditions for each operation can require additional system calls. When the program tries to open the input file, for example, it may find that there is no file of that name or that the file is protected against access. In these cases, the program should print a message on the console (another sequence of system calls) and then terminate abnormally (another system call). If the input file exists, then we must create a new output file. We may find that there is already an output file with the same name. This situation may cause the program to abort (a system call), or we may delete the existing file (another system call) and create a new one (yet another system call). Another option, in an interactive system, is to ask the user (via a sequence of system calls to output the prompting message and to read the response from the terminal) whether to replace the existing file or to abort the program. When both files are set up, we enter a loop that reads from the input file (a system call) and writes to the output file (another system call). Each read and write must return status information regarding various possible error conditions. On input, the program may find that the end of the file has been reached or that there was a hardware failure in the read (such as a parity error). The write operation may encounter various errors, depending on the output device (for example, no more disk space).
  • 46. 2.3 System Calls 63 Finally, after the entire file is copied, the program may close both files (another system call), write a message to the console or window (more system calls), and finally terminate normally (the final system call). This system-call sequence is shown in Figure 2.5. As you can see, even simple programs may make heavy use of the operating system. Frequently, systems execute thousands of system calls per second. Most programmers never see this level of detail, however. Typically, application developers design programs according to an application programming interface (API). The API specifies a set of functions that are available to an application programmer, including the parameters that are passed to each function and the return values the programmer can expect. Three of the most common APIs available to application programmers are the Windows API for Windows systems, the POSIX API for POSIX-based systems (which include virtually all versions of UNIX, Linux, and Mac OS X), and the Java API for programs that run on the Java virtual machine. A programmer accesses an API via a library of code provided by the operating system. In the case of UNIX and Linux for programs written in the C language, the library is called libc. Note that—unless specified—the system-call names used throughout this text are generic examples. Each operating system has its own name for each system call. Behind the scenes, the functions that make up an API typically invoke the actual system calls on behalf of the application programmer. For example, the Windows function CreateProcess() (which unsurprisingly is used to create a new process) actually invokes the NTCreateProcess() system call in the Windows kernel. Why would an application programmer prefer programming according to an API rather than invoking actual system calls? There are several reasons for doing so. One benefit concerns program portability. An application program- source file destination file Example System Call Sequence Acquire input file name Write prompt to screen Accept input Acquire output file name Write prompt to screen Accept input Open the input file if file doesn't exist, abort Create output file if file exists, abort Loop Read from input file Write to output file Until read fails Close output file Write completion message to screen Terminate normally Figure 2.5 Example of how system calls are used.
  • 47. 64 Chapter 2 Operating-System Structures EXAMPLE OF STANDARD API As an example of a standard API, consider the read() function that is available in UNIX and Linux systems. The API for this function is obtained from the man page by invoking the command man read on the command line. A description of this API appears below: #include <unistd.h> ssize_t read(int fd, void *buf, size_t count) return value function name parameters A program that uses the read() function must include the unistd.h header file, as this file defines the ssize t and size t data types (among other things). The parameters passed to read() are as follows: • int fd—the file descriptor to be read • void *buf—a buffer where the data will be read into • size t count—the maximum number of bytes to be read into the buffer On a successful read, the number of bytes read is returned. A return value of 0 indicates end of file. If an error occurs, read() returns −1. mer designing a program using an API can expect her program to compile and run on any system that supports the same API (although, in reality, architectural differences often make this more difficult than it may appear). Furthermore, actual system calls can often be more detailed and difficult to work with than the API available to an application programmer. Nevertheless, there often exists a strong correlation between a function in the API and its associated system call within the kernel. In fact, many of the POSIX and Windows APIs are similar to the native system calls provided by the UNIX, Linux, and Windows operating systems. For most programming languages, the run-time support system (a set of functions built into libraries included with a compiler) provides a system- call interface that serves as the link to system calls made available by the operating system. The system-call interface intercepts function calls in the API and invokes the necessary system calls within the operating system. Typically, a number is associated with each system call, and the system-call interface maintains a table indexed according to these numbers. The system call interface
  • 48. 2.3 System Calls 65 Implementation of open ( ) system call open ( ) user mode return user application system call interface kernel mode i open ( ) Figure 2.6 The handling of a user application invoking the open() system call. then invokes the intended system call in the operating-system kernel and returns the status of the system call and any return values. The caller need know nothing about how the system call is implemented or what it does during execution. Rather, the caller need only obey the API and understand what the operating system will do as a result of the execution of that system call. Thus, most of the details of the operating-system interface are hidden from the programmer by the API and are managed by the run-time support library. The relationship between an API, the system-call interface, and the operating system is shown in Figure 2.6, which illustrates how the operating system handles a user application invoking the open() system call. System calls occur in different ways, depending on the computer in use. Often, more information is required than simply the identity of the desired system call. The exact type and amount of information vary according to the particular operating system and call. For example, to get input, we may need to specify the file or device to use as the source, as well as the address and length of the memory buffer into which the input should be read. Of course, the device or file and length may be implicit in the call. Three general methods are used to pass parameters to the operating system. The simplest approach is to pass the parameters in registers. In some cases, however, there may be more parameters than registers. In these cases, the parameters are generally stored in a block, or table, in memory, and the address of the block is passed as a parameter in a register (Figure 2.7). This is the approach taken by Linux and Solaris. Parameters also can be placed, or pushed, onto the stack by the program and popped off the stack by the operating system. Some operating systems prefer the block or stack method because those approaches do not limit the number or length of parameters being passed.
  • 49. 66 Chapter 2 Operating-System Structures code for system call 13 operating system user program use parameters from table X register X X: parameters for call load address X system call 13 Figure 2.7 Passing of parameters as a table. 2.4 Types of System Calls System calls can be grouped roughly into six major categories: process control, file manipulation, device manipulation, information maintenance, communications, and protection. In Sections 2.4.1 through 2.4.6, we briefly discuss the types of system calls that may be provided by an operating system. Most of these system calls support, or are supported by, concepts and functions that are discussed in later chapters. Figure 2.8 summarizes the types of system calls normally provided by an operating system. As mentioned, in this text, we normally refer to the system calls by generic names. Throughout the text, however, we provide examples of the actual counterparts to the system calls for Windows, UNIX, and Linux systems. 2.4.1 Process Control A running program needs to be able to halt its execution either normally (end()) or abnormally (abort()). If a system call is made to terminate the currently running program abnormally, or if the program runs into a problem and causes an error trap, a dump of memory is sometimes taken and an error message generated. The dump is written to disk and may be examined by a debugger—a system program designed to aid the programmer in finding and correcting errors, or bugs—to determine the cause of the problem. Under either normal or abnormal circumstances, the operating system must transfer control to the invoking command interpreter. The command interpreter then reads the next command. In an interactive system, the command interpreter simply continues with the next command; it is assumed that the user will issue an appropriate command to respond to any error. In a GUI system, a pop-up window might alert the user to the error and ask for guidance. In a batch system, the command interpreter usually terminates the entire job and continues with the next job. Some systems may allow for special recovery actions in case an error occurs. If the program discovers an error in its input and wants to terminate abnormally, it may also want to define an error level. More severe errors can be indicated by a higher-level error parameter. It is then
  • 50. 2.4 Types of System Calls 67 • Process control ◦ end, abort ◦ load, execute ◦ create process, terminate process ◦ get process attributes, set process attributes ◦ wait for time ◦ wait event, signal event ◦ allocate and free memory • File management ◦ create file, delete file ◦ open, close ◦ read, write, reposition ◦ get file attributes, set file attributes • Device management ◦ request device, release device ◦ read, write, reposition ◦ get device attributes, set device attributes ◦ logically attach or detach devices • Information maintenance ◦ get time or date, set time or date ◦ get system data, set system data ◦ get process, file, or device attributes ◦ set process, file, or device attributes • Communications ◦ create, delete communication connection ◦ send, receive messages ◦ transfer status information ◦ attach or detach remote devices Figure 2.8 Types of system calls. possible to combine normal and abnormal termination by defining a normal termination as an error at level 0. The command interpreter or a following program can use this error level to determine the next action automatically. A process or job executing one program may want to load() and execute() another program. This feature allows the command interpreter to execute a program as directed by, for example, a user command, the click of a
  • 51. 68 Chapter 2 Operating-System Structures EXAMPLES OF WINDOWS AND UNIX SYSTEM CALLS Windows Unix Process CreateProcess() fork() Control ExitProcess() exit() WaitForSingleObject() wait() File CreateFile() open() Manipulation ReadFile() read() WriteFile() write() CloseHandle() close() Device SetConsoleMode() ioctl() Manipulation ReadConsole() read() WriteConsole() write() Information GetCurrentProcessID() getpid() Maintenance SetTimer() alarm() Sleep() sleep() Communication CreatePipe() pipe() CreateFileMapping() shm open() MapViewOfFile() mmap() Protection SetFileSecurity() chmod() InitlializeSecurityDescriptor() umask() SetSecurityDescriptorGroup() chown() mouse, or a batch command. An interesting question is where to return control when the loaded program terminates. This question is related to whether the existing program is lost, saved, or allowed to continue execution concurrently with the new program. If control returns to the existing program when the new program termi- nates, we must save the memory image of the existing program; thus, we have effectively created a mechanism for one program to call another program. If both programs continue concurrently, we have created a new job or process to be multiprogrammed. Often, there is a system call specifically for this purpose (create process() or submit job()). If we create a new job or process, or perhaps even a set of jobs or processes, we should be able to control its execution. This control requires the ability to determine and reset the attributes of a job or process, includ- ing the job’s priority, its maximum allowable execution time, and so on (get process attributes() and set process attributes()). We may also want to terminate a job or process that we created (terminate process()) if we find that it is incorrect or is no longer needed.
  • 52. 2.4 Types of System Calls 69 EXAMPLE OF STANDARD C LIBRARY The standard C library provides a portion of the system-call interface for many versions of UNIX and Linux. As an example, let’s assume a C program invokes the printf() statement. The C library intercepts this call and invokes the necessary system call (or calls) in the operating system—in this instance, the write() system call. The C library takes the value returned by write() and passes it back to the user program. This is shown below: write ( ) system call user mode kernel mode #include <stdio.h> int main ( ) { • • • printf ("Greetings"); • • • return 0; } standard C library write ( ) Having created new jobs or processes, we may need to wait for them to finish their execution. We may want to wait for a certain amount of time to pass (wait time()). More probably, we will want to wait for a specific event to occur (wait event()). The jobs or processes should then signal when that event has occurred (signal event()). Quite often, two or more processes may share data. To ensure the integrity of the data being shared, operating systems often provide system calls allowing a process to lock shared data. Then, no other process can access the data until the lock is released. Typically, such system calls include acquire lock() and release lock(). System calls of these types, dealing with the coordination of concurrent processes, are discussed in great detail in Chapter 5. There are so many facets of and variations in process and job control that we next use two examples—one involving a single-tasking system and the other a multitasking system—to clarify these concepts. The MS-DOS operating system is an example of a single-tasking system. It has a command interpreter that is invoked when the computer is started (Figure 2.9(a)). Because MS-DOS is single-tasking, it uses a simple method to run a program and does not create a new process. It loads the program into memory, writing over most of itself to
  • 53. 70 Chapter 2 Operating-System Structures (a) (b) free memory command interpreter kernel process free memory command interpreter kernel Figure 2.9 MS-DOS execution. (a) At system startup. (b) Running a program. give the program as much memory as possible (Figure 2.9(b)). Next, it sets the instruction pointer to the first instruction of the program. The program then runs, and either an error causes a trap, or the program executes a system call to terminate. In either case, the error code is saved in the system memory for later use. Following this action, the small portion of the command interpreter that was not overwritten resumes execution. Its first task is to reload the rest of the command interpreter from disk. Then the command interpreter makes the previous error code available to the user or to the next program. FreeBSD (derived from Berkeley UNIX) is an example of a multitasking system. When a user logs on to the system, the shell of the user’s choice is run. This shell is similar to the MS-DOS shell in that it accepts commands and executes programs that the user requests. However, since FreeBSD is a multitasking system, the command interpreter may continue running while another program is executed (Figure 2.10). To start a new process, the shell free memory interpreter kernel process D process C process B Figure 2.10 FreeBSD running multiple programs.
  • 54. 2.4 Types of System Calls 71 executes a fork() system call. Then, the selected program is loaded into memory via an exec() system call, and the program is executed. Depending on the way the command was issued, the shell then either waits for the process to finish or runs the process “in the background.” In the latter case, the shell immediately requests another command. When a process is running in the background, it cannot receive input directly from the keyboard, because the shell is using this resource. I/O is therefore done through files or through a GUI interface. Meanwhile, the user is free to ask the shell to run other programs, to monitor the progress of the running process, to change that program’s priority, and so on. When the process is done, it executes an exit() system call to terminate, returning to the invoking process a status code of 0 or a nonzero error code. This status or error code is then available to the shell or other programs. Processes are discussed in Chapter 3 with a program example using the fork() and exec() system calls. 2.4.2 File Management The file system is discussed in more detail in Chapters 11 and 12. We can, however, identify several common system calls dealing with files. We first need to be able to create() and delete() files. Either system call requires the name of the file and perhaps some of the file’s attributes. Once the file is created, we need to open() it and to use it. We may also read(), write(), or reposition() (rewind or skip to the end of the file, for example). Finally, we need to close() the file, indicating that we are no longer using it. We may need these same sets of operations for directories if we have a directory structure for organizing files in the file system. In addition, for either files or directories, we need to be able to determine the values of various attributes and perhaps to reset them if necessary. File attributes include the file name, file type, protection codes, accounting information, and so on. At least two system calls, get file attributes() and set file attributes(), are required for this function. Some operating systems provide many more calls, such as calls for file move() and copy(). Others might provide an API that performs those operations using code and other system calls, and others might provide system programs to perform those tasks. If the system programs are callable by other programs, then each can be considered an API by other system programs. 2.4.3 Device Management A process may need several resources to execute—main memory, disk drives, access to files, and so on. If the resources are available, they can be granted, and control can be returned to the user process. Otherwise, the process will have to wait until sufficient resources are available. The various resources controlled by the operating system can be thought of as devices. Some of these devices are physical devices (for example, disk drives), while others can be thought of as abstract or virtual devices (for example, files). A system with multiple users may require us to first request() a device, to ensure exclusive use of it. After we are finished with the device, we release() it. These functions are similar to the open() and close() system calls for files. Other operating systems allow unmanaged access to devices.
  • 55. 72 Chapter 2 Operating-System Structures The hazard then is the potential for device contention and perhaps deadlock, which are described in Chapter 7. Once the device has been requested (and allocated to us), we can read(), write(), and (possibly) reposition() the device, just as we can with files. In fact, the similarity between I/O devices and files is so great that many operating systems, including UNIX, merge the two into a combined file–device structure. In this case, a set of system calls is used on both files and devices. Sometimes, I/O devices are identified by special file names, directory placement, or file attributes. The user interface can also make files and devices appear to be similar, even though the underlying system calls are dissimilar. This is another example of the many design decisions that go into building an operating system and user interface. 2.4.4 Information Maintenance Many system calls exist simply for the purpose of transferring information between the user program and the operating system. For example, most systems have a system call to return the current time() and date(). Other system calls may return information about the system, such as the number of current users, the version number of the operating system, the amount of free memory or disk space, and so on. Another set of system calls is helpful in debugging a program. Many systems provide system calls to dump() memory. This provision is useful for debugging. A program trace lists each system call as it is executed. Even microprocessors provide a CPU mode known as single step, in which a trap is executed by the CPU after every instruction. The trap is usually caught by a debugger. Many operating systems provide a time profile of a program to indicate the amount of time that the program executes at a particular location or set of locations. A time profile requires either a tracing facility or regular timer interrupts. At every occurrence of the timer interrupt, the value of the program counter is recorded. With sufficiently frequent timer interrupts, a statistical picture of the time spent on various parts of the program can be obtained. In addition, the operating system keeps information about all its processes, and system calls are used to access this information. Generally, calls are also used to reset the process information (get process attributes() and set process attributes()). In Section 3.1.3, we discuss what information is normally kept. 2.4.5 Communication There are two common models of interprocess communication: the message- passing model and the shared-memory model. In the message-passing model, the communicating processes exchange messages with one another to transfer information. Messages can be exchanged between the processes either directly or indirectly through a common mailbox. Before communication can take place, a connection must be opened. The name of the other communicator must be known, be it another process on the same system or a process on another computer connected by a communications network. Each computer in a network has a host name by which it is commonly known. A host also has a
  • 56. 2.4 Types of System Calls 73 network identifier, such as an IP address. Similarly, each process has a process name, and this name is translated into an identifier by which the operating system can refer to the process. The get hostid() and get processid() system calls do this translation. The identifiers are then passed to the general- purpose open() and close() calls provided by the file system or to specific open connection()and close connection()systemcalls, dependingonthe system’s model of communication. The recipient process usually must give its permission for communication to take place with an accept connection() call. Most processes that will be receiving connections are special-purpose daemons, which are system programs provided for that purpose. They execute a wait for connection() call and are awakened when a connection is made. The source of the communication, known as the client, and the receiving daemon, known as a server, then exchange messages by using read message() and write message() system calls. The close connection() call terminates the communication. In the shared-memory model, processes use shared memory create() and shared memory attach() system calls to create and gain access to regions of memory owned by other processes. Recall that, normally, the operating system tries to prevent one process from accessing another process’s memory. Shared memory requires that two or more processes agree to remove this restriction. They can then exchange information by reading and writing data in the shared areas. The form of the data is determined by the processes and is not under the operating system’s control. The processes are also responsible for ensuring that they are not writing to the same location simultaneously. Such mechanisms are discussed in Chapter 5. In Chapter 4, we look at a variation of the process scheme—threads—in which memory is shared by default. Both of the models just discussed are common in operating systems, and most systems implement both. Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It is also easier to implement than is shared memory for intercomputer communication. Shared memory allows maximum speed and convenience of communication, since it can be done at memory transfer speeds when it takes place within a computer. Problems exist, however, in the areas of protection and synchronization between the processes sharing memory. 2.4.6 Protection Protection provides a mechanism for controlling access to the resources provided by a computer system. Historically, protection was a concern only on multiprogrammed computer systems with several users. However, with the advent of networking and the Internet, all computer systems, from servers to mobile handheld devices, must be concerned with protection. Typically, system calls providing protection include set permission() and get permission(), which manipulate the permission settings of resources such as files and disks. The allow user() and deny user() system calls specify whether particular users can—or cannot—be allowed access to certain resources. We cover protection in Chapter 14 and the much larger issue of security in Chapter 15.
  • 57. Part Two Process Management A process can be thought of as a program in execution. A process will need certain resources—such as CPU time, memory, files, and I/O devices —to accomplish its task. These resources are allocated to the process either when it is created or while it is executing. A process is the unit of work in most systems. Systems consist of a collection of processes: operating-system processes execute system code, and user processes execute user code. All these processes may execute concurrently. Although traditionally a process contained only a single thread of control as it ran, most modern operating systems now support processes that have multiple threads. The operating system is responsible for several important aspects of process and thread management: the creation and deletion of both user and system processes; the scheduling of processes; and the provision of mechanisms for synchronization, communication, and deadlock handling for processes.
  • 59. 3 C H A P T E R Processes Early computers allowed only one program to be executed at a time. This program had complete control of the system and had access to all the system’s resources. In contrast, contemporary computer systems allow multiple pro- grams to be loaded into memory and executed concurrently. This evolution required firmer control and more compartmentalization of the various pro- grams; and these needs resulted in the notion of a process, which is a program in execution. A process is the unit of work in a modern time-sharing system. The more complex the operating system is, the more it is expected to do on behalf of its users. Although its main concern is the execution of user programs, it also needs to take care of various system tasks that are better left outside the kernel itself. A system therefore consists of a collection of processes: operating- system processes executing system code and user processes executing user code. Potentially, all these processes can execute concurrently, with the CPU (or CPUs) multiplexed among them. By switching the CPU between processes, the operating system can make the computer more productive. In this chapter, you will read about what processes are and how they work. CHAPTER OBJECTIVES • To introduce the notion of a process—a program in execution, which forms the basis of all computation. • To describe the various features of processes, including scheduling, creation, and termination. • To explore interprocess communication using shared memory and mes- sage passing. • To describe communication in client–server systems. 3.1 Process Concept A question that arises in discussing operating systems involves what to call all the CPU activities. A batch system executes jobs, whereas a time-shared 105
  • 60. 106 Chapter 3 Processes system has user programs, or tasks. Even on a single-user system, a user may be able to run several programs at one time: a word processor, a Web browser, and an e-mail package. And even if a user can execute only one program at a time, such as on an embedded device that does not support multitasking, the operating system may need to support its own internal programmed activities, such as memory management. In many respects, all these activities are similar, so we call all of them processes. The terms job and process are used almost interchangeably in this text. Although we personally prefer the term process, much of operating-system theory and terminology was developed during a time when the major activity of operating systems was job processing. It would be misleading to avoid the use of commonly accepted terms that include the word job (such as job scheduling) simply because process has superseded job. 3.1.1 The Process Informally, as mentioned earlier, a process is a program in execution. A process is more than the program code, which is sometimes known as the text section. It also includes the current activity, as represented by the value of the program counter and the contents of the processor’s registers. A process generally also includes the process stack, which contains temporary data (such as function parameters, return addresses, and local variables), and a data section, which contains global variables. A process may also include a heap, which is memory that is dynamically allocated during process run time. The structure of a process in memory is shown in Figure 3.1. We emphasize that a program by itself is not a process. A program is a passive entity, such as a file containing a list of instructions stored on disk (often called an executable file). In contrast, a process is an active entity, with a program counter specifying the next instruction to execute and a set of associated resources. A program becomes a process when an executable file is loaded into memory. Two common techniques for loading executable files text 0 max data heap stack Figure 3.1 Process in memory.
  • 61. 3.1 Process Concept 107 are double-clicking an icon representing the executable file and entering the name of the executable file on the command line (as in prog.exe or a.out). Although two processes may be associated with the same program, they are nevertheless considered two separate execution sequences. For instance, several users may be running different copies of the mail program, or the same user may invoke many copies of the web browser program. Each of these is a separate process; and although the text sections are equivalent, the data, heap, and stack sections vary. It is also common to have a process that spawns many processes as it runs. We discuss such matters in Section 3.4. Note that a process itself can be an execution environment for other code. The Java programming environment provides a good example. In most circumstances, an executable Java program is executed within the Java virtual machine (JVM). The JVM executes as a process that interprets the loaded Java code and takes actions (via native machine instructions) on behalf of that code. For example, to run the compiled Java program Program.class, we would enter java Program The command java runs the JVM as an ordinary process, which in turns executes the Java program Program in the virtual machine. The concept is the same as simulation, except that the code, instead of being written for a different instruction set, is written in the Java language. 3.1.2 Process State As a process executes, it changes state. The state of a process is defined in part by the current activity of that process. A process may be in one of the following states: • New. The process is being created. • Running. Instructions are being executed. • Waiting. The process is waiting for some event to occur (such as an I/O completion or reception of a signal). • Ready. The process is waiting to be assigned to a processor. • Terminated. The process has finished execution. These names are arbitrary, and they vary across operating systems. The states that they represent are found on all systems, however. Certain operating systems also more finely delineate process states. It is important to realize that only one process can be running on any processor at any instant. Many processes may be ready and waiting, however. The state diagram corresponding to these states is presented in Figure 3.2. 3.1.3 Process Control Block Each process is represented in the operating system by a process control block (PCB)—also called a task control block. A PCB is shown in Figure 3.3. It contains many pieces of information associated with a specific process, including these:
  • 62. 108 Chapter 3 Processes new terminated running ready admitted interrupt scheduler dispatch I/O or event completion I/O or event wait exit waiting Figure 3.2 Diagram of process state. • Process state. The state may be new, ready, running, waiting, halted, and so on. • Program counter. The counter indicates the address of the next instruction to be executed for this process. • CPU registers. The registers vary in number and type, depending on the computer architecture. They include accumulators, index registers, stack pointers, and general-purpose registers, plus any condition-code information. Along with the program counter, this state information must be saved when an interrupt occurs, to allow the process to be continued correctly afterward (Figure 3.4). • CPU-scheduling information. This information includes a process priority, pointers to scheduling queues, and any other scheduling parameters. (Chapter 6 describes process scheduling.) • Memory-management information. This information may include such items as the value of the base and limit registers and the page tables, or the segment tables, depending on the memory system used by the operating system (Chapter 8). process state process number program counter memory limits list of open files registers • • • Figure 3.3 Process control block (PCB).
  • 63. 3.1 Process Concept 109 process P0 process P1 save state into PCB0 save state into PCB1 reload state from PCB1 reload state from PCB0 operating system idle idle executing idle executing executing interrupt or system call interrupt or system call • • • • • • Figure 3.4 Diagram showing CPU switch from process to process. • Accounting information. This information includes the amount of CPU and real time used, time limits, account numbers, job or process numbers, and so on. • I/O status information. This information includes the list of I/O devices allocated to the process, a list of open files, and so on. In brief, the PCB simply serves as the repository for any information that may vary from process to process. 3.1.4 Threads The process model discussed so far has implied that a process is a program that performs a single thread of execution. For example, when a process is running a word-processor program, a single thread of instructions is being executed. This single thread of control allows the process to perform only one task at a time. The user cannot simultaneously type in characters and run the spell checker within the same process, for example. Most modern operating systems have extended the process concept to allow a process to have multiple threads of execution and thus to perform more than one task at a time. This feature is especially beneficial on multicore systems, where multiple threads can run in parallel. On a system that supports threads, the PCB is expanded to include information for each thread. Other changes throughout the system are also needed to support threads. Chapter 4 explores threads in detail.
  • 64. 110 Chapter 3 Processes PROCESS REPRESENTATION IN LINUX The process control block in the Linux operating system is represented by the C structure task struct, which is found in the <linux/sched.h> include file in the kernel source-code directory. This structure contains all the necessary information for representing a process, including the state of the process, scheduling and memory-management information, list of open files, and pointers to the process’s parent and a list of its children and siblings. (A process’s parent is the process that created it; its children are any processes that it creates. Its siblings are children with the same parent process.) Some of these fields include: long state; /* state of the process */ struct sched entity se; /* scheduling information */ struct task struct *parent; /* this process’s parent */ struct list head children; /* this process’s children */ struct files struct *files; /* list of open files */ struct mm struct *mm; /* address space of this process */ For example, the state of a process is represented by the field long state in this structure. Within the Linux kernel, all active processes are represented using a doubly linked list of task struct. The kernel maintains a pointer— current—to the process currently executing on the system, as shown below: struct task_struct process information • • • struct task_struct process information • • • current (currently executing proccess) struct task_struct process information • • • • • • As an illustration of how the kernel might manipulate one of the fields in the task struct for a specified process, let’s assume the system would like to change the state of the process currently running to the value new state. If current is a pointer to the process currently executing, its state is changed with the following: current->state = new state; 3.2 Process Scheduling The objective of multiprogramming is to have some process running at all times, to maximize CPU utilization. The objective of time sharing is to switch the CPU among processes so frequently that users can interact with each program
  • 65. 3.2 Process Scheduling 111 queue header PCB7 PCB3 PCB5 PCB14 PCB6 PCB2 head head head head head ready queue disk unit 0 terminal unit 0 mag tape unit 0 mag tape unit 1 tail registers registers tail tail tail tail • • • • • • • • • Figure 3.5 The ready queue and various I/O device queues. while it is running. To meet these objectives, the process scheduler selects an available process (possibly from a set of several available processes) for program execution on the CPU. For a single-processor system, there will never be more than one running process. If there are more processes, the rest will have to wait until the CPU is free and can be rescheduled. 3.2.1 Scheduling Queues As processes enter the system, they are put into a job queue, which consists of all processes in the system. The processes that are residing in main memory and are ready and waiting to execute are kept on a list called the ready queue. This queue is generally stored as a linked list. A ready-queue header contains pointers to the first and final PCBs in the list. Each PCB includes a pointer field that points to the next PCB in the ready queue. The system also includes other queues. When a process is allocated the CPU, it executes for a while and eventually quits, is interrupted, or waits for the occurrence of a particular event, such as the completion of an I/O request. Suppose the process makes an I/O request to a shared device, such as a disk. Since there are many processes in the system, the disk may be busy with the I/O request of some other process. The process therefore may have to wait for the disk. The list of processes waiting for a particular I/O device is called a device queue. Each device has its own device queue (Figure 3.5).
  • 66. 112 Chapter 3 Processes ready queue CPU I/O I/O queue I/O request time slice expired fork a child wait for an interrupt interrupt occurs child executes Figure 3.6 Queueing-diagram representation of process scheduling. A common representation of process scheduling is a queueing diagram, such as that in Figure 3.6. Each rectangular box represents a queue. Two types of queues are present: the ready queue and a set of device queues. The circles represent the resources that serve the queues, and the arrows indicate the flow of processes in the system. A new process is initially put in the ready queue. It waits there until it is selected for execution, or dispatched. Once the process is allocated the CPU and is executing, one of several events could occur: • The process could issue an I/O request and then be placed in an I/O queue. • The process could create a new child process and wait for the child’s termination. • The process could be removed forcibly from the CPU, as a result of an interrupt, and be put back in the ready queue. In the first two cases, the process eventually switches from the waiting state to the ready state and is then put back in the ready queue. A process continues this cycle until it terminates, at which time it is removed from all queues and has its PCB and resources deallocated. 3.2.2 Schedulers A process migrates among the various scheduling queues throughout its lifetime. The operating system must select, for scheduling purposes, processes from these queues in some fashion. The selection process is carried out by the appropriate scheduler. Often, in a batch system, more processes are submitted than can be executed immediately. These processes are spooled to a mass-storage device (typically a disk), where they are kept for later execution. The long-term scheduler, or job scheduler, selects processes from this pool and loads them into memory for
  • 67. 3.2 Process Scheduling 113 execution. The short-term scheduler, or CPU scheduler, selects from among the processes that are ready to execute and allocates the CPU to one of them. The primary distinction between these two schedulers lies in frequency of execution. The short-term scheduler must select a new process for the CPU frequently. A process may execute for only a few milliseconds before waiting for an I/O request. Often, the short-term scheduler executes at least once every 100 milliseconds. Because of the short time between executions, the short-term scheduler must be fast. If it takes 10 milliseconds to decide to execute a process for 100 milliseconds, then 10/(100 + 10) = 9 percent of the CPU is being used (wasted) simply for scheduling the work. The long-term scheduler executes much less frequently; minutes may sep- arate the creation of one new process and the next. The long-term scheduler controls the degree of multiprogramming (the number of processes in mem- ory). If the degree of multiprogramming is stable, then the average rate of process creation must be equal to the average departure rate of processes leaving the system. Thus, the long-term scheduler may need to be invoked only when a process leaves the system. Because of the longer interval between executions, the long-term scheduler can afford to take more time to decide which process should be selected for execution. It is important that the long-term scheduler make a careful selection. In general, most processes can be described as either I/O bound or CPU bound. An I/O-bound process is one that spends more of its time doing I/O than it spends doing computations. A CPU-bound process, in contrast, generates I/O requests infrequently, using more of its time doing computations. It is important that the long-term scheduler select a good process mix of I/O-bound and CPU-bound processes. If all processes are I/O bound, the ready queue will almost always be empty, and the short-term scheduler will have little to do. If all processes are CPU bound, the I/O waiting queue will almost always be empty, devices will go unused, and again the system will be unbalanced. The system with the best performance will thus have a combination of CPU-bound and I/O-bound processes. On some systems, the long-term scheduler may be absent or minimal. For example, time-sharing systems such as UNIX and Microsoft Windows systems often have no long-term scheduler but simply put every new process in memory for the short-term scheduler. The stability of these systems depends either on a physical limitation (such as the number of available terminals) or on the self-adjusting nature of human users. If performance declines to unacceptable levels on a multiuser system, some users will simply quit. Some operating systems, such as time-sharing systems, may introduce an additional, intermediate level of scheduling. This medium-term scheduler is diagrammed in Figure 3.7. The key idea behind a medium-term scheduler is that sometimes it can be advantageous to remove a process from memory (and from active contention for the CPU) and thus reduce the degree of multiprogramming. Later, the process can be reintroduced into memory, and its execution can be continued where it left off. This scheme is called swapping. The process is swapped out, and is later swapped in, by the medium-term scheduler. Swapping may be necessary to improve the process mix or because a change in memory requirements has overcommitted available memory, requiring memory to be freed up. Swapping is discussed in Chapter 8.
  • 68. 114 Chapter 3 Processes swap in swap out end CPU I/O I/O waiting queues ready queue partially executed swapped-out processes Figure 3.7 Addition of medium-term scheduling to the queueing diagram. 3.2.3 Context Switch As mentioned in Section 1.2.1, interrupts cause the operating system to change a CPU from its current task and to run a kernel routine. Such operations happen frequently on general-purpose systems. When an interrupt occurs, the system needs to save the current context of the process running on the CPU so that it can restore that context when its processing is done, essentially suspending the process and then resuming it. The context is represented in the PCB of the process. It includes the value of the CPU registers, the process state (see Figure 3.2), and memory-management information. Generically, we perform a state save of the current state of the CPU, be it in kernel or user mode, and then a state restore to resume operations. Switching the CPU to another process requires performing a state save of the current process and a state restore of a different process. This task is known as a context switch. When a context switch occurs, the kernel saves the context of the old process in its PCB and loads the saved context of the new process scheduled to run. Context-switch time is pure overhead, because the system does no useful work while switching. Switching speed varies from machine to machine, depending on the memory speed, the number of registers that must be copied, and the existence of special instructions (such as a single instruction to load or store all registers). A typical speed is a few milliseconds. Context-switch times are highly dependent on hardware support. For instance, some processors (such as the Sun UltraSPARC) provide multiple sets of registers. A context switch here simply requires changing the pointer to the current register set. Of course, if there are more active processes than there are register sets, the system resorts to copying register data to and from memory, as before. Also, the more complex the operating system, the greater the amount of work that must be done during a context switch. As we will see in Chapter 8, advanced memory-management techniques may require that extra data be switched with each context. For instance, the address space of the current process must be preserved as the space of the next task is prepared for use. How the address space is preserved, and what amount of work is needed to preserve it, depend on the memory-management method of the operating system.
  • 69. 3.3 Operations on Processes 115 MULTITASKING IN MOBILE SYSTEMS Because of the constraints imposed on mobile devices, early versions of iOS did not provide user-application multitasking; only one application runs in the foreground and all other user applications are suspended. Operating- system tasks were multitasked because they were written by Apple and well behaved. However, beginning with iOS 4, Apple now provides a limited form of multitasking for user applications, thus allowing a single foreground application to run concurrently with multiple background applications. (On a mobile device, the foreground application is the application currently open and appearing on the display. The background application remains in memory, but does not occupy the display screen.) The iOS 4 programming API provides support for multitasking, thus allowing a process to run in the background without being suspended. However, it is limited and only available for a limited number of application types, including applications • running a single, finite-length task (such as completing a download of content from a network); • receiving notifications of an event occurring (such as a new email message); • with long-running background tasks (such as an audio player.) Apple probably limits multitasking due to battery life and memory use concerns. The CPU certainly has the features to support multitasking, but Apple chooses to not take advantage of some of them in order to better manage resource use. Android does not place such constraints on the types of applications that can run in the background. If an application requires processing while in the background, the application must use a service, a separate application component that runs on behalf of the background process. Consider a streaming audio application: if the application moves to the background, the service continues to send audio files to the audio device driver on behalf of the background application. In fact, the service will continue to run even if the background application is suspended. Services do not have a user interface and have a small memory footprint, thus providing an efficient technique for multitasking in a mobile environment. 3.3 Operations on Processes The processes in most systems can execute concurrently, and they may be created and deleted dynamically. Thus, these systems must provide a mechanism for process creation and termination. In this section, we explore the mechanisms involved in creating processes and illustrate process creation on UNIX and Windows systems.
  • 70. 116 Chapter 3 Processes 3.3.1 Process Creation During the course of execution, a process may create several new processes. As mentioned earlier, the creating process is called a parent process, and the new processes are called the children of that process. Each of these new processes may in turn create other processes, forming a tree of processes. Most operating systems (including UNIX, Linux, and Windows) identify processes according to a unique process identifier (or pid), which is typically an integer number. The pid provides a unique value for each process in the system, and it can be used as an index to access various attributes of a process within the kernel. Figure 3.8 illustrates a typical process tree for the Linux operating system, showing the name of each process and its pid. (We use the term process rather loosely, as Linux prefers the term task instead.) The initprocess (which always has a pid of 1) serves as the root parent process for all user processes. Once the system has booted, the initprocess can also create various user processes, such as a web or print server, an ssh server, and the like. In Figure 3.8, we see two children of init—kthreadd and sshd. The kthreadd process is responsible for creating additional processes that perform tasks on behalf of the kernel (in this situation, khelper and pdflush). The sshd process is responsible for managing clients that connect to the system by using ssh (which is short for secure shell). The login process is responsible for managing clients that directly log onto the system. In this example, a client has logged on and is using the bash shell, which has been assigned pid 8416. Using the bash command-line interface, this user has created the process ps as well as the emacs editor. On UNIX and Linux systems, we can obtain a listing of processes by using the ps command. For example, the command ps -el will list complete information for all processes currently active in the system. It is easy to construct a process tree similar to the one shown in Figure 3.8 by recursively tracing parent processes all the way to the init process. init pid = 1 sshd pid = 3028 login pid = 8415 kthreadd pid = 2 sshd pid = 3610 pdflush pid = 200 khelper pid = 6 tcsch pid = 4005 emacs pid = 9204 bash pid = 8416 ps pid = 9298 Figure 3.8 A tree of processes on a typical Linux system.
  • 71. 3.3 Operations on Processes 117 In general, when a process creates a child process, that child process will need certain resources (CPU time, memory, files, I/O devices) to accomplish its task. A child process may be able to obtain its resources directly from the operating system, or it may be constrained to a subset of the resources of the parent process. The parent may have to partition its resources among its children, or it may be able to share some resources (such as memory or files) among several of its children. Restricting a child process to a subset of the parent’s resources prevents any process from overloading the system by creating too many child processes. In addition to supplying various physical and logical resources, the parent process may pass along initialization data (input) to the child process. For example, consider a process whose function is to display the contents of a file —say, image.jpg—on the screen of a terminal. When the process is created, it will get, as an input from its parent process, the name of the file image.jpg. Using that file name, it will open the file and write the contents out. It may also get the name of the output device. Alternatively, some operating systems pass resources to child processes. On such a system, the new process may get two open files, image.jpg and the terminal device, and may simply transfer the datum between the two. When a process creates a new process, two possibilities for execution exist: 1. The parent continues to execute concurrently with its children. 2. The parent waits until some or all of its children have terminated. There are also two address-space possibilities for the new process: 1. The child process is a duplicate of the parent process (it has the same program and data as the parent). 2. The child process has a new program loaded into it. To illustrate these differences, let’s first consider the UNIX operating system. In UNIX, as we’ve seen, each process is identified by its process identifier, which is a unique integer. A new process is created by the fork() system call. The new process consists of a copy of the address space of the original process. This mechanism allows the parent process to communicate easily with its child process. Both processes (the parent and the child) continue execution at the instruction after the fork(), with one difference: the return code for the fork() is zero for the new (child) process, whereas the (nonzero) process identifier of the child is returned to the parent. After a fork() system call, one of the two processes typically uses the exec() system call to replace the process’s memory space with a new program. The exec() system call loads a binary file into memory (destroying the memory image of the program containing the exec() system call) and starts its execution. In this manner, the two processes are able to communicate and then go their separate ways. The parent can then create more children; or, if it has nothing else to do while the child runs, it can issue a wait() system call to move itself off the ready queue until the termination of the child. Because the
  • 72. 118 Chapter 3 Processes #include <sys/types.h> #include <stdio.h> #include <unistd.h> int main() { pid t pid; /* fork a child process */ pid = fork(); if (pid < 0) { /* error occurred */ fprintf(stderr, "Fork Failed"); return 1; } else if (pid == 0) { /* child process */ execlp("/bin/ls","ls",NULL); } else { /* parent process */ /* parent will wait for the child to complete */ wait(NULL); printf("Child Complete"); } return 0; } Figure 3.9 Creating a separate process using the UNIX fork() system call. call to exec() overlays the process’s address space with a new program, the call to exec() does not return control unless an error occurs. The C program shown in Figure 3.9 illustrates the UNIX system calls previously described. We now have two different processes running copies of the same program. The only difference is that the value of pid (the process identifier) for the child process is zero, while that for the parent is an integer value greater than zero (in fact, it is the actual pid of the child process). The child process inherits privileges and scheduling attributes from the parent, as well certain resources, such as open files. The child process then overlays its address space with the UNIX command /bin/ls (used to get a directory listing) using the execlp() system call (execlp() is a version of the exec() system call). The parent waits for the child process to complete with the wait() system call. When the child process completes (by either implicitly or explicitly invoking exit()), the parent process resumes from the call to wait(), where it completes using the exit() system call. This is also illustrated in Figure 3.10. Of course, there is nothing to prevent the child from not invoking exec() and instead continuing to execute as a copy of the parent process. In this scenario, the parent and child are concurrent processes running the same code
  • 73. 3.3 Operations on Processes 119 pid = fork() exec() parent parent (pid > 0) child (pid = 0) wait() exit() parent resumes Figure 3.10 Process creation using the fork() system call. instructions. Because the child is a copy of the parent, each process has its own copy of any data. As an alternative example, we next consider process creation in Windows. Processes are created in the Windows API using the CreateProcess() func- tion, which is similar to fork() in that a parent creates a new child process. However, whereas fork() has the child process inheriting the address space of its parent, CreateProcess() requires loading a specified program into the address space of the child process at process creation. Furthermore, whereas fork() is passed no parameters, CreateProcess() expects no fewer than ten parameters. The C program shown in Figure 3.11 illustrates the CreateProcess() function, which creates a child process that loads the application mspaint.exe. We opt for many of the default values of the ten parameters passed to CreateProcess(). Readers interested in pursuing the details of process creation and management in the Windows API are encouraged to consult the bibliographical notes at the end of this chapter. The two parameters passed to the CreateProcess() function are instances of the STARTUPINFO and PROCESS INFORMATION structures. STARTUPINFO specifies many properties of the new process, such as window size and appearance and handles to standard input and output files. The PRO- CESS INFORMATION structure contains a handle and the identifiers to the newly created process and its thread. We invoke the ZeroMemory() func- tion to allocate memory for each of these structures before proceeding with CreateProcess(). The first two parameters passed to CreateProcess() are the application name and command-line parameters. If the application name is NULL (as it is in this case), the command-line parameter specifies the application to load. In this instance, we are loading the Microsoft Windows mspaint.exe application. Beyond these two initial parameters, we use the default parameters for inheriting process and thread handles as well as specifying that there will be no creation flags. We also use the parent’s existing environment block and starting directory. Last, we provide two pointers to the STARTUPINFO and PROCESS - INFORMATION structures created at the beginning of the program. In Figure 3.9, the parent process waits for the child to complete by invoking the wait() system call. The equivalent of this in Windows is WaitForSingleObject(), which is passed a handle of the child process—pi.hProcess—and waits for this process to complete. Once the child process exits, control returns from the WaitForSingleObject() function in the parent process.
  • 74. 120 Chapter 3 Processes #include <stdio.h> #include <windows.h> int main(VOID) { STARTUPINFO si; PROCESS INFORMATION pi; /* allocate memory */ ZeroMemory(&si, sizeof(si)); si.cb = sizeof(si); ZeroMemory(&pi, sizeof(pi)); /* create child process */ if (!CreateProcess(NULL, /* use command line */ "C:WINDOWSsystem32mspaint.exe", /* command */ NULL, /* don’t inherit process handle */ NULL, /* don’t inherit thread handle */ FALSE, /* disable handle inheritance */ 0, /* no creation flags */ NULL, /* use parent’s environment block */ NULL, /* use parent’s existing directory */ &si, &pi)) { fprintf(stderr, "Create Process Failed"); return -1; } /* parent will wait for the child to complete */ WaitForSingleObject(pi.hProcess, INFINITE); printf("Child Complete"); /* close handles */ CloseHandle(pi.hProcess); CloseHandle(pi.hThread); } Figure 3.11 Creating a separate process using the Windows API. 3.3.2 Process Termination A process terminates when it finishes executing its final statement and asks the operating system to delete it by using the exit() system call. At that point, the process may return a status value (typically an integer) to its parent process (via the wait() system call). All the resources of the process—including physical and virtual memory, open files, and I/O buffers—are deallocated by the operating system. Termination can occur in other circumstances as well. A process can cause the termination of another process via an appropriate system call (for example, TerminateProcess() in Windows). Usually, such a system call can be invoked
  • 75. 3.3 Operations on Processes 121 only by the parent of the process that is to be terminated. Otherwise, users could arbitrarily kill each other’s jobs. Note that a parent needs to know the identities of its children if it is to terminate them. Thus, when one process creates a new process, the identity of the newly created process is passed to the parent. A parent may terminate the execution of one of its children for a variety of reasons, such as these: • The child has exceeded its usage of some of the resources that it has been allocated. (To determine whether this has occurred, the parent must have a mechanism to inspect the state of its children.) • The task assigned to the child is no longer required. • The parent is exiting, and the operating system does not allow a child to continue if its parent terminates. Some systems do not allow a child to exist if its parent has terminated. In such systems, if a process terminates (either normally or abnormally), then all its children must also be terminated. This phenomenon, referred to as cascading termination, is normally initiated by the operating system. To illustrate process execution and termination, consider that, in Linux and UNIX systems, we can terminate a process by using the exit() system call, providing an exit status as a parameter: /* exit with status 1 */ exit(1); In fact, under normal termination, exit() may be called either directly (as shown above) or indirectly (by a return statement in main()). A parent process may wait for the termination of a child process by using the wait() system call. The wait() system call is passed a parameter that allows the parent to obtain the exit status of the child. This system call also returns the process identifier of the terminated child so that the parent can tell which of its children has terminated: pid t pid; int status; pid = wait(&status); When a process terminates, its resources are deallocated by the operating system. However, its entry in the process table must remain there until the parent calls wait(), because the process table contains the process’s exit status. A process that has terminated, but whose parent has not yet called wait(), is known as a zombie process. All processes transition to this state when they terminate, but generally they exist as zombies only briefly. Once the parent calls wait(), the process identifier of the zombie process and its entry in the process table are released. Now consider what would happen if a parent did not invoke wait() and instead terminated, thereby leaving its child processes as orphans. Linux and UNIX address this scenario by assigning the init process as the new parent to
  • 76. 122 Chapter 3 Processes orphan processes. (Recall from Figure 3.8 that the init process is the root of the process hierarchy in UNIX and Linux systems.) The init process periodically invokes wait(), thereby allowing the exit status of any orphaned process to be collected and releasing the orphan’s process identifier and process-table entry. 3.4 Interprocess Communication Processes executing concurrently in the operating system may be either independent processes or cooperating processes. A process is independent if it cannot affect or be affected by the other processes executing in the system. Any process that does not share data with any other process is independent. A process is cooperating if it can affect or be affected by the other processes executing in the system. Clearly, any process that shares data with other processes is a cooperating process. There are several reasons for providing an environment that allows process cooperation: • Information sharing. Since several users may be interested in the same piece of information (for instance, a shared file), we must provide an environment to allow concurrent access to such information. • Computation speedup. If we want a particular task to run faster, we must break it into subtasks, each of which will be executing in parallel with the others. Notice that such a speedup can be achieved only if the computer has multiple processing cores. • Modularity. We may want to construct the system in a modular fashion, dividing the system functions into separate processes or threads, as we discussed in Chapter 2. • Convenience. Even an individual user may work on many tasks at the same time. For instance, a user may be editing, listening to music, and compiling in parallel. Cooperating processes require an interprocess communication (IPC) mech- anism that will allow them to exchange data and information. There are two fundamental models of interprocess communication: shared memory and mes- sage passing. In the shared-memory model, a region of memory that is shared by cooperating processes is established. Processes can then exchange informa- tion by reading and writing data to the shared region. In the message-passing model, communication takes place by means of messages exchanged between the cooperating processes. The two communications models are contrasted in Figure 3.12. Both of the models just mentioned are common in operating systems, and many systems implement both. Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. Message passing is also easier to implement in a distributed system than shared memory. (Although there are systems that provide distributed shared memory, we do not consider them in this text.) Shared memory can be faster than message passing, since message-passing systems are typically implemented using system calls
  • 77. 3.4 Interprocess Communication 123 MULTIPROCESS ARCHITECTURE—CHROME BROWSER Many websites contain active content such as JavaScript, Flash, and HTML5 to provide a rich and dynamic web-browsing experience. Unfortunately, these web applications may also contain software bugs, which can result in sluggish response times and can even cause the web browser to crash. This isn’t a big problem in a web browser that displays content from only one website. But most contemporary web browsers provide tabbed browsing, which allows a single instance of a web browser application to open several websites at the same time, with each site in a separate tab. To switch between the different sites , a user need only click on the appropriate tab. This arrangement is illustrated below: A problem with this approach is that if a web application in any tab crashes, the entire process—including all other tabs displaying additional websites —crashes as well. Google’s Chrome web browser was designed to address this issue by using a multiprocess architecture. Chrome identifies three different types of processes: browser, renderers, and plug-ins. • The browser process is responsible for managing the user interface as well as disk and network I/O. A new browser process is created when Chrome is started. Only one browser process is created. • Renderer processes contain logic for rendering web pages. Thus, they contain the logic for handling HTML, Javascript, images, and so forth. As a general rule, a new renderer process is created for each website opened in a new tab, and so several renderer processes may be active at the same time. • A plug-in process is created for each type of plug-in (such as Flash or QuickTime) in use. Plug-in processes contain the code for the plug-in as well as additional code that enables the plug-in to communicate with associated renderer processes and the browser process. The advantage of the multiprocess approach is that websites run in isolation from one another. If one website crashes, only its renderer process is affected; all other processes remain unharmed. Furthermore, renderer processes run in a sandbox, which means that access to disk and network I/O is restricted, minimizing the effects of any security exploits. and thus require the more time-consuming task of kernel intervention. In shared-memory systems, system calls are required only to establish shared-
  • 78. 124 Chapter 3 Processes process A message queue kernel (a) (b) process A shared memory kernel process B m0 m1 m2 ... m3 mn process B Figure 3.12 Communications models. (a) Message passing. (b) Shared memory. memory regions. Once shared memory is established, all accesses are treated as routine memory accesses, and no assistance from the kernel is required. Recent research on systems with several processing cores indicates that message passing provides better performance than shared memory on such systems. Shared memory suffers from cache coherency issues, which arise because shared data migrate among the several caches. As the number of processing cores on systems increases, it is possible that we will see message passing as the preferred mechanism for IPC. In the remainder of this section, we explore shared-memory and message- passing systems in more detail. 3.4.1 Shared-Memory Systems Interprocess communication using shared memory requires communicating processes to establish a region of shared memory. Typically, a shared-memory region resides in the address space of the process creating the shared-memory segment. Other processes that wish to communicate using this shared-memory segment must attach it to their address space. Recall that, normally, the operating system tries to prevent one process from accessing another process’s memory. Shared memory requires that two or more processes agree to remove this restriction. They can then exchange information by reading and writing data in the shared areas. The form of the data and the location are determined by these processes and are not under the operating system’s control. The processes are also responsible for ensuring that they are not writing to the same location simultaneously. To illustrate the concept of cooperating processes, let’s consider the producer–consumer problem, which is a common paradigm for cooperating processes. A producer process produces information that is consumed by a consumer process. For example, a compiler may produce assembly code that is consumed by an assembler. The assembler, in turn, may produce object modules that are consumed by the loader. The producer–consumer problem
  • 79. 3.4 Interprocess Communication 125 item next produced; while (true) { /* produce an item in next produced */ while (((in + 1) % BUFFER SIZE) == out) ; /* do nothing */ buffer[in] = next produced; in = (in + 1) % BUFFER SIZE; } Figure 3.13 The producer process using shared memory. also provides a useful metaphor for the client–server paradigm. We generally think of a server as a producer and a client as a consumer. For example, a web server produces (that is, provides) HTML files and images, which are consumed (that is, read) by the client web browser requesting the resource. One solution to the producer–consumer problem uses shared memory. To allow producer and consumer processes to run concurrently, we must have available a buffer of items that can be filled by the producer and emptied by the consumer. This buffer will reside in a region of memory that is shared by the producer and consumer processes. A producer can produce one item while the consumer is consuming another item. The producer and consumer must be synchronized, so that the consumer does not try to consume an item that has not yet been produced. Two types of buffers can be used. The unbounded buffer places no practical limit on the size of the buffer. The consumer may have to wait for new items, but the producer can always produce new items. The bounded buffer assumes a fixed buffer size. In this case, the consumer must wait if the buffer is empty, and the producer must wait if the buffer is full. Let’s look more closely at how the bounded buffer illustrates interprocess communication using shared memory. The following variables reside in a region of memory shared by the producer and consumer processes: #define BUFFER SIZE 10 typedef struct { . . . }item; item buffer[BUFFER SIZE]; int in = 0; int out = 0; The shared buffer is implemented as a circular array with two logical pointers: in and out. The variable in points to the next free position in the buffer; out points to the first full position in the buffer. The buffer is empty when in == out; the buffer is full when ((in + 1) % BUFFER SIZE) == out. The code for the producer process is shown in Figure 3.13, and the code for the consumer process is shown in Figure 3.14. The producer process has a
  • 80. 126 Chapter 3 Processes item next consumed; while (true) { while (in == out) ; /* do nothing */ next consumed = buffer[out]; out = (out + 1) % BUFFER SIZE; /* consume the item in next consumed */ } Figure 3.14 The consumer process using shared memory. local variable next produced in which the new item to be produced is stored. The consumer process has a local variable next consumed in which the item to be consumed is stored. This scheme allows at most BUFFER SIZE − 1 items in the buffer at the same time. We leave it as an exercise for you to provide a solution in which BUFFER SIZE items can be in the buffer at the same time. In Section 3.5.1, we illustrate the POSIX API for shared memory. One issue this illustration does not address concerns the situation in which both the producer process and the consumer process attempt to access the shared buffer concurrently. In Chapter 5, we discuss how synchronization among cooperating processes can be implemented effectively in a shared- memory environment. 3.4.2 Message-Passing Systems In Section 3.4.1, we showed how cooperating processes can communicate in a shared-memory environment. The scheme requires that these processes share a region of memory and that the code for accessing and manipulating the shared memory be written explicitly by the application programmer. Another way to achieve the same effect is for the operating system to provide the means for cooperating processes to communicate with each other via a message-passing facility. Message passing provides a mechanism to allow processes to communicate and to synchronize their actions without sharing the same address space. It is particularly useful in a distributed environment, where the communicating processes may reside on different computers connected by a network. For example, an Internet chat program could be designed so that chat participants communicate with one another by exchanging messages. A message-passing facility provides at least two operations: send(message) receive(message) Messages sent by a process can be either fixed or variable in size. If only fixed-sized messages can be sent, the system-level implementation is straight- forward. This restriction, however, makes the task of programming more difficult. Conversely, variable-sized messages require a more complex system-
  • 81. 3.4 Interprocess Communication 127 level implementation, but the programming task becomes simpler. This is a common kind of tradeoff seen throughout operating-system design. If processes P and Q want to communicate, they must send messages to and receive messages from each other: a communication link must exist between them. This link can be implemented in a variety of ways. We are concerned here not with the link’s physical implementation (such as shared memory, hardware bus, or network, which are covered in Chapter 17) but rather with its logical implementation. Here are several methods for logically implementing a link and the send()/receive() operations: • Direct or indirect communication • Synchronous or asynchronous communication • Automatic or explicit buffering We look at issues related to each of these features next. 3.4.2.1 Naming Processes that want to communicate must have a way to refer to each other. They can use either direct or indirect communication. Under direct communication, each process that wants to communicate must explicitly name the recipient or sender of the communication. In this scheme, the send() and receive() primitives are defined as: • send(P, message)—Send a message to process P. • receive(Q, message)—Receive a message from process Q. A communication link in this scheme has the following properties: • A link is established automatically between every pair of processes that want to communicate. The processes need to know only each other’s identity to communicate. • A link is associated with exactly two processes. • Between each pair of processes, there exists exactly one link. This scheme exhibits symmetry in addressing; that is, both the sender process and the receiver process must name the other to communicate. A variant of this scheme employs asymmetry in addressing. Here, only the sender names the recipient; the recipient is not required to name the sender. In this scheme, the send() and receive() primitives are defined as follows: • send(P, message)—Send a message to process P. • receive(id, message)—Receive a message from any process. The variable id is set to the name of the process with which communication has taken place.
  • 82. 128 Chapter 3 Processes The disadvantage in both of these schemes (symmetric and asymmetric) is the limited modularity of the resulting process definitions. Changing the identifier of a process may necessitate examining all other process definitions. All references to the old identifier must be found, so that they can be modified to the new identifier. In general, any such hard-coding techniques, where identifiersmustbe explicitlystated, are lessdesirable thantechniquesinvolving indirection, as described next. With indirect communication, the messages are sent to and received from mailboxes, or ports. A mailbox can be viewed abstractly as an object into which messages can be placed by processes and from which messages can be removed. Each mailbox has a unique identification. For example, POSIX message queues use an integer value to identify a mailbox. A process can communicate with another process via a number of different mailboxes, but two processes can communicate only if they have a shared mailbox. The send() and receive() primitives are defined as follows: • send(A, message)—Send a message to mailbox A. • receive(A, message)—Receive a message from mailbox A. In this scheme, a communication link has the following properties: • A link is established between a pair of processes only if both members of the pair have a shared mailbox. • A link may be associated with more than two processes. • Between each pair of communicating processes, a number of different links may exist, with each link corresponding to one mailbox. Now suppose that processes P1, P2, and P3 all share mailbox A. Process P1 sends a message to A, while both P2 and P3 execute a receive() from A. Which process will receive the message sent by P1? The answer depends on which of the following methods we choose: • Allow a link to be associated with two processes at most. • Allow at most one process at a time to execute a receive() operation. • Allow the system to select arbitrarily which process will receive the message (that is, either P2 or P3, but not both, will receive the message). The system may define an algorithm for selecting which process will receive the message (for example, round robin, where processes take turns receiving messages). The system may identify the receiver to the sender. A mailbox may be owned either by a process or by the operating system. If the mailbox is owned by a process (that is, the mailbox is part of the address space of the process), then we distinguish between the owner (which can only receive messages through this mailbox) and the user (which can only send messages to the mailbox). Since each mailbox has a unique owner, there can be no confusion about which process should receive a message sent to this mailbox. When a process that owns a mailbox terminates, the mailbox
  • 83. 3.4 Interprocess Communication 129 disappears. Any process that subsequently sends a message to this mailbox must be notified that the mailbox no longer exists. In contrast, a mailbox that is owned by the operating system has an existence of its own. It is independent and is not attached to any particular process. The operating system then must provide a mechanism that allows a process to do the following: • Create a new mailbox. • Send and receive messages through the mailbox. • Delete a mailbox. The process that creates a new mailbox is that mailbox’s owner by default. Initially, the owner is the only process that can receive messages through this mailbox. However, the ownership and receiving privilege may be passed to other processes through appropriate system calls. Of course, this provision could result in multiple receivers for each mailbox. 3.4.2.2 Synchronization Communication between processes takes place through calls to send() and receive() primitives. There are different design options for implementing each primitive. Message passing may be either blocking or nonblocking— also known as synchronous and asynchronous. (Throughout this text, you will encounter the concepts of synchronous and asynchronous behavior in relation to various operating-system algorithms.) • Blocking send. The sending process is blocked until the message is received by the receiving process or by the mailbox. • Nonblocking send. The sending process sends the message and resumes operation. • Blocking receive. The receiver blocks until a message is available. • Nonblocking receive. The receiver retrieves either a valid message or a null. Different combinations of send() and receive() are possible. When both send() and receive() are blocking, we have a rendezvous between the sender and the receiver. The solution to the producer–consumer problem becomes trivial when we use blocking send() and receive() statements. The producer merely invokes the blocking send() call and waits until the message is delivered to either the receiver or the mailbox. Likewise, when the consumer invokes receive(), it blocks until a message is available. This is illustrated in Figures 3.15 and 3.16. 3.4.2.3 Buffering Whether communication is direct or indirect, messages exchanged by commu- nicating processes reside in a temporary queue. Basically, such queues can be implemented in three ways:
  • 84. 130 Chapter 3 Processes message next produced; while (true) { /* produce an item in next produced */ send(next produced); } Figure 3.15 The producer process using message passing. • Zero capacity. The queue has a maximum length of zero; thus, the link cannot have any messages waiting in it. In this case, the sender must block until the recipient receives the message. • Bounded capacity. The queue has finite length n; thus, at most n messages can reside in it. If the queue is not full when a new message is sent, the message is placed in the queue (either the message is copied or a pointer to the message is kept), and the sender can continue execution without waiting. The link’s capacity is finite, however. If the link is full, the sender must block until space is available in the queue. • Unbounded capacity. The queue’s length is potentially infinite; thus, any number of messages can wait in it. The sender never blocks. The zero-capacity case is sometimes referred to as a message system with no buffering. The other cases are referred to as systems with automatic buffering. 3.5 Examples of IPC Systems In this section, we explore three different IPC systems. We first cover the POSIX API forshared memoryand thendiscussmessage passinginthe Machoperating system. We conclude with Windows, which interestingly uses shared memory as a mechanism for providing certain types of message passing. 3.5.1 An Example: POSIX Shared Memory Several IPC mechanisms are available for POSIX systems, including shared memory and message passing. Here, we explore the POSIX API for shared memory. POSIX shared memory is organized using memory-mapped files, which associate the region of shared memory with a file. A process must first create message next consumed; while (true) { receive(next consumed); /* consume the item in next consumed */ } Figure 3.16 The consumer process using message passing.
  • 85. Programming Projects 159 Part II—Creating a History Feature The next task is to modify the shell interface program so that it provides a history feature that allows the user to access the most recently entered commands. The user will be able to access up to 10 commands by using the feature. The commands will be consecutively numbered starting at 1, and the numbering will continue past 10. For example, if the user has entered 35 commands, the 10 most recent commands will be numbered 26 to 35. The user will be able to list the command history by entering the command history at the osh> prompt. As an example, assume that the history consists of the commands (from most to least recent): ps, ls -l, top, cal, who, date The command history will output: 6 ps 5 ls -l 4 top 3 cal 2 who 1 date Your program should support two techniques for retrieving commands from the command history: 1. When the user enters !!, the most recent command in the history is executed. 2. When the user enters a single ! followed by an integer N, the Nth command in the history is executed. Continuing our example from above, if the user enters !!, the ps command will be performed; if the user enters !3, the command cal will be executed. Any command executed in this fashion should be echoed on the user’s screen. The command should also be placed in the history buffer as the next command. The program should also manage basic error handling. If there are no commands in the history, entering !! should result in a message “No commands in history.” If there is no command corresponding to the number entered with the single !, the program should output "No such command in history." Project 2—Linux Kernel Module for Listing Tasks In this project, you will write a kernel module that lists all current tasks in a Linux system. Be sure to review the programming project in Chapter 2, which deals with creating Linux kernel modules, before you begin this project. The project can be completed using the Linux virtual machine provided with this text.
  • 86. 160 Chapter 3 Processes Part I—Iterating over Tasks Linearly As illustrated in Section 3.1, the PCB in Linux is represented by the structure task struct, which is found in the <linux/sched.h> include file. In Linux, the for each process() macro easily allows iteration over all current tasks in the system: #include <linux/sched.h> struct task struct *task; for each process(task) { /* on each iteration task points to the next task */ } The various fields in task struct can then be displayed as the program loops through the for each process() macro. Part I Assignment Design a kernel module that iterates through all tasks in the system using the for each process() macro. In particular, output the task name (known as executable name), state, and process id of each task. (You will probably have to read through the task struct structure in <linux/sched.h> to obtain the names of these fields.) Write this code in the module entry point so that its contents will appear in the kernel log buffer, which can be viewed using the dmesg command. To verify that your code is working correctly, compare the contents of the kernel log buffer with the output of the following command, which lists all tasks in the system: ps -el The two values should be very similar. Because tasks are dynamic, however, it is possible that a few tasks may appear in one listing but not the other. Part II—Iterating over Tasks with a Depth-First Search Tree The second portion of this project involves iterating over all tasks in the system using a depth-first search (DFS) tree. (As an example: the DFS iteration of the processes in Figure 3.8 is 1, 8415, 8416, 9298, 9204, 2, 6, 200, 3028, 3610, 4005.) Linux maintains its process tree as a series of lists. Examining the task struct in <linux/sched.h>, we see two struct list head objects: children and sibling
  • 87. Bibliographical Notes 161 These objects are pointers to a list of the task’s children, as well as its sib- lings. Linux also maintains references to the init task (struct task struct init task). Using this information as well as macro operations on lists, we can iterate over the children of init as follows: struct task struct *task; struct list head *list; list for each(list, &init task->children) { task = list entry(list, struct task struct, sibling); /* task points to the next child in the list */ } The list for each() macro is passed two parameters, both of type struct list head: • A pointer to the head of the list to be traversed • A pointer to the head node of the list to be traversed At each iteration of list for each(), the first parameter is set to the list structure of the next child. We then use this value to obtain each structure in the list using the list entry() macro. Part II Assignment Beginning from the init task, design a kernel module that iterates over all tasks in the system using a DFS tree. Just as in the first part of this project, output the name, state, and pid of each task. Perform this iteration in the kernel entry module so that its output appears in the kernel log buffer. If you output all tasks in the system, you may see many more tasks than appear with the ps -ael command. This is because some threads appear as children but do not show up as ordinary processes. Therefore, to check the output of the DFS tree, use the command ps -eLf This command lists all tasks—including threads—in the system. To verify that you have indeed performed an appropriate DFS iteration, you will have to examine the relationships among the various tasks output by the ps command. Bibliographical Notes Process creation, management, and IPC in UNIX and Windows systems, respectively, are discussed in [Robbins and Robbins (2003)] and [Russinovich and Solomon (2009)]. [Love (2010)] covers support for processes in the Linux kernel, and [Hart (2005)] covers Windows systems programming in detail. Coverage of the multiprocess model used in Google’s Chrome can be found at http://blog.chromium.org/2008/09/multi-process-architecture.html.
  • 88. 4 C H A P T E R Threads The process model introduced in Chapter 3 assumed that a process was an executing program with a single thread of control. Virtually all modern operating systems, however, provide features enabling a process to contain multiple threads of control. In this chapter, we introduce many concepts associated with multithreaded computer systems, including a discussion of the APIs for the Pthreads, Windows, and Java thread libraries. We look at a number of issues related to multithreaded programming and its effect on the design of operating systems. Finally, we explore how the Windows and Linux operating systems support threads at the kernel level. CHAPTER OBJECTIVES • To introduce the notion of a thread—a fundamental unit of CPU utilization that forms the basis of multithreaded computer systems. • To discuss the APIs for the Pthreads, Windows, and Java thread libraries. • To explore several strategies that provide implicit threading. • To examine issues related to multithreaded programming. • To cover operating system support for threads in Windows and Linux. 4.1 Overview A thread is a basic unit of CPU utilization; it comprises a thread ID, a program counter, a register set, and a stack. It shares with other threads belonging to the same process its code section, data section, and other operating-system resources, such as open files and signals. A traditional (or heavyweight) process has a single thread of control. If a process has multiple threads of control, it can perform more than one task at a time. Figure 4.1 illustrates the difference between a traditional single-threaded process and a multithreaded process. 4.1.1 Motivation Most software applications that run on modern computers are multithreaded. An application typically is implemented as a separate process with several 163
  • 89. 164 Chapter 4 Threads registers code data files stack registers registers registers code data files stack stack stack thread thread single-threaded process multithreaded process Figure 4.1 Single-threaded and multithreaded processes. threads of control. A web browser might have one thread display images or text while another thread retrieves data from the network, for example. A word processor may have a thread for displaying graphics, another thread for responding to keystrokes from the user, and a third thread for performing spelling and grammar checking in the background. Applications can also be designed to leverage processing capabilities on multicore systems. Such applications can perform several CPU-intensive tasks in parallel across the multiple computing cores. In certain situations, a single application may be required to perform several similar tasks. For example, a web server accepts client requests for web pages, images, sound, and so forth. A busy web server may have several (perhaps thousands of) clients concurrently accessing it. If the web server ran as a traditional single-threaded process, it would be able to service only one client at a time, and a client might have to wait a very long time for its request to be serviced. One solution is to have the server run as a single process that accepts requests. When the server receives a request, it creates a separate process to service that request. In fact, this process-creation method was in common use before threads became popular. Process creation is time consuming and resource intensive, however. If the new process will perform the same tasks as the existing process, why incur all that overhead? It is generally more efficient to use one process that contains multiple threads. If the web-server process is multithreaded, the server will create a separate thread that listens for client requests. When a request is made, rather than creating another process, the server creates a new thread to service the request and resume listening for additional requests. This is illustrated in Figure 4.2. Threads also play a vital role in remote procedure call (RPC) systems. Recall from Chapter 3 that RPCs allow interprocess communication by providing a communication mechanism similar to ordinary function or procedure calls. Typically, RPC servers are multithreaded. When a server receives a message, it
  • 90. 4.1 Overview 165 client (1) request (2) create new thread to service the request (3) resume listening for additional client requests server thread Figure 4.2 Multithreaded server architecture. services the message using a separate thread. This allows the server to service several concurrent requests. Finally, most operating-system kernels are now multithreaded. Several threads operate in the kernel, and each thread performs a specific task, such as managing devices, managing memory, or interrupt handling. For example, Solaris has a set of threads in the kernel specifically for interrupt handling; Linux uses a kernel thread for managing the amount of free memory in the system. 4.1.2 Benefits The benefits of multithreaded programming can be broken down into four major categories: 1. Responsiveness. Multithreading an interactive application may allow a program to continue running even if part of it is blocked or is performing a lengthy operation, thereby increasing responsiveness to the user. This quality is especially useful in designing user interfaces. For instance, consider what happens when a user clicks a button that results in the performance of a time-consuming operation. A single-threaded application would be unresponsive to the user until the operation had completed. In contrast, if the time-consuming operation is performed in a separate thread, the application remains responsive to the user. 2. Resource sharing. Processes can only share resources through techniques such as shared memory and message passing. Such techniques must be explicitly arranged by the programmer. However, threads share the memory and the resources of the process to which they belong by default. The benefit of sharing code and data is that it allows an application to have several different threads of activity within the same address space. 3. Economy. Allocating memory and resources for process creation is costly. Because threads share the resources of the process to which they belong, it is more economical to create and context-switch threads. Empirically gauging the difference in overhead can be difficult, but in general it is significantly more time consuming to create and manage processes than threads. In Solaris, for example, creating a process is about thirty times
  • 91. 166 Chapter 4 Threads T1 T2 T3 T4 T1 T2 T3 T4 T1 single core time … Figure 4.3 Concurrent execution on a single-core system. slower than is creating a thread, and context switching is about five times slower. 4. Scalability. The benefits of multithreading can be even greater in a multiprocessor architecture, where threads may be running in parallel on different processing cores. A single-threaded process can run on only one processor, regardless how many are available. We explore this issue further in the following section. 4.2 Multicore Programming Earlier in the history of computer design, in response to the need for more computing performance, single-CPU systems evolved into multi-CPU systems. A more recent, similar trend in system design is to place multiple computing cores on a single chip. Each core appears as a separate processor to the operating system (Section 1.3.2). Whether the cores appear across CPU chips or within CPU chips, we call these systems multicore or multiprocessor systems. Multithreaded programming provides a mechanism for more efficient use of these multiple computing cores and improved concurrency. Consider an application with four threads. On a system with a single computing core, concurrency merely means that the execution of the threads will be interleaved over time (Figure 4.3), because the processing core is capable of executing only one thread at a time. On a system with multiple cores, however, concurrency means that the threads can run in parallel, because the system can assign a separate thread to each core (Figure 4.4). Notice the distinction between parallelism and concurrency in this discus- sion. A system is parallel if it can perform more than one task simultaneously. In contrast, a concurrent system supports more than one task by allowing all the tasks to make progress. Thus, it is possible to have concurrency without parallelism. Before the advent of SMP and multicore architectures, most com- puter systems had only a single processor. CPU schedulers were designed to provide the illusion of parallelism by rapidly switching between processes in T1 T3 T1 T3 T1 core 1 T2 T4 T2 T4 T2 core 2 time … … Figure 4.4 Parallel execution on a multicore system.
  • 92. 4.2 Multicore Programming 167 AMDAHL’S LAW Amdahl’s Law is a formula that identifies potential performance gains from adding additional computing cores to an application that has both serial (nonparallel) and parallel components. If S is the portion of the application that must be performed serially on a system with N processing cores, the formula appears as follows: speedup ≤ 1 S + (1−S) N As an example, assume we have an application that is 75 percent parallel and 25 percent serial. If we run this application on a system with two processing cores, we can get a speedup of 1.6 times. If we add two additional cores (for a total of four), the speedup is 2.28 times. One interesting fact about Amdahl’s Law is that as N approaches infinity, the speedup converges to 1/S. For example, if 40 percent of an application is performed serially, the maximum speedup is 2.5 times, regardless of the number of processing cores we add. This is the fundamental principle behind Amdahl’s Law: the serial portion of an application can have a disproportionate effect on the performance we gain by adding additional computing cores. Some argue that Amdahl’s Law does not take into account the hardware performance enhancements used in the design of contemporary multicore systems. Such arguments suggest Amdahl’s Law may cease to be applicable as the number of processing cores continues to increase on modern computer systems. the system, thereby allowing each process to make progress. Such processes were running concurrently, but not in parallel. As systems have grown from tens of threads to thousands of threads, CPU designers have improved system performance by adding hardware to improve thread performance. Modern Intel CPUs frequently support two threads per core, while the Oracle T4 CPU supports eight threads per core. This support means that multiple threads can be loaded into the core for fast switching. Multicore computers will no doubt continue to increase in core counts and hardware thread support. 4.2.1 Programming Challenges The trend towards multicore systems continues to place pressure on system designers and application programmers to make better use of the multiple computing cores. Designers of operating systems must write scheduling algorithms that use multiple processing cores to allow the parallel execution shown in Figure 4.4. For application programmers, the challenge is to modify existing programs as well as design new programs that are multithreaded. In general, five areas present challenges in programming for multicore systems:
  • 93. 168 Chapter 4 Threads 1. Identifying tasks. This involves examining applications to find areas that can be divided into separate, concurrent tasks. Ideally, tasks are independent of one another and thus can run in parallel on individual cores. 2. Balance. While identifying tasks that can run in parallel, programmers must also ensure that the tasks perform equal work of equal value. In some instances, a certain task may not contribute as much value to the overall process as other tasks. Using a separate execution core to run that task may not be worth the cost. 3. Data splitting. Just as applications are divided into separate tasks, the data accessed and manipulated by the tasks must be divided to run on separate cores. 4. Data dependency. The data accessed by the tasks must be examined for dependencies between two or more tasks. When one task depends on data from another, programmers must ensure that the execution of the tasks is synchronized to accommodate the data dependency. We examine such strategies in Chapter 5. 5. Testing and debugging. When a program is running in parallel on multiple cores, many different execution paths are possible. Testing and debugging such concurrent programs is inherently more difficult than testing and debugging single-threaded applications. Because of these challenges, many software developers argue that the advent of multicore systems will require an entirely new approach to designing software systems in the future. (Similarly, many computer science educators believe that software development must be taught with increased emphasis on parallel programming.) 4.2.2 Types of Parallelism In general, there are two types of parallelism: data parallelism and task parallelism. Data parallelism focuses on distributing subsets of the same data across multiple computing cores and performing the same operation on each core. Consider, for example, summing the contents of an array of size N. On a single-core system, one thread would simply sum the elements [0] . . . [N − 1]. On a dual-core system, however, thread A, running on core 0, could sum the elements [0] . . . [N/2 − 1] while thread B, running on core 1, could sum the elements [N/2] . . . [N − 1]. The two threads would be running in parallel on separate computing cores. Task parallelism involves distributing not data but tasks (threads) across multiple computing cores. Each thread is performing a unique operation. Different threads may be operating on the same data, or they may be operating on different data. Consider again our example above. In contrast to that situation, an example of task parallelism might involve two threads, each performing a unique statistical operation on the array of elements. The threads again are operating in parallel on separate computing cores, but each is performing a unique operation.
  • 94. 4.3 Multithreading Models 169 Fundamentally, then, data parallelism involves the distribution of data across multiple cores and task parallelism on the distribution of tasks across multiple cores. In practice, however, few applications strictly follow either data or task parallelism. In most instances, applications use a hybrid of these two strategies. 4.3 Multithreading Models Our discussion so far has treated threads in a generic sense. However, support for threads may be provided either at the user level, for user threads, or by the kernel, for kernel threads. User threads are supported above the kernel and are managed without kernel support, whereas kernel threads are supported and managed directly by the operating system. Virtually all contemporary operating systems—including Windows, Linux, Mac OS X, and Solaris— support kernel threads. Ultimately, a relationship must exist between user threads and kernel threads. In this section, we look at three common ways of establishing such a relationship: the many-to-one model, the one-to-one model, and the many-to- many model. 4.3.1 Many-to-One Model The many-to-one model (Figure 4.5) maps many user-level threads to one kernel thread. Thread management is done by the thread library in user space, so it is efficient (we discuss thread libraries in Section 4.4). However, the entire process will block if a thread makes a blocking system call. Also, because only one thread can access the kernel at a time, multiple threads are unable to run in parallel on multicore systems. Green threads—a thread library available for Solaris systems and adopted in early versions of Java—used the many-to-one model. However, very few systems continue to use the model because of its inability to take advantage of multiple processing cores. user thread kernel thread k Figure 4.5 Many-to-one model.
  • 95. 170 Chapter 4 Threads user thread kernel thread k k k k Figure 4.6 One-to-one model. 4.3.2 One-to-One Model The one-to-one model (Figure 4.6) maps each user thread to a kernel thread. It provides more concurrency than the many-to-one model by allowing another thread to run when a thread makes a blocking system call. It also allows multiple threads to run in parallel on multiprocessors. The only drawback to this model is that creating a user thread requires creating the corresponding kernel thread. Because the overhead of creating kernel threads can burden the performance of an application, most implementations of this model restrict the number of threads supported by the system. Linux, along with the family of Windows operating systems, implement the one-to-one model. 4.3.3 Many-to-Many Model The many-to-many model (Figure 4.7) multiplexes many user-level threads to a smaller or equal number of kernel threads. The number of kernel threads may be specific to either a particular application or a particular machine (an application may be allocated more kernel threads on a multiprocessor than on a single processor). Let’s consider the effect of this design on concurrency. Whereas the many- to-one model allows the developer to create as many user threads as she wishes, it does not result in true concurrency, because the kernel can schedule only one thread at a time. The one-to-one model allows greater concurrency, but the developer has to be careful not to create too many threads within an application (and in some instances may be limited in the number of threads she can user thread kernel thread k k k Figure 4.7 Many-to-many model.
  • 96. 4.4 Thread Libraries 171 user thread kernel thread k k k k Figure 4.8 Two-level model. create). The many-to-many model suffers from neither of these shortcomings: developers can create as many user threads as necessary, and the corresponding kernel threads can run in parallel on a multiprocessor. Also, when a thread performs a blocking system call, the kernel can schedule another thread for execution. One variation on the many-to-many model still multiplexes many user- level threads to a smaller or equal number of kernel threads but also allows a user-level thread to be bound to a kernel thread. This variation is sometimes referred to as the two-level model (Figure 4.8). The Solaris operating system supported the two-level model in versions older than Solaris 9. However, beginning with Solaris 9, this system uses the one-to-one model. 4.4 Thread Libraries A thread library provides the programmer with an API for creating and managing threads. There are two primary ways of implementing a thread library. The first approach is to provide a library entirely in user space with no kernel support. All code and data structures for the library exist in user space. This means that invoking a function in the library results in a local function call in user space and not a system call. The second approach is to implement a kernel-level library supported directly by the operating system. In this case, code and data structures for the library exist in kernel space. Invoking a function in the API for the library typically results in a system call to the kernel. Three main thread libraries are in use today: POSIX Pthreads, Windows, and Java. Pthreads, the threads extension of the POSIX standard, may be provided as either a user-level or a kernel-level library. The Windows thread library is a kernel-level library available on Windows systems. The Java thread API allows threads to be created and managed directly in Java programs. However, because in most instances the JVM is running on top of a host operating system, the Java thread API is generally implemented using a thread library available on the host system. This means that on Windows systems, Java threads are typically implemented using the Windows API; UNIX and Linux systems often use Pthreads.
  • 97. 5 C H A P T E R Process Synchronization A cooperating process is one that can affect or be affected by other processes executing in the system. Cooperating processes can either directly share a logical address space (that is, both code and data) or be allowed to share data only through files or messages. The former case is achieved through the use of threads, discussed in Chapter 4. Concurrent access to shared data may result in data inconsistency, however. In this chapter, we discuss various mechanisms to ensure the orderly execution of cooperating processes that share a logical address space, so that data consistency is maintained. CHAPTER OBJECTIVES • To introduce the critical-section problem, whose solutions can be used to ensure the consistency of shared data. • To present both software and hardware solutions of the critical-section problem. • To examine several classical process-synchronization problems. • To explore several tools that are used to solve process synchronization problems. 5.1 Background We’ve already seen that processes can execute concurrently or in parallel. Section 3.2.2 introduced the role of process scheduling and described how the CPU scheduler switches rapidly between processes to provide concurrent execution. This means that one process may only partially complete execution before another process is scheduled. In fact, a process may be interrupted at any point in its instruction stream, and the processing core may be assigned to execute instructions of another process. Additionally, Section 4.2 introduced parallel execution, in which two instruction streams (representing different processes) execute simultaneously on separate processing cores. In this chapter, 203
  • 98. 204 Chapter 5 Process Synchronization we explain how concurrent or parallel execution can contribute to issues involving the integrity of data shared by several processes. Let’s consider an example of how this can happen. In Chapter 3, we devel- oped a model of a system consisting of cooperating sequential processes or threads, all running asynchronously and possibly sharing data. We illustrated this model with the producer–consumer problem, which is representative of operating systems. Specifically, in Section 3.4.1, we described how a bounded buffer could be used to enable processes to share memory. We now return to our consideration of the bounded buffer. As we pointed out, our original solution allowed at most BUFFER SIZE − 1 items in the buffer at the same time. Suppose we want to modify the algorithm to remedy this deficiency. One possibility is to add an integer variable counter, initialized to 0. counter is incremented every time we add a new item to the buffer and is decremented every time we remove one item from the buffer. The code for the producer process can be modified as follows: while (true) { /* produce an item in next produced */ while (counter == BUFFER SIZE) ; /* do nothing */ buffer[in] = next produced; in = (in + 1) % BUFFER SIZE; counter++; } The code for the consumer process can be modified as follows: while (true) { while (counter == 0) ; /* do nothing */ next consumed = buffer[out]; out = (out + 1) % BUFFER SIZE; counter--; /* consume the item in next consumed */ } Although the producer and consumer routines shown above are correct separately, they may not function correctly when executed concurrently. As an illustration, suppose that the value of the variable counter is currently 5 and that the producer and consumer processes concurrently execute the statements “counter++” and “counter--”. Following the execution of these two statements, the value of the variable counter may be 4, 5, or 6! The only correct result, though, is counter == 5, which is generated correctly if the producer and consumer execute separately.
  • 99. 5.1 Background 205 We can show that the value of counter may be incorrect as follows. Note that the statement “counter++” may be implemented in machine language (on a typical machine) as follows: register1 = counter register1 = register1 + 1 counter = register1 where register1 is one of the local CPU registers. Similarly, the statement “counter--” is implemented as follows: register2 = counter register2 = register2 − 1 counter = register2 where again register2 is one of the local CPU registers. Even though register1 and register2 may be the same physical register (an accumulator, say), remember that the contents of this register will be saved and restored by the interrupt handler (Section 1.2.3). The concurrent execution of “counter++” and “counter--” is equivalent to a sequential execution in which the lower-level statements presented previously are interleaved in some arbitrary order (but the order within each high-level statement is preserved). One such interleaving is the following: T0: producer execute register1 = counter {register1 = 5} T1: producer execute register1 = register1 + 1 {register1 = 6} T2: consumer execute register2 = counter {register2 = 5} T3: consumer execute register2 = register2 − 1 {register2 = 4} T4: producer execute counter = register1 {counter = 6} T5: consumer execute counter = register2 {counter = 4} Notice that we have arrived at the incorrect state “counter == 4”, indicating that four buffers are full, when, in fact, five buffers are full. If we reversed the order of the statements at T4 and T5, we would arrive at the incorrect state “counter == 6”. We would arrive at this incorrect state because we allowed both processes to manipulate the variable counter concurrently. A situation like this, where several processes access and manipulate the same data concurrently and the outcome of the execution depends on the particular order in which the access takes place, is called a race condition. To guard against the race condition above, we need to ensure that only one process at a time can be manipulating the variable counter. To make such a guarantee, we require that the processes be synchronized in some way. Situations such as the one just described occur frequently in operating systems as different parts of the system manipulate resources. Furthermore, as we have emphasized in earlier chapters, the growing importance of multicore systems has brought an increased emphasis on developing multithreaded applications. In such applications, several threads—which are quite possibly sharing data—are running in parallel on different processing cores. Clearly,
  • 100. 206 Chapter 5 Process Synchronization do { entry section critical section exit section remainder section } while (true); Figure 5.1 General structure of a typical process Pi . we want any changes that result from such activities not to interfere with one another. Because of the importance of this issue, we devote a major portion of this chapter to process synchronization and coordination among cooperating processes. 5.2 The Critical-Section Problem We begin our consideration of process synchronization by discussing the so- called critical-section problem. Consider a system consisting of n processes {P0, P1, ..., Pn−1}. Each process has a segment of code, called a critical section, in which the process may be changing common variables, updating a table, writing a file, and so on. The important feature of the system is that, when one process is executing in its critical section, no other process is allowed to execute in its critical section. That is, no two processes are executing in their critical sections at the same time. The critical-section problem is to design a protocol that the processes can use to cooperate. Each process must request permission to enter its critical section. The section of code implementing this request is the entry section. The critical section may be followed by an exit section. The remaining code is the remainder section. The general structure of a typical process Pi is shown in Figure 5.1. The entry section and exit section are enclosed in boxes to highlight these important segments of code. A solution to the critical-section problem must satisfy the following three requirements: 1. Mutual exclusion. If process Pi is executing in its critical section, then no other processes can be executing in their critical sections. 2. Progress. If no process is executing in its critical section and some processes wish to enter their critical sections, then only those processes that are not executing in their remainder sections can participate in deciding which will enter its critical section next, and this selection cannot be postponed indefinitely. 3. Bounded waiting. There exists a bound, or limit, on the number of times that other processes are allowed to enter their critical sections after a
  • 101. 5.3 Peterson’s Solution 207 process has made a request to enter its critical section and before that request is granted. We assume that each process is executing at a nonzero speed. However, we can make no assumption concerning the relative speed of the n processes. At a given point in time, many kernel-mode processes may be active in the operating system. As a result, the code implementing an operating system (kernel code) is subject to several possible race conditions. Consider as an example a kernel data structure that maintains a list of all open files in the system. This list must be modified when a new file is opened or closed (adding the file to the list or removing it from the list). If two processes were to open files simultaneously, the separate updates to this list could result in a race condition. Other kernel data structures that are prone to possible race conditions include structures for maintaining memory allocation, for maintaining process lists, and for interrupt handling. It is up to kernel developers to ensure that the operating system is free from such race conditions. Two general approaches are used to handle critical sections in operating systems: preemptive kernels and nonpreemptive kernels. A preemptive kernel allows a process to be preempted while it is running in kernel mode. A nonpreemptive kernel does not allow a process running in kernel mode to be preempted; a kernel-mode process will run until it exits kernel mode, blocks, or voluntarily yields control of the CPU. Obviously, a nonpreemptive kernel is essentially free from race conditions on kernel data structures, as only one process is active in the kernel at a time. We cannot say the same about preemptive kernels, so they must be carefully designed to ensure that shared kernel data are free from race conditions. Preemptive kernels are especially difficult to design for SMP architectures, since in these environments it is possible for two kernel-mode processes to run simultaneously on different processors. Why, then, would anyone favor a preemptive kernel over a nonpreemptive one? A preemptive kernel may be more responsive, since there is less risk that a kernel-mode process will run for an arbitrarily long period before relinquishing the processor to waiting processes. (Of course, this risk can also be minimized by designing kernel code that does not behave in this way.) Furthermore, a preemptive kernel is more suitable for real-time programming, as it will allow a real-time process to preempt a process currently running in the kernel. Later in this chapter, we explore how various operating systems manage preemption within the kernel. 5.3 Peterson’s Solution Next, we illustrate a classic software-based solution to the critical-section problem known as Peterson’s solution. Because of the way modern computer architectures perform basic machine-language instructions, such as load and store, there are no guarantees that Peterson’s solution will work correctly on such architectures. However, we present the solution because it provides a good algorithmic description of solving the critical-section problem and illustrates some of the complexities involved in designing software that addresses the requirements of mutual exclusion, progress, and bounded waiting.
  • 102. 208 Chapter 5 Process Synchronization do { flag[i] = true; turn = j; while (flag[j] && turn == j); critical section flag[i] = false; remainder section } while (true); Figure 5.2 The structure of process Pi in Peterson’s solution. Peterson’s solution is restricted to two processes that alternate execution between their critical sections and remainder sections. The processes are numbered P0 and P1. For convenience, when presenting Pi , we use Pj to denote the other process; that is, j equals 1 − i. Peterson’s solution requires the two processes to share two data items: int turn; boolean flag[2]; The variable turn indicates whose turn it is to enter its critical section. That is, if turn == i, then process Pi is allowed to execute in its critical section. The flag array is used to indicate if a process is ready to enter its critical section. For example, if flag[i] is true, this value indicates that Pi is ready to enter its critical section. With an explanation of these data structures complete, we are now ready to describe the algorithm shown in Figure 5.2. To enter the critical section, process Pi first sets flag[i] to be true and then sets turn to the value j, thereby asserting that if the other process wishes to enter the critical section, it can do so. If both processes try to enter at the same time, turn will be set to both i and j at roughly the same time. Only one of these assignments will last; the other will occur but will be overwritten immediately. The eventual value of turn determines which of the two processes is allowed to enter its critical section first. We now prove that this solution is correct. We need to show that: 1. Mutual exclusion is preserved. 2. The progress requirement is satisfied. 3. The bounded-waiting requirement is met. To prove property 1, we note that each Pi enters its critical section only if either flag[j] == false or turn == i. Also note that, if both processes can be executing in their critical sections at the same time, then flag[0] == flag[1] == true. These two observations imply that P0 and P1 could not have successfully executed their while statements at about the same time, since the
  • 103. 5.4 Synchronization Hardware 209 value of turn can be either 0 or 1 but cannot be both. Hence, one of the processes —say, Pj —must have successfully executed the while statement, whereas Pi had to execute at least one additional statement (“turn == j”). However, at that time, flag[j] == true and turn == j, and this condition will persist as long as Pj is in its critical section; as a result, mutual exclusion is preserved. To prove properties 2 and 3, we note that a process Pi can be prevented from entering the critical section only if it is stuck in the while loop with the condition flag[j] == true and turn == j; this loop is the only one possible. If Pj is not ready to enter the critical section, then flag[j] == false, and Pi can enter its critical section. If Pj has set flag[j] to true and is also executing in its while statement, then either turn == i or turn == j. If turn == i, then Pi will enter the critical section. If turn == j, then Pj will enter the critical section. However, once Pj exits its critical section, it will reset flag[j] to false, allowing Pi to enter its critical section. If Pj resets flag[j] to true, it must also set turn to i. Thus, since Pi does not change the value of the variable turn while executing the while statement, Pi will enter the critical section (progress) after at most one entry by Pj (bounded waiting). 5.4 Synchronization Hardware We have just described one software-based solution to the critical-section problem. However, as mentioned, software-based solutions such as Peterson’s are not guaranteed to work on modern computer architectures. In the following discussions, we explore several more solutions to the critical-section problem using techniques ranging from hardware to software-based APIs available to both kernel developers and application programmers. All these solutions are based on the premise of locking —that is, protecting critical regions through the use of locks. As we shall see, the designs of such locks can be quite sophisticated. We start by presenting some simple hardware instructions that are available on many systems and showing how they can be used effectively in solving the critical-section problem. Hardware features can make any programming task easier and improve system efficiency. The critical-section problem could be solved simply in a single-processor environment if we could prevent interrupts from occurring while a shared variable was being modified. In this way, we could be sure that the current sequence of instructions would be allowed to execute in order without pre- emption. No other instructions would be run, so no unexpected modifications could be made to the shared variable. This is often the approach taken by nonpreemptive kernels. boolean test and set(boolean *target) { boolean rv = *target; *target = true; return rv; } Figure 5.3 The definition of the test and set() instruction.
  • 104. 210 Chapter 5 Process Synchronization do { while (test and set(&lock)) ; /* do nothing */ /* critical section */ lock = false; /* remainder section */ } while (true); Figure 5.4 Mutual-exclusion implementation with test and set(). Unfortunately, this solution is not as feasible in a multiprocessor environ- ment. Disabling interrupts on a multiprocessor can be time consuming, since the message is passed to all the processors. This message passing delays entry into each critical section, and system efficiency decreases. Also consider the effect on a system’s clock if the clock is kept updated by interrupts. Many modern computer systems therefore provide special hardware instructions that allow us either to test and modify the content of a word or to swap the contents of two words atomically—that is, as one uninterruptible unit. We can use these special instructions to solve the critical-section problem in a relatively simple manner. Rather than discussing one specific instruction for one specific machine, we abstract the main concepts behind these types of instructions by describing the test and set() and compare and swap() instructions. The test and set() instruction can be defined as shown in Figure 5.3. The important characteristic of this instruction is that it is executed atomically. Thus, if two test and set() instructions are executed simultaneously (each on a different CPU), they will be executed sequentially in some arbitrary order. If the machine supports the test and set() instruction, then we can implement mutual exclusion by declaring a boolean variable lock, initialized to false. The structure of process Pi is shown in Figure 5.4. The compare and swap() instruction, in contrast to the test and set() instruction, operates on three operands; it is defined in Figure 5.5. The operand value is set to new value only if the expression (*value == exected) is true. Regardless, compare and swap() always returns the original value of the variable value. Like the test and set() instruction, compare and swap() is int compare and swap(int *value, int expected, int new value) { int temp = *value; if (*value == expected) *value = new value; return temp; } Figure 5.5 The definition of the compare and swap() instruction.
  • 105. 5.4 Synchronization Hardware 211 do { while (compare and swap(&lock, 0, 1) != 0) ; /* do nothing */ /* critical section */ lock = 0; /* remainder section */ } while (true); Figure 5.6 Mutual-exclusion implementation with the compare and swap() instruction. executed atomically. Mutual exclusion can be provided as follows: a global variable (lock) is declared and is initialized to 0. The first process that invokes compare and swap() will set lock to 1. It will then enter its critical section, because the original value of lock was equal to the expected value of 0. Subsequent calls to compare and swap() will not succeed, because lock now is not equal to the expected value of 0. When a process exits its critical section, it sets lock back to 0, which allows another process to enter its critical section. The structure of process Pi is shown in Figure 5.6. Although these algorithms satisfy the mutual-exclusion requirement, they do not satisfy the bounded-waiting requirement. In Figure 5.7, we present another algorithm using the test and set() instruction that satisfies all the critical-section requirements. The common data structures are do { waiting[i] = true; key = true; while (waiting[i] && key) key = test and set(&lock); waiting[i] = false; /* critical section */ j = (i + 1) % n; while ((j != i) && !waiting[j]) j = (j + 1) % n; if (j == i) lock = false; else waiting[j] = false; /* remainder section */ } while (true); Figure 5.7 Bounded-waiting mutual exclusion with test and set().
  • 106. 212 Chapter 5 Process Synchronization boolean waiting[n]; boolean lock; These data structures are initialized to false. To prove that the mutual- exclusion requirement is met, we note that process Pi can enter its critical section only if either waiting[i] == false or key == false. The value of key can become false only if the test and set() is executed. The first process to execute the test and set() will find key == false; all others must wait. The variable waiting[i] can become false only if another process leaves its critical section; only one waiting[i] is set to false, maintaining the mutual-exclusion requirement. To prove that the progress requirement is met, we note that the arguments presented for mutual exclusion also apply here, since a process exiting the critical section either sets lock to false or sets waiting[j] to false. Both allow a process that is waiting to enter its critical section to proceed. To prove that the bounded-waiting requirement is met, we note that, when a process leaves its critical section, it scans the array waiting in the cyclic ordering (i + 1, i + 2, ..., n − 1, 0, ..., i − 1). It designates the first process in this ordering that is in the entry section (waiting[j] == true) as the next one to enter the critical section. Any process waiting to enter its critical section will thus do so within n − 1 turns. Details describing the implementation of the atomic test and set() and compare and swap() instructions are discussed more fully in books on computer architecture. 5.5 Mutex Locks The hardware-based solutions to the critical-section problem presented in Section 5.4 are complicated as well as generally inaccessible to application programmers. Instead, operating-systems designers build software tools to solve the critical-section problem. The simplest of these tools is the mutex lock. (In fact, the term mutex is short for mutual exclusion.) We use the mutex lock to protect critical regions and thus prevent race conditions. That is, a process must acquire the lock before entering a critical section; it releases the lock when it exits the critical section. The acquire()function acquires the lock, and the release() function releases the lock, as illustrated in Figure 5.8. A mutex lock has a boolean variable available whose value indicates if the lock is available or not. If the lock is available, a call to acquire() succeeds, and the lock is then considered unavailable. A process that attempts to acquire an unavailable lock is blocked until the lock is released. The definition of acquire() is as follows: acquire() { while (!available) ; /* busy wait */ available = false;; }
  • 107. 5.6 Semaphores 213 do { acquire lock critical section release lock remainder section } while (true); Figure 5.8 Solution to the critical-section problem using mutex locks. The definition of release() is as follows: release() { available = true; } Calls to either acquire() or release() must be performed atomically. Thus, mutex locks are often implemented using one of the hardware mecha- nisms described in Section 5.4, and we leave the description of this technique as an exercise. The main disadvantage of the implementation given here is that it requires busy waiting. While a process is in its critical section, any other process that tries to enter its critical section must loop continuously in the call to acquire(). In fact, this type of mutex lock is also called a spinlock because the process “spins” while waiting for the lock to become available. (We see the same issue with the code examples illustrating the test and set() instruction and the compare and swap() instruction.) This continual looping is clearly a problem in a real multiprogramming system, where a single CPU is shared among many processes. Busy waiting wastes CPU cycles that some other process might be able to use productively. Spinlocks do have an advantage, however, in that no context switch is required when a process must wait on a lock, and a context switch may take considerable time. Thus, when locks are expected to be held for short times, spinlocks are useful. They are often employed on multiprocessor systems where one thread can “spin” on one processor while another thread performs its critical section on another processor. Later in this chapter (Section 5.7), we examine how mutex locks can be used to solve classical synchronization problems. We also discuss how these locks are used in several operating systems, as well as in Pthreads. 5.6 Semaphores Mutex locks, as we mentioned earlier, are generally considered the simplest of synchronization tools. In this section, we examine a more robust tool that can
  • 108. 214 Chapter 5 Process Synchronization behave similarly to a mutex lock but can also provide more sophisticated ways for processes to synchronize their activities. A semaphore S is an integer variable that, apart from initialization, is accessed only through two standard atomic operations: wait() and signal(). The wait() operation was originally termed P (from the Dutch proberen, “to test”); signal() was originally called V (from verhogen, “to increment”). The definition of wait() is as follows: wait(S) { while (S <= 0) ; // busy wait S--; } The definition of signal() is as follows: signal(S) { S++; } All modifications to the integer value of the semaphore in the wait() and signal() operations must be executed indivisibly. That is, when one process modifies the semaphore value, no other process can simultaneously modify that same semaphore value. In addition, in the case of wait(S), the testing of the integer value of S (S ≤ 0), as well as its possible modification (S--), must be executed without interruption. We shall see how these operations can be implemented in Section 5.6.2. First, let’s see how semaphores can be used. 5.6.1 Semaphore Usage Operating systems often distinguish between counting and binary semaphores. The value of a counting semaphore can range over an unrestricted domain. The value of a binary semaphore can range only between 0 and 1. Thus, binary semaphores behave similarly to mutex locks. In fact, on systems that do not provide mutex locks, binary semaphores can be used instead for providing mutual exclusion. Counting semaphores can be used to control access to a given resource consisting of a finite number of instances. The semaphore is initialized to the number of resources available. Each process that wishes to use a resource performs a wait() operation on the semaphore (thereby decrementing the count). When a process releases a resource, it performs a signal() operation (incrementing the count). When the count for the semaphore goes to 0, all resources are being used. After that, processes that wish to use a resource will block until the count becomes greater than 0. We can also use semaphores to solve various synchronization problems. For example, consider two concurrently running processes: P1 with a statement S1 and P2 with a statement S2. Suppose we require that S2 be executed only after S1 has completed. We can implement this scheme readily by letting P1 and P2 share a common semaphore synch, initialized to 0. In process P1, we insert the statements
  • 109. 5.6 Semaphores 215 S1; signal(synch); In process P2, we insert the statements wait(synch); S2; Because synch is initialized to 0, P2 will execute S2 only after P1 has invoked signal(synch), which is after statement S1 has been executed. 5.6.2 Semaphore Implementation Recall that the implementation of mutex locks discussed in Section 5.5 suffers from busy waiting. The definitions of the wait() and signal() semaphore operations just described present the same problem. To overcome the need for busy waiting, we can modify the definition of the wait() and signal() operations as follows: When a process executes the wait() operation and finds that the semaphore value is not positive, it must wait. However, rather than engaging in busy waiting, the process can block itself. The block operation places a process into a waiting queue associated with the semaphore, and the state of the process is switched to the waiting state. Then control is transferred to the CPU scheduler, which selects another process to execute. A process that is blocked, waiting on a semaphore S, should be restarted when some other process executes a signal() operation. The process is restarted by a wakeup() operation, which changes the process from the waiting state to the ready state. The process is then placed in the ready queue. (The CPU may or may not be switched from the running process to the newly ready process, depending on the CPU-scheduling algorithm.) To implement semaphores under this definition, we define a semaphore as follows: typedef struct { int value; struct process *list; } semaphore; Each semaphore has an integer value and a list of processes list. When a process must wait on a semaphore, it is added to the list of processes. A signal() operation removes one process from the list of waiting processes and awakens that process. Now, the wait() semaphore operation can be defined as wait(semaphore *S) { S->value--; if (S->value < 0) { add this process to S->list; block(); } }
  • 110. 216 Chapter 5 Process Synchronization and the signal() semaphore operation can be defined as signal(semaphore *S) { S->value++; if (S->value <= 0) { remove a process P from S->list; wakeup(P); } } The block() operation suspends the process that invokes it. The wakeup(P) operation resumes the execution of a blocked process P. These two operations are provided by the operating system as basic system calls. Note that in this implementation, semaphore values may be negative, whereas semaphore values are never negative under the classical definition of semaphores with busy waiting. If a semaphore value is negative, its magnitude is the number of processes waiting on that semaphore. This fact results from switching the order of the decrement and the test in the implementation of the wait() operation. The list of waiting processes can be easily implemented by a link field in each process control block (PCB). Each semaphore contains an integer value and a pointer to a list of PCBs. One way to add and remove processes from the list so as to ensure bounded waiting is to use a FIFO queue, where the semaphore contains both head and tail pointers to the queue. In general, however, the list can use any queueing strategy. Correct usage of semaphores does not depend on a particular queueing strategy for the semaphore lists. It is critical that semaphore operations be executed atomically. We must guarantee that no two processes can execute wait() and signal() operations on the same semaphore at the same time. This is a critical-section problem; and in a single-processor environment, we can solve it by simply inhibiting interrupts during the time the wait() and signal() operations are executing. This scheme works in a single-processor environment because, once interrupts are inhibited, instructions from different processes cannot be interleaved. Only the currently running process executes until interrupts are reenabled and the scheduler can regain control. In a multiprocessor environment, interrupts must be disabled on every pro- cessor. Otherwise, instructions from different processes (running on different processors) may be interleaved in some arbitrary way. Disabling interrupts on every processor can be a difficult task and furthermore can seriously diminish performance. Therefore, SMP systems must provide alternative locking tech- niques—such as compare and swap() or spinlocks—to ensure that wait() and signal() are performed atomically. It is important to admit that we have not completely eliminated busy waiting with this definition of the wait() and signal() operations. Rather, we have moved busy waiting from the entry section to the critical sections of application programs. Furthermore, we have limited busy waiting to the critical sections of the wait() and signal() operations, and these sections are short (if properly coded, they should be no more than about ten instructions). Thus, the critical section is almost never occupied, and busy waiting occurs
  • 111. 5.6 Semaphores 217 rarely, and then for only a short time. An entirely different situation exists with application programs whose critical sections may be long (minutes or even hours) or may almost always be occupied. In such cases, busy waiting is extremely inefficient. 5.6.3 Deadlocks and Starvation The implementation of a semaphore with a waiting queue may result in a situation where two or more processes are waiting indefinitely for an event that can be caused only by one of the waiting processes. The event in question is the execution of a signal() operation. When such a state is reached, these processes are said to be deadlocked. To illustrate this, consider a system consisting of two processes, P0 and P1, each accessing two semaphores, S and Q, set to the value 1: P0 P1 wait(S); wait(Q); wait(Q); wait(S); . . . . . . signal(S); signal(Q); signal(Q); signal(S); Suppose that P0 executes wait(S) and then P1 executes wait(Q). When P0 executes wait(Q), it must wait until P1 executes signal(Q). Similarly, when P1 executes wait(S), it must wait until P0 executes signal(S). Since these signal() operations cannot be executed, P0 and P1 are deadlocked. We say that a set of processes is in a deadlocked state when every process in the set is waiting for an event that can be caused only by another process in the set. The events with which we are mainly concerned here are resource acquisition and release. Other types of events may result in deadlocks, as we show in Chapter 7. In that chapter, we describe various mechanisms for dealing with the deadlock problem. Another problem related to deadlocks is indefinite blocking or starvation, a situation in which processes wait indefinitely within the semaphore. Indefi- nite blocking may occur if we remove processes from the list associated with a semaphore in LIFO (last-in, first-out) order. 5.6.4 Priority Inversion A scheduling challenge arises when a higher-priority process needs to read or modify kernel data that are currently being accessed by a lower-priority process—or a chain of lower-priority processes. Since kernel data are typically protected with a lock, the higher-priority process will have to wait for a lower-priority one to finish with the resource. The situation becomes more complicated if the lower-priority process is preempted in favor of another process with a higher priority. As an example, assume we have three processes— L, M, and H —whose priorities follow the order L < M < H. Assume that process H requires
  • 112. 218 Chapter 5 Process Synchronization PRIORITY INVERSION AND THE MARS PATHFINDER Priority inversion can be more than a scheduling inconvenience. On systems with tight time constraints—such as real-time systems—priority inversion can cause a process to take longer than it should to accomplish a task. When that happens, other failures can cascade, resulting in system failure. Consider the Mars Pathfinder, a NASA space probe that landed a robot, the Sojourner rover, on Mars in 1997 to conduct experiments. Shortly after the Sojourner began operating, it started to experience frequent computer resets. Each reset reinitialized all hardware and software, including communica- tions. If the problem had not been solved, the Sojourner would have failed in its mission. The problem was caused by the fact that one high-priority task, “bc dist,” was taking longer than expected to complete its work. This task was being forced to wait for a shared resource that was held by the lower-priority “ASI/MET” task, which in turn was preempted by multiple medium-priority tasks. The “bc dist” task would stall waiting for the shared resource, and ultimately the “bc sched” task would discover the problem and perform the reset. The Sojourner was suffering from a typical case of priority inversion. The operating system on the Sojourner was the VxWorks real-time operat- ing system, which had a global variable to enable priority inheritance on all semaphores. After testing, the variable was set on the Sojourner (on Mars!), and the problem was solved. A full description of the problem, its detection, and its solu- tion was written by the software team lead and is available at http://research.microsoft.com/en-us/um/people/mbj/mars pathfinder/ authoritative account.html. resource R, which is currently being accessed by process L. Ordinarily, process H would wait for L to finish using resource R. However, now suppose that process M becomes runnable, thereby preempting process L. Indirectly, a process with a lower priority—process M—has affected how long process H must wait for L to relinquish resource R. This problem is known as priority inversion. It occurs only in systems with more than two priorities, so one solution is to have only two priorities. That is insufficient for most general-purpose operating systems, however. Typically these systems solve the problem by implementing a priority-inheritance protocol. According to this protocol, all processes that are accessing resources needed by a higher-priority process inherit the higher priority until they are finished with the resources in question. When they are finished, their priorities revert to their original values. In the example above, a priority-inheritance protocol would allow process L to temporarily inherit the priority of process H, thereby preventing process M from preempting its execution. When process L had finished using resource R, it would relinquish its inherited priority from H and assume its original priority. Because resource R would now be available, process H —not M—would run next.
  • 113. 5.7 Classic Problems of Synchronization 219 do { . . . /* produce an item in next produced */ . . . wait(empty); wait(mutex); . . . /* add next produced to the buffer */ . . . signal(mutex); signal(full); } while (true); Figure 5.9 The structure of the producer process. 5.7 Classic Problems of Synchronization In this section, we present a number of synchronization problems as examples of a large class of concurrency-control problems. These problems are used for testing nearly every newly proposed synchronization scheme. In our solutions to the problems, we use semaphores for synchronization, since that is the traditional way to present such solutions. However, actual implementations of these solutions could use mutex locks in place of binary semaphores. 5.7.1 The Bounded-Buffer Problem The bounded-buffer problem was introduced in Section 5.1; it is commonly used to illustrate the power of synchronization primitives. Here, we present a general structure of this scheme without committing ourselves to any particular implementation. We provide a related programming project in the exercises at the end of the chapter. In our problem, the producer and consumer processes share the following data structures: int n; semaphore mutex = 1; semaphore empty = n; semaphore full = 0 We assume that the pool consists of n buffers, each capable of holding one item. The mutex semaphore provides mutual exclusion for accesses to the buffer pool and is initialized to the value 1. The empty and full semaphores count the number of empty and full buffers. The semaphore empty is initialized to the value n; the semaphore full is initialized to the value 0. The code for the producer process is shown in Figure 5.9, and the code for the consumer process is shown in Figure 5.10. Note the symmetry between the producer and the consumer. We can interpret this code as the producer producing full buffers for the consumer or as the consumer producing empty buffers for the producer.
  • 114. 220 Chapter 5 Process Synchronization do { wait(full); wait(mutex); . . . /* remove an item from buffer to next consumed */ . . . signal(mutex); signal(empty); . . . /* consume the item in next consumed */ . . . } while (true); Figure 5.10 The structure of the consumer process. 5.7.2 The Readers–Writers Problem Suppose that a database is to be shared among several concurrent processes. Some of these processes may want only to read the database, whereas others may want to update (that is, to read and write) the database. We distinguish between these two types of processes by referring to the former as readers and to the latter as writers. Obviously, if two readers access the shared data simultaneously, no adverse effects will result. However, if a writer and some other process (either a reader or a writer) access the database simultaneously, chaos may ensue. To ensure that these difficulties do not arise, we require that the writers have exclusive access to the shared database while writing to the database. This synchronization problem is referred to as the readers–writers problem. Since it was originally stated, it has been used to test nearly every new synchronization primitive. The readers–writers problem has several variations, all involving priorities. The simplest one, referred to as the first readers–writers problem, requires that no reader be kept waiting unless a writer has already obtained permission to use the shared object. In other words, no reader should wait for other readers to finish simply because a writer is waiting. The second readers –writers problem requires that, once a writer is ready, that writer perform its write as soon as possible. In other words, if a writer is waiting to access the object, no new readers may start reading. A solution to either problem may result in starvation. In the first case, writers may starve; in the second case, readers may starve. For this reason, other variants of the problem have been proposed. Next, we present a solution to the first readers–writers problem. See the bibliographical notes at the end of the chapter for references describing starvation-free solutions to the second readers–writers problem. In the solution to the first readers–writers problem, the reader processes share the following data structures: semaphore rw mutex = 1; semaphore mutex = 1; int read count = 0; The semaphores mutex and rw mutex are initialized to 1; read count is initialized to 0. The semaphore rw mutex is common to both reader and writer
  • 115. 5.7 Classic Problems of Synchronization 221 do { wait(rw mutex); . . . /* writing is performed */ . . . signal(rw mutex); } while (true); Figure 5.11 The structure of a writer process. processes. The mutex semaphore is used to ensure mutual exclusion when the variable read count is updated. The read count variable keeps track of how many processes are currently reading the object. The semaphore rw mutex functions as a mutual exclusion semaphore for the writers. It is also used by the first or last reader that enters or exits the critical section. It is not used by readers who enter or exit while other readers are in their critical sections. The code for a writer process is shown in Figure 5.11; the code for a reader process is shown in Figure 5.12. Note that, if a writer is in the critical section and n readers are waiting, then one reader is queued on rw mutex, and n − 1 readers are queued on mutex. Also observe that, when a writer executes signal(rw mutex), we may resume the execution of either the waiting readers or a single waiting writer. The selection is made by the scheduler. The readers–writers problem and its solutions have been generalized to provide reader–writer locks on some systems. Acquiring a reader–writer lock requires specifying the mode of the lock: either read or write access. When a process wishes only to read shared data, it requests the reader–writer lock in read mode. A process wishing to modify the shared data must request the lock in write mode. Multiple processes are permitted to concurrently acquire a reader–writer lock in read mode, but only one process may acquire the lock for writing, as exclusive access is required for writers. Reader–writer locks are most useful in the following situations: do { wait(mutex); read count++; if (read count == 1) wait(rw mutex); signal(mutex); . . . /* reading is performed */ . . . wait(mutex); read count--; if (read count == 0) signal(rw mutex); signal(mutex); } while (true); Figure 5.12 The structure of a reader process.
  • 116. 222 Chapter 5 Process Synchronization RICE Figure 5.13 The situation of the dining philosophers. • In applications where it is easy to identify which processes only read shared data and which processes only write shared data. • In applications that have more readers than writers. This is because reader– writer locks generally require more overhead to establish than semaphores or mutual-exclusion locks. The increased concurrency of allowing multiple readers compensates for the overhead involved in setting up the reader– writer lock. 5.7.3 The Dining-Philosophers Problem Consider five philosophers who spend their lives thinking and eating. The philosophers share a circular table surrounded by five chairs, each belonging to one philosopher. In the center of the table is a bowl of rice, and the table is laid with five single chopsticks (Figure 5.13). When a philosopher thinks, she does not interact with her colleagues. From time to time, a philosopher gets hungry and tries to pick up the two chopsticks that are closest to her (the chopsticks that are between her and her left and right neighbors). A philosopher may pick up only one chopstick at a time. Obviously, she cannot pick up a chopstick that is already in the hand of a neighbor. When a hungry philosopher has both her chopsticks at the same time, she eats without releasing the chopsticks. When she is finished eating, she puts down both chopsticks and starts thinking again. The dining-philosophers problem is considered a classic synchronization problem neither because of its practical importance nor because computer scientists dislike philosophers but because it is an example of a large class of concurrency-control problems. It is a simple representation of the need to allocate several resources among several processes in a deadlock-free and starvation-free manner. One simple solution is to represent each chopstick with a semaphore. A philosopher tries to grab a chopstick by executing a wait() operation on that semaphore. She releases her chopsticks by executing the signal() operation on the appropriate semaphores. Thus, the shared data are semaphore chopstick[5];
  • 117. 5.8 Monitors 223 do { wait(chopstick[i]); wait(chopstick[(i+1) % 5]); . . . /* eat for awhile */ . . . signal(chopstick[i]); signal(chopstick[(i+1) % 5]); . . . /* think for awhile */ . . . } while (true); Figure 5.14 The structure of philosopher i. where all the elements of chopstick are initialized to 1. The structure of philosopher i is shown in Figure 5.14. Although this solution guarantees that no two neighbors are eating simultaneously, it nevertheless must be rejected because it could create a deadlock. Suppose that all five philosophers become hungry at the same time and each grabs her left chopstick. All the elements of chopstick will now be equal to 0. When each philosopher tries to grab her right chopstick, she will be delayed forever. Several possible remedies to the deadlock problem are replaced by: • Allow at most four philosophers to be sitting simultaneously at the table. • Allow a philosopher to pick up her chopsticks only if both chopsticks are available (to do this, she must pick them up in a critical section). • Use an asymmetric solution—that is, an odd-numbered philosopher picks up first her left chopstick and then her right chopstick, whereas an even- numbered philosopher picks up her right chopstick and then her left chopstick. In Section 5.8, we present a solution to the dining-philosophers problem that ensures freedom from deadlocks. Note, however, that any satisfactory solution to the dining-philosophers problem must guard against the possibility that one of the philosophers will starve to death. A deadlock-free solution does not necessarily eliminate the possibility of starvation. 5.8 Monitors Although semaphores provide a convenient and effective mechanism for process synchronization, using them incorrectly can result in timing errors that are difficult to detect, since these errors happen only if particular execution sequences take place and these sequences do not always occur. We have seen an example of such errors in the use of counters in our solution to the producer–consumer problem (Section 5.1). In that example, the timing problem happened only rarely, and even then the counter value
  • 118. 6 C H A P T E R CPU Scheduling CPU scheduling is the basis of multiprogrammed operating systems. By switching the CPU among processes, the operating system can make the computer more productive. In this chapter, we introduce basic CPU-scheduling concepts and present several CPU-scheduling algorithms. We also consider the problem of selecting an algorithm for a particular system. In Chapter 4, we introduced threads to the process model. On operating systems that support them, it is kernel-level threads—not processes—that are in fact being scheduled by the operating system. However, the terms "process scheduling" and "thread scheduling" are often used interchangeably. In this chapter, we use process scheduling when discussing general scheduling concepts and thread scheduling to refer to thread-specific ideas. CHAPTER OBJECTIVES • To introduce CPU scheduling, which is the basis for multiprogrammed operating systems. • To describe various CPU-scheduling algorithms. • To discuss evaluation criteria for selecting a CPU-scheduling algorithm for a particular system. • To examine the scheduling algorithms of several operating systems. 6.1 Basic Concepts In a single-processor system, only one process can run at a time. Others must wait until the CPU is free and can be rescheduled. The objective of multiprogramming is to have some process running at all times, to maximize CPU utilization. The idea is relatively simple. A process is executed until it must wait, typically for the completion of some I/O request. In a simple computer system, the CPU then just sits idle. All this waiting time is wasted; no useful work is accomplished. With multiprogramming, we try to use this time productively. Several processes are kept in memory at one time. When 261
  • 119. 262 Chapter 6 CPU Scheduling CPU burst load store add store read from file store increment index write to file load store add store read from file wait for I/O wait for I/O wait for I/O I/O burst I/O burst I/O burst CPU burst CPU burst • • • • • • Figure 6.1 Alternating sequence of CPU and I/O bursts. one process has to wait, the operating system takes the CPU away from that process and gives the CPU to another process. This pattern continues. Every time one process has to wait, another process can take over use of the CPU. Scheduling of this kind is a fundamental operating-system function. Almost all computer resources are scheduled before use. The CPU is, of course, one of the primary computer resources. Thus, its scheduling is central to operating-system design. 6.1.1 CPU–I/O Burst Cycle The success of CPU scheduling depends on an observed property of processes: process execution consists of a cycle of CPU execution and I/O wait. Processes alternate between these two states. Process execution begins with a CPU burst. That is followed by an I/O burst, which is followed by another CPU burst, then another I/O burst, and so on. Eventually, the final CPU burst ends with a system request to terminate execution (Figure 6.1). The durations of CPU bursts have been measured extensively. Although they vary greatly from process to process and from computer to computer, they tend to have a frequency curve similar to that shown in Figure 6.2. The curve is generally characterized as exponential or hyperexponential, with a large number of short CPU bursts and a small number of long CPU bursts.
  • 120. 6.1 Basic Concepts 263 frequency 160 140 120 100 80 60 40 20 0 8 16 24 32 40 burst duration (milliseconds) Figure 6.2 Histogram of CPU-burst durations. An I/O-bound program typically has many short CPU bursts. A CPU-bound program might have a few long CPU bursts. This distribution can be important in the selection of an appropriate CPU-scheduling algorithm. 6.1.2 CPU Scheduler Whenever the CPU becomes idle, the operating system must select one of the processes in the ready queue to be executed. The selection process is carried out by the short-term scheduler, or CPU scheduler. The scheduler selects a process from the processes in memory that are ready to execute and allocates the CPU to that process. Note that the ready queue is not necessarily a first-in, first-out (FIFO) queue. As we shall see when we consider the various scheduling algorithms, a ready queue can be implemented as a FIFO queue, a priority queue, a tree, or simply an unordered linked list. Conceptually, however, all the processes in the ready queue are lined up waiting for a chance to run on the CPU. The records in the queues are generally process control blocks (PCBs) of the processes. 6.1.3 Preemptive Scheduling CPU-scheduling decisions may take place under the following four circum- stances: 1. When a process switches from the running state to the waiting state (for example, as the result of an I/O request or an invocation of wait() for the termination of a child process)
  • 121. 264 Chapter 6 CPU Scheduling 2. When a process switches from the running state to the ready state (for example, when an interrupt occurs) 3. When a process switches from the waiting state to the ready state (for example, at completion of I/O) 4. When a process terminates For situations 1 and 4, there is no choice in terms of scheduling. A new process (if one exists in the ready queue) must be selected for execution. There is a choice, however, for situations 2 and 3. When scheduling takes place only under circumstances 1 and 4, we say that the scheduling scheme is nonpreemptive or cooperative. Otherwise, it is preemptive. Under nonpreemptive scheduling, once the CPU has been allocated to a process, the process keeps the CPU until it releases the CPU either by terminating or by switching to the waiting state. This scheduling method was used by Microsoft Windows 3.x. Windows 95 introduced preemptive scheduling, and all subsequent versions of Windows operating systems have used preemptive scheduling. The Mac OS X operating system for the Macintosh also uses preemptive scheduling; previous versions of the Macintosh operating system relied on cooperative scheduling. Cooperative scheduling is the only method that can be used on certain hardware platforms, because it does not require the special hardware (for example, a timer) needed for preemptive scheduling. Unfortunately, preemptive scheduling can result in race conditions when data are shared among several processes. Consider the case of two processes that share data. While one process is updating the data, it is preempted so that the second process can run. The second process then tries to read the data, which are in an inconsistent state. This issue was explored in detail in Chapter 5. Preemption also affects the design of the operating-system kernel. During the processing of a system call, the kernel may be busy with an activity on behalf of a process. Such activities may involve changing important kernel data (for instance, I/O queues). What happens if the process is preempted in the middle of these changes and the kernel (or the device driver) needs to read or modify the same structure? Chaos ensues. Certain operating systems, including most versions of UNIX, deal with this problem by waiting either for a system call to complete or for an I/O block to take place before doing a context switch. This scheme ensures that the kernel structure is simple, since the kernel will not preempt a process while the kernel data structures are in an inconsistent state. Unfortunately, this kernel-execution model is a poor one for supporting real-time computing where tasks must complete execution within a given time frame. In Section 6.6, we explore scheduling demands of real-time systems. Because interrupts can, by definition, occur at any time, and because they cannot always be ignored by the kernel, the sections of code affected by interrupts must be guarded from simultaneous use. The operating system needs to accept interrupts at almost all times. Otherwise, input might be lost or output overwritten. So that these sections of code are not accessed concurrently by several processes, they disable interrupts at entry and reenable interrupts at exit. It is important to note that sections of code that disable interrupts do not occur very often and typically contain few instructions.
  • 122. 6.2 Scheduling Criteria 265 6.1.4 Dispatcher Another component involved in the CPU-scheduling function is the dispatcher. The dispatcher is the module that gives control of the CPUto the process selected by the short-term scheduler. This function involves the following: • Switching context • Switching to user mode • Jumping to the proper location in the user program to restart that program The dispatcher should be as fast as possible, since it is invoked during every process switch. The time it takes for the dispatcher to stop one process and start another running is known as the dispatch latency. 6.2 Scheduling Criteria Different CPU-scheduling algorithms have different properties, and the choice of a particular algorithm may favor one class of processes over another. In choosing which algorithm to use in a particular situation, we must consider the properties of the various algorithms. Many criteria have been suggested for comparing CPU-scheduling algo- rithms. Which characteristics are used for comparison can make a substantial difference in which algorithm is judged to be best. The criteria include the following: • CPU utilization. We want to keep the CPU as busy as possible. Concep- tually, CPU utilization can range from 0 to 100 percent. In a real system, it should range from 40 percent (for a lightly loaded system) to 90 percent (for a heavily loaded system). • Throughput. If the CPU is busy executing processes, then work is being done. One measure of work is the number of processes that are completed per time unit, called throughput. For long processes, this rate may be one process per hour; for short transactions, it may be ten processes per second. • Turnaround time. From the point of view of a particular process, the important criterion is how long it takes to execute that process. The interval from the time of submission of a process to the time of completion is the turnaround time. Turnaround time is the sum of the periods spent waiting to get into memory, waiting in the ready queue, executing on the CPU, and doing I/O. • Waiting time. The CPU-scheduling algorithm does not affect the amount of time during which a process executes or does I/O. It affects only the amount of time that a process spends waiting in the ready queue. Waiting time is the sum of the periods spent waiting in the ready queue. • Response time. In an interactive system, turnaround time may not be the best criterion. Often, a process can produce some output fairly early and can continue computing new results while previous results are being
  • 123. 266 Chapter 6 CPU Scheduling output to the user. Thus, another measure is the time from the submission of a request until the first response is produced. This measure, called response time, is the time it takes to start responding, not the time it takes to output the response. The turnaround time is generally limited by the speed of the output device. It is desirable to maximize CPU utilization and throughput and to minimize turnaround time, waiting time, and response time. In most cases, we optimize the average measure. However, under some circumstances, we prefer to optimize the minimum or maximum values rather than the average. For example, to guarantee that all users get good service, we may want to minimize the maximum response time. Investigators have suggested that, for interactive systems (such as desktop systems), it is more important to minimize the variance in the response time than to minimize the average response time. A system with reasonable and predictable response time may be considered more desirable than a system that is faster on the average but is highly variable. However, little work has been done on CPU-scheduling algorithms that minimize variance. As we discuss various CPU-scheduling algorithms in the following section, we illustrate their operation. An accurate illustration should involve many processes, each a sequence of several hundred CPU bursts and I/O bursts. For simplicity, though, we consider only one CPU burst (in milliseconds) per process in our examples. Our measure of comparison is the average waiting time. More elaborate evaluation mechanisms are discussed in Section 6.8. 6.3 Scheduling Algorithms CPU schedulingdealswiththe problemofdecidingwhichofthe processesinthe readyqueue istobe allocated the CPU. There are manydifferent CPU-scheduling algorithms. In this section, we describe several of them. 6.3.1 First-Come, First-Served Scheduling By far the simplest CPU-scheduling algorithm is the first-come, first-served (FCFS) scheduling algorithm. With this scheme, the process that requests the CPU first is allocated the CPU first. The implementation of the FCFS policy is easily managed with a FIFO queue. When a process enters the ready queue, its PCB is linked onto the tail of the queue. When the CPU is free, it is allocated to the process at the head of the queue. The running process is then removed from the queue. The code for FCFS scheduling is simple to write and understand. On the negative side, the average waiting time under the FCFS policy is often quite long. Consider the following set of processes that arrive at time 0, with the length of the CPU burst given in milliseconds: Process Burst Time P1 24 P2 3 P3 3
  • 124. 6.3 Scheduling Algorithms 267 If the processes arrive in the order P1, P2, P3, and are served in FCFS order, we get the result shown in the following Gantt chart, which is a bar chart that illustrates a particular schedule, including the start and finish times of each of the participating processes: P1 P2 P3 30 27 24 0 The waiting time is 0 milliseconds for process P1, 24 milliseconds for process P2, and 27 milliseconds for process P3. Thus, the average waiting time is (0 + 24 + 27)/3 = 17 milliseconds. If the processes arrive in the order P2, P3, P1, however, the results will be as shown in the following Gantt chart: P1 P2 P3 30 0 3 6 The average waiting time is now (6 + 0 + 3)/3 = 3 milliseconds. This reduction is substantial. Thus, the average waiting time under an FCFS policy is generally not minimal and may vary substantially if the processes’ CPU burst times vary greatly. In addition, consider the performance of FCFS scheduling in a dynamic situation. Assume we have one CPU-bound process and many I/O-bound processes. As the processes flow around the system, the following scenario may result. The CPU-bound process will get and hold the CPU. During this time, all the other processes will finish their I/O and will move into the ready queue, waiting for the CPU. While the processes wait in the ready queue, the I/O devices are idle. Eventually, the CPU-bound process finishes its CPU burst and moves to an I/O device. All the I/O-bound processes, which have short CPU bursts, execute quickly and move back to the I/O queues. At this point, the CPU sits idle. The CPU-bound process will then move back to the ready queue and be allocated the CPU. Again, all the I/O processes end up waiting in the ready queue until the CPU-bound process is done. There is a convoy effect as all the other processes wait for the one big process to get off the CPU. This effect results in lower CPU and device utilization than might be possible if the shorter processes were allowed to go first. Note also that the FCFS scheduling algorithm is nonpreemptive. Once the CPU has been allocated to a process, that process keeps the CPU until it releases the CPU, either by terminating or by requesting I/O. The FCFS algorithm is thus particularly troublesome for time-sharing systems, where it is important that each user get a share of the CPU at regular intervals. It would be disastrous to allow one process to keep the CPU for an extended period. 6.3.2 Shortest-Job-First Scheduling A differentapproachtoCPU schedulingisthe shortest-job-first (SJF)scheduling algorithm. This algorithm associates with each process the length of the process’s next CPU burst. When the CPU is available, it is assigned to the
  • 125. 268 Chapter 6 CPU Scheduling process that has the smallest next CPU burst. If the next CPU bursts of two processes are the same, FCFS scheduling is used to break the tie. Note that a more appropriate term for this scheduling method would be the shortest-next- CPU-burst algorithm, because scheduling depends on the length of the next CPU burst of a process, rather than its total length. We use the term SJF because most people and textbooks use this term to refer to this type of scheduling. As an example of SJF scheduling, consider the following set of processes, with the length of the CPU burst given in milliseconds: Process Burst Time P1 6 P2 8 P3 7 P4 3 Using SJF scheduling, we would schedule these processes according to the following Gantt chart: P3 P2 P4 P1 24 16 9 0 3 The waiting time is 3 milliseconds for process P1, 16 milliseconds for process P2, 9 milliseconds for process P3, and 0 milliseconds for process P4. Thus, the average waiting time is (3 + 16 + 9 + 0)/4 = 7 milliseconds. By comparison, if we were using the FCFS scheduling scheme, the average waiting time would be 10.25 milliseconds. The SJF scheduling algorithm is provably optimal, in that it gives the minimum average waiting time for a given set of processes. Moving a short process before a long one decreases the waiting time of the short process more than it increases the waiting time of the long process. Consequently, the average waiting time decreases. The real difficulty with the SJF algorithm is knowing the length of the next CPU request. For long-term (job) scheduling in a batch system, we can use the process time limit that a user specifies when he submits the job. In this situation, users are motivated to estimate the process time limit accurately, since a lower value may mean faster response but too low a value will cause a time-limit-exceeded error and require resubmission. SJF scheduling is used frequently in long-term scheduling. Although the SJF algorithm is optimal, it cannot be implemented at the level of short-term CPU scheduling. With short-term scheduling, there is no way to know the length of the next CPU burst. One approach to this problem is to try to approximate SJF scheduling. We may not know the length of the next CPU burst, but we may be able to predict its value. We expect that the next CPU burst will be similar in length to the previous ones. By computing an approximation of the length of the next CPU burst, we can pick the process with the shortest predicted CPU burst. The next CPU burst is generally predicted as an exponential average of the measured lengths of previous CPU bursts. We can define the exponential
  • 126. 6.3 Scheduling Algorithms 269 6 4 6 4 13 13 13 … 8 10 6 6 5 9 11 12 … CPU burst (ti) "guess" (τi) ti τi 2 time 4 6 8 10 12 Figure 6.3 Prediction of the length of the next CPU burst. average with the following formula. Let tn be the length of the nth CPU burst, and let n+1 be our predicted value for the next CPU burst. Then, for , 0 ≤ ≤ 1, define n+1 = tn + (1 − )n. The value of tn contains our most recent information, while n stores the past history. The parameter controls the relative weight of recent and past history in our prediction. If = 0, then n+1 = n, and recent history has no effect (current conditions are assumed to be transient). If = 1, then n+1 = tn, and only the most recent CPU burst matters (history is assumed to be old and irrelevant). More commonly, = 1/2, so recent history and past history are equally weighted. The initial 0 can be defined as a constant or as an overall system average. Figure 6.3 shows an exponential average with = 1/2 and 0 = 10. To understand the behavior of the exponential average, we can expand the formula for n+1 by substituting for n to find n+1 = tn + (1 − )tn−1 + · · · + (1 − )j tn− j + · · · + (1 − )n+1 0. Typically, is less than 1. As a result, (1 − ) is also less than 1, and each successive term has less weight than its predecessor. The SJF algorithm can be either preemptive or nonpreemptive. The choice arises when a new process arrives at the ready queue while a previous process is still executing. The next CPU burst of the newly arrived process may be shorter than what is left of the currently executing process. A preemptive SJF algorithm will preempt the currently executing process, whereas a nonpreemptive SJF algorithm will allow the currently running process to finish its CPU burst. Preemptive SJF scheduling is sometimes called shortest-remaining-time-first scheduling.
  • 127. 270 Chapter 6 CPU Scheduling As an example, consider the following four processes, with the length of the CPU burst given in milliseconds: Process Arrival Time Burst Time P1 0 8 P2 1 4 P3 2 9 P4 3 5 If the processes arrive at the ready queue at the times shown and need the indicated burst times, then the resulting preemptive SJF schedule is as depicted in the following Gantt chart: P1 P3 P1 P2 P4 26 17 10 0 1 5 Process P1 is started at time 0, since it is the only process in the queue. Process P2 arrives at time 1. The remaining time for process P1 (7 milliseconds) is larger than the time required by process P2 (4 milliseconds), so process P1 is preempted, and process P2 is scheduled. The average waiting time for this example is [(10 − 1) + (1 − 1) + (17 − 2) + (5 − 3)]/4 = 26/4 = 6.5 milliseconds. Nonpreemptive SJF scheduling would result in an average waiting time of 7.75 milliseconds. 6.3.3 Priority Scheduling The SJF algorithm is a special case of the general priority-scheduling algorithm. A priority is associated with each process, and the CPUis allocated to the process with the highest priority. Equal-priority processes are scheduled in FCFS order. An SJF algorithm is simply a priority algorithm where the priority (p) is the inverse of the (predicted) next CPU burst. The larger the CPU burst, the lower the priority, and vice versa. Note that we discuss scheduling in terms of high priority and low priority. Priorities are generally indicated by some fixed range of numbers, such as 0 to 7 or 0 to 4,095. However, there is no general agreement on whether 0 is the highest or lowest priority. Some systems use low numbers to represent low priority; others use low numbers for high priority. This difference can lead to confusion. In this text, we assume that low numbers represent high priority. As an example, consider the following set of processes, assumed to have arrived at time 0 in the order P1, P2, · · ·, P5, with the length of the CPU burst given in milliseconds: Process Burst Time Priority P1 10 3 P2 1 1 P3 2 4 P4 1 5 P5 5 2
  • 128. 6.3 Scheduling Algorithms 271 Using priority scheduling, we would schedule these processes according to the following Gantt chart: P1 P4 P3 P2 P5 19 18 16 6 0 1 The average waiting time is 8.2 milliseconds. Priorities can be defined either internally or externally. Internally defined priorities use some measurable quantity or quantities to compute the priority of a process. For example, time limits, memory requirements, the number of open files, and the ratio of average I/O burst to average CPU burst have been used in computing priorities. External priorities are set by criteria outside the operating system, such as the importance of the process, the type and amount of funds being paid for computer use, the department sponsoring the work, and other, often political, factors. Priority scheduling can be either preemptive or nonpreemptive. When a process arrives at the ready queue, its priority is compared with the priority of the currently running process. A preemptive priority scheduling algorithm will preempt the CPU if the priority of the newly arrived process is higher than the priority of the currently running process. A nonpreemptive priority scheduling algorithm will simply put the new process at the head of the ready queue. A major problem with priority scheduling algorithms is indefinite block- ing, or starvation. A process that is ready to run but waiting for the CPU can be considered blocked. A priority scheduling algorithm can leave some low- priority processes waiting indefinitely. In a heavily loaded computer system, a steady stream of higher-priority processes can prevent a low-priority process from ever getting the CPU. Generally, one of two things will happen. Either the process will eventually be run (at 2 A.M. Sunday, when the system is finally lightly loaded), or the computer system will eventually crash and lose all unfinished low-priority processes. (Rumor has it that when they shut down the IBM 7094 at MIT in 1973, they found a low-priority process that had been submitted in 1967 and had not yet been run.) A solution to the problem of indefinite blockage of low-priority processes is aging. Aging involves gradually increasing the priority of processes that wait in the system for a long time. For example, if priorities range from 127 (low) to 0 (high), we could increase the priority of a waiting process by 1 every 15 minutes. Eventually, even a process with an initial priority of 127 would have the highest priority in the system and would be executed. In fact, it would take no more than 32 hours for a priority-127 process to age to a priority-0 process. 6.3.4 Round-Robin Scheduling The round-robin (RR) scheduling algorithm is designed especially for time- sharing systems. It is similar to FCFS scheduling, but preemption is added to enable the system to switch between processes. A small unit of time, called a time quantum or time slice, is defined. A time quantum is generally from 10 to 100 milliseconds in length. The ready queue is treated as a circular queue.
  • 129. 272 Chapter 6 CPU Scheduling The CPU scheduler goes around the ready queue, allocating the CPU to each process for a time interval of up to 1 time quantum. To implement RR scheduling, we again treat the ready queue as a FIFO queue of processes. New processes are added to the tail of the ready queue. The CPU scheduler picks the first process from the ready queue, sets a timer to interrupt after 1 time quantum, and dispatches the process. One of two things will then happen. The process may have a CPU burst of less than 1 time quantum. In this case, the process itself will release the CPU voluntarily. The scheduler will then proceed to the next process in the ready queue. If the CPU burst of the currently running process is longer than 1 time quantum, the timer will go off and will cause an interrupt to the operating system. A context switch will be executed, and the process will be put at the tail of the ready queue. The CPU scheduler will then select the next process in the ready queue. The average waiting time under the RR policy is often long. Consider the following set of processes that arrive at time 0, with the length of the CPU burst given in milliseconds: Process Burst Time P1 24 P2 3 P3 3 If we use a time quantum of 4 milliseconds, then process P1 gets the first 4 milliseconds. Since it requires another 20 milliseconds, it is preempted after the first time quantum, and the CPU is given to the next process in the queue, process P2. Process P2 does not need 4 milliseconds, so it quits before its time quantum expires. The CPU is then given to the next process, process P3. Once each process has received 1 time quantum, the CPU is returned to process P1 for an additional time quantum. The resulting RR schedule is as follows: P1 P1 P1 P1 P1 P1 P2 30 18 14 26 22 10 7 0 4 P3 Let’s calculate the average waiting time for this schedule. P1 waits for 6 milliseconds (10 - 4), P2 waits for 4 milliseconds, and P3 waits for 7 milliseconds. Thus, the average waiting time is 17/3 = 5.66 milliseconds. In the RR scheduling algorithm, no process is allocated the CPU for more than 1 time quantum in a row (unless it is the only runnable process). If a process’s CPU burst exceeds 1 time quantum, that process is preempted and is put back in the ready queue. The RR scheduling algorithm is thus preemptive. If there are n processes in the ready queue and the time quantum is q, then each process gets 1/n of the CPU time in chunks of at most q time units. Each process must wait no longer than (n − 1) × q time units until its next time quantum. For example, with five processes and a time quantum of 20 milliseconds, each process will get up to 20 milliseconds every 100 milliseconds. The performance ofthe RR algorithmdependsheavilyonthe size ofthe time quantum. At one extreme, if the time quantum is extremely large, the RR policy
  • 130. 6.3 Scheduling Algorithms 273 process time 10 quantum context switches 12 0 6 1 1 9 0 10 0 10 0 1 2 3 4 5 6 7 8 9 10 6 Figure 6.4 How a smaller time quantum increases context switches. is the same as the FCFS policy. In contrast, if the time quantum is extremely small (say, 1 millisecond), the RR approach can result in a large number of context switches. Assume, for example, that we have only one process of 10 time units. If the quantum is 12 time units, the process finishes in less than 1 time quantum, with no overhead. If the quantum is 6 time units, however, the process requires 2 quanta, resulting in a context switch. If the time quantum is 1 time unit, then nine context switches will occur, slowing the execution of the process accordingly (Figure 6.4). Thus, we want the time quantum to be large with respect to the context- switch time. If the context-switch time is approximately 10 percent of the time quantum, then about 10 percent of the CPU time will be spent in context switching. In practice, most modern systems have time quanta ranging from 10 to 100 milliseconds. The time required for a context switch is typically less than 10 microseconds; thus, the context-switch time is a small fraction of the time quantum. Turnaround time also depends on the size of the time quantum. As we can see from Figure 6.5, the average turnaround time of a set of processes does not necessarily improve as the time-quantum size increases. In general, the average turnaround time can be improved if most processes finish their next CPU burst in a single time quantum. For example, given three processes of 10 time units each and a quantum of 1 time unit, the average turnaround time is 29. If the time quantum is 10, however, the average turnaround time drops to 20. If context-switch time is added in, the average turnaround time increases even more for a smaller time quantum, since more context switches are required. Although the time quantum should be large compared with the context- switch time, it should not be too large. As we pointed out earlier, if the time quantum is too large, RR scheduling degenerates to an FCFS policy. A rule of thumb is that 80 percent of the CPU bursts should be shorter than the time quantum. 6.3.5 Multilevel Queue Scheduling Another class of scheduling algorithms has been created for situations in which processes are easily classified into different groups. For example, a
  • 131. 274 Chapter 6 CPU Scheduling average turnaround time 1 12.5 12.0 11.5 11.0 10.5 10.0 9.5 9.0 2 3 4 time quantum 5 6 7 P1 P2 P3 P4 6 3 1 7 process time Figure 6.5 How turnaround time varies with the time quantum. common division is made between foreground (interactive) processes and background (batch) processes. These two types of processes have different response-time requirements and so may have different scheduling needs. In addition, foreground processes may have priority (externally defined) over background processes. A multilevel queue scheduling algorithm partitions the ready queue into several separate queues (Figure 6.6). The processes are permanently assigned to one queue, generally based on some property of the process, such as memory size, process priority, or process type. Each queue has its own scheduling algorithm. For example, separate queues might be used for foreground and background processes. The foreground queue might be scheduled by an RR algorithm, while the background queue is scheduled by an FCFS algorithm. In addition, there must be scheduling among the queues, which is com- monly implemented as fixed-priority preemptive scheduling. For example, the foreground queue may have absolute priority over the background queue. Let’s look at an example of a multilevel queue scheduling algorithm with five queues, listed below in order of priority: 1. System processes 2. Interactive processes 3. Interactive editing processes 4. Batch processes 5. Student processes
  • 132. 6.3 Scheduling Algorithms 275 system processes highest priority lowest priority interactive processes interactive editing processes batch processes student processes Figure 6.6 Multilevel queue scheduling. Each queue has absolute priority over lower-priority queues. No process in the batch queue, for example, could run unless the queues for system processes, interactive processes, and interactive editing processes were all empty. If an interactive editing process entered the ready queue while a batch process was running, the batch process would be preempted. Another possibility is to time-slice among the queues. Here, each queue gets a certain portion of the CPU time, which it can then schedule among its various processes. For instance, in the foreground–background queue example, the foreground queue can be given 80 percent of the CPU time for RR scheduling among its processes, while the background queue receives 20 percent of the CPU to give to its processes on an FCFS basis. 6.3.6 Multilevel Feedback Queue Scheduling Normally, when the multilevel queue scheduling algorithm is used, processes are permanently assigned to a queue when they enter the system. If there are separate queues for foreground and background processes, for example, processes do not move from one queue to the other, since processes do not change their foreground or background nature. This setup has the advantage of low scheduling overhead, but it is inflexible. The multilevel feedback queue scheduling algorithm, in contrast, allows a process to move between queues. The idea is to separate processes according to the characteristics of their CPU bursts. If a process uses too much CPU time, it will be moved to a lower-priority queue. This scheme leaves I/O-bound and interactive processes in the higher-priority queues. In addition, a process that waits too long in a lower-priority queue may be moved to a higher-priority queue. This form of aging prevents starvation. For example, consider a multilevel feedback queue scheduler with three queues, numbered from 0 to 2 (Figure 6.7). The scheduler first executes all
  • 133. 276 Chapter 6 CPU Scheduling quantum 8 quantum 16 FCFS Figure 6.7 Multilevel feedback queues. processes in queue 0. Only when queue 0 is empty will it execute processes in queue 1. Similarly, processes in queue 2 will be executed only if queues 0 and 1 are empty. A process that arrives for queue 1 will preempt a process in queue 2. A process in queue 1 will in turn be preempted by a process arriving for queue 0. A process entering the ready queue is put in queue 0. A process in queue 0 is given a time quantum of 8 milliseconds. If it does not finish within this time, it is moved to the tail of queue 1. If queue 0 is empty, the process at the head of queue 1 is given a quantum of 16 milliseconds. If it does not complete, it is preempted and is put into queue 2. Processes in queue 2 are run on an FCFS basis but are run only when queues 0 and 1 are empty. This scheduling algorithm gives highest priority to any process with a CPU burst of 8 milliseconds or less. Such a process will quickly get the CPU, finish its CPU burst, and go off to its next I/O burst. Processes that need more than 8 but less than 24 milliseconds are also served quickly, although with lower priority than shorter processes. Long processes automatically sink to queue 2 and are served in FCFS order with any CPU cycles left over from queues 0 and 1. In general, a multilevel feedback queue scheduler is defined by the following parameters: • The number of queues • The scheduling algorithm for each queue • The method used to determine when to upgrade a process to a higher- priority queue • The method used to determine when to demote a process to a lower- priority queue • The method used to determine which queue a process will enter when that process needs service The definition of a multilevel feedback queue scheduler makes it the most general CPU-scheduling algorithm. It can be configured to match a specific system under design. Unfortunately, it is also the most complex algorithm,
  • 134. 6.4 Thread Scheduling 277 since defining the best scheduler requires some means by which to select values for all the parameters. 6.4 Thread Scheduling In Chapter 4, we introduced threads to the process model, distinguishing between user-level and kernel-level threads. On operating systems that support them, it is kernel-level threads—not processes—that are being scheduled by the operating system. User-level threads are managed by a thread library, and the kernel is unaware of them. To run on a CPU, user-level threads must ultimately be mapped to an associated kernel-level thread, although this mapping may be indirect and may use a lightweight process (LWP). In this section, we explore scheduling issues involving user-level and kernel-level threads and offer specific examples of scheduling for Pthreads. 6.4.1 Contention Scope One distinction between user-level and kernel-level threads lies in how they are scheduled. On systems implementing the many-to-one (Section 4.3.1) and many-to-many (Section 4.3.3) models, the thread library schedules user-level threads to run on an available LWP. This scheme is known as process- contention scope (PCS), since competition for the CPU takes place among threads belonging to the same process. (When we say the thread library schedules user threads onto available LWPs, we do not mean that the threads are actually running on a CPU. That would require the operating system to schedule the kernel thread onto a physical CPU.) To decide which kernel-level thread to schedule onto a CPU, the kernel uses system-contention scope (SCS). Competition for the CPU with SCS scheduling takes place among all threads in the system. Systems using the one-to-one model (Section 4.3.2), such as Windows, Linux, and Solaris, schedule threads using only SCS. Typically, PCS is done according to priority—the scheduler selects the runnable thread with the highest priority to run. User-level thread priorities are set by the programmer and are not adjusted by the thread library, although some thread libraries may allow the programmer to change the priority of a thread. It is important to note that PCS will typically preempt the thread currently running in favor of a higher-priority thread; however, there is no guarantee of time slicing (Section 6.3.4) among threads of equal priority. 6.4.2 Pthread Scheduling We provided a sample POSIX Pthread program in Section 4.4.1, along with an introduction to thread creation with Pthreads. Now, we highlight the POSIX Pthread API that allows specifying PCS or SCS during thread creation. Pthreads identifies the following contention scope values: • PTHREAD SCOPE PROCESS schedules threads using PCS scheduling. • PTHREAD SCOPE SYSTEM schedules threads using SCS scheduling.
  • 135. 7 C H A P T E R Deadlocks In a multiprogramming environment, several processes may compete for a finite number of resources. A process requests resources; if the resources are not available at that time, the process enters a waiting state. Sometimes, a waiting process is never again able to change state, because the resources it has requested are held by other waiting processes. This situation is called a deadlock. We discussed this issue briefly in Chapter 5 in connection with semaphores. Perhaps the best illustration of a deadlock can be drawn from a law passed by the Kansas legislature early in the 20th century. It said, in part: “When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.” In this chapter, we describe methods that an operating system can use to prevent or deal with deadlocks. Although some applications can identify programs that may deadlock, operating systems typically do not provide deadlock-prevention facilities, and it remains the responsibility of program- mers to ensure that they design deadlock-free programs. Deadlock problems can only become more common, given current trends, including larger num- bers of processes, multithreaded programs, many more resources within a system, and an emphasis on long-lived file and database servers rather than batch systems. CHAPTER OBJECTIVES • To develop a description of deadlocks, which prevent sets of concurrent processes from completing their tasks. • To present a number of different methods for preventing or avoiding deadlocks in a computer system. 7.1 System Model A system consists of a finite number of resources to be distributed among a number of competing processes. The resources may be partitioned into several 315
  • 136. 316 Chapter 7 Deadlocks types (or classes), each consisting of some number of identical instances. CPU cycles, files, and I/O devices (such as printers and DVD drives) are examples of resource types. If a system has two CPUs, then the resource type CPU has two instances. Similarly, the resource type printer may have five instances. If a process requests an instance of a resource type, the allocation of any instance of the type should satisfy the request. If it does not, then the instances are not identical, and the resource type classes have not been defined properly. For example, a system may have two printers. These two printers may be defined to be in the same resource class if no one cares which printer prints which output. However, if one printer is on the ninth floor and the other is in the basement, then people on the ninth floor may not see both printers as equivalent, and separate resource classes may need to be defined for each printer. Chapter 5 discussed various synchronization tools, such as mutex locks and semaphores. These tools are also considered system resources, and they are a common source of deadlock. However, a lock is typically associated with protecting a specific data structure—that is, one lock may be used to protect access to a queue, another to protect access to a linked list, and so forth. For that reason, each lock is typically assigned its own resource class, and definition is not a problem. A process must request a resource before using it and must release the resource after using it. A process may request as many resources as it requires to carry out its designated task. Obviously, the number of resources requested may not exceed the total number of resources available in the system. In other words, a process cannot request three printers if the system has only two. Under the normal mode of operation, a process may utilize a resource in only the following sequence: 1. Request. The process requests the resource. If the request cannot be granted immediately (for example, if the resource is being used by another process), then the requesting process must wait until it can acquire the resource. 2. Use. The process can operate on the resource (for example, if the resource is a printer, the process can print on the printer). 3. Release. The process releases the resource. The request and release of resources may be system calls, as explained in Chapter 2. Examples are the request() and release() device, open() and close() file, and allocate() and free() memory system calls. Similarly, as we saw in Chapter 5, the request and release of semaphores can be accomplished through the wait() and signal() operations on semaphores or through acquire() and release() of a mutex lock. For each use of a kernel-managed resource by a process or thread, the operating system checks to make sure that the process has requested and has been allocated the resource. A system table records whether each resource is free or allocated. For each resource that is allocated, the table also records the process to which it is allocated. If a process requests a resource that is currently allocated to another process, it can be added to a queue of processes waiting for this resource. A set of processes is in a deadlocked state when every process in the set is waiting for an event that can be caused only by another process in the set. The
  • 137. 7.2 Deadlock Characterization 317 events with which we are mainly concerned here are resource acquisition and release. The resources may be either physical resources (for example, printers, tape drives, memory space, and CPU cycles) or logical resources (for example, semaphores, mutex locks, and files). However, other types of events may result in deadlocks (for example, the IPC facilities discussed in Chapter 3). To illustrate a deadlocked state, consider a system with three CD RW drives. Suppose each of three processes holds one of these CDRW drives. If each process now requests another drive, the three processes will be in a deadlocked state. Each is waiting for the event “CD RW is released,” which can be caused only by one of the other waiting processes. This example illustrates a deadlock involving the same resource type. Deadlocks may also involve different resource types. For example, consider a system with one printer and one DVD drive. Suppose that process Pi is holding the DVD and process Pj is holding the printer. If Pi requests the printer and Pj requests the DVD drive, a deadlock occurs. Developers of multithreaded applications must remain aware of the possibility of deadlocks. The locking tools presented in Chapter 5 are designed to avoid race conditions. However, in using these tools, developers must pay careful attention to how locks are acquired and released. Otherwise, deadlock can occur, as illustrated in the dining-philosophers problem in Section 5.7.3. 7.2 Deadlock Characterization In a deadlock, processes never finish executing, and system resources are tied up, preventing other jobs from starting. Before we discuss the various methods for dealing with the deadlock problem, we look more closely at features that characterize deadlocks. DEADLOCK WITH MUTEX LOCKS Let’s see how deadlock can occur in a multithreaded Pthread program using mutex locks. The pthread mutex init() function initializes an unlocked mutex. Mutex locks are acquired and released using pthread mutex lock() and pthread mutex unlock(), respec- tively. If a thread attempts to acquire a locked mutex, the call to pthread mutex lock() blocks the thread until the owner of the mutex lock invokes pthread mutex unlock(). Two mutex locks are created in the following code example: /* Create and initialize the mutex locks */ pthread mutex t first mutex; pthread mutex t second mutex; pthread mutex init(first mutex,NULL); pthread mutex init(second mutex,NULL); Next, two threads—thread one and thread two—are created, and both these threads have access to both mutex locks. thread one and thread two
  • 138. 318 Chapter 7 Deadlocks DEADLOCK WITH MUTEX LOCKS (Continued) run in the functions do work one() and do work two(), respectively, as shown below: /* thread one runs in this function */ void *do work one(void *param) { pthread mutex lock(first mutex); pthread mutex lock(second mutex); /** * Do some work */ pthread mutex unlock(second mutex); pthread mutex unlock(first mutex); pthread exit(0); } /* thread two runs in this function */ void *do work two(void *param) { pthread mutex lock(second mutex); pthread mutex lock(first mutex); /** * Do some work */ pthread mutex unlock(first mutex); pthread mutex unlock(second mutex); pthread exit(0); } In this example, thread one attempts to acquire the mutex locks in the order (1) first mutex, (2) second mutex, while thread two attempts to acquire the mutex locks in the order (1) second mutex, (2) first mutex. Deadlock is possible if thread one acquires first mutex while thread two acquires second mutex. Note that, even though deadlock is possible, it will not occur if thread one can acquire and release the mutex locks for first mutex and second mutex before thread two attempts to acquire the locks. And, of course, the order in which the threads run depends on how they are scheduled by the CPU scheduler. This example illustrates a problem with handling deadlocks: it is difficult to identify and test for deadlocks that may occur only under certain scheduling circumstances. 7.2.1 Necessary Conditions A deadlock situation can arise if the following four conditions hold simultane- ously in a system:
  • 139. 7.2 Deadlock Characterization 319 1. Mutual exclusion. At least one resource must be held in a nonsharable mode; that is, only one process at a time can use the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released. 2. Hold and wait. A process must be holding at least one resource and waiting to acquire additional resources that are currently being held by other processes. 3. No preemption. Resources cannot be preempted; that is, a resource can be released only voluntarily by the process holding it, after that process has completed its task. 4. Circular wait. A set {P0, P1, ..., Pn} of waiting processes must exist such that P0 is waiting for a resource held by P1, P1 is waiting for a resource held by P2, ..., Pn−1 is waiting for a resource held by Pn, and Pn is waiting for a resource held by P0. We emphasize that all four conditions must hold for a deadlock to occur. The circular-wait condition implies the hold-and-wait condition, so the four conditions are not completely independent. We shall see in Section 7.4, however, that it is useful to consider each condition separately. 7.2.2 Resource-Allocation Graph Deadlocks can be described more precisely in terms of a directed graph called a system resource-allocation graph. This graph consists of a set of vertices V and a set of edges E. The set of vertices V is partitioned into two different types of nodes: P = {P1, P2, ..., Pn}, the set consisting of all the active processes in the system, and R = {R1, R2, ..., Rm}, the set consisting of all resource types in the system. A directed edge from process Pi to resource type Rj is denoted by Pi → Rj ; it signifies that process Pi has requested an instance of resource type Rj and is currently waiting for that resource. A directed edge from resource type Rj to process Pi is denoted by Rj → Pi ; it signifies that an instance of resource type Rj has been allocated to process Pi . A directed edge Pi → Rj is called a request edge; a directed edge Rj → Pi is called an assignment edge. Pictorially, we represent each process Pi as a circle and each resource type Rj as a rectangle. Since resource type Rj may have more than one instance, we represent each such instance as a dot within the rectangle. Note that a request edge points to only the rectangle Rj , whereas an assignment edge must also designate one of the dots in the rectangle. When process Pi requests an instance of resource type Rj , a request edge is inserted in the resource-allocation graph. When this request can be fulfilled, the request edge is instantaneously transformed to an assignment edge. When the process no longer needs access to the resource, it releases the resource. As a result, the assignment edge is deleted. The resource-allocation graph shown in Figure 7.1 depicts the following situation. • The sets P, R, and E: ◦ P = {P1, P2, P3}
  • 140. 320 Chapter 7 Deadlocks R1 R3 R2 R4 P3 P2 P1 Figure 7.1 Resource-allocation graph. ◦ R = {R1, R2, R3, R4} ◦ E = {P1 → R1, P2 → R3, R1 → P2, R2 → P2, R2 → P1, R3 → P3} • Resource instances: ◦ One instance of resource type R1 ◦ Two instances of resource type R2 ◦ One instance of resource type R3 ◦ Three instances of resource type R4 • Process states: ◦ Process P1 is holding an instance of resource type R2 and is waiting for an instance of resource type R1. ◦ Process P2 is holding an instance of R1 and an instance of R2 and is waiting for an instance of R3. ◦ Process P3 is holding an instance of R3. Given the definition of a resource-allocation graph, it can be shown that, if the graph contains no cycles, then no process in the system is deadlocked. If the graph does contain a cycle, then a deadlock may exist. If each resource type has exactly one instance, then a cycle implies that a deadlock has occurred. If the cycle involves only a set of resource types, each of which has only a single instance, then a deadlock has occurred. Each process involved in the cycle is deadlocked. In this case, a cycle in the graph is both a necessary and a sufficient condition for the existence of deadlock. If each resource type has several instances, then a cycle does not necessarily imply that a deadlock has occurred. In this case, a cycle in the graph is a necessary but not a sufficient condition for the existence of deadlock. To illustrate this concept, we return to the resource-allocation graph depicted in Figure 7.1. Suppose that process P3 requests an instance of resource
  • 141. 7.2 Deadlock Characterization 321 R1 R3 R2 R4 P3 P2 P1 Figure 7.2 Resource-allocation graph with a deadlock. type R2. Since no resource instance is currently available, we add a request edge P3 → R2 to the graph (Figure 7.2). At this point, two minimal cycles exist in the system: P1 → R1 → P2 → R3 → P3 → R2 → P1 P2 → R3 → P3 → R2 → P2 Processes P1, P2, and P3 are deadlocked. Process P2 is waiting for the resource R3, which is held by process P3. Process P3 is waiting for either process P1 or process P2 to release resource R2. In addition, process P1 is waiting for process P2 to release resource R1. Now consider the resource-allocation graph in Figure 7.3. In this example, we also have a cycle: P1 → R1 → P3 → R2 → P1 R2 R1 P3 P4 P2 P1 Figure 7.3 Resource-allocation graph with a cycle but no deadlock.
  • 142. 322 Chapter 7 Deadlocks However, there is no deadlock. Observe that process P4 may release its instance of resource type R2. That resource can then be allocated to P3, breaking the cycle. In summary, if a resource-allocation graph does not have a cycle, then the system is not in a deadlocked state. If there is a cycle, then the system may or may not be in a deadlocked state. This observation is important when we deal with the deadlock problem. 7.3 Methods for Handling Deadlocks Generally speaking, we can deal with the deadlock problem in one of three ways: • We can use a protocol to prevent or avoid deadlocks, ensuring that the system will never enter a deadlocked state. • We can allow the system to enter a deadlocked state, detect it, and recover. • We can ignore the problem altogether and pretend that deadlocks never occur in the system. The third solution is the one used by most operating systems, including Linux and Windows. It is then up to the application developer to write programs that handle deadlocks. Next, we elaborate briefly on each of the three methods for handling deadlocks. Then, in Sections 7.4 through 7.7, we present detailed algorithms. Before proceeding, we should mention that some researchers have argued that none of the basic approaches alone is appropriate for the entire spectrum of resource-allocation problems in operating systems. The basic approaches can be combined, however, allowing us to select an optimal approach for each class of resources in a system. To ensure that deadlocks never occur, the system can use either a deadlock- prevention or a deadlock-avoidance scheme. Deadlock prevention provides a set of methods to ensure that at least one of the necessary conditions (Section 7.2.1) cannot hold. These methods prevent deadlocks by constraining how requests for resources can be made. We discuss these methods in Section 7.4. Deadlock avoidance requires that the operating system be given additional information in advance concerning which resources a process will request and use during its lifetime. With this additional knowledge, the operating system can decide for each request whether or not the process should wait. To decide whether the current request can be satisfied or must be delayed, the system must consider the resources currently available, the resources currently allocated to each process, and the future requests and releases of each process. We discuss these schemes in Section 7.5. If a system does not employ either a deadlock-prevention or a deadlock- avoidance algorithm, then a deadlock situation may arise. In this environment, the system can provide an algorithm that examines the state of the system to determine whether a deadlock has occurred and an algorithm to recover from the deadlock (if a deadlock has indeed occurred). We discuss these issues in Section 7.6 and Section 7.7.
  • 143. 7.4 Deadlock Prevention 323 In the absence of algorithms to detect and recover from deadlocks, we may arrive at a situation in which the system is in a deadlocked state yet has no way of recognizing what has happened. In this case, the undetected deadlock will cause the system’s performance to deteriorate, because resources are being held by processes that cannot run and because more and more processes, as they make requests for resources, will enter a deadlocked state. Eventually, the system will stop functioning and will need to be restarted manually. Although this method may not seem to be a viable approach to the deadlock problem, it is nevertheless used in most operating systems, as mentioned earlier. Expense is one important consideration. Ignoring the possibility of deadlocks is cheaper than the other approaches. Since in many systems, deadlocks occur infrequently (say, once per year), the extra expense of the other methods may not seem worthwhile. In addition, methods used to recover from other conditions may be put to use to recover from deadlock. In some circumstances, a system is in a frozen state but not in a deadlocked state. We see this situation, for example, with a real-time process running at the highest priority (or any process running on a nonpreemptive scheduler) and never returning control to the operating system. The system must have manual recovery methods for such conditions and may simply use those techniques for deadlock recovery. 7.4 Deadlock Prevention As we noted in Section 7.2.1, for a deadlock to occur, each of the four necessary conditions must hold. By ensuring that at least one of these conditions cannot hold, we can prevent the occurrence of a deadlock. We elaborate on this approach by examining each of the four necessary conditions separately. 7.4.1 Mutual Exclusion The mutual exclusion condition must hold. That is, at least one resource must be nonsharable. Sharable resources, in contrast, do not require mutually exclusive access and thus cannot be involved in a deadlock. Read-only files are a good example of a sharable resource. If several processes attempt to open a read-only file at the same time, they can be granted simultaneous access to the file. A process never needs to wait for a sharable resource. In general, however, we cannot prevent deadlocks by denying the mutual-exclusion condition, because some resources are intrinsically nonsharable. For example, a mutex lock cannot be simultaneously shared by several processes. 7.4.2 Hold and Wait To ensure that the hold-and-wait condition never occurs in the system, we must guarantee that, whenever a process requests a resource, it does not hold any other resources. One protocol that we can use requires each process to request and be allocated all its resources before it begins execution. We can implement this provision by requiring that system calls requesting resources for a process precede all other system calls.
  • 144. 324 Chapter 7 Deadlocks An alternative protocol allows a process to request resources only when it has none. A process may request some resources and use them. Before it can request any additional resources, it must release all the resources that it is currently allocated. To illustrate the difference between these two protocols, we consider a process that copies data from a DVD drive to a file on disk, sorts the file, and then prints the results to a printer. If all resources must be requested at the beginning of the process, then the process must initially request the DVD drive, disk file, and printer. It will hold the printer for its entire execution, even though it needs the printer only at the end. The second method allows the process to request initially only the DVD drive and disk file. It copies from the DVD drive to the disk and then releases both the DVD drive and the disk file. The process must then request the disk file and the printer. After copying the disk file to the printer, it releases these two resources and terminates. Both these protocols have two main disadvantages. First, resource utiliza- tion may be low, since resources may be allocated but unused for a long period. In the example given, for instance, we can release the DVD drive and disk file, and then request the disk file and printer, only if we can be sure that our data will remain on the disk file. Otherwise, we must request all resources at the beginning for both protocols. Second, starvation is possible. A process that needs several popular resources may have to wait indefinitely, because at least one of the resources that it needs is always allocated to some other process. 7.4.3 No Preemption The third necessary condition for deadlocks is that there be no preemption of resources that have already been allocated. To ensure that this condition does not hold, we can use the following protocol. If a process is holding some resources and requests another resource that cannot be immediately allocated to it (that is, the process must wait), then all resources the process is currently holding are preempted. In other words, these resources are implicitly released. The preempted resources are added to the list of resources for which the process is waiting. The process will be restarted only when it can regain its old resources, as well as the new ones that it is requesting. Alternatively, if a process requests some resources, we first check whether they are available. If they are, we allocate them. If they are not, we check whether they are allocated to some other process that is waiting for additional resources. If so, we preempt the desired resources from the waiting process and allocate them to the requesting process. If the resources are neither available nor held by a waiting process, the requesting process must wait. While it is waiting, some of its resources may be preempted, but only if another process requests them. A process can be restarted only when it is allocated the new resources it is requesting and recovers any resources that were preempted while it was waiting. This protocol is often applied to resources whose state can be easily saved and restored later, such as CPU registers and memory space. It cannot generally be applied to such resources as mutex locks and semaphores.
  • 145. 7.4 Deadlock Prevention 325 7.4.4 Circular Wait The fourth and final condition for deadlocks is the circular-wait condition. One way to ensure that this condition never holds is to impose a total ordering of all resource types and to require that each process requests resources in an increasing order of enumeration. To illustrate, we let R = {R1, R2, ..., Rm} be the set of resource types. We assign to each resource type a unique integer number, which allows us to compare two resources and to determine whether one precedes another in our ordering. Formally, we define a one-to-one function F: R → N, where N is the set of natural numbers. For example, if the set of resource types R includes tape drives, disk drives, and printers, then the function F might be defined as follows: F(tape drive) = 1 F(disk drive) = 5 F(printer) = 12 We can now consider the following protocol to prevent deadlocks: Each process can request resources only in an increasing order of enumeration. That is, a process can initially request any number of instances of a resource type —say, Ri . After that, the process can request instances of resource type Rj if and only if F(Rj ) F(Ri ). For example, using the function defined previously, a process that wants to use the tape drive and printer at the same time must first request the tape drive and then request the printer. Alternatively, we can require that a process requesting an instance of resource type Rj must have released any resources Ri such that F(Ri ) ≥ F(Rj ). Note also that if several instances of the same resource type are needed, a single request for all of them must be issued. If these two protocols are used, then the circular-wait condition cannot hold. We can demonstrate this fact by assuming that a circular wait exists (proof by contradiction). Let the set of processes involved in the circular wait be {P0, P1, ..., Pn}, where Pi is waiting for a resource Ri , which is held by process Pi+1. (Modulo arithmetic is used on the indexes, so that Pn is waiting for a resource Rn held by P0.) Then, since process Pi+1 is holding resource Ri while requesting resource Ri+1, we must have F(Ri ) F(Ri+1) for all i. But this condition means that F(R0) F(R1) ... F(Rn) F(R0). By transitivity, F(R0) F(R0), which is impossible. Therefore, there can be no circular wait. We can accomplish this scheme in an application program by developing an ordering among all synchronization objects in the system. All requests for synchronization objects must be made in increasing order. For example, if the lock ordering in the Pthread program shown in Figure 7.4 was F(first mutex) = 1 F(second mutex) = 5 then thread two could not request the locks out of order. Keep in mind that developing an ordering, or hierarchy, does not in itself prevent deadlock. It is up to application developers to write programs that follow the ordering. Also note that the function F should be defined according to the normal order of usage of the resources in a system. For example, because
  • 146. 326 Chapter 7 Deadlocks /* thread one runs in this function */ void *do work one(void *param) { pthread mutex lock(first mutex); pthread mutex lock(second mutex); /** * Do some work */ pthread mutex unlock(second mutex); pthread mutex unlock(first mutex); pthread exit(0); } /* thread two runs in this function */ void *do work two(void *param) { pthread mutex lock(second mutex); pthread mutex lock(first mutex); /** * Do some work */ pthread mutex unlock(first mutex); pthread mutex unlock(second mutex); pthread exit(0); } Figure 7.4 Deadlock example. the tape drive is usually needed before the printer, it would be reasonable to define F(tape drive) F(printer). Although ensuring that resources are acquired in the proper order is the responsibility of application developers, certain software can be used to verify that locks are acquired in the proper order and to give appropriate warnings when locks are acquired out of order and deadlock is possible. One lock-order verifier, which works on BSD versions of UNIX such as FreeBSD, is known as witness. Witness uses mutual-exclusion locks to protect critical sections, as described in Chapter 5. It works by dynamically maintaining the relationship of lock orders in a system. Let’s use the program shown in Figure 7.4 as an example. Assume that thread one is the first to acquire the locks and does so in the order (1) first mutex, (2) second mutex. Witness records the relationship that first mutex must be acquired before second mutex. If thread two later acquires the locks out of order, witness generates a warning message on the system console. It is also important to note that imposing a lock ordering does not guarantee deadlock prevention if locks can be acquired dynamically. For example, assume we have a function that transfers funds between two accounts. To prevent a race condition, each account has an associated mutex lock that is obtained from a get lock() function such as shown in Figure 7.5:
  • 147. 7.5 Deadlock Avoidance 327 void transaction(Account from, Account to, double amount) { mutex lock1, lock2; lock1 = get lock(from); lock2 = get lock(to); acquire(lock1); acquire(lock2); withdraw(from, amount); deposit(to, amount); release(lock2); release(lock1); } Figure 7.5 Deadlock example with lock ordering. Deadlock is possible if two threads simultaneously invoke the transaction() function, transposing different accounts. That is, one thread might invoke transaction(checking account, savings account, 25); and another might invoke transaction(savings account, checking account, 50); We leave it as an exercise for students to fix this situation. 7.5 Deadlock Avoidance Deadlock-preventionalgorithms, asdiscussed inSection7.4, preventdeadlocks by limiting how requests can be made. The limits ensure that at least one of the necessary conditions for deadlock cannot occur. Possible side effects of preventing deadlocks by this method, however, are low device utilization and reduced system throughput. An alternative method for avoiding deadlocks is to require additional information about how resources are to be requested. For example, in a system with one tape drive and one printer, the system might need to know that process P will request first the tape drive and then the printer before releasing both resources, whereas process Q will request first the printer and then the tape drive. With this knowledge of the complete sequence of requests and releases for each process, the system can decide for each request whether or not the process should wait in order to avoid a possible future deadlock. Each request requires that in making this decision the system consider the resources currently available, the resources currently allocated to each process, and the future requests and releases of each process. The various algorithms that use this approach differ in the amount and type of information required. The simplest and most useful model requires that each process declare the maximum number of resources of each type that it may need. Given this a priori information, it is possible to construct an
  • 148. 328 Chapter 7 Deadlocks algorithm that ensures that the system will never enter a deadlocked state. A deadlock-avoidance algorithm dynamically examines the resource-allocation state to ensure that a circular-wait condition can never exist. The resource- allocation state is defined by the number of available and allocated resources and the maximum demands of the processes. In the following sections, we explore two deadlock-avoidance algorithms. 7.5.1 Safe State A state is safe if the system can allocate resources to each process (up to its maximum) in some order and still avoid a deadlock. More formally, a system is in a safe state only if there exists a safe sequence. A sequence of processes P1, P2, ..., Pn is a safe sequence for the current allocation state if, for each Pi , the resource requests that Pi can still make can be satisfied by the currently available resources plus the resources held by all Pj , with j i. In this situation, if the resources that Pi needs are not immediately available, then Pi can wait until all Pj have finished. When they have finished, Pi can obtain all of its needed resources, complete its designated task, return its allocated resources, and terminate. When Pi terminates, Pi+1 can obtain its needed resources, and so on. If no such sequence exists, then the system state is said to be unsafe. A safe state is not a deadlocked state. Conversely, a deadlocked state is an unsafe state. Not all unsafe states are deadlocks, however (Figure 7.6). An unsafe state may lead to a deadlock. As long as the state is safe, the operating system can avoid unsafe (and deadlocked) states. In an unsafe state, the operating system cannot prevent processes from requesting resources in such a way that a deadlock occurs. The behavior of the processes controls unsafe states. To illustrate, we consider a system with twelve magnetic tape drives and three processes: P0, P1, and P2. Process P0 requires ten tape drives, process P1 may need as many as four tape drives, and process P2 may need up to nine tape drives. Suppose that, at time t0, process P0 is holding five tape drives, process P1 is holding two tape drives, and process P2 is holding two tape drives. (Thus, there are three free tape drives.) deadlock unsafe safe Figure 7.6 Safe, unsafe, and deadlocked state spaces.
  • 149. 7.5 Deadlock Avoidance 329 Maximum Needs Current Needs P0 10 5 P1 4 2 P2 9 2 At time t0, the system is in a safe state. The sequence P1, P0, P2 satisfies the safety condition. Process P1 can immediately be allocated all its tape drives and then return them (the system will then have five available tape drives); then process P0 can get all its tape drives and return them (the system will then have ten available tape drives); and finally process P2 can get all its tape drives and return them (the system will then have all twelve tape drives available). A system can go from a safe state to an unsafe state. Suppose that, at time t1, process P2 requests and is allocated one more tape drive. The system is no longer in a safe state. At this point, only process P1 can be allocated all its tape drives. When it returns them, the system will have only four available tape drives. Since process P0 is allocated five tape drives but has a maximum of ten, it may request five more tape drives. If it does so, it will have to wait, because they are unavailable. Similarly, process P2 may request six additional tape drives and have to wait, resulting in a deadlock. Our mistake was in granting the request from process P2 for one more tape drive. If we had made P2 wait until either of the other processes had finished and released its resources, then we could have avoided the deadlock. Given the concept of a safe state, we can define avoidance algorithms that ensure that the system will never deadlock. The idea is simply to ensure that the system will always remain in a safe state. Initially, the system is in a safe state. Whenever a process requests a resource that is currently available, the system must decide whether the resource can be allocated immediately or whether the process must wait. The request is granted only if the allocation leaves the system in a safe state. In this scheme, if a process requests a resource that is currently available, it may still have to wait. Thus, resource utilization may be lower than it would otherwise be. 7.5.2 Resource-Allocation-Graph Algorithm If we have a resource-allocation system with only one instance of each resource type, we can use a variant of the resource-allocation graph defined in Section 7.2.2 for deadlock avoidance. In addition to the request and assignment edges already described, we introduce a new type of edge, called a claim edge. A claim edge Pi → Rj indicates that process Pi may request resource Rj at some time in the future. This edge resembles a request edge in direction but is represented in the graph by a dashed line. When process Pi requests resource Rj , the claim edge Pi → Rj is converted to a request edge. Similarly, when a resource Rj is released by Pi , the assignment edge Rj → Pi is reconverted to a claim edge Pi → Rj . Note that the resources must be claimed a priori in the system. That is, before process Pi starts executing, all its claim edges must already appear in the resource-allocation graph. We can relax this condition by allowing a claim edge Pi → Rj to be added to the graph only if all the edges associated with process Pi are claim edges.
  • 150. 330 Chapter 7 Deadlocks R1 R2 P2 P1 Figure 7.7 Resource-allocation graph for deadlock avoidance. Now suppose that process Pi requests resource Rj . The request can be granted only if converting the request edge Pi → Rj to an assignment edge Rj → Pi does not result in the formation of a cycle in the resource-allocation graph. We check for safety by using a cycle-detection algorithm. An algorithm for detecting a cycle in this graph requires an order of n2 operations, where n is the number of processes in the system. If no cycle exists, then the allocation of the resource will leave the system in a safe state. If a cycle is found, then the allocation will put the system in an unsafe state. In that case, process Pi will have to wait for its requests to be satisfied. To illustrate this algorithm, we consider the resource-allocation graph of Figure 7.7. Suppose that P2 requests R2. Although R2 is currently free, we cannot allocate it to P2, since this action will create a cycle in the graph (Figure 7.8). A cycle, as mentioned, indicates that the system is in an unsafe state. If P1 requests R2, and P2 requests R1, then a deadlock will occur. 7.5.3 Banker’s Algorithm The resource-allocation-graph algorithm is not applicable to a resource- allocation system with multiple instances of each resource type. The deadlock- avoidance algorithm that we describe next is applicable to such a system but is less efficient than the resource-allocation graph scheme. This algorithm is commonly known as the banker’s algorithm. The name was chosen because the algorithm could be used in a banking system to ensure that the bank never R1 R2 P2 P1 Figure 7.8 An unsafe state in a resource-allocation graph.
  • 151. 7.5 Deadlock Avoidance 331 allocated its available cash in such a way that it could no longer satisfy the needs of all its customers. When a new process enters the system, it must declare the maximum number of instances of each resource type that it may need. This number may not exceed the total number of resources in the system. When a user requests a set of resources, the system must determine whether the allocation of these resources will leave the system in a safe state. If it will, the resources are allocated; otherwise, the process must wait until some other process releases enough resources. Several data structures must be maintained to implement the banker’s algorithm. These data structures encode the state of the resource-allocation system. We need the following data structures, where n is the number of processes in the system and m is the number of resource types: • Available. A vector of length m indicates the number of available resources of each type. If Available[j] equals k, then k instances of resource type Rj are available. • Max. An n × m matrix defines the maximum demand of each process. If Max[i][j] equals k, then process Pi may request at most k instances of resource type Rj . • Allocation. An n × m matrix defines the number of resources of each type currently allocated to each process. If Allocation[i][j] equals k, then process Pi is currently allocated k instances of resource type Rj . • Need. An n × m matrix indicates the remaining resource need of each process. If Need[i][j] equals k, then process Pi may need k more instances of resource type Rj to complete its task. Note that Need[i][j] equals Max[i][j] − Allocation[i][j]. These data structures vary over time in both size and value. To simplify the presentation of the banker’s algorithm, we next establish some notation. Let X and Y be vectors of length n. We say that X ≤ Y if and only if X[i] ≤ Y[i] for all i = 1, 2, ..., n. For example, if X = (1,7,3,2) and Y = (0,3,2,1), then Y ≤ X. In addition, Y X if Y ≤ X and Y = X. We can treat each row in the matrices Allocation and Need as vectors and refer to them as Allocationi and Needi . The vector Allocationi specifies the resources currently allocated to process Pi ; the vector Needi specifies the additional resources that process Pi may still request to complete its task. 7.5.3.1 Safety Algorithm We can now present the algorithm for finding out whether or not a system is in a safe state. This algorithm can be described as follows: 1. Let Work and Finish be vectors of length m and n, respectively. Initialize Work = Available and Finish[i] = false for i = 0, 1, ..., n − 1. 2. Find an index i such that both a. Finish[i] == false b. Needi ≤ Work
  • 152. 332 Chapter 7 Deadlocks If no such i exists, go to step 4. 3. Work = Work + Allocationi Finish[i] = true Go to step 2. 4. If Finish[i] == true for all i, then the system is in a safe state. This algorithm may require an order of m × n2 operations to determine whether a state is safe. 7.5.3.2 Resource-Request Algorithm Next, we describe the algorithm for determining whether requests can be safely granted. Let Requesti be the request vector for process Pi . If Requesti [ j] == k, then process Pi wants k instances of resource type Rj . When a request for resources is made by process Pi , the following actions are taken: 1. If Requesti ≤ Needi , go to step 2. Otherwise, raise an error condition, since the process has exceeded its maximum claim. 2. If Requesti ≤ Available, go to step 3. Otherwise, Pi must wait, since the resources are not available. 3. Have the system pretend to have allocated the requested resources to process Pi by modifying the state as follows: Available = Available–Requesti ; Allocationi = Allocationi + Requesti ; Needi = Needi –Requesti ; If the resulting resource-allocation state is safe, the transaction is com- pleted, and process Pi is allocated its resources. However, if the new state is unsafe, then Pi must wait for Requesti , and the old resource-allocation state is restored. 7.5.3.3 An Illustrative Example To illustrate the use of the banker’s algorithm, consider a system with five processes P0 through P4 and three resource types A, B, and C. Resource type A has ten instances, resource type B has five instances, and resource type C has seven instances. Suppose that, at time T0, the following snapshot of the system has been taken: Allocation Max Available A B C A B C A B C P0 0 1 0 7 5 3 3 3 2 P1 2 0 0 3 2 2 P2 3 0 2 9 0 2 P3 2 1 1 2 2 2 P4 0 0 2 4 3 3
  • 153. 7.6 Deadlock Detection 333 The content of the matrix Need is defined to be Max − Allocation and is as follows: Need A B C P0 7 4 3 P1 1 2 2 P2 6 0 0 P3 0 1 1 P4 4 3 1 We claim that the system is currently in a safe state. Indeed, the sequence P1, P3, P4, P2, P0 satisfies the safety criteria. Suppose now that process P1 requests one additional instance of resource type A and two instances of resource type C, so Request1 = (1,0,2). To decide whether this request can be immediately granted, we first check that Request1 ≤ Available—that is, that (1,0,2) ≤ (3,3,2), which is true. We then pretend that this request has been fulfilled, and we arrive at the following new state: Allocation Need Available A B C A B C A B C P0 0 1 0 7 4 3 2 3 0 P1 3 0 2 0 2 0 P2 3 0 2 6 0 0 P3 2 1 1 0 1 1 P4 0 0 2 4 3 1 We must determine whether this new system state is safe. To do so, we execute our safety algorithm and find that the sequence P1, P3, P4, P0, P2 satisfies the safety requirement. Hence, we can immediately grant the request of process P1. You should be able to see, however, that when the system is in this state, a request for (3,3,0) by P4 cannot be granted, since the resources are not available. Furthermore, a request for (0,2,0) by P0 cannot be granted, even though the resources are available, since the resulting state is unsafe. We leave it as a programming exercise for students to implement the banker’s algorithm. 7.6 Deadlock Detection If a system does not employ either a deadlock-prevention or a deadlock- avoidance algorithm, then a deadlock situation may occur. In this environment, the system may provide: • An algorithm that examines the state of the system to determine whether a deadlock has occurred • An algorithm to recover from the deadlock
  • 154. 334 Chapter 7 Deadlocks P3 P5 P4 P2 P1 R2 R1 R3 R4 R5 P3 P5 P4 P2 P1 (b) (a) Figure 7.9 (a) Resource-allocation graph. (b) Corresponding wait-for graph. In the following discussion, we elaborate on these two requirements as they pertain to systems with only a single instance of each resource type, as well as to systems with several instances of each resource type. At this point, however, we note that a detection-and-recovery scheme requires overhead that includes not only the run-time costs of maintaining the necessary information and executing the detection algorithm but also the potential losses inherent in recovering from a deadlock. 7.6.1 Single Instance of Each Resource Type If all resources have only a single instance, then we can define a deadlock- detection algorithm that uses a variant of the resource-allocation graph, called a wait-for graph. We obtain this graph from the resource-allocation graph by removing the resource nodes and collapsing the appropriate edges. More precisely, an edge from Pi to Pj in a wait-for graph implies that process Pi is waiting for process Pj to release a resource that Pi needs. An edge Pi → Pj exists in a wait-for graph if and only if the corresponding resource- allocation graph contains two edges Pi → Rq and Rq → Pj for some resource Rq . In Figure 7.9, we present a resource-allocation graph and the corresponding wait-for graph. As before, a deadlock exists in the system if and only if the wait-for graph contains a cycle. To detect deadlocks, the system needs to maintain the wait- for graph and periodically invoke an algorithm that searches for a cycle in the graph. An algorithm to detect a cycle in a graph requires an order of n2 operations, where n is the number of vertices in the graph. 7.6.2 Several Instances of a Resource Type The wait-for graph scheme is not applicable to a resource-allocation system with multiple instances of each resource type. We turn now to a deadlock-
  • 155. 7.6 Deadlock Detection 335 detection algorithm that is applicable to such a system. The algorithm employs several time-varying data structures that are similar to those used in the banker’s algorithm (Section 7.5.3): • Available. A vector of length m indicates the number of available resources of each type. • Allocation. An n × m matrix defines the number of resources of each type currently allocated to each process. • Request. An n × m matrix indicates the current request of each process. If Request[i][j] equals k, then process Pi is requesting k more instances of resource type Rj . The ≤ relation between two vectors is defined as in Section 7.5.3. To simplify notation, we again treat the rows in the matrices Allocation and Request as vectors; we refer to them as Allocationi and Requesti . The detection algorithm described here simply investigates every possible allocation sequence for the processes that remain to be completed. Compare this algorithm with the banker’s algorithm of Section 7.5.3. 1. Let Work and Finish be vectors of length m and n, respectively. Initialize Work = Available. For i = 0, 1, ..., n–1, if Allocationi = 0, then Finish[i] = false. Otherwise, Finish[i] = true. 2. Find an index i such that both a. Finish[i] == false b. Requesti ≤ Work If no such i exists, go to step 4. 3. Work = Work + Allocationi Finish[i] = true Go to step 2. 4. If Finish[i] == false for some i, 0 ≤ i n, then the system is in a deadlocked state. Moreover, if Finish[i] == false, then process Pi is deadlocked. This algorithm requires an order of m × n2 operations to detect whether the system is in a deadlocked state. You may wonder why we reclaim the resources of process Pi (in step 3) as soon as we determine that Requesti ≤ Work (in step 2b). We know that Pi is currently not involved in a deadlock (since Requesti ≤ Work). Thus, we take an optimistic attitude and assume that Pi will require no more resources to complete its task; it will thus soon return all currently allocated resources to the system. If our assumption is incorrect, a deadlock may occur later. That deadlock will be detected the next time the deadlock-detection algorithm is invoked. To illustrate this algorithm, we consider a system with five processes P0 through P4 and three resource types A, B, and C. Resource type A has seven instances, resource type B has two instances, and resource type C has six
  • 156. 336 Chapter 7 Deadlocks instances. Suppose that, at time T0, we have the following resource-allocation state: Allocation Request Available A B C A B C A B C P0 0 1 0 0 0 0 0 0 0 P1 2 0 0 2 0 2 P2 3 0 3 0 0 0 P3 2 1 1 1 0 0 P4 0 0 2 0 0 2 We claim that the system is not in a deadlocked state. Indeed, if we execute our algorithm, we will find that the sequence P0, P2, P3, P1, P4 results in Finish[i] == true for all i. Suppose now that process P2 makes one additional request for an instance of type C. The Request matrix is modified as follows: Request A B C P0 0 0 0 P1 2 0 2 P2 0 0 1 P3 1 0 0 P4 0 0 2 We claim that the system is now deadlocked. Although we can reclaim the resources held by process P0, the number of available resources is not sufficient to fulfill the requests of the other processes. Thus, a deadlock exists, consisting of processes P1, P2, P3, and P4. 7.6.3 Detection-Algorithm Usage When should we invoke the detection algorithm? The answer depends on two factors: 1. How often is a deadlock likely to occur? 2. How many processes will be affected by deadlock when it happens? If deadlocks occur frequently, then the detection algorithm should be invoked frequently. Resources allocated to deadlocked processes will be idle until the deadlock can be broken. In addition, the number of processes involved in the deadlock cycle may grow. Deadlocks occur only when some process makes a request that cannot be granted immediately. This request may be the final request that completes a chain of waiting processes. In the extreme, then, we can invoke the deadlock- detection algorithm every time a request for allocation cannot be granted immediately. In this case, we can identify not only the deadlocked set of
  • 157. 7.7 Recovery from Deadlock 337 processes but also the specific process that “caused” the deadlock. (In reality, each of the deadlocked processes is a link in the cycle in the resource graph, so all of them, jointly, caused the deadlock.) If there are many different resource types, one request may create many cycles in the resource graph, each cycle completed by the most recent request and “caused” by the one identifiable process. Of course, invoking the deadlock-detection algorithm for every resource request will incur considerable overhead in computation time. A less expensive alternative is simply to invoke the algorithm at defined intervals—for example, once per hour or whenever CPU utilization drops below 40 percent. (A deadlock eventually cripples system throughput and causes CPU utilization to drop.) If the detection algorithm is invoked at arbitrary points in time, the resource graph may contain many cycles. In this case, we generally cannot tell which of the many deadlocked processes “caused” the deadlock. 7.7 Recovery from Deadlock When a detection algorithm determines that a deadlock exists, several alter- natives are available. One possibility is to inform the operator that a deadlock has occurred and to let the operator deal with the deadlock manually. Another possibility is to let the system recover from the deadlock automatically. There are two options for breaking a deadlock. One is simply to abort one or more processes to break the circular wait. The other is to preempt some resources from one or more of the deadlocked processes. 7.7.1 Process Termination To eliminate deadlocks by aborting a process, we use one of two methods. In both methods, the system reclaims all resources allocated to the terminated processes. • Abort all deadlocked processes. This method clearly will break the deadlock cycle, but at great expense. The deadlocked processes may have computed for a long time, and the results of these partial computations must be discarded and probably will have to be recomputed later. • Abort one process at a time until the deadlock cycle is eliminated. This method incurs considerable overhead, since after each process is aborted, a deadlock-detection algorithm must be invoked to determine whether any processes are still deadlocked. Aborting a process may not be easy. If the process was in the midst of updating a file, terminating it will leave that file in an incorrect state. Similarly, if the process was in the midst of printing data on a printer, the system must reset the printer to a correct state before printing the next job. If the partial termination method is used, then we must determine which deadlocked process (or processes) should be terminated. This determination is a policy decision, similar to CPU-scheduling decisions. The question is basically an economic one; we should abort those processes whose termination will incur
  • 158. 338 Chapter 7 Deadlocks the minimum cost. Unfortunately, the term minimum cost is not a precise one. Many factors may affect which process is chosen, including: 1. What the priority of the process is 2. How long the process has computed and how much longer the process will compute before completing its designated task 3. How many and what types of resources the process has used (for example, whether the resources are simple to preempt) 4. How many more resources the process needs in order to complete 5. How many processes will need to be terminated 6. Whether the process is interactive or batch 7.7.2 Resource Preemption To eliminate deadlocks using resource preemption, we successively preempt some resources from processes and give these resources to other processes until the deadlock cycle is broken. If preemption is required to deal with deadlocks, then three issues need to be addressed: 1. Selecting a victim. Which resources and which processes are to be preempted? As in process termination, we must determine the order of preemption to minimize cost. Cost factors may include such parameters as the number of resources a deadlocked process is holding and the amount of time the process has thus far consumed. 2. Rollback. If we preempt a resource from a process, what should be done with that process? Clearly, it cannot continue with its normal execution; it is missing some needed resource. We must roll back the process to some safe state and restart it from that state. Since, in general, it is difficult to determine what a safe state is, the simplest solution is a total rollback: abort the process and then restart it. Although it is more effective to roll back the process only as far as necessary to break the deadlock, this method requires the system to keep more information about the state of all running processes. 3. Starvation. How do we ensure that starvation will not occur? That is, how can we guarantee that resources will not always be preempted from the same process? In a system where victim selection is based primarily on cost factors, it may happen that the same process is always picked as a victim. As a result, this process never completes its designated task, a starvation situation any practical system must address. Clearly, we must ensure that a process can be picked as a victim only a (small) finite number of times. The most common solution is to include the number of rollbacks in the cost factor.
  • 159. Practice Exercises 339 7.8 Summary A deadlocked state occurs when two or more processes are waiting indefinitely for an event that can be caused only by one of the waiting processes. There are three principal methods for dealing with deadlocks: • Use some protocol to prevent or avoid deadlocks, ensuring that the system will never enter a deadlocked state. • Allow the system to enter a deadlocked state, detect it, and then recover. • Ignore the problem altogether and pretend that deadlocks never occur in the system. The third solution is the one used by most operating systems, including Linux and Windows. A deadlock can occur only if four necessary conditions hold simultaneously in the system: mutual exclusion, hold and wait, no preemption, and circular wait. To prevent deadlocks, we can ensure that at least one of the necessary conditions never holds. A method for avoiding deadlocks, rather than preventing them, requires that the operating system have a priori information about how each process will utilize system resources. The banker’s algorithm, for example, requires a priori information about the maximum number of each resource class that each process may request. Using this information, we can define a deadlock- avoidance algorithm. If a system does not employ a protocol to ensure that deadlocks will never occur, then a detection-and-recovery scheme may be employed. A deadlock- detection algorithm must be invoked to determine whether a deadlock has occurred. If a deadlock is detected, the system must recover either by terminating some of the deadlocked processes or by preempting resources from some of the deadlocked processes. Where preemption is used to deal with deadlocks, three issues must be addressed: selecting a victim, rollback, and starvation. In a system that selects victims for rollback primarily on the basis of cost factors, starvation may occur, and the selected process can never complete its designated task. Researchers have argued that none of the basic approaches alone is appro- priate for the entire spectrum of resource-allocation problems in operating systems. The basic approaches can be combined, however, allowing us to select an optimal approach for each class of resources in a system. Practice Exercises 7.1 List three examples of deadlocks that are not related to a computer- system environment. 7.2 Suppose that a system is in an unsafe state. Show that it is possible for the processes to complete their execution without entering a deadlocked state.
  • 160. 340 Chapter 7 Deadlocks 7.3 Consider the following snapshot of a system: Allocation Max Available A B C D A B C D A B C D P0 0 0 1 2 0 0 1 2 1 5 2 0 P1 1 0 0 0 1 7 5 0 P2 1 3 5 4 2 3 5 6 P3 0 6 3 2 0 6 5 2 P4 0 0 1 4 0 6 5 6 Answer the following questions using the banker’s algorithm: a. What is the content of the matrix Need? b. Is the system in a safe state? c. If a request from process P1 arrives for (0,4,2,0), can the request be granted immediately? 7.4 A possible method for preventing deadlocks is to have a single, higher- order resource that must be requested before any other resource. For example, if multiple threads attempt to access the synchronization objects A· · · E, deadlock is possible. (Such synchronization objects may include mutexes, semaphores, condition variables, and the like.) We can prevent the deadlock by adding a sixth object F. Whenever a thread wants to acquire the synchronization lock for any object A· · · E, it must first acquire the lock for object F. This solution is known as containment: the locks for objects A · · · E are contained within the lock for object F. Compare this scheme with the circular-wait scheme of Section 7.4.4. 7.5 Prove that the safety algorithm presented in Section 7.5.3 requires an order of m × n2 operations. 7.6 Consider a computer system that runs 5,000 jobs per month and has no deadlock-prevention or deadlock-avoidance scheme. Deadlocks occur about twice per month, and the operator must terminate and rerun about ten jobs per deadlock. Each job is worth about two dollars (in CPU time), and the jobs terminated tend to be about half done when they are aborted. A systems programmer has estimated that a deadlock-avoidance algorithm (like the banker’s algorithm) could be installed in the system with an increase of about 10 percent in the average execution time per job. Since the machine currently has 30 percent idle time, all 5,000 jobs per month could still be run, although turnaround time would increase by about 20 percent on average. a. What are the arguments for installing the deadlock-avoidance algorithm? b. What are the arguments against installing the deadlock-avoidance algorithm?
  • 161. Exercises 341 7.7 Can a system detect that some of its processes are starving? If you answer “yes,” explain how it can. If you answer “no,” explain how the system can deal with the starvation problem. 7.8 Consider the following resource-allocation policy. Requests for and releases of resources are allowed at any time. If a request for resources cannot be satisfied because the resources are not available, then we check any processes that are blocked waiting for resources. If a blocked process has the desired resources, then these resources are taken away from it and are given to the requesting process. The vector of resources for which the blocked process is waiting is increased to include the resources that were taken away. For example, a system has three resource types, and the vector Available is initialized to (4,2,2). If process P0 asks for (2,2,1), it gets them. If P1 asks for (1,0,1), it gets them. Then, if P0 asks for (0,0,1), it is blocked (resource not available). If P2 now asks for (2,0,0), it gets the available one (1,0,0), as well as one that was allocated to P0 (since P0 is blocked). P0’s Allocation vector goes down to (1,2,1), and its Need vector goes up to (1,0,1). a. Can deadlock occur? If you answer “yes,” give an example. If you answer “no,” specify which necessary condition cannot occur. b. Can indefinite blocking occur? Explain your answer. 7.9 Suppose that you have coded the deadlock-avoidance safety algorithm and now have been asked to implement the deadlock-detection algo- rithm. Can you do so by simply using the safety algorithm code and redefining Maxi = Waitingi + Allocationi , where Waitingi is a vector specifying the resources for which process i is waiting and Allocationi is as defined in Section 7.5? Explain your answer. 7.10 Is it possible to have a deadlock involving only one single-threaded process? Explain your answer. Exercises 7.11 Consider the traffic deadlock depicted in Figure 7.10. a. Show that the four necessary conditions for deadlock hold in this example. b. State a simple rule for avoiding deadlocks in this system. 7.12 Assume a multithreaded application uses only reader–writer locks for synchronization. Applying the four necessary conditions for deadlock, is deadlock still possible if multiple reader–writer locks are used? 7.13 The program example shown in Figure 7.4 doesn’t always lead to deadlock. Describe what role the CPU scheduler plays and how it can contribute to deadlock in this program.
  • 162. 342 Chapter 7 Deadlocks • • • • • • • • • • • • Figure 7.10 Traffic deadlock for Exercise 7.11. 7.14 In Section 7.4.4, we describe a situation in which we prevent deadlock by ensuring that all locks are acquired in a certain order. However, we also point out that deadlock is possible in this situation if two threads simultaneously invoke the transaction() function. Fix the transaction() function to prevent deadlocks. 7.15 Compare the circular-wait scheme with the various deadlock-avoidance schemes (like the banker’s algorithm) with respect to the following issues: a. Runtime overheads b. System throughput 7.16 In a real computer system, neither the resources available nor the demands of processes for resources are consistent over long periods (months). Resources break or are replaced, new processes come and go, and new resources are bought and added to the system. If deadlock is controlled by the banker’s algorithm, which of the following changes can be made safely (without introducing the possibility of deadlock), and under what circumstances? a. Increase Available (new resources added). b. Decrease Available (resource permanently removed from system). c. Increase Max for one process (the process needs or wants more resources than allowed). d. Decrease Max for one process (the process decides it does not need that many resources).
  • 163. Exercises 343 e. Increase the number of processes. f. Decrease the number of processes. 7.17 Consider a system consisting of four resources of the same type that are shared by three processes, each of which needs at most two resources. Show that the system is deadlock free. 7.18 Consider a system consisting of m resources of the same type being shared by n processes. A process can request or release only one resource at a time. Show that the system is deadlock free if the following two conditions hold: a. The maximum need of each process is between one resource and m resources. b. The sum of all maximum needs is less than m + n. 7.19 Consider the version of the dining-philosophers problem in which the chopsticks are placed at the center of the table and any two of them can be used by a philosopher. Assume that requests for chopsticks are made one at a time. Describe a simple rule for determining whether a particular request can be satisfied without causing deadlock given the current allocation of chopsticks to philosophers. 7.20 Consider again the setting in the preceding question. Assume now that each philosopher requires three chopsticks to eat. Resource requests are still issued one at a time. Describe some simple rules for determining whether a particular request can be satisfied without causing deadlock given the current allocation of chopsticks to philosophers. 7.21 We can obtain the banker’s algorithm for a single resource type from the general banker’s algorithm simply by reducing the dimensionality of the various arrays by 1. Show through an example that we cannot implement the multiple-resource-type banker’s scheme by applying the single-resource-type scheme to each resource type individually. 7.22 Consider the following snapshot of a system: Allocation Max A B C D A B C D P0 3 0 1 4 5 1 1 7 P1 2 2 1 0 3 2 1 1 P2 3 1 2 1 3 3 2 1 P3 0 5 1 0 4 6 1 2 P4 4 2 1 2 6 3 2 5 Using the banker’s algorithm, determine whether or not each of the following states is unsafe. If the state is safe, illustrate the order in which the processes may complete. Otherwise, illustrate why the state is unsafe. a. Available = (0, 3, 0, 1) b. Available = (1, 0, 0, 2)
  • 164. 344 Chapter 7 Deadlocks 7.23 Consider the following snapshot of a system: Allocation Max Available A B C D A B C D A B C D P0 2 0 0 1 4 2 1 2 3 3 2 1 P1 3 1 2 1 5 2 5 2 P2 2 1 0 3 2 3 1 6 P3 1 3 1 2 1 4 2 4 P4 1 4 3 2 3 6 6 5 Answer the following questions using the banker’s algorithm: a. Illustrate that the system is in a safe state by demonstrating an order in which the processes may complete. b. If a request from process P1 arrives for (1, 1, 0, 0), can the request be granted immediately? c. If a request from process P4 arrives for (0, 0, 2, 0), can the request be granted immediately? 7.24 What is the optimistic assumption made in the deadlock-detection algorithm? How can this assumption be violated? 7.25 A single-lane bridge connects the two Vermont villages of North Tunbridge and South Tunbridge. Farmers in the two villages use this bridge to deliver their produce to the neighboring town. The bridge can become deadlocked if a northbound and a southbound farmer get on the bridge at the same time. (Vermont farmers are stubborn and are unable to back up.) Using semaphores and/or mutex locks, design an algorithm in pseudocode that prevents deadlock. Initially, do not be concerned about starvation (the situation in which northbound farmers prevent southbound farmers from using the bridge, or vice versa). 7.26 Modify your solution to Exercise 7.25 so that it is starvation-free. Programming Problems 7.27 Implement your solution to Exercise 7.25 using POSIX synchronization. In particular, represent northbound and southbound farmers as separate threads. Once a farmer is on the bridge, the associated thread will sleep for a random period of time, representing traveling across the bridge. Design your program so that you can create several threads representing the northbound and southbound farmers.
  • 165. Programming Projects 345 Programming Projects Banker’s Algorithm For this project, you will write a multithreaded program that implements the banker’s algorithm discussed in Section 7.5.3. Several customers request and release resources from the bank. The banker will grant a request only if it leaves the system in a safe state. A request that leaves the system in an unsafe state will be denied. This programming assignment combines three separate topics: (1) multithreading, (2) preventing race conditions, and (3) deadlock avoidance. The Banker The banker will consider requests from n customers for m resources types. as outlined in Section 7.5.3. The banker will keep track of the resources using the following data structures: /* these may be any values = 0 */ #define NUMBER OF CUSTOMERS 5 #define NUMBER OF RESOURCES 3 /* the available amount of each resource */ int available[NUMBER OF RESOURCES]; /*the maximum demand of each customer */ int maximum[NUMBER OF CUSTOMERS][NUMBER OF RESOURCES]; /* the amount currently allocated to each customer */ int allocation[NUMBER OF CUSTOMERS][NUMBER OF RESOURCES]; /* the remaining need of each customer */ int need[NUMBER OF CUSTOMERS][NUMBER OF RESOURCES]; The Customers Create n customer threads that request and release resources from the bank. The customers will continually loop, requesting and then releasing random numbers of resources. The customers’ requests for resources will be bounded by their respective values in the need array. The banker will grant a request if it satisfies the safety algorithm outlined in Section 7.5.3.1. If a request does not leave the system in a safe state, the banker will deny it. Function prototypes for requesting and releasing resources are as follows: int request resources(int customer num, int request[]); int release resources(int customer num, int release[]); These two functions should return 0 if successful (the request has been granted) and –1 if unsuccessful. Multiple threads (customers) will concurrently
  • 166. 346 Chapter 7 Deadlocks access shared data through these two functions. Therefore, access must be controlled through mutex locks to prevent race conditions. Both the Pthreads and Windows APIs provide mutex locks. The use of Pthreads mutex locks is covered in Section 5.9.4; mutex locks for Windows systems are described in the project entitled “Producer–Consumer Problem” at the end of Chapter 5. Implementation You should invoke your program by passing the number of resources of each type on the command line. For example, if there were three resource types, with ten instances of the first type, five of the second type, and seven of the third type, you would invoke your program follows: ./a.out 10 5 7 The available array would be initialized to these values. You may initialize the maximum array (which holds the maximum demand of each customer) using any method you find convenient. Bibliographical Notes Most research involving deadlock was conducted many years ago. [Dijkstra (1965)] was one of the first and most influential contributors in the deadlock area. [Holt (1972)] was the first person to formalize the notion of deadlocks in terms of an allocation-graph model similar to the one presented in this chapter. Starvation was also covered by [Holt (1972)]. [Hyman (1985)] provided the deadlock example from the Kansas legislature. A study of deadlock handling is provided in [Levine (2003)]. The various prevention algorithms were suggested by [Havender (1968)], who devised the resource-ordering scheme for the IBM OS/360 system. The banker’s algorithm for avoiding deadlocks was developed for a single resource type by [Dijkstra (1965)] and was extended to multiple resource types by [Habermann (1969)]. The deadlock-detection algorithm for multiple instances of a resource type, which is described in Section 7.6.2, was presented by [Coffman et al. (1971)]. [Bach (1987)] describes how many of the algorithms in the traditional UNIX kernel handle deadlock. Solutions to deadlock problems in networks are discussed in works such as [Culler et al. (1998)] and [Rodeheffer and Schroeder (1991)]. The witness lock-order verifier is presented in [Baldwin (2002)]. Bibliography [Bach (1987)] M. J. Bach, The Design of the UNIX Operating System, Prentice Hall (1987). [Baldwin (2002)] J. Baldwin, “Locking in the Multithreaded FreeBSD Kernel”, USENIX BSD (2002).
  • 167. Bibliography 347 [Coffman et al. (1971)] E. G. Coffman, M. J. Elphick, and A. Shoshani, “System Deadlocks”, Computing Surveys, Volume 3, Number 2 (1971), pages 67–78. [Culler et al. (1998)] D. E. Culler, J. P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers Inc. (1998). [Dijkstra (1965)] E. W. Dijkstra, “Cooperating Sequential Processes”, Technical report, Technological University, Eindhoven, the Netherlands (1965). [Habermann (1969)] A. N. Habermann, “Prevention of System Deadlocks”, Communications of the ACM, Volume 12, Number 7 (1969), pages 373–377, 385. [Havender (1968)] J. W. Havender, “Avoiding Deadlock in Multitasking Sys- tems”, IBM Systems Journal, Volume 7, Number 2 (1968), pages 74–84. [Holt (1972)] R. C. Holt, “Some Deadlock Properties of Computer Systems”, Computing Surveys, Volume 4, Number 3 (1972), pages 179–196. [Hyman (1985)] D. Hyman, The Columbus Chicken Statute and More Bonehead Legislation, S. Greene Press (1985). [Levine (2003)] G. Levine, “Defining Deadlock”, Operating Systems Review, Vol- ume 37, Number 1 (2003). [Rodeheffer and Schroeder (1991)] T. L. Rodeheffer and M. D. Schroeder, “Automatic Reconfiguration in Autonet”, Proceedings of the ACM Symposium on Operating Systems Principles (1991), pages 183–97.
  • 169. Part Three Memory Management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be at least partially in main memory during execution. To improve both the utilization of the CPU and the speed of its response to users, a general-purpose computer must keep several pro- cesses in memory. Many memory-management schemes exist, reflect- ing various approaches, and the effectiveness of each algorithm depends on the situation. Selection of a memory-management scheme for a sys- tem depends on many factors, especially on the hardware design of the system. Most algorithms require hardware support.
  • 171. 8 C H A P T E R Main Memory In Chapter 6, we showed how the CPU can be shared by a set of processes. As a result of CPU scheduling, we can improve both the utilization of the CPU and the speed of the computer’s response to its users. To realize this increase in performance, however, we must keep several processes in memory—that is, we must share memory. In this chapter, we discuss various ways to manage memory. The memory- management algorithms vary from a primitive bare-machine approach to paging and segmentation strategies. Each approach has its own advantages and disadvantages. Selection of a memory-management method for a specific system depends on many factors, especially on the hardware design of the system. As we shall see, many algorithms require hardware support, leading many systems to have closely integrated hardware and operating-system memory management. CHAPTER OBJECTIVES • To provide a detailed description of various ways of organizing memory hardware. • To explore various techniques of allocating memory to processes. • To discuss in detail how paging works in contemporary computer systems. 8.1 Background As we saw in Chapter 1, memory is central to the operation of a modern computer system. Memory consists of a large array of bytes, each with its own address. The CPU fetches instructions from memory according to the value of the program counter. These instructions may cause additional loading from and storing to specific memory addresses. A typical instruction-execution cycle, for example, first fetches an instruc- tion from memory. The instruction is then decoded and may cause operands to be fetched from memory. After the instruction has been executed on the operands, results may be stored back in memory. The memory unit sees only 351
  • 172. 352 Chapter 8 Main Memory a stream of memory addresses; it does not know how they are generated (by the instruction counter, indexing, indirection, literal addresses, and so on) or what they are for (instructions or data). Accordingly, we can ignore how a program generates a memory address. We are interested only in the sequence of memory addresses generated by the running program. We begin our discussion by covering several issues that are pertinent to managing memory: basic hardware, the binding of symbolic memory addresses to actual physical addresses, and the distinction between logical and physical addresses. We conclude the section with a discussion of dynamic linking and shared libraries. 8.1.1 Basic Hardware Main memory and the registers built into the processor itself are the only general-purpose storage that the CPU can access directly. There are machine instructions that take memory addresses as arguments, but none that take disk addresses. Therefore, any instructions in execution, and any data being used by the instructions, must be in one of these direct-access storage devices. If the data are not in memory, they must be moved there before the CPU can operate on them. Registers that are built into the CPU are generally accessible within one cycle of the CPU clock. Most CPUs can decode instructions and perform simple operations on register contents at the rate of one or more operations per clock tick. The same cannot be said of main memory, which is accessed via a transaction on the memory bus. Completing a memory access may take many cycles of the CPU clock. In such cases, the processor normally needs to stall, since it does not have the data required to complete the instruction that it is executing. This situation is intolerable because of the frequency of memory accesses. The remedy is to add fast memory between the CPUand main memory, typically on the CPU chip for fast access. Such a cache was described in Section 1.8.3. To manage a cache built into the CPU, the hardware automatically speeds up memory access without any operating-system control. Not only are we concerned with the relative speed of accessing physical memory, but we also must ensure correct operation. For proper system operation we must protect the operating system from access by user processes. On multiuser systems, we must additionally protect user processes from one another. This protection must be provided by the hardware because the operating system doesn’t usually intervene between the CPU and its memory accesses (because of the resulting performance penalty). Hardware implements this production in several different ways, as we show throughout the chapter. Here, we outline one possible implementation. We first need to make sure that each process has a separate memory space. Separate per-process memory space protects the processes from each other and is fundamental to having multiple processes loaded in memory for concurrent execution. To separate memory spaces, we need the ability to determine the range of legal addresses that the process may access and to ensure that the process can access only these legal addresses. We can provide this protection by using two registers, usually a base and a limit, as illustrated in Figure 8.1. The base register holds the smallest legal physical memory address; the limit register specifies the size of the range. For example, if the base register holds
  • 173. 8.1 Background 353 operating system 0 256000 300040 300040 base 120900 limit 420940 880000 1024000 process process process Figure 8.1 A base and a limit register define a logical address space. 300040 and the limit register is 120900, then the program can legally access all addresses from 300040 through 420939 (inclusive). Protection of memory space is accomplished by having the CPU hardware compare every address generated in user mode with the registers. Any attempt by a program executing in user mode to access operating-system memory or other users’ memory results in a trap to the operating system, which treats the attempt as a fatal error (Figure 8.2). This scheme prevents a user program from (accidentally or deliberately) modifying the code or data structures of either the operating system or other users. The base and limit registers can be loaded only by the operating system, which uses a special privileged instruction. Since privileged instructions can be executed only in kernel mode, and since only the operating system executes in kernel mode, only the operating system can load the base and limit registers. base memory trap to operating system monitor—addressing error address yes yes no no CPU base limit ≥ Figure 8.2 Hardware address protection with base and limit registers.
  • 174. 354 Chapter 8 Main Memory This scheme allows the operating system to change the value of the registers but prevents user programs from changing the registers’ contents. The operating system, executing in kernel mode, is given unrestricted access to both operating-system memory and users’ memory. This provision allows the operating system to load users’ programs into users’ memory, to dump out those programs in case of errors, to access and modify parameters of system calls, to perform I/O to and from user memory, and to provide many other services. Consider, for example, that an operating system for a multiprocessing system must execute context switches, storing the state of one process from the registers into main memory before loading the next process’s context from main memory into the registers. 8.1.2 Address Binding Usually, a program resides on a disk as a binary executable file. To be executed, the program must be brought into memory and placed within a process. Depending on the memory management in use, the process may be moved between disk and memory during its execution. The processes on the disk that are waiting to be brought into memory for execution form the input queue. The normal single-tasking procedure is to select one of the processes in the input queue and to load that process into memory. As the process is executed, it accesses instructions and data from memory. Eventually, the process terminates, and its memory space is declared available. Most systems allow a user process to reside in any part of the physical memory. Thus, although the address space of the computer may start at 00000, the first address of the user process need not be 00000. You will see later how a user program actually places a process in physical memory. In most cases, a user program goes through several steps—some of which may be optional—before being executed (Figure 8.3). Addresses may be represented in different ways during these steps. Addresses in the source program are generally symbolic (such as the variable count). A compiler typically binds these symbolic addresses to relocatable addresses (such as “14 bytes from the beginning of this module”). The linkage editor or loader in turn binds the relocatable addresses to absolute addresses (such as 74014). Each binding is a mapping from one address space to another. Classically, the binding of instructions and data to memory addresses can be done at any step along the way: • Compile time. If you know at compile time where the process will reside in memory, then absolute code can be generated. For example, if you know that a user process will reside starting at location R, then the generated compiler code will start at that location and extend up from there. If, at some later time, the starting location changes, then it will be necessary to recompile this code. The MS-DOS .COM-format programs are bound at compile time. • Load time. If it is not known at compile time where the process will reside in memory, then the compiler must generate relocatable code. In this case, final binding is delayed until load time. If the starting address changes, we need only reload the user code to incorporate this changed value.
  • 175. 8.1 Background 355 dynamic linking source program object module linkage editor load module loader in-memory binary memory image other object modules compile time load time execution time (run time) compiler or assembler system library dynamically loaded system library Figure 8.3 Multistep processing of a user program. • Execution time. If the process can be moved during its execution from one memory segment to another, then binding must be delayed until run time. Special hardware must be available for this scheme to work, as will be discussed in Section 8.1.3. Most general-purpose operating systems use this method. A major portion of this chapter is devoted to showing how these various bind- ings can be implemented effectively in a computer system and to discussing appropriate hardware support. 8.1.3 Logical Versus Physical Address Space An address generated by the CPU is commonly referred to as a logical address, whereas an address seen by the memory unit—that is, the one loaded into the memory-address register of the memory—is commonly referred to as a physical address. The compile-time and load-time address-binding methods generate iden- tical logical and physical addresses. However, the execution-time address-
  • 176. 356 Chapter 8 Main Memory MMU CPU memory 14346 14000 relocation register 346 logical address physical address Figure 8.4 Dynamic relocation using a relocation register. binding scheme results in differing logical and physical addresses. In this case, we usually refer to the logical address as a virtual address. We use logical address and virtual address interchangeably in this text. The set of all logical addresses generated by a program is a logical address space. The set of all physical addresses corresponding to these logical addresses is a physical address space. Thus, in the execution-time address-binding scheme, the logical and physical address spaces differ. The run-time mapping from virtual to physical addresses is done by a hardware device called the memory-management unit (MMU). We can choose from many different methods to accomplish such mapping, as we discuss in Section 8.3 through Section 8.5. For the time being, we illustrate this mapping with a simple MMU scheme that is a generalization of the base-register scheme described in Section 8.1.1. The base register is now called a relocation register. The value in the relocation register is added to every address generated by a user process at the time the address is sent to memory (see Figure 8.4). For example, if the base is at 14000, then an attempt by the user to address location 0 is dynamically relocated to location 14000; an access to location 346 is mapped to location 14346. The user program never sees the real physical addresses. The program can create a pointer to location 346, store it in memory, manipulate it, and compare it with other addresses—all as the number 346. Only when it is used as a memory address (in an indirect load or store, perhaps) is it relocated relative to the base register. The user program deals with logical addresses. The memory-mapping hardware converts logical addresses into physical addresses. This form of execution-time binding was discussed in Section 8.1.2. The final location of a referenced memory address is not determined until the reference is made. We now have two different types of addresses: logical addresses (in the range 0 to max) and physical addresses (in the range R + 0 to R + max for a base value R). The user program generates only logical addresses and thinks that the process runs in locations 0 to max. However, these logical addresses must be mapped to physical addresses before they are used. The concept of a logical
  • 177. 8.1 Background 357 address space that is bound to a separate physical address space is central to proper memory management. 8.1.4 Dynamic Loading In our discussion so far, it has been necessary for the entire program and all data of a process to be in physical memory for the process to execute. The size of a process has thus been limited to the size of physical memory. To obtain better memory-space utilization, we can use dynamic loading. With dynamic loading, a routine is not loaded until it is called. All routines are kept on disk in a relocatable load format. The main program is loaded into memory and is executed. When a routine needs to call another routine, the calling routine first checks to see whether the other routine has been loaded. If it has not, the relocatable linking loader is called to load the desired routine into memory and to update the program’s address tables to reflect this change. Then control is passed to the newly loaded routine. The advantage of dynamic loading is that a routine is loaded only when it is needed. This method is particularly useful when large amounts of code are needed to handle infrequently occurring cases, such as error routines. In this case, although the total program size may be large, the portion that is used (and hence loaded) may be much smaller. Dynamic loading does not require special support from the operating system. It is the responsibility of the users to design their programs to take advantage of such a method. Operating systems may help the programmer, however, by providing library routines to implement dynamic loading. 8.1.5 Dynamic Linking and Shared Libraries Dynamically linked libraries are system libraries that are linked to user programs when the programs are run (refer back to Figure 8.3). Some operating systems support only static linking, in which system libraries are treated like any other object module and are combined by the loader into the binary program image. Dynamic linking, in contrast, is similar to dynamic loading. Here, though, linking, rather than loading, is postponed until execution time. This feature is usually used with system libraries, such as language subroutine libraries. Without this facility, each program on a system must include a copy of its language library (or at least the routines referenced by the program) in the executable image. This requirement wastes both disk space and main memory. With dynamic linking, a stub is included in the image for each library- routine reference. The stub is a small piece of code that indicates how to locate the appropriate memory-resident library routine or how to load the library if the routine is not already present. When the stub is executed, it checks to see whether the needed routine is already in memory. If it is not, the program loads the routine into memory. Either way, the stub replaces itself with the address of the routine and executes the routine. Thus, the next time that particular code segment is reached, the library routine is executed directly, incurring no cost for dynamic linking. Under this scheme, all processes that use a language library execute only one copy of the library code. This feature can be extended to library updates (such as bug fixes). A library may be replaced by a new version, and all programs that reference the library will automatically use the new version. Without dynamic linking, all such
  • 178. 358 Chapter 8 Main Memory programs would need to be relinked to gain access to the new library. So that programs will not accidentally execute new, incompatible versions of libraries, version information is included in both the program and the library. More than one version of a library may be loaded into memory, and each program uses its version information to decide which copy of the library to use. Versions with minor changes retain the same version number, whereas versions with major changes increment the number. Thus, only programs that are compiled with the new library version are affected by any incompatible changes incorporated in it. Other programs linked before the new library was installed will continue using the older library. This system is also known as shared libraries. Unlike dynamic loading, dynamic linking and shared libraries generally require help from the operating system. If the processes in memory are protected from one another, then the operating system is the only entity that can check to see whether the needed routine is in another process’s memory space or that can allow multiple processes to access the same memory addresses. We elaborate on this concept when we discuss paging in Section 8.5.4. 8.2 Swapping A process must be in memory to be executed. A process, however, can be swapped temporarily out of memory to a backing store and then brought back into memory for continued execution (Figure 8.5). Swapping makes it possible for the total physical address space of all processes to exceed the real physical memory of the system, thus increasing the degree of multiprogramming in a system. 8.2.1 Standard Swapping Standard swapping involves moving processes between main memory and a backing store. The backing store is commonly a fast disk. It must be large operating system swap out swap in user space main memory backing store process P2 process P1 1 2 Figure 8.5 Swapping of two processes using a disk as a backing store.
  • 179. 8.2 Swapping 359 enough to accommodate copies of all memory images for all users, and it must provide direct access to these memory images. The system maintains a ready queue consisting of all processes whose memory images are on the backing store or in memory and are ready to run. Whenever the CPU scheduler decides to execute a process, it calls the dispatcher. The dispatcher checks to see whether the next process in the queue is in memory. If it is not, and if there is no free memory region, the dispatcher swaps out a process currently in memory and swaps in the desired process. It then reloads registers and transfers control to the selected process. The context-switch time in such a swapping system is fairly high. To get an idea of the context-switch time, let’s assume that the user process is 100 MB in size and the backing store is a standard hard disk with a transfer rate of 50 MB per second. The actual transfer of the 100-MB process to or from main memory takes 100 MB/50 MB per second = 2 seconds The swap time is 200 milliseconds. Since we must swap both out and in, the total swap time is about 4,000 milliseconds. (Here, we are ignoring other disk performance aspects, which we cover in Chapter 10.) Notice that the major part of the swap time is transfer time. The total transfer time is directly proportional to the amount of memory swapped. If we have a computer system with 4 GB of main memory and a resident operating system taking 1 GB, the maximum size of the user process is 3 GB. However, many user processes may be much smaller than this—say, 100 MB. A 100-MB process could be swapped out in 2 seconds, compared with the 60 seconds required for swapping 3 GB. Clearly, it would be useful to know exactly how much memory a user process is using, not simply how much it might be using. Then we would need to swap only what is actually used, reducing swap time. For this method to be effective, the user must keep the system informed of any changes in memory requirements. Thus, a process with dynamic memory requirements will need to issue system calls (request memory() and release memory()) to inform the operating system of its changing memory needs. Swapping is constrained by other factors as well. If we want to swap a process, we must be sure that it is completely idle. Of particular concern is any pending I/O. A process may be waiting for an I/O operation when we want to swap that process to free up memory. However, if the I/O is asynchronously accessing the user memory for I/O buffers, then the process cannot be swapped. Assume that the I/O operation is queued because the device is busy. If we were to swap out process P1 and swap in process P2, the I/O operation might then attempt to use memory that now belongs to process P2. There are two main solutions to this problem: never swap a process with pending I/O, or execute I/O operations only into operating-system buffers. Transfers between operating-system buffers and process memory then occur only when the process is swapped in. Note that this double buffering itself adds overhead. We now need to copy the data again, from kernel memory to user memory, before the user process can access it. Standard swapping is not used in modern operating systems. It requires too much swapping time and provides too little execution time to be a reasonable
  • 180. 360 Chapter 8 Main Memory memory-management solution. Modified versions of swapping, however, are found on many systems, including UNIX, Linux, and Windows. In one common variation, swapping is normally disabled but will start if the amount of free memory (unused memory available for the operating system or processes to use) falls below a threshold amount. Swapping is halted when the amount of free memory increases. Another variation involves swapping portions of processes—rather than entire processes—to decrease swap time. Typically, these modified forms of swapping work in conjunction with virtual memory, which we cover in Chapter 9. 8.2.2 Swapping on Mobile Systems Although most operating systems for PCs and servers support some modified version of swapping, mobile systems typically do not support swapping in any form. Mobile devices generally use flash memory rather than more spacious hard disks as their persistent storage. The resulting space constraint is one reason why mobile operating-system designers avoid swapping. Other reasons include the limited number of writes that flash memory can tolerate before it becomes unreliable and the poor throughput between main memory and flash memory in these devices. Instead of using swapping, when free memory falls below a certain threshold, Apple’s iOS asks applications to voluntarily relinquish allocated memory. Read-only data (such as code) are removed from the system and later reloaded from flash memory if necessary. Data that have been modified (such as the stack) are never removed. However, any applications that fail to free up sufficient memory may be terminated by the operating system. Android does not support swapping and adopts a strategy similar to that used by iOS. It may terminate a process if insufficient free memory is available. However, before terminating a process, Android writes its application state to flash memory so that it can be quickly restarted. Because of these restrictions, developers for mobile systems must carefully allocate and release memory to ensure that their applications do not use too much memory or suffer from memory leaks. Note that both iOS and Android support paging, so they do have memory-management abilities. We discuss paging later in this chapter. 8.3 Contiguous Memory Allocation The main memory must accommodate both the operating system and the various user processes. We therefore need to allocate main memory in the most efficient way possible. This section explains one early method, contiguous memory allocation. The memory is usually divided into two partitions: one for the resident operating system and one for the user processes. We can place the operating system in either low memory or high memory. The major factor affecting this decision is the location of the interrupt vector. Since the interrupt vector is often in low memory, programmers usually place the operating system in low memory as well. Thus, in this text, we discuss only the situation in which
  • 181. 8.3 Contiguous Memory Allocation 361 the operating system resides in low memory. The development of the other situation is similar. We usually want several user processes to reside in memory at the same time. We therefore need to consider how to allocate available memory to the processes that are in the input queue waiting to be brought into memory. In contiguous memory allocation, each process is contained in a single section of memory that is contiguous to the section containing the next process. 8.3.1 Memory Protection Before discussing memory allocation further, we must discuss the issue of memory protection. We can prevent a process from accessing memory it does not own by combining two ideas previously discussed. If we have a system with a relocation register (Section 8.1.3), together with a limit register (Section 8.1.1), we accomplish our goal. The relocation register contains the value of the smallest physical address; the limit register contains the range of logical addresses (for example, relocation = 100040 and limit = 74600). Each logical address must fall within the range specified by the limit register. The MMU maps the logical address dynamically by adding the value in the relocation register. This mapped address is sent to memory (Figure 8.6). When the CPU scheduler selects a process for execution, the dispatcher loads the relocation and limit registers with the correct values as part of the context switch. Because every address generated by a CPU is checked against these registers, we can protect both the operating system and the other users’ programs and data from being modified by this running process. The relocation-register scheme provides an effective way to allow the operating system’s size to change dynamically. This flexibility is desirable in many situations. For example, the operating system contains code and buffer space for device drivers. If a device driver (or other operating-system service) is not commonly used, we do not want to keep the code and data in memory, as we might be able to use that space for other purposes. Such code is sometimes called transient operating-system code; it comes and goes as needed. Thus, using this code changes the size of the operating system during program execution. CPU memory logical address trap: addressing error no yes physical address relocation register limit register Figure 8.6 Hardware support for relocation and limit registers.
  • 182. 362 Chapter 8 Main Memory 8.3.2 Memory Allocation Now we are ready to turn to memory allocation. One of the simplest methods for allocating memory is to divide memory into several fixed-sized partitions. Each partition may contain exactly one process. Thus, the degree of multiprogramming is bound by the number of partitions. In this multiple- partition method, when a partition is free, a process is selected from the input queue and is loaded into the free partition. When the process terminates, the partition becomes available for another process. This method was originally used by the IBM OS/360 operating system (called MFT)but is no longer in use. The method described next is a generalization of the fixed-partition scheme (called MVT); it is used primarily in batch environments. Many of the ideas presented here are also applicable to a time-sharing environment in which pure segmentation is used for memory management (Section 8.4). In the variable-partition scheme, the operating system keeps a table indicating which parts of memory are available and which are occupied. Initially, all memory is available for user processes and is considered one large block of available memory, a hole. Eventually, as you will see, memory contains a set of holes of various sizes. As processes enter the system, they are put into an input queue. The operating system takes into account the memory requirements of each process and the amount of available memory space in determining which processes are allocated memory. When a process is allocated space, it is loaded into memory, and it can then compete for CPU time. When a process terminates, it releases its memory, which the operating system may then fill with another process from the input queue. At any given time, then, we have a list of available block sizes and an input queue. The operating system can order the input queue according to a scheduling algorithm. Memory is allocated to processes until, finally, the memory requirements of the next process cannot be satisfied—that is, no available block of memory (or hole) is large enough to hold that process. The operating system can then wait until a large enough block is available, or it can skip down the input queue to see whether the smaller memory requirements of some other process can be met. In general, as mentioned, the memory blocks available comprise a set of holes of various sizes scattered throughout memory. When a process arrives and needs memory, the system searches the set for a hole that is large enough for this process. If the hole is too large, it is split into two parts. One part is allocated to the arriving process; the other is returned to the set of holes. When a process terminates, it releases its block of memory, which is then placed back in the set of holes. If the new hole is adjacent to other holes, these adjacent holes are merged to form one larger hole. At this point, the system may need to check whether there are processes waiting for memory and whether this newly freed and recombined memory could satisfy the demands of any of these waiting processes. This procedure is a particular instance of the general dynamic storage- allocation problem, which concerns how to satisfy a request of size n from a list of free holes. There are many solutions to this problem. The first-fit, best-fit, and worst-fit strategies are the ones most commonly used to select a free hole from the set of available holes.
  • 183. 8.3 Contiguous Memory Allocation 363 • First fit. Allocate the first hole that is big enough. Searching can start either at the beginning of the set of holes or at the location where the previous first-fit search ended. We can stop searching as soon as we find a free hole that is large enough. • Best fit. Allocate the smallest hole that is big enough. We must search the entire list, unless the list is ordered by size. This strategy produces the smallest leftover hole. • Worst fit. Allocate the largest hole. Again, we must search the entire list, unless it is sorted by size. This strategy produces the largest leftover hole, which may be more useful than the smaller leftover hole from a best-fit approach. Simulations have shown that both first fit and best fit are better than worst fit in terms of decreasing time and storage utilization. Neither first fit nor best fit is clearly better than the other in terms of storage utilization, but first fit is generally faster. 8.3.3 Fragmentation Both the first-fit and best-fit strategies for memory allocation suffer from external fragmentation. As processes are loaded and removed from memory, the free memory space is broken into little pieces. External fragmentation exists when there is enough total memory space to satisfy a request but the available spaces are not contiguous: storage is fragmented into a large number of small holes. This fragmentation problem can be severe. In the worst case, we could have a block of free (or wasted) memory between every two processes. If all these small pieces of memory were in one big free block instead, we might be able to run several more processes. Whether we are using the first-fit or best-fit strategy can affect the amount of fragmentation. (First fit is better for some systems, whereas best fit is better for others.) Another factor is which end of a free block is allocated. (Which is the leftover piece—the one on the top or the one on the bottom?) No matter which algorithm is used, however, external fragmentation will be a problem. Depending on the total amount of memory storage and the average process size, external fragmentation may be a minor or a major problem. Statistical analysis of first fit, for instance, reveals that, even with some optimization, given N allocated blocks, another 0.5 N blocks will be lost to fragmentation. That is, one-third of memory may be unusable! This property is known as the 50-percent rule. Memory fragmentation can be internal as well as external. Consider a multiple-partition allocation scheme with a hole of 18,464 bytes. Suppose that the next process requests 18,462 bytes. If we allocate exactly the requested block, we are left with a hole of 2 bytes. The overhead to keep track of this hole will be substantially larger than the hole itself. The general approach to avoiding this problem is to break the physical memory into fixed-sized blocks and allocate memory in units based on block size. With this approach, the memory allocated to a process may be slightly larger than the requested memory. The difference between these two numbers is internal fragmentation—unused memory that is internal to a partition.
  • 184. 364 Chapter 8 Main Memory One solution to the problem of external fragmentation is compaction. The goal is to shuffle the memory contents so as to place all free memory together in one large block. Compaction is not always possible, however. If relocation is static and is done at assembly or load time, compaction cannot be done. It is possible only if relocation is dynamic and is done at execution time. If addresses are relocated dynamically, relocation requires only moving the program and data and then changing the base register to reflect the new base address. When compaction is possible, we must determine its cost. The simplest compaction algorithm is to move all processes toward one end of memory; all holes move in the other direction, producing one large hole of available memory. This scheme can be expensive. Another possible solution to the external-fragmentation problem is to permit the logical address space of the processes to be noncontiguous, thus allowing a process to be allocated physical memory wherever such memory is available. Two complementary techniques achieve this solution: segmentation (Section 8.4) and paging (Section 8.5). These techniques can also be combined. Fragmentation is a general problem in computing that can occur wherever we must manage blocks of data. We discuss the topic further in the storage management chapters (Chapters 10 through and 12). 8.4 Segmentation As we’ve already seen, the user’s view of memory is not the same as the actual physical memory. This is equally true of the programmer’s view of memory. Indeed, dealing with memory in terms of its physical properties is inconvenient to both the operating system and the programmer. What if the hardware could provide a memory mechanism that mapped the programmer’s view to the actual physical memory? The system would have more freedom to manage memory, while the programmer would have a more natural programming environment. Segmentation provides such a mechanism. 8.4.1 Basic Method Do programmers think of memory as a linear array of bytes, some containing instructions and others containing data? Most programmers would say “no.” Rather, they prefer to view memory as a collection of variable-sized segments, with no necessary ordering among the segments (Figure 8.7). When writing a program, a programmer thinks of it as a main program with a set of methods, procedures, or functions. It may also include various data structures: objects, arrays, stacks, variables, and so on. Each of these modules or data elements is referred to by name. The programmer talks about “the stack,” “the math library,” and “the main program” without caring what addresses in memory these elements occupy. She is not concerned with whether the stack is stored before or after the Sqrt() function. Segments vary in length, and the length of each is intrinsically defined by its purpose in the program. Elements within a segment are identified by their offset from the beginning of the segment: the first statement of the program, the seventh stack frame entry in the stack, the fifth instruction of the Sqrt(), and so on. Segmentation is a memory-management scheme that supports this pro- grammer view of memory. A logical address space is a collection of segments.
  • 185. 8.4 Segmentation 365 logical address subroutine stack symbol table main program Sqrt Figure 8.7 Programmer’s view of a program. Each segment has a name and a length. The addresses specify both the segment name and the offset within the segment. The programmer therefore specifies each address by two quantities: a segment name and an offset. For simplicity of implementation, segments are numbered and are referred to by a segment number, rather than by a segment name. Thus, a logical address consists of a two tuple: segment-number, offset. Normally, when a program is compiled, the compiler automatically constructs segments reflecting the input program. A C compiler might create separate segments for the following: 1. The code 2. Global variables 3. The heap, from which memory is allocated 4. The stacks used by each thread 5. The standard C library Libraries that are linked in during compile time might be assigned separate segments. The loader would take all these segments and assign them segment numbers. 8.4.2 Segmentation Hardware Although the programmer can now refer to objects in the program by a two-dimensional address, the actual physical memory is still, of course, a one- dimensional sequence of bytes. Thus, we must define an implementation to map two-dimensional user-defined addresses into one-dimensional physical
  • 186. 366 Chapter 8 Main Memory CPU physical memory s d + trap: addressing error no yes segment table limit base s Figure 8.8 Segmentation hardware. addresses. This mapping is effected by a segment table. Each entry in the segment table has a segment base and a segment limit. The segment base contains the starting physical address where the segment resides in memory, and the segment limit specifies the length of the segment. The use of a segment table is illustrated in Figure 8.8. A logical address consists of two parts: a segment number, s, and an offset into that segment, d. The segment number is used as an index to the segment table. The offset d of the logical address must be between 0 and the segment limit. If it is not, we trap to the operating system (logical addressing attempt beyond end of segment). When an offset is legal, it is added to the segment base to produce the address in physical memory of the desired byte. The segment table is thus essentially an array of base–limit register pairs. As an example, consider the situation shown in Figure 8.9. We have five segments numbered from 0 through 4. The segments are stored in physical memory as shown. The segment table has a separate entry for each segment, giving the beginning address of the segment in physical memory (or base) and the length of that segment (or limit). For example, segment 2 is 400 bytes long and begins at location 4300. Thus, a reference to byte 53 of segment 2 is mapped onto location 4300 + 53 = 4353. A reference to segment 3, byte 852, is mapped to 3200 (the base of segment 3) + 852 = 4052. A reference to byte 1222 of segment 0 would result in a trap to the operating system, as this segment is only 1,000 bytes long. 8.5 Paging Segmentation permits the physical address space of a process to be non- contiguous. Paging is another memory-management scheme that offers this advantage. However, paging avoids external fragmentation and the need for
  • 187. 8.5 Paging 367 logical address space subroutine stack symbol table main program Sqrt 1400 physical memory 2400 3200 segment 2 4300 4700 5700 6300 6700 segment table limit 0 1 2 3 4 1000 400 400 1100 1000 base 1400 6300 4300 3200 4700 segment 0 segment 3 segment 4 segment 2 segment 1 segment 0 segment 3 segment 4 segment 1 Figure 8.9 Example of segmentation. compaction, whereas segmentation does not. It also solves the considerable problem of fitting memory chunks of varying sizes onto the backing store. Most memory-management schemes used before the introduction of paging suffered from this problem. The problem arises because, when code fragments or data residing in main memory need to be swapped out, space must be found on the backing store. The backing store has the same fragmentation problems discussed in connection with main memory, but access is much slower, so compaction is impossible. Because of its advantages over earlier methods, paging in its various forms is used in most operating systems, from those for mainframes through those for smartphones. Paging is implemented through cooperation between the operating system and the computer hardware. 8.5.1 Basic Method The basic method for implementing paging involves breaking physical mem- ory into fixed-sized blocks called frames and breaking logical memory into blocks of the same size called pages. When a process is to be executed, its pages are loaded into any available memory frames from their source (a file system or the backing store). The backing store is divided into fixed-sized blocks that are the same size as the memory frames or clusters of multiple frames. This rather simple idea has great functionality and wide ramifications. For example, the logical address space is now totally separate from the physical address space, so a process can have a logical 64-bit address space even though the system has less than 264 bytes of physical memory. The hardware support for paging is illustrated in Figure 8.10. Every address generated by the CPU is divided into two parts: a page number (p) and a page
  • 188. 368 Chapter 8 Main Memory physical memory f logical address page table physical address CPU p p f d d f f0000 … 0000 f1111 … 1111 Figure 8.10 Paging hardware. offset (d). The page number is used as an index into a page table. The page table contains the base address of each page in physical memory. This base address is combined with the page offset to define the physical memory address that is sent to the memory unit. The paging model of memory is shown in Figure 8.11. page 0 page 1 page 2 page 3 logical memory page 1 page 3 page 0 page 2 physical memory page table frame number 1 4 3 7 0 1 2 3 0 1 2 3 4 5 6 7 Figure 8.11 Paging model of logical and physical memory.
  • 189. 8.5 Paging 369 The page size (like the frame size) is defined by the hardware. The size of a page is a power of 2, varying between 512 bytes and 1 GB per page, depending on the computer architecture. The selection of a power of 2 as a page size makes the translation of a logical address into a page number and page offset particularly easy. If the size of the logical address space is 2m , and a page size is 2n bytes, then the high-order m − n bits of a logical address designate the page number, and the n low-order bits designate the page offset. Thus, the logical address is as follows: p d page number page offset m – n n where p is an index into the page table and d is the displacement within the page. As a concrete (although minuscule) example, consider the memory in Figure 8.12. Here, in the logical address, n= 2 and m = 4. Using a page size of 4 bytes and a physical memory of 32 bytes (8 pages), we show how the programmer’s view of memory can be mapped into physical memory. Logical address 0 is page 0, offset 0. Indexing into the page table, we find that page 0 logical memory physical memory page table i j k l m n o p a b c d e f g h a b c d e f g h i j k l m n o p 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 4 8 12 16 20 24 28 1 2 3 5 6 1 2 Figure 8.12 Paging example for a 32-byte memory with 4-byte pages.
  • 190. 370 Chapter 8 Main Memory OBTAINING THE PAGE SIZE ON LINUX SYSTEMS On a Linux system, the page size varies according to architecture, and there are several ways of obtaining the page size. One approach is to use the getpagesize() system call. Another strategy is to enter the following command on the command line: getconf PAGESIZE Each of these techniques returns the page size as a number of bytes. is in frame 5. Thus, logical address 0 maps to physical address 20 [= (5 × 4) + 0]. Logical address 3 (page 0, offset 3) maps to physical address 23 [= (5 × 4) + 3]. Logical address 4 is page 1, offset 0; according to the page table, page 1 is mapped to frame 6. Thus, logical address 4 maps to physical address 24 [= (6 × 4) + 0]. Logical address 13 maps to physical address 9. You may have noticed that paging itself is a form of dynamic relocation. Every logical address is bound by the paging hardware to some physical address. Using paging is similar to using a table of base (or relocation) registers, one for each frame of memory. When we use a paging scheme, we have no external fragmentation: any free frame can be allocated to a process that needs it. However, we may have some internal fragmentation. Notice that frames are allocated as units. If the memory requirements of a process do not happen to coincide with page boundaries, the last frame allocated may not be completely full. For example, if page size is 2,048 bytes, a process of 72,766 bytes will need 35 pages plus 1,086 bytes. It will be allocated 36 frames, resulting in internal fragmentation of 2,048 − 1,086 = 962 bytes. In the worst case, a process would need n pages plus 1 byte. It would be allocated n + 1 frames, resulting in internal fragmentation of almost an entire frame. If process size is independent of page size, we expect internal fragmentation to average one-half page per process. This consideration suggests that small page sizes are desirable. However, overhead is involved in each page-table entry, and this overhead is reduced as the size of the pages increases. Also, disk I/O is more efficient when the amount data being transferred is larger (Chapter 10). Generally, page sizes have grown over time as processes, data sets, and main memory have become larger. Today, pages typically are between 4 KB and 8 KB in size, and some systems support even larger page sizes. Some CPUs and kernels even support multiple page sizes. For instance, Solaris uses page sizes of 8 KB and 4 MB, depending on the data stored by the pages. Researchers are now developing support for variable on-the-fly page size. Frequently, on a 32-bit CPU, each page-table entry is 4 bytes long, but that size can vary as well. A 32-bit entry can point to one of 232 physical page frames. If frame size is 4 KB (212 ), then a system with 4-byte entries can address 244 bytes (or 16 TB) of physical memory. We should note here that the size of physical memory in a paged memory system is different from the maximum logical size of a process. As we further explore paging, we introduce other information that must be kept in the page-table entries. That information reduces the number
  • 191. 8.5 Paging 371 (a) free-frame list 14 13 18 20 15 13 14 15 16 17 18 19 20 21 page 0 page 1 page 2 page 3 new process (b) free-frame list 15 13 page 1 page 0 page 2 page 3 14 15 16 17 18 19 20 21 page 0 page 1 page 2 page 3 new process new-process page table 14 0 1 2 3 13 18 20 Figure 8.13 Free frames (a) before allocation and (b) after allocation. of bits available to address page frames. Thus, a system with 32-bit page-table entries may address less physical memory than the possible maximum. A 32-bit CPU uses 32-bit addresses, meaning that a given process space can only be 232 bytes (4 TB). Therefore, paging lets us use physical memory that is larger than what can be addressed by the CPU’s address pointer length. When a process arrives in the system to be executed, its size, expressed in pages, is examined. Each page of the process needs one frame. Thus, if the process requires n pages, at least n frames must be available in memory. If n frames are available, they are allocated to this arriving process. The first page of the process is loaded into one of the allocated frames, and the frame number is put in the page table for this process. The next page is loaded into another frame, its frame number is put into the page table, and so on (Figure 8.13). An important aspect of paging is the clear separation between the program- mer’s view of memory and the actual physical memory. The programmer views memory as one single space, containing only this one program. In fact, the user program is scattered throughout physical memory, which also holds other programs. The difference between the programmer’s view of memory and the actual physical memory is reconciled by the address-translation hardware. The logical addresses are translated into physical addresses. This mapping is hidden from the programmer and is controlled by the operating system. Notice that the user process by definition is unable to access memory it does not own. It has no way of addressing memory outside of its page table, and the table includes only those pages that the process owns. Since the operating system is managing physical memory, it must be aware of the allocation details of physical memory—which frames are allocated, which frames are available, how many total frames there are, and so on. This information is generally kept in a data structure called a frame table. The frame table has one entry for each physical page frame, indicating whether the latter
  • 192. 372 Chapter 8 Main Memory is free or allocated and, if it is allocated, to which page of which process or processes. In addition, the operating system must be aware that user processes operate in user space, and all logical addresses must be mapped to produce physical addresses. If a user makes a system call (to do I/O, for example) and provides an address as a parameter (a buffer, for instance), that address must be mapped to produce the correct physical address. The operating system maintains a copy of the page table for each process, just as it maintains a copy of the instruction counter and register contents. This copy is used to translate logical addresses to physical addresses whenever the operating system must map a logical address to a physical address manually. It is also used by the CPU dispatcher to define the hardware page table when a process is to be allocated the CPU. Paging therefore increases the context-switch time. 8.5.2 Hardware Support Each operating system has its own methods for storing page tables. Some allocate a page table for each process. A pointer to the page table is stored with the other register values (like the instruction counter) in the process control block. When the dispatcher is told to start a process, it must reload the user registers and define the correct hardware page-table values from the stored user page table. Other operating systems provide one or at most a few page tables, which decreases the overhead involved when processes are context-switched. The hardware implementation of the page table can be done in several ways. In the simplest case, the page table is implemented as a set of dedicated registers. These registers should be built with very high-speed logic to make the paging-address translation efficient. Every access to memory must go through the paging map, so efficiency is a major consideration. The CPU dispatcher reloads these registers, just as it reloads the other registers. Instructions to load or modify the page-table registers are, of course, privileged, so that only the operating system can change the memory map. The DEC PDP-11 is an example of such an architecture. The address consists of 16 bits, and the page size is 8 KB. The page table thus consists of eight entries that are kept in fast registers. The use of registers for the page table is satisfactory if the page table is reasonably small (for example, 256 entries). Most contemporary computers, however, allow the page table to be very large (for example, 1 million entries). For these machines, the use of fast registers to implement the page table is not feasible. Rather, the page table is kept in main memory, and a page-table base register (PTBR) points to the page table. Changing page tables requires changing only this one register, substantially reducing context-switch time. The problem with this approach is the time required to access a user memory location. If we want to access location i, we must first index into the page table, using the value in the PTBR offset by the page number for i. This task requires a memory access. It provides us with the frame number, which is combined with the page offset to produce the actual address. We can then access the desired place in memory. With this scheme, two memory accesses are needed to access a byte (one for the page-table entry, one for the byte). Thus, memory access is slowed by a factor of 2. This delay would be intolerable under most circumstances. We might as well resort to swapping!
  • 193. 8.5 Paging 373 The standard solution to this problem is to use a special, small, fast- lookup hardware cache called a translation look-aside buffer (TLB). The TLB is associative, high-speed memory. Each entry in the TLB consists of two parts: a key (or tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; a TLB lookup in modern hardware is part of the instruction pipeline, essentially adding no performance penalty. To be able to execute the search within a pipeline step, however, the TLB must be kept small. It is typically between 32 and 1,024 entries in size. Some CPUs implement separate instruction and data address TLBs. That can double the number of TLB entries available, because those lookups occur in different pipeline steps. We can see in this development an example of the evolution of CPU technology: systems have evolved from having no TLBs to having multiple levels of TLBs, just as they have multiple levels of caches. The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found, its frame number is immediately available and is used to access memory. As just mentioned, these steps are executed as part of the instruction pipeline within the CPU, adding no performance penalty compared with a system that does not implement paging. If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. Depending on the CPU, this may be done automatically in hardware or via an interrupt to the operating system. When the frame number is obtained, we can use it to access memory (Figure 8.14). In addition, we add the page number and frame number to the TLB, so page table f CPU logical address p d f d physical address physical memory p TLB miss page number frame number TLB hit TLB Figure 8.14 Paging hardware with TLB.
  • 194. 374 Chapter 8 Main Memory that they will be found quickly on the next reference. If the TLB is already full of entries, an existing entry must be selected for replacement. Replacement policies range from least recently used (LRU) through round-robin to random. Some CPUs allow the operating system to participate in LRU entry replacement, while others handle the matter themselves. Furthermore, some TLBs allow certain entries to be wired down, meaning that they cannot be removed from the TLB. Typically, TLB entries for key kernel code are wired down. Some TLBs store address-space identifiers (ASIDs) in each TLB entry. An ASID uniquely identifies each process and is used to provide address-space protection for that process. When the TLB attempts to resolve virtual page numbers, it ensures that the ASID for the currently running process matches the ASID associated with the virtual page. If the ASIDs do not match, the attempt is treated asaTLB miss. Inadditiontoprovidingaddress-space protection, anASID allows the TLB to contain entries for several different processes simultaneously. If the TLB does not support separate ASIDs, then every time a new page table is selected (for instance, with each context switch), the TLB must be flushed (or erased) to ensure that the next executing process does not use the wrong translation information. Otherwise, the TLB could include old entries that contain valid virtual addresses but have incorrect or invalid physical addresses left over from the previous process. The percentage of times that the page number of interest is found in the TLB is called the hit ratio. An 80-percent hit ratio, for example, means that we find the desired page number in the TLB 80 percent of the time. If it takes 100 nanoseconds to access memory, then a mapped-memory access takes 100 nanoseconds when the page number is in the TLB. If we fail to find the page number in the TLB then we must first access memory for the page table and frame number (100 nanoseconds) and then access the desired byte in memory (100 nanoseconds), for a total of 200 nanoseconds. (We are assuming that a page-table lookup takes only one memory access, but it can take more, as we shall see.) To find the effective memory-access time, we weight the case by its probability: effective access time = 0.80 × 100 + 0.20 × 200 = 120 nanoseconds In this example, we suffer a 20-percent slowdown in average memory-access time (from 100 to 120 nanoseconds). For a 99-percent hit ratio, which is much more realistic, we have effective access time = 0.99 × 100 + 0.01 × 200 = 101 nanoseconds This increased hit rate produces only a 1 percent slowdown in access time. As we noted earlier, CPUs today may provide multiple levels of TLBs. Calculating memory access times in modern CPUs is therefore much more complicated than shown in the example above. For instance, the Intel Core i7 CPU has a 128-entry L1 instruction TLB and a 64-entry L1 data TLB. In the case of a miss at L1, it takes the CPU six cycles to check for the entry in the L2 512-entry TLB. A miss in L2 means that the CPU must either walk through the
  • 195. 8.5 Paging 375 page-table entries in memory to find the associated frame address, which can take hundreds of cycles, or interrupt to the operating system to have it do the work. A complete performance analysis of paging overhead in such a system would require miss-rate information about each TLB tier. We can see from the general information above, however, that hardware features can have a signif- icant effect on memory performance and that operating-system improvements (such as paging) can result in and, in turn, be affected by hardware changes (such as TLBs). We will further explore the impact of the hit ratio on the TLB in Chapter 9. TLBs are a hardware feature and therefore would seem to be of little concern to operating systems and their designers. But the designer needs to understand the function and features of TLBs, which vary by hardware platform. For optimal operation, an operating-system design for a given platform must implement paging according to the platform’s TLB design. Likewise, a change in the TLB design (for example, between generations of Intel CPUs) may necessitate a change in the paging implementation of the operating systems that use it. 8.5.3 Protection Memory protection in a paged environment is accomplished by protection bits associated with each frame. Normally, these bits are kept in the page table. One bit can define a page to be read–write or read-only. Every reference to memory goes through the page table to find the correct frame number. At the same time that the physical address is being computed, the protection bits can be checked to verify that no writes are being made to a read-only page. An attempt to write to a read-only page causes a hardware trap to the operating system (or memory-protection violation). We can easily expand this approach to provide a finer level of protection. We can create hardware to provide read-only, read–write, or execute-only protection; or, by providing separate protection bits for each kind of access, we can allow any combination of these accesses. Illegal attempts will be trapped to the operating system. One additional bit is generally attached to each entry in the page table: a valid–invalid bit. When this bit is set to valid, the associated page is in the process’s logical address space and is thus a legal (or valid) page. When the bit is set toinvalid, the page is not in the process’s logical address space. Illegal addresses are trapped by use of the valid–invalid bit. The operating system sets this bit for each page to allow or disallow access to the page. Suppose, for example, that in a system with a 14-bit address space (0 to 16383), we have a program that should use only addresses 0 to 10468. Given a page size of 2 KB, we have the situation shown in Figure 8.15. Addresses in pages 0, 1, 2, 3, 4, and 5 are mapped normally through the page table. Any attempt to generate an address in pages 6 or 7, however, will find that the valid–invalid bit is set to invalid, and the computer will trap to the operating system (invalid page reference). Notice that this scheme has created a problem. Because the program extends only to address 10468, any reference beyond that address is illegal. However, references to page 5 are classified as valid, so accesses to addresses up to 12287 are valid. Only the addresses from 12288 to 16383 are invalid. This
  • 196. 376 Chapter 8 Main Memory page 0 page 0 page 1 page 2 page 3 page 4 page 5 page n • • • 00000 0 1 2 3 4 5 6 7 8 9 frame number 0 1 2 3 4 5 6 7 2 3 4 7 8 9 0 0 v v v v v v i i page table valid–invalid bit 10,468 12,287 page 1 page 2 page 3 page 4 page 5 Figure 8.15 Valid (v) or invalid (i) bit in a page table. problem is a result of the 2-KB page size and reflects the internal fragmentation of paging. Rarely does a process use all its address range. In fact, many processes use only a small fraction of the address space available to them. It would be wasteful in these cases to create a page table with entries for every page in the address range. Most of this table would be unused but would take up valuable memory space. Some systems provide hardware, in the form of a page-table length register (PTLR), to indicate the size of the page table. This value is checked against every logical address to verify that the address is in the valid range for the process. Failure of this test causes an error trap to the operating system. 8.5.4 Shared Pages An advantage of paging is the possibility of sharing common code. This con- sideration is particularly important in a time-sharing environment. Consider a system that supports 40 users, each of whom executes a text editor. If the text editor consists of 150 KB of code and 50 KB of data space, we need 8,000 KB to support the 40 users. If the code is reentrant code (or pure code), however, it can be shared, as shown in Figure 8.16. Here, we see three processes sharing a three-page editor—each page 50 KB in size (the large page size is used to simplify the figure). Each process has its own data page. Reentrant code is non-self-modifying code: it never changes during execu- tion. Thus, two or more processes can execute the same code at the same time.
  • 197. 8.5 Paging 377 7 6 5 ed 2 4 ed 1 3 2 data 1 1 0 3 4 6 1 page table for P1 process P1 data 1 ed 2 ed 3 ed 1 3 4 6 2 page table for P3 process P3 data 3 ed 2 ed 3 ed 1 3 4 6 7 page table for P2 process P2 data 2 ed 2 ed 3 ed 1 8 9 10 11 data 3 2 data ed 3 Figure 8.16 Sharing of code in a paging environment. Each process has its own copy of registers and data storage to hold the data for the process’s execution. The data for two different processes will, of course, be different. Only one copy of the editor need be kept in physical memory. Each user’s page table maps onto the same physical copy of the editor, but data pages are mapped onto different frames. Thus, to support 40 users, we need only one copy of the editor (150 KB), plus 40 copies of the 50 KB of data space per user. The total space required is now 2,150 KB instead of 8,000 KB—a significant savings. Other heavily used programs can also be shared—compilers, window systems, run-time libraries, database systems, and so on. To be sharable, the code must be reentrant. The read-only nature of shared code should not be left to the correctness of the code; the operating system should enforce this property. The sharing of memory among processes on a system is similar to the sharing of the address space of a task by threads, described in Chapter 4. Furthermore, recall that in Chapter 3 we described shared memory as a method of interprocess communication. Some operating systems implement shared memory using shared pages. Organizing memory according to pages provides numerous benefits in addition to allowing several processes to share the same physical pages. We cover several other benefits in Chapter 9.
  • 198. 378 Chapter 8 Main Memory 8.6 Structure of the Page Table In this section, we explore some of the most common techniques for structuring the page table, including hierarchical paging, hashed page tables, and inverted page tables. 8.6.1 Hierarchical Paging Most modern computer systems support a large logical address space (232 to 264 ). In such an environment, the page table itself becomes excessively large. For example, consider a system with a 32-bit logical address space. If the page size in such a system is 4 KB (212 ), then a page table may consist of up to 1 million entries (232 /212 ). Assuming that each entry consists of 4 bytes, each process may need up to 4 MB of physical address space for the page table alone. Clearly, we would not want to allocate the page table contiguously in main memory. One simple solution to this problem is to divide the page table into smaller pieces. We can accomplish this division in several ways. One way is to use a two-level paging algorithm, in which the page table itself is also paged (Figure 8.17). For example, consider again the system with a 32-bit logical address space and a page size of 4 KB. A logical address is divided into a page number consisting of 20 bits and a page offset consisting of 12 bits. Because we page the page table, the page number is further divided • • • • • • outer page table page of page table page table memory 929 900 929 900 708 500 100 1 0 • • • 100 708 • • • • • • • • • • • • • • • • • • • • • • • • • • • 1 500 Figure 8.17 A two-level page-table scheme.
  • 199. 8.6 Structure of the Page Table 379 logical address outer page table p1 p2 p1 page of page table p2 d d Figure 8.18 Address translation for a two-level 32-bit paging architecture. into a 10-bit page number and a 10-bit page offset. Thus, a logical address is as follows: p1 p2 d page number page offset 10 10 12 where p1 is an index into the outer page table and p2 is the displacement within the page of the inner page table. The address-translation method for this architecture is shown in Figure 8.18. Because address translation works from the outer page table inward, this scheme is also known as a forward-mapped page table. Consider the memory management of one of the classic systems, the VAX minicomputer from Digital Equipment Corporation (DEC). The VAX was the most popular minicomputer of its time and was sold from 1977 through 2000. The VAX architecture supported a variation of two-level paging. The VAX is a 32- bit machine with a page size of 512 bytes. The logical address space of a process is divided into four equal sections, each of which consists of 230 bytes. Each section represents a different part of the logical address space of a process. The first 2 high-order bits of the logical address designate the appropriate section. The next 21 bits represent the logical page number of that section, and the final 9 bits represent an offset in the desired page. By partitioning the page table in this manner, the operating system can leave partitions unused until a process needs them. Entire sections of virtual address space are frequently unused, and multilevel page tables have no entries for these spaces, greatly decreasing the amount of memory needed to store virtual memory data structures. An address on the VAX architecture is as follows: s p d section page offset 2 21 9 where s designates the section number, p is an index into the page table, and d is the displacement within the page. Even when this scheme is used, the size of a one-level page table for a VAX process using one section is 221 bits ∗ 4
  • 200. 380 Chapter 8 Main Memory bytes per entry = 8 MB. To further reduce main-memory use, the VAX pages the user-process page tables. For a system with a 64-bit logical address space, a two-level paging scheme is no longer appropriate. To illustrate this point, let’s suppose that the page size in such a system is 4 KB (212 ). In this case, the page table consists of up to 252 entries. If we use a two-level paging scheme, then the inner page tables can conveniently be one page long, or contain 210 4-byte entries. The addresses look like this: p1 p2 d outer page inner page offset 42 10 12 The outer page table consists of 242 entries, or 244 bytes. The obvious way to avoid such a large table is to divide the outer page table into smaller pieces. (This approach is also used on some 32-bit processors for added flexibility and efficiency.) We can divide the outer page table in various ways. For example, we can page the outer page table, giving us a three-level paging scheme. Suppose that the outer page table is made up of standard-size pages (210 entries, or 212 bytes). In this case, a 64-bit address space is still daunting: p1 p2 p3 2nd outer page outer page inner page 32 10 10 d offset 12 The outer page table is still 234 bytes (16 GB) in size. The next step would be a four-level paging scheme, where the second-level outer page table itself is also paged, and so forth. The 64-bit UltraSPARC would require seven levels of paging—a prohibitive number of memory accesses— to translate each logical address. You can see from this example why, for 64-bit architectures, hierarchical page tables are generally considered inappropriate. 8.6.2 Hashed Page Tables A common approach for handling address spaces larger than 32 bits is to use a hashed page table, with the hash value being the virtual page number. Each entry in the hash table contains a linked list of elements that hash to the same location (to handle collisions). Each element consists of three fields: (1) the virtual page number, (2) the value of the mapped page frame, and (3) a pointer to the next element in the linked list. The algorithm works as follows: The virtual page number in the virtual address is hashed into the hash table. The virtual page number is compared with field 1 in the first element in the linked list. If there is a match, the corresponding page frame (field 2) is used to form the desired physical address. If there is no match, subsequent entries in the linked list are searched for a matching virtual page number. This scheme is shown in Figure 8.19. A variation of this scheme that is useful for 64-bit address spaces has been proposed. This variation uses clustered page tables, which are similar to
  • 201. 8.6 Structure of the Page Table 381 hash table q s logical address physical address physical memory p d r d p r hash function • • • Figure 8.19 Hashed page table. hashed page tables except that each entry in the hash table refers to several pages (such as 16) rather than a single page. Therefore, a single page-table entry can store the mappings for multiple physical-page frames. Clustered page tables are particularly useful for sparse address spaces, where memory references are noncontiguous and scattered throughout the address space. 8.6.3 Inverted Page Tables Usually, each process has an associated page table. The page table has one entry for each page that the process is using (or one slot for each virtual address, regardless of the latter’s validity). This table representation is a natural one, since processes reference pages through the pages’ virtual addresses. The operating system must then translate this reference into a physical memory address. Since the table is sorted by virtual address, the operating system is able to calculate where in the table the associated physical address entry is located and to use that value directly. One of the drawbacks of this method is that each page table may consist of millions of entries. These tables may consume large amounts of physical memory just to keep track of how other physical memory is being used. To solve this problem, we can use an inverted page table. An inverted page table has one entry for each real page (or frame) of memory. Each entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns the page. Thus, only one page table is in the system, and it has only one entry for each page of physical memory. Figure 8.20 shows the operation of an inverted page table. Compare it with Figure 8.10, which depicts a standard page table in operation. Inverted page tables often require that an address-space identifier (Section 8.5.2) be stored in each entry of the page table, since the table usually contains several different address spaces mapping physical memory. Storing the address-space identifier ensures that a logical page for a particular process is mapped to the corresponding physical page frame. Examples of systems using inverted page tables include the 64-bit UltraSPARC and PowerPC.
  • 202. 382 Chapter 8 Main Memory page table CPU logical address physical address physical memory i pid p pid search p d i d Figure 8.20 Inverted page table. To illustrate this method, we describe a simplified version of the inverted page table used in the IBM RT. IBM was the first major company to use inverted page tables, starting with the IBM System 38 and continuing through the RS/6000 and the current IBM Power CPUs. For the IBM RT, each virtual address in the system consists of a triple: process-id, page-number, offset. Each inverted page-table entry is a pair process-id, page-number where the process-id assumes the role of the address-space identifier. When a memory reference occurs, part of the virtual address, consisting of process-id, page- number, is presented to the memory subsystem. The inverted page table is then searched for a match. If a match is found—say, at entry i—then the physical address i, offset is generated. If no match is found, then an illegal address access has been attempted. Although this scheme decreases the amount of memory needed to store each page table, it increases the amount of time needed to search the table when a page reference occurs. Because the inverted page table is sorted by physical address, but lookups occur on virtual addresses, the whole table might need to be searched before a match is found. This search would take far too long. To alleviate this problem, we use a hash table, as described in Section 8.6.2, to limit the search to one—or at most a few—page-table entries. Of course, each access to the hash table adds a memory reference to the procedure, so one virtual memory reference requires at least two real memory reads—one for the hash-table entry and one for the page table. (Recall that the TLB is searched first, before the hash table is consulted, offering some performance improvement.) Systems that use inverted page tables have difficulty implementing shared memory. Shared memory is usually implemented as multiple virtual addresses (one for each process sharing the memory) that are mapped to one physical address. This standard method cannot be used with inverted page tables; because there is only one virtual page entry for every physical page, one
  • 203. 8.7 Example: Intel 32 and 64-bit Architectures 383 physical page cannot have two (or more) shared virtual addresses. A simple technique for addressing this issue is to allow the page table to contain only one mapping of a virtual address to the shared physical address. This means that references to virtual addresses that are not mapped result in page faults. 8.6.4 Oracle SPARC Solaris Consider as a final example a modern 64-bit CPU and operating system that are tightly integrated to provide low-overhead virtual memory. Solaris running on the SPARC CPU is a fully 64-bit operating system and as such has to solve the problem of virtual memory without using up all of its physical memory by keeping multiple levels of page tables. Its approach is a bit complex but solves the problem efficiently using hashed page tables. There are two hash tables—one for the kernel and one for all user processes. Each maps memory addresses from virtual to physical memory. Each hash-table entry represents a contiguous area of mapped virtual memory, which is more efficient than having a separate hash-table entry for each page. Each entry has a base address and a span indicating the number of pages the entry represents. Virtual-to-physical translation would take too long if each address required searching through a hash table, so the CPU implements a TLB that holds translation table entries (TTEs) for fast hardware lookups. A cache of these TTEs reside in a translation storage buffer (TSB), which includes an entry per recently accessed page. When a virtual address reference occurs, the hardware searches the TLB for a translation. If none is found, the hardware walks through the in-memory TSB looking for the TTE that corresponds to the virtual address that caused the lookup. This TLB walk functionality is found on many modern CPUs. If a match is found in the TSB, the CPU copies the TSB entry into the TLB, and the memory translation completes. If no match is found in the TSB, the kernel is interrupted to search the hash table. The kernel then creates a TTE from the appropriate hash table and stores it in the TSB for automatic loading into the TLB by the CPU memory-management unit. Finally, the interrupt handler returns control to the MMU, which completes the address translation and retrieves the requested byte or word from main memory. 8.7 Example: Intel 32 and 64-bit Architectures The architecture of Intel chips has dominated the personal computer landscape for several years. The 16-bit Intel 8086 appeared in the late 1970s and was soon followed by another 16-bit chip—the Intel 8088—which was notable for being the chip used in the original IBM PC. Both the 8086 chip and the 8088 chip were based on a segmented architecture. Intel later produced a series of 32-bit chips —the IA-32—which included the family of 32-bit Pentium processors. The IA-32 architecture supported both paging and segmentation. More recently, Intel has produced a series of 64-bit chips based on the x86-64 architecture. Currently, all the most popular PC operating systems run on Intel chips, including Windows, Mac OS X, and Linux (although Linux, of course, runs on several other architectures as well). Notably, however, Intel’s dominance has not spread to mobile systems, where the ARM architecture currently enjoys considerable success (see Section 8.8).
  • 204. 9 C H A P T E R Virtual Memory In Chapter 8, we discussed various memory-management strategies used in computer systems. All these strategies have the same goal: to keep many processes in memory simultaneously to allow multiprogramming. However, they tend to require that an entire process be in memory before it can execute. Virtual memory is a technique that allows the execution of processes that are not completely in memory. One major advantage of this scheme is that programs can be larger than physical memory. Further, virtual memory abstracts main memory into an extremely large, uniform array of storage, separating logical memory as viewed by the user from physical memory. This technique frees programmers from the concerns of memory-storage limitations. Virtual memory also allows processes to share files easily and to implement shared memory. In addition, it provides an efficient mechanism for process creation. Virtual memory is not easy to implement, however, and may substantially decrease performance if it is used carelessly. In this chapter, we discuss virtual memory in the form of demand paging and examine its complexity and cost. CHAPTER OBJECTIVES • To describe the benefits of a virtual memory system. • To explain the concepts of demand paging, page-replacement algorithms, and allocation of page frames. • To discuss the principles of the working-set model. • To examine the relationship between shared memory and memory-mapped files. • To explore how kernel memory is managed. 9.1 Background The memory-management algorithms outlined in Chapter 8 are necessary because of one basic requirement: The instructions being executed must be 397
  • 205. 398 Chapter 9 Virtual Memory in physical memory. The first approach to meeting this requirement is to place the entire logical address space in physical memory. Dynamic loading can help to ease this restriction, but it generally requires special precautions and extra work by the programmer. The requirement that instructions must be in physical memory to be executed seems both necessary and reasonable; but it is also unfortunate, since it limits the size of a program to the size of physical memory. In fact, an examination of real programs shows us that, in many cases, the entire program is not needed. For instance, consider the following: • Programs often have code to handle unusual error conditions. Since these errors seldom, if ever, occur in practice, this code is almost never executed. • Arrays, lists, and tables are often allocated more memory than they actually need. An array may be declared 100 by 100 elements, even though it is seldom larger than 10 by 10 elements. An assembler symbol table may have room for 3,000 symbols, although the average program has less than 200 symbols. • Certain options and features of a program may be used rarely. For instance, the routines on U.S. government computers that balance the budget have not been used in many years. Even in those cases where the entire program is needed, it may not all be needed at the same time. The ability to execute a program that is only partially in memory would confer many benefits: • A program would no longer be constrained by the amount of physical memory that is available. Users would be able to write programs for an extremely large virtual address space, simplifying the programming task. • Because each user program could take less physical memory, more programs could be run at the same time, with a corresponding increase in CPU utilization and throughput but with no increase in response time or turnaround time. • Less I/O would be needed to load or swap user programs into memory, so each user program would run faster. Thus, running a program that is not entirely in memory would benefit both the system and the user. Virtual memory involves the separation of logical memory as perceived by users from physical memory. This separation allows an extremely large virtual memory to be provided for programmers when only a smaller physical memory is available (Figure 9.1). Virtual memory makes the task of program- ming much easier, because the programmer no longer needs to worry about the amount of physical memory available; she can concentrate instead on the problem to be programmed. The virtual address space of a process refers to the logical (or virtual) view of how a process is stored in memory. Typically, this view is that a process begins at a certain logical address—say, address 0—and exists in contiguous memory, as shown in Figure 9.2. Recall from Chapter 8, though, that in fact
  • 206. 9.1 Background 399 virtual memory memory map physical memory • • • page 0 page 1 page 2 page v Figure 9.1 Diagram showing virtual memory that is larger than physical memory. physical memory may be organized in page frames and that the physical page frames assigned to a process may not be contiguous. It is up to the memory- management unit (MMU) to map logical pages to physical page frames in memory. Note in Figure 9.2 that we allow the heap to grow upward in memory as it is used for dynamic memory allocation. Similarly, we allow for the stack to code 0 Max data heap stack Figure 9.2 Virtual address space.
  • 207. 400 Chapter 9 Virtual Memory shared library stack shared pages code data heap code data heap shared library stack Figure 9.3 Shared library using virtual memory. grow downward in memory through successive function calls. The large blank space (or hole) between the heap and the stack is part of the virtual address space but will require actual physical pages only if the heap or stack grows. Virtual address spaces that include holes are known as sparse address spaces. Using a sparse address space is beneficial because the holes can be filled as the stack or heap segments grow or if we wish to dynamically link libraries (or possibly other shared objects) during program execution. In addition to separating logical memory from physical memory, virtual memory allows files and memory to be shared by two or more processes through page sharing (Section 8.5.4). This leads to the following benefits: • System libraries can be shared by several processes through mapping of the shared object into a virtual address space. Although each process considers the libraries to be part of its virtual address space, the actual pages where the libraries reside in physical memory are shared by all the processes (Figure 9.3). Typically, a library is mapped read-only into the space of each process that is linked with it. • Similarly, processes can share memory. Recall from Chapter 3 that two or more processes can communicate through the use of shared memory. Virtual memory allows one process to create a region of memory that it can share with another process. Processes sharing this region consider it part of their virtual address space, yet the actual physical pages of memory are shared, much as is illustrated in Figure 9.3. • Pages can be shared during process creation with the fork() system call, thus speeding up process creation. We further explore these—and other—benefits of virtual memory later in this chapter. First, though, we discuss implementing virtual memory through demand paging.
  • 208. 9.2 Demand Paging 401 9.2 Demand Paging Consider how an executable program might be loaded from disk into memory. One option is to load the entire program in physical memory at program execution time. However, a problem with this approach is that we may not initially need the entire program in memory. Suppose a program starts with a list of available options from which the user is to select. Loading the entire program into memory results in loading the executable code for all options, regardless of whether or not an option is ultimately selected by the user. An alternative strategy is to load pages only as they are needed. This technique is known as demand paging and is commonly used in virtual memory systems. With demand-paged virtual memory, pages are loaded only when they are demanded during program execution. Pages that are never accessed are thus never loaded into physical memory. A demand-paging system is similar to a paging system with swapping (Figure 9.4) where processes reside in secondary memory (usually a disk). When we want to execute a process, we swap it into memory. Rather than swapping the entire process into memory, though, we use a lazy swapper. A lazy swapper never swaps a page into memory unless that page will be needed. In the context of a demand-paging system, use of the term “swapper” is technically incorrect. A swapper manipulates entire processes, whereas a pager is concerned with the individual pages of a process. We thus use “pager,” rather than “swapper,” in connection with demand paging. program A swap out 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 swap in program B main memory Figure 9.4 Transfer of a paged memory to contiguous disk space.
  • 209. 402 Chapter 9 Virtual Memory 9.2.1 Basic Concepts When a process is to be swapped in, the pager guesses which pages will be used before the process is swapped out again. Instead of swapping in a whole process, the pager brings only those pages into memory. Thus, it avoids reading into memory pages that will not be used anyway, decreasing the swap time and the amount of physical memory needed. With this scheme, we need some form of hardware support to distinguish between the pages that are in memory and the pages that are on the disk. The valid–invalid bit scheme described in Section 8.5.3 can be used for this purpose. This time, however, when this bit is set to “valid,” the associated page is both legal and in memory. If the bit is set to “invalid,” the page either is not valid (that is, not in the logical address space of the process) or is valid but is currently on the disk. The page-table entry for a page that is brought into memory is set as usual, but the page-table entry for a page that is not currently in memory is either simply marked invalid or contains the address of the page on disk. This situation is depicted in Figure 9.5. Notice that marking a page invalid will have no effect if the process never attempts to access that page. Hence, if we guess right and page in all pages that are actually needed and only those pages, the process will run exactly as though we had brought in all pages. While the process executes and accesses pages that are memory resident, execution proceeds normally. B D D E F H logical memory valid–invalid bit frame page table 1 0 4 6 2 3 4 5 9 6 7 1 0 2 3 4 5 6 7 i v v i i v i i physical memory A A B C C F G H F 1 0 2 3 4 5 6 7 9 8 10 11 12 13 14 15 A C E G Figure 9.5 Page table when some pages are not in main memory.
  • 210. 9.2 Demand Paging 403 load M reference trap i page is on backing store operating system restart instruction reset page table page table physical memory bring in missing page free frame 1 2 3 6 5 4 Figure 9.6 Steps in handling a page fault. But what happens if the process tries to access a page that was not brought into memory? Access to a page marked invalid causes a page fault. The paging hardware, in translating the address through the page table, will notice that the invalid bit is set, causing a trap to the operating system. This trap is the result of the operating system’s failure to bring the desired page into memory. The procedure for handling this page fault is straightforward (Figure 9.6): 1. We check an internal table (usually kept with the process control block) for this process to determine whether the reference was a valid or an invalid memory access. 2. If the reference was invalid, we terminate the process. If it was valid but we have not yet brought in that page, we now page it in. 3. We find a free frame (by taking one from the free-frame list, for example). 4. We schedule a disk operation to read the desired page into the newly allocated frame. 5. When the disk read is complete, we modify the internal table kept with the process and the page table to indicate that the page is now in memory. 6. We restart the instruction that was interrupted by the trap. The process can now access the page as though it had always been in memory. In the extreme case, we can start executing a process with no pages in memory. When the operating system sets the instruction pointer to the first
  • 211. 404 Chapter 9 Virtual Memory instruction of the process, which is on a non-memory-resident page, the process immediately faults for the page. After this page is brought into memory, the process continues to execute, faulting as necessary until every page that it needs is in memory. At that point, it can execute with no more faults. This scheme is pure demand paging: never bring a page into memory until it is required. Theoretically, some programs could access several new pages of memory with each instruction execution (one page for the instruction and many for data), possibly causing multiple page faults per instruction. This situation would result in unacceptable system performance. Fortunately, analysis of running processes shows that this behavior is exceedingly unlikely. Programs tend to have locality of reference, described in Section 9.6.1, which results in reasonable performance from demand paging. The hardware to support demand paging is the same as the hardware for paging and swapping: • Page table. This table has the ability to mark an entry invalid through a valid–invalid bit or a special value of protection bits. • Secondary memory. This memory holds those pages that are not present in main memory. The secondary memory is usually a high-speed disk. It is known as the swap device, and the section of disk used for this purpose is known as swap space. Swap-space allocation is discussed in Chapter 10. A crucial requirement for demand paging is the ability to restart any instruction after a page fault. Because we save the state (registers, condition code, instruction counter) of the interrupted process when the page fault occurs, we must be able to restart the process in exactly the same place and state, except that the desired page is now in memory and is accessible. In most cases, this requirement is easy to meet. A page fault may occur at any memory reference. If the page fault occurs on the instruction fetch, we can restart by fetching the instruction again. If a page fault occurs while we are fetching an operand, we must fetch and decode the instruction again and then fetch the operand. As a worst-case example, consider a three-address instruction such as ADD the content of A to B, placing the result in C. These are the steps to execute this instruction: 1. Fetch and decode the instruction (ADD). 2. Fetch A. 3. Fetch B. 4. Add A and B. 5. Store the sum in C. If we fault when we try to store in C (because C is in a page not currently in memory), we will have to get the desired page, bring it in, correct the page table, and restart the instruction. The restart will require fetching the instruction again, decoding it again, fetching the two operands again, and then adding again. However, there is not much repeated work (less than one
  • 212. 9.2 Demand Paging 405 complete instruction), and the repetition is necessary only when a page fault occurs. The major difficulty arises when one instruction may modify several different locations. For example, consider the IBM System 360/370 MVC (move character) instruction, which can move up to 256 bytes from one location to another (possibly overlapping) location. If either block (source or destination) straddles a page boundary, a page fault might occur after the move is partially done. In addition, if the source and destination blocks overlap, the source block may have been modified, in which case we cannot simply restart the instruction. This problem can be solved in two different ways. In one solution, the microcode computes and attempts to access both ends of both blocks. If a page fault is going to occur, it will happen at this step, before anything is modified. The move can then take place; we know that no page fault can occur, since all the relevant pages are in memory. The other solution uses temporary registers to hold the values of overwritten locations. If there is a page fault, all the old values are written back into memory before the trap occurs. This action restores memory to its state before the instruction was started, so that the instruction can be repeated. This is by no means the only architectural problem resulting from adding paging to an existing architecture to allow demand paging, but it illustrates some of the difficulties involved. Paging is added between the CPU and the memory in a computer system. It should be entirely transparent to the user process. Thus, people often assume that paging can be added to any system. Although this assumption is true for a non-demand-paging environment, where a page fault represents a fatal error, it is not true where a page fault means only that an additional page must be brought into memory and the process restarted. 9.2.2 Performance of Demand Paging Demand paging can significantly affect the performance of a computer system. To see why, let’s compute the effective access time for a demand-paged memory. For most computer systems, the memory-access time, denoted ma, ranges from 10 to 200 nanoseconds. As long as we have no page faults, the effective access time is equal to the memory access time. If, however, a page fault occurs, we must first read the relevant page from disk and then access the desired word. Let p be the probability of a page fault (0 ≤ p ≤ 1). We would expect p to be close to zero—that is, we would expect to have only a few page faults. The effective access time is then effective access time = (1 − p) × ma + p × page fault time. To compute the effective access time, we must know how much time is needed to service a page fault. A page fault causes the following sequence to occur: 1. Trap to the operating system. 2. Save the user registers and process state.
  • 213. 406 Chapter 9 Virtual Memory 3. Determine that the interrupt was a page fault. 4. Check that the page reference was legal and determine the location of the page on the disk. 5. Issue a read from the disk to a free frame: a. Wait in a queue for this device until the read request is serviced. b. Wait for the device seek and/or latency time. c. Begin the transfer of the page to a free frame. 6. While waiting, allocate the CPU to some other user (CPU scheduling, optional). 7. Receive an interrupt from the disk I/O subsystem (I/O completed). 8. Save the registers and process state for the other user (if step 6 is executed). 9. Determine that the interrupt was from the disk. 10. Correct the page table and other tables to show that the desired page is now in memory. 11. Wait for the CPU to be allocated to this process again. 12. Restore the user registers, process state, and new page table, and then resume the interrupted instruction. Not all of these steps are necessary in every case. For example, we are assuming that, in step 6, the CPU is allocated to another process while the I/O occurs. This arrangement allows multiprogramming to maintain CPU utilization but requires additional time to resume the page-fault service routine when the I/O transfer is complete. In any case, we are faced with three major components of the page-fault service time: 1. Service the page-fault interrupt. 2. Read in the page. 3. Restart the process. The first and third tasks can be reduced, with careful coding, to several hundred instructions. These tasks may take from 1 to 100 microseconds each. The page-switch time, however, will probably be close to 8 milliseconds. (A typical hard disk has an average latency of 3 milliseconds, a seek of 5 milliseconds, and a transfer time of 0.05 milliseconds. Thus, the total paging time is about 8 milliseconds, including hardware and software time.) Remember also that we are looking at only the device-service time. If a queue of processes is waiting for the device, we have to add device-queueing time as we wait for the paging device to be free to service our request, increasing even more the time to swap.
  • 214. 9.2 Demand Paging 407 With an average page-fault service time of 8 milliseconds and a memory- access time of 200 nanoseconds, the effective access time in nanoseconds is effective access time = (1 − p) × (200) + p (8 milliseconds) = (1 − p) × 200 + p × 8,000,000 = 200 + 7,999,800 × p. We see, then, that the effective access time is directly proportional to the page-fault rate. If one access out of 1,000 causes a page fault, the effective access time is 8.2 microseconds. The computer will be slowed down by a factor of 40 because of demand paging! If we want performance degradation to be less than 10 percent, we need to keep the probability of page faults at the following level: 220 200 + 7,999,800 × p, 20 7,999,800 × p, p 0.0000025. That is, to keep the slowdown due to paging at a reasonable level, we can allow fewer than one memory access out of 399,990 to page-fault. In sum, it is important to keep the page-fault rate low in a demand-paging system. Otherwise, the effective access time increases, slowing process execution dramatically. An additional aspect of demand paging is the handling and overall use of swap space. Disk I/O to swap space is generally faster than that to the file system. It is a faster file system because swap space is allocated in much larger blocks, and file lookups and indirect allocation methods are not used (Chapter 10). The system can therefore gain better paging throughput by copying an entire file image into the swap space at process startup and then performing demand paging from the swap space. Another option is to demand pages from the file system initially but to write the pages to swap space as they are replaced. This approach will ensure that only needed pages are read from the file system but that all subsequent paging is done from swap space. Some systems attempt to limit the amount of swap space used through demand paging of binary files. Demand pages for such files are brought directly from the file system. However, when page replacement is called for, these frames can simply be overwritten (because they are never modified), and the pages can be read in from the file system again if needed. Using this approach, the file system itself serves as the backing store. However, swap space must still be used for pages not associated with a file (known as anonymous memory); these pages include the stack and heap for a process. This method appears to be a good compromise and is used in several systems, including Solaris and BSD UNIX. Mobile operating systems typically do not support swapping. Instead, these systems demand-page from the file system and reclaim read-only pages (such as code) from applications if memory becomes constrained. Such data can be demand-paged from the file system if it is later needed. Under iOS, anonymous memory pages are never reclaimed from an application unless the application is terminated or explicitly releases the memory.
  • 215. 408 Chapter 9 Virtual Memory 9.3 Copy-on-Write In Section 9.2, we illustrated how a process can start quickly by demand-paging in the page containing the first instruction. However, process creation using the fork() system call may initially bypass the need for demand paging by using a technique similar to page sharing (covered in Section 8.5.4). This technique provides rapid process creation and minimizes the number of new pages that must be allocated to the newly created process. Recall that the fork() system call creates a child process that is a duplicate of its parent. Traditionally, fork() worked by creating a copy of the parent’s address space for the child, duplicating the pages belonging to the parent. However, considering that many child processes invoke the exec() system call immediately after creation, the copying of the parent’s address space may be unnecessary. Instead, we can use a technique known as copy-on-write, which works by allowing the parent and child processes initially to share the same pages. These shared pages are marked as copy-on-write pages, meaning that if either process writes to a shared page, a copy of the shared page is created. Copy-on-write is illustrated in Figures 9.7 and 9.8, which show the contents of the physical memory before and after process 1 modifies page C. For example, assume that the child process attempts to modify a page containing portions of the stack, with the pages set to be copy-on-write. The operating system will create a copy of this page, mapping it to the address space of the child process. The child process will then modify its copied page and not the page belonging to the parent process. Obviously, when the copy-on-write technique is used, only the pages that are modified by either process are copied; all unmodified pages can be shared by the parent and child processes. Note, too, that only pages that can be modified need be marked as copy-on-write. Pages that cannot be modified (pages containing executable code) can be shared by the parent and child. Copy-on-write is a common technique used by several operating systems, including Windows XP, Linux, and Solaris. When it is determined that a page is going to be duplicated using copy- on-write, it is important to note the location from which the free page will be allocated. Many operating systems provide a pool of free pages for such requests. These free pages are typically allocated when the stack or heap for a process must expand or when there are copy-on-write pages to be managed. process1 physical memory page A page B page C process2 Figure 9.7 Before process 1 modifies page C.
  • 216. 9.4 Page Replacement 409 process1 physical memory page A page B page C Copy of page C process2 Figure 9.8 After process 1 modifies page C. Operating systems typically allocate these pages using a technique known as zero-fill-on-demand. Zero-fill-on-demand pages have been zeroed-out before being allocated, thus erasing the previous contents. Several versions of UNIX (including Solaris and Linux) provide a variation of the fork() system call—vfork() (for virtual memory fork)—that operates differently from fork() with copy-on-write. With vfork(), the parent process is suspended, and the child process uses the address space of the parent. Because vfork() does not use copy-on-write, if the child process changes any pages of the parent’s address space, the altered pages will be visible to the parent once it resumes. Therefore, vfork() must be used with caution to ensure that the child process does not modify the address space of the parent. vfork() is intended to be used when the child process calls exec() immediately after creation. Because no copying of pages takes place, vfork() is an extremely efficient method of process creation and is sometimes used to implement UNIX command-line shell interfaces. 9.4 Page Replacement In our earlier discussion of the page-fault rate, we assumed that each page faults at most once, when it is first referenced. This representation is not strictly accurate, however. If a process of ten pages actually uses only half of them, then demand paging saves the I/O necessary to load the five pages that are never used. We could also increase our degree of multiprogramming by running twice as many processes. Thus, if we had forty frames, we could run eight processes, rather than the four that could run if each required ten frames (five of which were never used). If we increase our degree of multiprogramming, we are over-allocating memory. If we run six processes, each of which is ten pages in size but actually uses only five pages, we have higher CPU utilization and throughput, with ten frames to spare. It is possible, however, that each of these processes, for a particular data set, may suddenly try to use all ten of its pages, resulting in a need for sixty frames when only forty are available. Further, consider that system memory is not used only for holding program pages. Buffers for I/O also consume a considerable amount of memory. This use
  • 217. 410 Chapter 9 Virtual Memory monitor load M physical memory 1 0 2 3 4 5 6 7 H load M J M logical memory for user 1 0 PC 1 2 3 B M valid–invalid bit frame page table for user 1 i A B D E logical memory for user 2 0 1 2 3 valid–invalid bit frame page table for user 2 i 4 3 5 v v v 7 2 v v 6 v D H J A E Figure 9.9 Need for page replacement. can increase the strain on memory-placement algorithms. Deciding how much memory to allocate to I/O and how much to program pages is a significant challenge. Some systems allocate a fixed percentage of memory for I/O buffers, whereas others allow both user processes and the I/O subsystem to compete for all system memory. Over-allocation of memory manifests itself as follows. While a user process is executing, a page fault occurs. The operating system determines where the desired page is residing on the disk but then finds that there are no free frames on the free-frame list; all memory is in use (Figure 9.9). The operating system has several options at this point. It could terminate the user process. However, demand paging is the operating system’s attempt to improve the computer system’s utilization and throughput. Users should not be aware that their processes are running on a paged system—paging should be logically transparent to the user. So this option is not the best choice. The operating system could instead swap out a process, freeing all its frames and reducing the level of multiprogramming. This option is a good one in certain circumstances, and we consider it further in Section 9.6. Here, we discuss the most common solution: page replacement. 9.4.1 Basic Page Replacement Page replacement takes the following approach. If no frame is free, we find one that is not currently being used and free it. We can free a frame by writing its contents to swap space and changing the page table (and all other tables) to indicate that the page is no longer in memory (Figure 9.10). We can now use the freed frame to hold the page for which the process faulted. We modify the page-fault service routine to include page replacement:
  • 218. 9.4 Page Replacement 411 valid–invalid bit frame f page table victim change to invalid page out victim page page in desired page reset page table for new page physical memory 2 4 1 3 f 0 i v Figure 9.10 Page replacement. 1. Find the location of the desired page on the disk. 2. Find a free frame: a. If there is a free frame, use it. b. If there is no free frame, use a page-replacement algorithm to select a victim frame. c. Write the victim frame to the disk; change the page and frame tables accordingly. 3. Read the desired page into the newly freed frame; change the page and frame tables. 4. Continue the user process from where the page fault occurred. Notice that, if no frames are free, two page transfers (one out and one in) are required. This situation effectively doubles the page-fault service time and increases the effective access time accordingly. We can reduce this overhead by using a modify bit (or dirty bit). When this scheme is used, each page or frame has a modify bit associated with it in the hardware. The modify bit for a page is set by the hardware whenever any byte in the page is written into, indicating that the page has been modified. When we select a page for replacement, we examine its modify bit. If the bit is set, we know that the page has been modified since it was read in from the disk. In this case, we must write the page to the disk. If the modify bit is not set, however, the page has not been modified since it was read into memory. In this case, we need not write the memory page to the disk: it is already there. This technique also applies to read-only pages (for example, pages of binary code).
  • 219. 412 Chapter 9 Virtual Memory Such pages cannot be modified; thus, they may be discarded when desired. This scheme can significantly reduce the time required to service a page fault, since it reduces I/O time by one-half if the page has not been modified. Page replacement is basic to demand paging. It completes the separation between logical memory and physical memory. With this mechanism, an enormous virtual memory can be provided for programmers on a smaller physical memory. With no demand paging, user addresses are mapped into physical addresses, and the two sets of addresses can be different. All the pages of a process still must be in physical memory, however. With demand paging, the size of the logical address space is no longer constrained by physical memory. If we have a user process of twenty pages, we can execute it in ten frames simply by using demand paging and using a replacement algorithm to find a free frame whenever necessary. If a page that has been modified is to be replaced, its contents are copied to the disk. A later reference to that page will cause a page fault. At that time, the page will be brought back into memory, perhaps replacing some other page in the process. We must solve two major problems to implement demand paging: we must develop a frame-allocation algorithm and a page-replacement algorithm. That is, if we have multiple processes in memory, we must decide how many frames to allocate to each process; and when page replacement is required, we must select the frames that are to be replaced. Designing appropriate algorithms to solve these problems is an important task, because disk I/O is so expensive. Even slight improvements in demand-paging methods yield large gains in system performance. There are many different page-replacement algorithms. Every operating system probably has its own replacement scheme. How do we select a particular replacement algorithm? In general, we want the one with the lowest page-fault rate. We evaluate an algorithm by running it on a particular string of memory references and computing the number of page faults. The string of memory references is called a reference string. We can generate reference strings artificially (by using a random-number generator, for example), or we can trace a given system and record the address of each memory reference. The latter choice produces a large number of data (on the order of 1 million addresses per second). To reduce the number of data, we use two facts. First, for a given page size (and the page size is generally fixed by the hardware or system), we need to consider only the page number, rather than the entire address. Second, if we have a reference to a page p, then any references to page p that immediately follow will never cause a page fault. Page p will be in memory after the first reference, so the immediately following references will not fault. For example, if we trace a particular process, we might record the following address sequence: 0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103, 0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105 At 100 bytes per page, this sequence is reduced to the following reference string: 1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1
  • 220. 9.4 Page Replacement 413 number of page faults 16 14 12 10 8 6 4 2 1 2 3 number of frames 4 5 6 Figure 9.11 Graph of page faults versus number of frames. To determine the number of page faults for a particular reference string and page-replacement algorithm, we also need to know the number of page frames available. Obviously, as the number of frames available increases, the number of page faults decreases. For the reference string considered previously, for example, if we had three or more frames, we would have only three faults— one fault for the first reference to each page. In contrast, with only one frame available, we would have a replacement with every reference, resulting in eleven faults. In general, we expect a curve such as that in Figure 9.11. As the number of frames increases, the number of page faults drops to some minimal level. Of course, adding physical memory increases the number of frames. We next illustrate several page-replacement algorithms. In doing so, we use the reference string 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 for a memory with three frames. 9.4.2 FIFO Page Replacement The simplest page-replacement algorithm is a first-in, first-out (FIFO) algorithm. A FIFO replacement algorithm associates with each page the time when that page was brought into memory. When a page must be replaced, the oldest page is chosen. Notice that it is not strictly necessary to record the time when a page is brought in. We can create a FIFO queue to hold all pages in memory. We replace the page at the head of the queue. When a page is brought into memory, we insert it at the tail of the queue. For our example reference string, our three frames are initially empty. The first three references (7, 0, 1) cause page faults and are brought into these empty frames. The next reference (2) replaces page 7, because page 7 was brought in first. Since 0 is the next reference and 0 is already in memory, we have no fault for this reference. The first reference to 3 results in replacement of page 0, since it is now first in line. Because of this replacement, the next reference, to 0, will
  • 221. 414 Chapter 9 Virtual Memory 7 7 0 7 0 1 page frames reference string 2 0 1 2 3 1 2 3 0 4 3 0 4 2 0 4 2 3 0 2 3 7 1 2 7 0 2 7 0 1 0 1 3 0 7 0 1 2 0 3 0 4 2 3 0 7 1 1 0 2 1 2 0 3 1 2 Figure 9.12 FIFO page-replacement algorithm. fault. Page 1 is then replaced by page 0. This process continues as shown in Figure 9.12. Every time a fault occurs, we show which pages are in our three frames. There are fifteen faults altogether. The FIFO page-replacement algorithm is easy to understand and program. However, its performance is not always good. On the one hand, the page replaced may be an initialization module that was used a long time ago and is no longer needed. On the other hand, it could contain a heavily used variable that was initialized early and is in constant use. Notice that, even if we select for replacement a page that is in active use, everything still works correctly. After we replace an active page with a new one, a fault occurs almost immediately to retrieve the active page. Some other page must be replaced to bring the active page back into memory. Thus, a bad replacement choice increases the page-fault rate and slows process execution. It does not, however, cause incorrect execution. To illustrate the problems that are possible with a FIFO page-replacement algorithm, consider the following reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 Figure 9.13 shows the curve of page faults for this reference string versus the number of available frames. Notice that the number of faults for four frames (ten) is greater than the number of faults for three frames (nine)! This most unexpected result is known as Belady’s anomaly: for some page-replacement algorithms, the page-fault rate may increase as the number of allocated frames increases. We would expect that giving more memory to a process would improve its performance. In some early research, investigators noticed that this assumption was not always true. Belady’s anomaly was discovered as a result. 9.4.3 Optimal Page Replacement One result of the discovery of Belady’s anomaly was the search for an optimal page-replacement algorithm—the algorithm that has the lowest page-fault rate of all algorithms and will never suffer from Belady’s anomaly. Such an algorithm does exist and has been called OPT or MIN. It is simply this: Replace the page that will not be used for the longest period of time. Use of this page-replacement algorithm guarantees the lowest possible page- fault rate for a fixed number of frames.
  • 222. 9.4 Page Replacement 415 number of page faults 16 14 12 10 8 6 4 2 1 2 3 number of frames 4 5 6 7 Figure 9.13 Page-fault curve for FIFO replacement on a reference string. For example, on our sample reference string, the optimal page-replacement algorithm would yield nine page faults, as shown in Figure 9.14. The first three references cause faults that fill the three empty frames. The reference to page 2 replaces page 7, because page 7 will not be used until reference 18, whereas page 0 will be used at 5, and page 1 at 14. The reference to page 3 replaces page 1, as page 1 will be the last of the three pages in memory to be referenced again. With only nine page faults, optimal replacement is much better than a FIFO algorithm, which results in fifteen faults. (If we ignore the first three, which all algorithms must suffer, then optimal replacement is twice as good as FIFO replacement.) In fact, no replacement algorithm can process this reference string in three frames with fewer than nine faults. Unfortunately, the optimal page-replacement algorithm is difficult to implement, because it requires future knowledge of the reference string. (We encountered a similar situation with the SJF CPU-scheduling algorithm in Section 6.3.2.) As a result, the optimal algorithm is used mainly for comparison studies. For instance, it may be useful to know that, although a new algorithm is not optimal, it is within 12.3 percent of optimal at worst and within 4.7 percent on average. page frames reference string 7 7 0 7 0 1 2 0 1 2 0 3 2 4 3 2 0 3 7 0 1 2 0 1 7 0 1 2 0 3 0 4 2 3 0 7 1 1 0 2 1 2 0 3 Figure 9.14 Optimal page-replacement algorithm.
  • 223. 416 Chapter 9 Virtual Memory 9.4.4 LRU Page Replacement If the optimal algorithm is not feasible, perhaps an approximation of the optimal algorithm is possible. The key distinction between the FIFO and OPT algorithms (other than looking backward versus forward in time) is that the FIFO algorithm uses the time when a page was brought into memory, whereas the OPT algorithm uses the time when a page is to be used. If we use the recent past as an approximation of the near future, then we can replace the page that has not been used for the longest period of time. This approach is the least recently used (LRU) algorithm. LRU replacement associates with each page the time of that page’s last use. When a page must be replaced, LRU chooses the page that has not been used for the longest period of time. We can think of this strategy as the optimal page-replacement algorithm looking backward in time, rather than forward. (Strangely, if we let SR be the reverse of a reference string S, then the page-fault rate for the OPT algorithm on S is the same as the page-fault rate for the OPT algorithm on SR . Similarly, the page-fault rate for the LRU algorithm on S is the same as the page-fault rate for the LRU algorithm on SR .) The result of applying LRU replacement to our example reference string is shown in Figure 9.15. The LRU algorithm produces twelve faults. Notice that the first five faults are the same as those for optimal replacement. When the reference to page 4 occurs, however, LRU replacement sees that, of the three frames in memory, page 2 was used least recently. Thus, the LRU algorithm replaces page 2, not knowing that page 2 is about to be used. When it then faults for page 2, the LRU algorithm replaces page 3, since it is now the least recently used of the three pages in memory. Despite these problems, LRU replacement with twelve faults is much better than FIFO replacement with fifteen. The LRU policy is often used as a page-replacement algorithm and is considered to be good. The major problem is how to implement LRU replacement. An LRU page-replacement algorithm may require substantial hardware assistance. The problem is to determine an order for the frames defined by the time of last use. Two implementations are feasible: • Counters. In the simplest case, we associate with each page-table entry a time-of-use field and add to the CPU a logical clock or counter. The clock is incremented for every memory reference. Whenever a reference to a page is made, the contents of the clock register are copied to the time-of-use field in the page-table entry for that page. In this way, we always have page frames reference string 7 7 0 7 0 1 2 0 1 2 0 3 4 0 3 4 0 2 4 3 2 0 3 2 1 3 2 1 0 2 1 0 7 7 0 1 2 0 3 0 4 2 3 0 7 1 1 0 2 1 2 0 3 Figure 9.15 LRU page-replacement algorithm.
  • 224. 9.4 Page Replacement 417 the “time” of the last reference to each page. We replace the page with the smallest time value. This scheme requires a search of the page table to find the LRU page and a write to memory (to the time-of-use field in the page table) for each memory access. The times must also be maintained when page tables are changed (due to CPU scheduling). Overflow of the clock must be considered. • Stack. Another approach to implementing LRU replacement is to keep a stack of page numbers. Whenever a page is referenced, it is removed from the stack and put on the top. In this way, the most recently used page is always at the top of the stack and the least recently used page is always at the bottom (Figure 9.16). Because entries must be removed from the middle of the stack, it is best to implement this approach by using a doubly linked list with a head pointer and a tail pointer. Removing a page and putting it on the top of the stack then requires changing six pointers at worst. Each update is a little more expensive, but there is no search for a replacement; the tail pointer points to the bottom of the stack, which is the LRU page. This approach is particularly appropriate for software or microcode implementations of LRU replacement. Like optimal replacement, LRU replacement does not suffer from Belady’s anomaly. Both belong to a class of page-replacement algorithms, called stack algorithms, that can never exhibit Belady’s anomaly. A stack algorithm is an algorithm for which it can be shown that the set of pages in memory for n frames is always a subset of the set of pages that would be in memory with n + 1 frames. For LRU replacement, the set of pages in memory would be the n most recently referenced pages. If the number of frames is increased, these n pages will still be the most recently referenced and so will still be in memory. Note that neither implementation of LRU would be conceivable without hardware assistance beyond the standard TLB registers. The updating of the clock fields or stack must be done for every memory reference. If we were to use an interrupt for every reference to allow software to update such data structures, it would slow every memory reference by a factor of at least ten, 2 1 0 4 7 stack before a 7 2 1 4 0 stack after b reference string 4 7 0 7 1 0 1 2 1 2 2 7 a b 1 Figure 9.16 Use of a stack to record the most recent page references.
  • 225. 418 Chapter 9 Virtual Memory hence slowing every user process by a factor of ten. Few systems could tolerate that level of overhead for memory management. 9.4.5 LRU-Approximation Page Replacement Few computer systems provide sufficient hardware support for true LRU page replacement. In fact, some systems provide no hardware support, and other page-replacement algorithms (such as a FIFO algorithm) must be used. Many systems provide some help, however, in the form of a reference bit. The reference bit for a page is set by the hardware whenever that page is referenced (either a read or a write to any byte in the page). Reference bits are associated with each entry in the page table. Initially, all bits are cleared (to 0) by the operating system. As a user process executes, the bit associated with each page referenced is set (to 1) by the hardware. After some time, we can determine which pages have been used and which have not been used by examining the reference bits, although we do not know the order of use. This information is the basis for many page-replacement algorithms that approximate LRU replacement. 9.4.5.1 Additional-Reference-Bits Algorithm We can gain additional ordering information by recording the reference bits at regular intervals. We can keep an 8-bit byte for each page in a table in memory. At regular intervals (say, every 100 milliseconds), a timer interrupt transfers control to the operating system. The operating system shifts the reference bit for each page into the high-order bit of its 8-bit byte, shifting the other bits right by 1 bit and discarding the low-order bit. These 8-bit shift registers contain the history of page use for the last eight time periods. If the shift register contains 00000000, for example, then the page has not been used for eight time periods. A page that is used at least once in each period has a shift register value of 11111111. A page with a history register value of 11000100 has been used more recently than one with a value of 01110111. If we interpret these 8-bit bytes as unsigned integers, the page with the lowest number is the LRU page, and it can be replaced. Notice that the numbers are not guaranteed to be unique, however. We can either replace (swap out) all pages with the smallest value or use the FIFO method to choose among them. The number of bits of history included in the shift register can be varied, of course, and is selected (depending on the hardware available) to make the updating as fast as possible. In the extreme case, the number can be reduced to zero, leaving only the reference bit itself. This algorithm is called the second-chance page-replacement algorithm. 9.4.5.2 Second-Chance Algorithm The basic algorithm of second-chance replacement is a FIFO replacement algorithm. When a page has been selected, however, we inspect its reference bit. If the value is 0, we proceed to replace this page; but if the reference bit is set to 1, we give the page a second chance and move on to select the next FIFO page. When a page gets a second chance, its reference bit is cleared, and its arrival time is reset to the current time. Thus, a page that is given a second chance will not be replaced until all other pages have been replaced (or given
  • 226. 9.4 Page Replacement 419 circular queue of pages (a) next victim 0 reference bits pages 0 1 1 0 1 1 … … circular queue of pages (b) 0 reference bits pages 0 0 0 0 1 1 … … Figure 9.17 Second-chance (clock) page-replacement algorithm. second chances). In addition, if a page is used often enough to keep its reference bit set, it will never be replaced. One way to implement the second-chance algorithm (sometimes referred to as the clock algorithm) is as a circular queue. A pointer (that is, a hand on the clock) indicates which page is to be replaced next. When a frame is needed, the pointer advances until it finds a page with a 0 reference bit. As it advances, it clears the reference bits (Figure 9.17). Once a victim page is found, the page is replaced, and the new page is inserted in the circular queue in that position. Notice that, in the worst case, when all bits are set, the pointer cycles through the whole queue, giving each page a second chance. It clears all the reference bits before selecting the next page for replacement. Second-chance replacement degenerates to FIFO replacement if all bits are set. 9.4.5.3 Enhanced Second-Chance Algorithm We can enhance the second-chance algorithm by considering the reference bit and the modify bit (described in Section 9.4.1) as an ordered pair. With these two bits, we have the following four possible classes: 1. (0, 0) neither recently used nor modified—best page to replace 2. (0, 1) not recently used but modified—not quite as good, because the page will need to be written out before replacement
  • 227. 420 Chapter 9 Virtual Memory 3. (1, 0) recently used but clean—probably will be used again soon 4. (1, 1) recently used and modified—probably will be used again soon, and the page will be need to be written out to disk before it can be replaced Each page is in one of these four classes. When page replacement is called for, we use the same scheme as in the clock algorithm; but instead of examining whether the page to which we are pointing has the reference bit set to 1, we examine the class to which that page belongs. We replace the first page encountered in the lowest nonempty class. Notice that we may have to scan the circular queue several times before we find a page to be replaced. The major difference between this algorithm and the simpler clock algo- rithm is that here we give preference to those pages that have been modified in order to reduce the number of I/Os required. 9.4.6 Counting-Based Page Replacement There are many other algorithms that can be used for page replacement. For example, we can keep a counter of the number of references that have been made to each page and develop the following two schemes. • The least frequently used (LFU) page-replacement algorithm requires that the page with the smallest count be replaced. The reason for this selection is that an actively used page should have a large reference count. A problem arises, however, when a page is used heavily during the initial phase of a process but then is never used again. Since it was used heavily, it has a large count and remains in memory even though it is no longer needed. One solution is to shift the counts right by 1 bit at regular intervals, forming an exponentially decaying average usage count. • The most frequently used (MFU) page-replacement algorithm is based on the argument that the page with the smallest count was probably just brought in and has yet to be used. As you might expect, neither MFU nor LFU replacement is common. The implementation of these algorithms is expensive, and they do not approximate OPT replacement well. 9.4.7 Page-Buffering Algorithms Other procedures are often used in addition to a specific page-replacement algorithm. For example, systems commonly keep a pool of free frames. When a page fault occurs, a victim frame is chosen as before. However, the desired page is read into a free frame from the pool before the victim is written out. This procedure allows the process to restart as soon as possible, without waiting for the victim page to be written out. When the victim is later written out, its frame is added to the free-frame pool. An expansion of this idea is to maintain a list of modified pages. Whenever the paging device is idle, a modified page is selected and is written to the disk. Its modify bit is then reset. This scheme increases the probability that a page will be clean when it is selected for replacement and will not need to be written out.
  • 228. 9.5 Allocation of Frames 421 Another modification is to keep a pool of free frames but to remember which page was in each frame. Since the frame contents are not modified when a frame is written to the disk, the old page can be reused directly from the free-frame pool if it is needed before that frame is reused. No I/O is needed in this case. When a page fault occurs, we first check whether the desired page is in the free-frame pool. If it is not, we must select a free frame and read into it. This technique is used in the VAX/VMS system along with a FIFO replace- ment algorithm. When the FIFO replacement algorithm mistakenly replaces a page that is still in active use, that page is quickly retrieved from the free-frame pool, and no I/O is necessary. The free-frame buffer provides protection against the relatively poor, but simple, FIFO replacement algorithm. This method is necessary because the early versions of VAX did not implement the reference bit correctly. Some versions of the UNIX system use this method in conjunction with the second-chance algorithm. It can be a useful augmentation to any page- replacement algorithm, to reduce the penalty incurred if the wrong victim page is selected. 9.4.8 Applications and Page Replacement In certain cases, applications accessing data through the operating system’s virtual memory perform worse than if the operating system provided no buffering at all. A typical example is a database, which provides its own memory management and I/O buffering. Applications like this understand their memory use and disk use better than does an operating system that is implementing algorithms for general-purpose use. If the operating system is buffering I/O and the application is doing so as well, however, then twice the memory is being used for a set of I/O. In another example, data warehouses frequently perform massive sequen- tial disk reads, followed by computations and writes. The LRU algorithm would be removing old pages and preserving new ones, while the application would more likely be reading older pages than newer ones (as it starts its sequential reads again). Here, MFU would actually be more efficient than LRU. Because of such problems, some operating systems give special programs the ability to use a disk partition as a large sequential array of logical blocks, without any file-system data structures. This array is sometimes called the raw disk, and I/O to this array is termed raw I/O. Raw I/O bypasses all the file- system services, such as file I/O demand paging, file locking, prefetching, space allocation, file names, and directories. Note that although certain applications are more efficient when implementing their own special-purpose storage services on a raw partition, most applications perform better when they use the regular file-system services. 9.5 Allocation of Frames We turn next to the issue of allocation. How do we allocate the fixed amount of free memory among the various processes? If we have 93 free frames and two processes, how many frames does each process get? The simplest case is the single-user system. Consider a single-user system with 128 KB of memory composed of pages 1 KB in size. This system has 128
  • 229. 422 Chapter 9 Virtual Memory frames. The operating system may take 35 KB, leaving 93 frames for the user process. Under pure demand paging, all 93 frames would initially be put on the free-frame list. When a user process started execution, it would generate a sequence of page faults. The first 93 page faults would all get free frames from the free-frame list. When the free-frame list was exhausted, a page-replacement algorithm would be used to select one of the 93 in-memory pages to be replaced with the 94th, and so on. When the process terminated, the 93 frames would once again be placed on the free-frame list. There are many variations on this simple strategy. We can require that the operating system allocate all its buffer and table space from the free-frame list. When this space is not in use by the operating system, it can be used to support user paging. We can try to keep three free frames reserved on the free-frame list at all times. Thus, when a page fault occurs, there is a free frame available to page into. While the page swap is taking place, a replacement can be selected, which is then written to the disk as the user process continues to execute. Other variants are also possible, but the basic strategy is clear: the user process is allocated any free frame. 9.5.1 Minimum Number of Frames Our strategies for the allocation of frames are constrained in various ways. We cannot, for example, allocate more than the total number of available frames (unless there is page sharing). We must also allocate at least a minimum number of frames. Here, we look more closely at the latter requirement. One reason for allocating at least a minimum number of frames involves performance. Obviously, as the number of frames allocated to each process decreases, the page-fault rate increases, slowing process execution. In addition, remember that, when a page fault occurs before an executing instruction is complete, the instruction must be restarted. Consequently, we must have enough frames to hold all the different pages that any single instruction can reference. For example, consider a machine in which all memory-reference instruc- tions may reference only one memory address. In this case, we need at least one frame for the instruction and one frame for the memory reference. In addition, if one-level indirect addressing is allowed (for example, a load instruction on page 16 can refer to an address on page 0, which is an indirect reference to page 23), then paging requires at least three frames per process. Think about what might happen if a process had only two frames. The minimum number of frames is defined by the computer architecture. For example, the move instruction for the PDP-11 includes more than one word for some addressing modes, and thus the instruction itself may straddle two pages. In addition, each of its two operands may be indirect references, for a total of six frames. Another example is the IBM 370 MVC instruction. Since the instruction is from storage location to storage location, it takes 6 bytes and can straddle two pages. The block of characters to move and the area to which it is to be moved can each also straddle two pages. This situation would require six frames. The worst case occurs when the MVC instruction is the operand of an EXECUTE instruction that straddles a page boundary; in this case, we need eight frames.
  • 230. 9.5 Allocation of Frames 423 The worst-case scenario occurs in computer architectures that allow multiple levels of indirection (for example, each 16-bit word could contain a 15-bit address plus a 1-bit indirect indicator). Theoretically, a simple load instruction could reference an indirect address that could reference an indirect address (on another page) that could also reference an indirect address (on yet another page), and so on, until every page in virtual memory had been touched. Thus, in the worst case, the entire virtual memory must be in physical memory. To overcome this difficulty, we must place a limit on the levels of indirection (for example, limit an instruction to at most 16 levels of indirection). When the first indirection occurs, a counter is set to 16; the counter is then decremented for each successive indirection for this instruction. If the counter is decremented to 0, a trap occurs (excessive indirection). This limitation reduces the maximum number of memory references per instruction to 17, requiring the same number of frames. Whereas the minimum number of frames per process is defined by the architecture, the maximum number is defined by the amount of available physical memory. In between, we are still left with significant choice in frame allocation. 9.5.2 Allocation Algorithms The easiest way to split m frames among n processes is to give everyone an equal share, m/n frames (ignoring frames needed by the operating system for the moment). For instance, if there are 93 frames and five processes, each process will get 18 frames. The three leftover frames can be used as a free-frame buffer pool. This scheme is called equal allocation. An alternative is to recognize that various processes will need differing amounts of memory. Consider a system with a 1-KB frame size. If a small student process of 10 KB and an interactive database of 127 KB are the only two processes running in a system with 62 free frames, it does not make much sense to give each process 31 frames. The student process does not need more than 10 frames, so the other 21 are, strictly speaking, wasted. To solve this problem, we can use proportional allocation, in which we allocate available memory to each process according to its size. Let the size of the virtual memory for process pi be si , and define S = si . Then, if the total number of available frames is m, we allocate ai frames to process pi , where ai is approximately ai = si /S × m. Of course, we must adjust each ai to be an integer that is greater than the minimum number of frames required by the instruction set, with a sum not exceeding m. With proportional allocation, we would split 62 frames between two processes, one of 10 pages and one of 127 pages, by allocating 4 frames and 57 frames, respectively, since 10/137 × 62 ≈ 4, and 127/137 × 62 ≈ 57.
  • 231. 424 Chapter 9 Virtual Memory In this way, both processes share the available frames according to their “needs,” rather than equally. In both equal and proportional allocation, of course, the allocation may vary according to the multiprogramming level. If the multiprogramming level is increased, each process will lose some frames to provide the memory needed for the new process. Conversely, if the multiprogramming level decreases, the frames that were allocated to the departed process can be spread over the remaining processes. Notice that, with either equal or proportional allocation, a high-priority process is treated the same as a low-priority process. By its definition, however, we may want to give the high-priority process more memory to speed its execution, to the detriment of low-priority processes. One solution is to use a proportional allocation scheme wherein the ratio of frames depends not on the relative sizes of processes but rather on the priorities of processes or on a combination of size and priority. 9.5.3 Global versus Local Allocation Another important factor in the way frames are allocated to the various processes is page replacement. With multiple processes competing for frames, we can classify page-replacement algorithms into two broad categories: global replacement and local replacement. Global replacement allows a process to select a replacement frame from the set of all frames, even if that frame is currently allocated to some other process; that is, one process can take a frame from another. Local replacement requires that each process select from only its own set of allocated frames. For example, consider an allocation scheme wherein we allow high-priority processes to select frames from low-priority processes for replacement. A process can select a replacement from among its own frames or the frames of any lower-priority process. This approach allows a high-priority process to increase its frame allocation at the expense of a low-priority process. With a local replacement strategy, the number of frames allocated to a process does not change. With global replacement, a process may happen to select only frames allocated to other processes, thus increasing the number of frames allocated to it (assuming that other processes do not choose its frames for replacement). One problem with a global replacement algorithm is that a process cannot control its own page-fault rate. The set of pages in memory for a process depends not only on the paging behavior of that process but also on the paging behavior of other processes. Therefore, the same process may perform quite differently (for example, taking 0.5 seconds for one execution and 10.3 seconds for the next execution) because of totally external circumstances. Such is not the case with a local replacement algorithm. Under local replacement, the set of pages in memory for a process is affected by the paging behavior of only that process. Local replacement might hinder a process, however, by not making available to it other, less used pages of memory. Thus, global replacement generally results in greater system throughput and is therefore the more commonly used method. 9.5.4 Non-Uniform Memory Access Thus far in our coverage of virtual memory, we have assumed that all main memory is created equal—or at least that it is accessed equally. On many
  • 232. 9.6 Thrashing 425 computer systems, that is not the case. Often, in systems with multiple CPUs (Section 1.3.2), a given CPU can access some sections of main memory faster than it can access others. These performance differences are caused by how CPUs and memory are interconnected in the system. Frequently, such a system is made up of several system boards, each containing multiple CPUs and some memory. The system boards are interconnected in various ways, ranging from system buses to high-speed network connections like InfiniBand. As you might expect, the CPUs on a particular board can access the memory on that board with less delay than they can access memory on other boards in the system. Systems in which memory access times vary significantly are known collectively as non-uniform memory access (NUMA) systems, and without exception, they are slower than systems in which memory and CPUs are located on the same motherboard. Managing which page frames are stored at which locations can significantly affect performance in NUMA systems. If we treat memory as uniform in such a system, CPUs may wait significantly longer for memory access than if we modify memory allocation algorithms to take NUMA into account. Similar changes must be made to the scheduling system. The goal of these changes is to have memory frames allocated “as close as possible” to the CPU on which the process is running. The definition of “close” is “with minimum latency,” which typically means on the same system board as the CPU. The algorithmic changes consist of having the scheduler track the last CPU on which each process ran. If the scheduler tries to schedule each process onto its previous CPU, and the memory-management system tries to allocate frames for the process close to the CPU on which it is being scheduled, then improved cache hits and decreased memory access times will result. The picture is more complicated once threads are added. For example, a process with many running threads may end up with those threads scheduled on many different system boards. How is the memory to be allocated in this case? Solaris solves the problem by creating lgroups (for “latency groups”) in the kernel. Each lgroup gathers together close CPUs and memory. In fact, there is a hierarchy of lgroups based on the amount of latency between the groups. Solaris tries to schedule all threads of a process and allocate all memory of a process within an lgroup. If that is not possible, it picks nearby lgroups for the rest of the resources needed. This practice minimizes overall memory latency and maximizes CPU cache hit rates. 9.6 Thrashing If the number of frames allocated to a low-priority process falls below the minimum number required by the computer architecture, we must suspend that process’s execution. We should then page out its remaining pages, freeing all its allocated frames. This provision introduces a swap-in, swap-out level of intermediate CPU scheduling. In fact, look at any process that does not have “enough” frames. If the process does not have the number of frames it needs to support pages in active use, it will quickly page-fault. At this point, it must replace some page. However, since all its pages are in active use, it must replace a page that will be needed again right away. Consequently, it quickly faults again, and again, and again, replacing pages that it must bring back in immediately.
  • 233. 426 Chapter 9 Virtual Memory This high paging activity is called thrashing. A process is thrashing if it is spending more time paging than executing. 9.6.1 Cause of Thrashing Thrashing results in severe performance problems. Consider the following scenario, which is based on the actual behavior of early paging systems. The operating system monitors CPU utilization. If CPU utilization is too low, we increase the degree of multiprogramming by introducing a new process to the system. A global page-replacement algorithm is used; it replaces pages without regard to the process to which they belong. Now suppose that a process enters a new phase in its execution and needs more frames. It starts faulting and taking frames away from other processes. These processes need those pages, however, and so they also fault, taking frames from other processes. These faulting processes must use the paging device to swap pages in and out. As they queue up for the paging device, the ready queue empties. As processes wait for the paging device, CPU utilization decreases. The CPU scheduler sees the decreasing CPU utilization and increases the degree of multiprogramming as a result. The new process tries to get started by taking frames from running processes, causing more page faults and a longer queue for the paging device. As a result, CPU utilization drops even further, and the CPU scheduler tries to increase the degree of multiprogramming even more. Thrashing has occurred, and system throughput plunges. The page- fault rate increases tremendously. As a result, the effective memory-access time increases. No work is getting done, because the processes are spending all their time paging. This phenomenon is illustrated in Figure 9.18, in which CPU utilization is plotted against the degree of multiprogramming. As the degree of multi- programming increases, CPU utilization also increases, although more slowly, until a maximum is reached. If the degree of multiprogramming is increased even further, thrashing sets in, and CPU utilization drops sharply. At this point, to increase CPU utilization and stop thrashing, we must decrease the degree of multiprogramming. thrashing degree of multiprogramming CPU utilization Figure 9.18 Thrashing.
  • 234. 9.6 Thrashing 427 We can limit the effects of thrashing by using a local replacement algorithm (or priority replacement algorithm). With local replacement, if one process starts thrashing, it cannot steal frames from another process and cause the latter to thrash as well. However, the problem is not entirely solved. If processes are thrashing, they will be in the queue for the paging device most of the time. The average service time for a page fault will increase because of the longer average queue for the paging device. Thus, the effective access time will increase even for a process that is not thrashing. To prevent thrashing, we must provide a process with as many frames as it needs. But how do we know how many frames it “needs”? There are several techniques. The working-set strategy (Section 9.6.2) starts by looking at how many frames a process is actually using. This approach defines the locality model of process execution. The locality model states that, as a process executes, it moves from locality to locality. A locality is a set of pages that are actively used together (Figure 9.19). A program is generally composed of several different localities, which may overlap. For example, when a function is called, it defines a new locality. In this locality, memory references are made to the instructions of the function call, its local variables, and a subset of the global variables. When we exit the function, the process leaves this locality, since the local variables and instructions of the function are no longer in active use. We may return to this locality later. Thus, we see that localities are defined by the program structure and its data structures. The locality model states that all programs will exhibit this basic memory reference structure. Note that the locality model is the unstated principle behind the caching discussions so far in this book. If accesses to any types of data were random rather than patterned, caching would be useless. Suppose we allocate enough frames to a process to accommodate its current locality. It will fault for the pages in its locality until all these pages are in memory; then, it will not fault again until it changes localities. If we do not allocate enough frames to accommodate the size of the current locality, the process will thrash, since it cannot keep in memory all the pages that it is actively using. 9.6.2 Working-Set Model As mentioned, the working-set model is based on the assumption of locality. This model uses a parameter, , to define the working-set window. The idea is to examine the most recent page references. The set of pages in the most recent page references is the working set (Figure 9.20). If a page is in active use, it will be in the working set. If it is no longer being used, it will drop from the working set time units after its last reference. Thus, the working set is an approximation of the program’s locality. For example, given the sequence of memory references shown in Figure 9.20, if = 10 memory references, then the working set at time t1 is {1, 2, 5, 6, 7}. By time t2, the working set has changed to {3, 4}. The accuracy of the working set depends on the selection of . If is too small, it will not encompass the entire locality; if is too large, it may overlap several localities. In the extreme, if is infinite, the working set is the set of pages touched during the process execution.
  • 235. 428 Chapter 9 Virtual Memory 18 20 22 24 26 28 30 32 34 page numbers memory address execution time Figure 9.19 Locality in a memory-reference pattern. The most important property of the working set, then, is its size. If we compute the working-set size, WSSi , for each process in the system, we can then consider that D = WSSi , where D is the total demand for frames. Each process is actively using the pages in its working set. Thus, process i needs WSSi frames. If the total demand is greater than the total number of available frames (D m), thrashing will occur, because some processes will not have enough frames. Once has been selected, use of the working-set model is simple. The operating system monitors the working set of each process and allocates to
  • 236. 9.6 Thrashing 429 page reference table . . . 2 6 1 5 7 7 7 7 5 1 6 2 3 4 1 2 3 4 4 4 3 4 3 4 4 4 1 3 2 3 4 4 4 3 4 4 4 . . . Δ t1 WS(t1 ) = {1,2,5,6,7} Δ t2 WS(t2 ) = {3,4} Figure 9.20 Working-set model. that working set enough frames to provide it with its working-set size. If there are enough extra frames, another process can be initiated. If the sum of the working-set sizes increases, exceeding the total number of available frames, the operating system selects a process to suspend. The process’s pages are written out (swapped), and its frames are reallocated to other processes. The suspended process can be restarted later. This working-set strategy prevents thrashing while keeping the degree of multiprogramming as high as possible. Thus, it optimizes CPU utilization. The difficulty with the working-set model is keeping track of the working set. The working-set window is a moving window. At each memory reference, a new reference appears at one end, and the oldest reference drops off the other end. A page is in the working set if it is referenced anywhere in the working-set window. We can approximate the working-set model with a fixed-interval timer interrupt and a reference bit. For example, assume that equals 10,000 references and that we can cause a timer interrupt every 5,000 references. When we get a timer interrupt, we copy and clear the reference-bit values for each page. Thus, if a page fault occurs, we can examine the current reference bit and two in-memory bits to determine whether a page was used within the last 10,000 to 15,000 references. If it was used, at least one of these bits will be on. If it has not been used, these bits will be off. Pages with at least one bit on will be considered to be in the working set. Note that this arrangement is not entirely accurate, because we cannot tell where, within an interval of 5,000, a reference occurred. We can reduce the uncertainty by increasing the number of history bits and the frequency of inter- rupts (for example, 10 bits and interrupts every 1,000 references). However, the cost to service these more frequent interrupts will be correspondingly higher. 9.6.3 Page-Fault Frequency The working-set model is successful, and knowledge of the working set can be useful for prepaging (Section 9.9.1), but it seems a clumsy way to control thrashing. A strategy that uses the page-fault frequency (PFF) takes a more direct approach. The specific problem is how to prevent thrashing. Thrashing has a high page-fault rate. Thus, we want to control the page-fault rate. When it is too high, we know that the process needs more frames. Conversely, if the page-fault rate is too low, then the process may have too many frames. We can establish upper and lower bounds on the desired page-fault rate (Figure 9.21). If the actual page-fault rate exceeds the upper limit, we allocate the process another
  • 237. 430 Chapter 9 Virtual Memory number of frames increase number of frames upper bound lower bound decrease number of frames page-fault rate Figure 9.21 Page-fault frequency. frame. If the page-fault rate falls below the lower limit, we remove a frame from the process. Thus, we can directly measure and control the page-fault rate to prevent thrashing. As with the working-set strategy, we may have to swap out a process. If the page-fault rate increases and no free frames are available, we must select some process and swap it out to backing store. The freed frames are then distributed to processes with high page-fault rates. 9.6.4 Concluding Remarks Practically speaking, thrashing and the resulting swapping have a disagreeably large impact on performance. The current best practice in implementing a computer facility is to include enough physical memory, whenever possible, to avoid thrashing and swapping. From smartphones through mainframes, providing enough memory to keep all working sets in memory concurrently, except under extreme conditions, gives the best user experience. 9.7 Memory-Mapped Files Consider a sequential read of a file on disk using the standard system calls open(), read(), and write(). Each file access requires a system call and disk access. Alternatively, we can use the virtual memory techniques discussed so far to treat file I/O as routine memory accesses. This approach, known as memory mapping a file, allows a part of the virtual address space to be logically associated with the file. As we shall see, this can lead to significant performance increases. 9.7.1 Basic Mechanism Memory mapping a file is accomplished by mapping a disk block to a page (or pages) in memory. Initial access to the file proceeds through ordinary demand paging, resulting in a page fault. However, a page-sized portion of the file is read from the file system into a physical page (some systems may opt to read
  • 238. Part Four Storage Management Since main memory is usually too small to accommodate all the data and programs permanently, the computer system must provide secondary storage to back up main memory. Modern computer systems use disks as the primary on-line storage medium for information (both programs and data). The file system provides the mechanism for on-line storage of and access to both data and programs residing on the disks. A file is a collection of related information defined by its creator. The files are mapped by the operating system onto physical devices. Files are normally organized into directories for ease of use. The devices that attach to a computer vary in many aspects. Some devices transfer a character or a block of characters at a time. Some can be accessed only sequentially, others randomly. Some transfer data synchronously, others asynchronously. Some are dedicated, some shared. They can be read-only or read–write. They vary greatly in speed. In many ways, they are also the slowest major component of the computer. Because of all this device variation, the operating system needs to provide a wide range of functionality to applications, to allow them to control all aspects of the devices. One key goal of an operating system’s I/O subsystem is to provide the simplest interface possible to the rest of the system. Because devices are a performance bottleneck, another key is to optimize I/O for maximum concurrency.
  • 239. 10 C H A P T E R Mass-Storage Structure The file system can be viewed logically as consisting of three parts. In Chapter 11, we examine the user and programmer interface to the file system. In Chapter 12, we describe the internal data structures and algorithms used by the operating system to implement this interface. In this chapter, we begin a discussion of file systems at the lowest level: the structure of secondary storage. We first describe the physical structure of magnetic disks and magnetic tapes. We then describe disk-scheduling algorithms, which schedule the order of disk I/Os to maximize performance. Next, we discuss disk formatting and management of boot blocks, damaged blocks, and swap space. We conclude with an examination of the structure of RAID systems. CHAPTER OBJECTIVES • To describe the physical structure of secondary storage devices and its effects on the uses of the devices. • To explain the performance characteristics of mass-storage devices. • To evaluate disk scheduling algorithms. • To discuss operating-system services provided for mass storage, including RAID. 10.1 Overview of Mass-Storage Structure In this section, we present a general overview of the physical structure of secondary and tertiary storage devices. 10.1.1 Magnetic Disks Magnetic disks provide the bulk of secondary storage for modern computer systems. Conceptually, disks are relatively simple (Figure 10.1). Each disk platter has a flat circular shape, like a CD. Common platter diameters range from 1.8 to 3.5 inches. The two surfaces of a platter are covered with a magnetic material. We store information by recording it magnetically on the platters. 467
  • 240. 468 Chapter 10 Mass-Storage Structure track t sector s spindle cylinder c platter arm read-write head arm assembly rotation Figure 10.1 Moving-head disk mechanism. A read–write head “flies” just above each surface of every platter. The heads are attached to a disk arm that moves all the heads as a unit. The surface of a platter is logically divided into circular tracks, which are subdivided into sectors. The set of tracks that are at one arm position makes up a cylinder. There may be thousands of concentric cylinders in a disk drive, and each track may contain hundreds of sectors. The storage capacity of common disk drives is measured in gigabytes. When the disk is in use, a drive motor spins it at high speed. Most drives rotate 60 to 250 times per second, specified in terms of rotations per minute (RPM). Common drives spin at 5,400, 7,200, 10,000, and 15,000 RPM. Disk speed has two parts. The transfer rate is the rate at which data flow between the drive and the computer. The positioning time, or random-access time, consists of two parts: the time necessary to move the disk arm to the desired cylinder, called the seek time, and the time necessary for the desired sector to rotate to the disk head, called the rotational latency. Typical disks can transfer several megabytes of data per second, and they have seek times and rotational latencies of several milliseconds. Because the disk head flies on an extremely thin cushion of air (measured in microns), there is a danger that the head will make contact with the disk surface. Although the disk platters are coated with a thin protective layer, the head will sometimes damage the magnetic surface. This accident is called a head crash. A head crash normally cannot be repaired; the entire disk must be replaced. A disk can be removable, allowing different disks to be mounted as needed. Removable magnetic disks generally consist of one platter, held in a plastic case to prevent damage while not in the disk drive. Other forms of removable disks include CDs, DVDs, and Blu-ray discs as well as removable flash-memory devices known as flash drives (which are a type of solid-state drive).
  • 241. 10.1 Overview of Mass-Storage Structure 469 A disk drive is attached to a computer by a set of wires called an I/O bus. Several kinds of buses are available, including advanced technology attachment (ATA), serial ATA (SATA), eSATA, universal serial bus (USB), and fibre channel (FC). The data transfers on a bus are carried out by special electronic processors called controllers. The host controller is the controller at the computer end of the bus. A disk controller is built into each disk drive. To perform a disk I/O operation, the computer places a command into the host controller, typically using memory-mapped I/O ports, as described in Section 9.7.3. The host controller then sends the command via messages to the disk controller, and the disk controller operates the disk-drive hardware to carry out the command. Disk controllers usually have a built-in cache. Data transfer at the disk drive happens between the cache and the disk surface, and data transfer to the host, at fast electronic speeds, occurs between the cache and the host controller. 10.1.2 Solid-State Disks Sometimes old technologies are used in new ways as economics change or the technologies evolve. An example is the growing importance of solid-state disks, or SSDs. Simply described, an SSD is nonvolatile memory that is used like a hard drive. There are many variations of this technology, from DRAM with a battery to allow it to maintain its state in a power failure through flash-memory technologies like single-level cell (SLC) and multilevel cell (MLC) chips. SSDs have the same characteristics as traditional hard disks but can be more reliable because they have no moving parts and faster because they have no seek time or latency. In addition, they consume less power. However, they are more expensive per megabyte than traditional hard disks, have less capacity than the larger hard disks, and may have shorter life spans than hard disks, so their uses are somewhat limited. One use for SSDs is in storage arrays, where they hold file-system metadata that require high performance. SSDs are also used in some laptop computers to make them smaller, faster, and more energy-efficient. Because SSDs can be much faster than magnetic disk drives, standard bus interfaces can cause a major limit on throughput. Some SSDs are designed to connect directly to the system bus (PCI, for example). SSDs are changing other traditional aspects of computer design as well. Some systems use them as a direct replacement for disk drives, while others use them as a new cache tier, moving data between magnetic disks, SSDs, and memory to optimize performance. In the remainder of this chapter, some sections pertain to SSDs, while others do not. For example, because SSDs have no disk head, disk-scheduling algorithms largely do not apply. Throughput and formatting, however, do apply. 10.1.3 Magnetic Tapes Magnetic tape was used as an early secondary-storage medium. Although it is relatively permanent and can hold large quantities of data, its access time is slow compared with that of main memory and magnetic disk. In addition, random access to magnetic tape is about a thousand times slower than random access to magnetic disk, so tapes are not very useful for secondary storage.
  • 242. 470 Chapter 10 Mass-Storage Structure DISK TRANSFER RATES As with many aspects of computing, published performance numbers for disks are not the same as real-world performance numbers. Stated transfer rates are always lower than effective transfer rates, for example. The transfer rate may be the rate at which bits can be read from the magnetic media by the disk head, but that is different from the rate at which blocks are delivered to the operating system. Tapes are used mainly for backup, for storage of infrequently used information, and as a medium for transferring information from one system to another. A tape is kept in a spool and is wound or rewound past a read–write head. Moving to the correct spot on a tape can take minutes, but once positioned, tape drives can write data at speeds comparable to disk drives. Tape capacities vary greatly, depending on the particular kind of tape drive, with current capacities exceeding several terabytes. Some tapes have built-in compression that can more than double the effective storage. Tapes and their drivers are usually categorized by width, including 4, 8, and 19 millimeters and 1/4 and 1/2 inch. Some are named according to technology, such as LTO-5 and SDLT. 10.2 Disk Structure Modern magnetic disk drives are addressed as large one-dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer. The size of a logical block is usually 512 bytes, although some disks can be low-level formatted to have a different logical block size, such as 1,024 bytes. This option is described in Section 10.5.1. The one-dimensional array of logical blocks is mapped onto the sectors of the disk sequentially. Sector 0 is the first sector of the first track on the outermost cylinder. The mapping proceeds in order through that track, then through the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost. By using this mapping, we can—at least in theory—convert a logical block number into an old-style disk address that consists of a cylinder number, a track number within that cylinder, and a sector number within that track. In practice, it is difficult to perform this translation, for two reasons. First, most disks have some defective sectors, but the mapping hides this by substituting spare sectors from elsewhere on the disk. Second, the number of sectors per track is not a constant on some drives. Let’s look more closely at the second reason. On media that use constant linear velocity (CLV), the density of bits per track is uniform. The farther a track is from the center of the disk, the greater its length, so the more sectors it can hold. As we move from outer zones to inner zones, the number of sectors per track decreases. Tracks in the outermost zone typically hold 40 percent more sectors than do tracks in the innermost zone. The drive increases its rotation speed as the head moves from the outer to the inner tracks to keep the same rate of data moving under the head. This method is used in CD-ROM
  • 243. 10.3 Disk Attachment 471 and DVD-ROM drives. Alternatively, the disk rotation speed can stay constant; in this case, the density of bits decreases from inner tracks to outer tracks to keep the data rate constant. This method is used in hard disks and is known as constant angular velocity (CAV). The number of sectors per track has been increasing as disk technology improves, and the outer zone of a disk usually has several hundred sectors per track. Similarly, the number of cylinders per disk has been increasing; large disks have tens of thousands of cylinders. 10.3 Disk Attachment Computers access disk storage in two ways. One way is via I/O ports (or host-attached storage); this is common on small systems. The other way is via a remote host in a distributed file system; this is referred to as network-attached storage. 10.3.1 Host-Attached Storage Host-attached storage is storage accessed through local I/O ports. These ports use several technologies. The typical desktop PC uses an I/O bus architecture called IDE or ATA. This architecture supports a maximum of two drives per I/O bus. A newer, similar protocol that has simplified cabling is SATA. High-end workstations and servers generally use more sophisticated I/O architectures such as fibre channel (FC), a high-speed serial architecture that can operate over optical fiber or over a four-conductor copper cable. It has two variants. One is a large switched fabric having a 24-bit address space. This variant is expected to dominate in the future and is the basis of storage-area networks (SANs), discussed in Section 10.3.3. Because of the large address space and the switched nature of the communication, multiple hosts and storage devices can attach to the fabric, allowing great flexibility in I/O communication. The other FC variant is an arbitrated loop (FC-AL) that can address 126 devices (drives and controllers). A wide variety of storage devices are suitable for use as host-attached storage. Among these are hard disk drives, RAID arrays, and CD, DVD, and tape drives. The I/O commands that initiate data transfers to a host-attached storage device are reads and writes of logical data blocks directed to specifically identified storage units (such as bus ID or target logical unit). 10.3.2 Network-Attached Storage A network-attached storage (NAS) device is a special-purpose storage system that is accessed remotely over a data network (Figure 10.2). Clients access network-attached storage via a remote-procedure-call interface such as NFS for UNIX systems or CIFS for Windows machines. The remote procedure calls (RPCs) are carried via TCP or UDP over an IP network—usually the same local- area network (LAN) that carries all data traffic to the clients. Thus, it may be easiest to think of NAS as simply another storage-access protocol. The network- attached storage unit is usually implemented as a RAID array with software that implements the RPC interface.
  • 244. 472 Chapter 10 Mass-Storage Structure NAS client NAS client client LAN/WAN Figure 10.2 Network-attached storage. Network-attached storage provides a convenient way for all the computers on a LAN to share a pool of storage with the same ease of naming and access enjoyed with local host-attached storage. However, it tends to be less efficient and have lower performance than some direct-attached storage options. iSCSI is the latest network-attached storage protocol. In essence, it uses the IP network protocol to carry the SCSI protocol. Thus, networks—rather than SCSI cables—can be used as the interconnects between hosts and their storage. As a result, hosts can treat their storage as if it were directly attached, even if the storage is distant from the host. 10.3.3 Storage-Area Network One drawback of network-attached storage systems is that the storage I/O operations consume bandwidth on the data network, thereby increasing the latency of network communication. This problem can be particularly acute in large client–server installations—the communication between servers and clients competes for bandwidth with the communication among servers and storage devices. A storage-area network (SAN) is a private network (using storage protocols rather than networking protocols) connecting servers and storage units, as shown in Figure 10.3. The power of a SAN lies in its flexibility. Multiple hosts and multiple storage arrays can attach to the same SAN, and storage can be dynamically allocated to hosts. A SAN switch allows or prohibits access between the hosts and the storage. As one example, if a host is running low on disk space, the SAN can be configured to allocate more storage to that host. SANs make it possible for clusters of servers to share the same storage and for storage arrays to include multiple direct host connections. SANs typically have more ports—as well as more expensive ports—than storage arrays. FC is the most common SAN interconnect, although the simplicity of iSCSI is increasing its use. Another SAN interconnect is InfiniBand — a special-purpose bus architecture that provides hardware and software support for high-speed interconnection networks for servers and storage units. 10.4 Disk Scheduling One of the responsibilities of the operating system is to use the hardware efficiently. For the disk drives, meeting this responsibility entails having fast
  • 245. 10.4 Disk Scheduling 473 LAN/WAN storage array storage array data-processing center web content provider server client client client server tape library SAN Figure 10.3 Storage-area network. access time and large disk bandwidth. For magnetic disks, the access time has two major components, as mentioned in Section 10.1.1. The seek time is the time for the disk arm to move the heads to the cylinder containing the desired sector. The rotational latency is the additional time for the disk to rotate the desired sector to the disk head. The disk bandwidth is the total number of bytes transferred, divided by the total time between the first request for service and the completion of the last transfer. We can improve both the access time and the bandwidth by managing the order in which disk I/O requests are serviced. Whenever a process needs I/O to or from the disk, it issues a system call to the operating system. The request specifies several pieces of information: • Whether this operation is input or output • What the disk address for the transfer is • What the memory address for the transfer is • What the number of sectors to be transferred is If the desired disk drive and controller are available, the request can be serviced immediately. If the drive or controller is busy, any new requests for service will be placed in the queue of pending requests for that drive. For a multiprogramming system with many processes, the disk queue may often have several pending requests. Thus, when one request is completed, the operating system chooses which pending request to service next. How does the operating system make this choice? Any one of several disk-scheduling algorithms can be used, and we discuss them next. 10.4.1 FCFS Scheduling The simplest form of disk scheduling is, of course, the first-come, first-served (FCFS) algorithm. This algorithm is intrinsically fair, but it generally does not provide the fastest service. Consider, for example, a disk queue with requests for I/O to blocks on cylinders 98, 183, 37, 122, 14, 124, 65, 67,
  • 246. 474 Chapter 10 Mass-Storage Structure 0 14 37 536567 98 122124 183199 queue 98, 183, 37, 122, 14, 124, 65, 67 head starts at 53 Figure 10.4 FCFS disk scheduling. in that order. If the disk head is initially at cylinder 53, it will first move from 53 to 98, then to 183, 37, 122, 14, 124, 65, and finally to 67, for a total head movement of 640 cylinders. This schedule is diagrammed in Figure 10.4. The wild swing from 122 to 14 and then back to 124 illustrates the problem with this schedule. If the requests for cylinders 37 and 14 could be serviced together, before or after the requests for 122 and 124, the total head movement could be decreased substantially, and performance could be thereby improved. 10.4.2 SSTF Scheduling It seems reasonable to service all the requests close to the current head position before moving the head far away to service other requests. This assumption is the basis for the shortest-seek-time-first (SSTF) algorithm. The SSTF algorithm selects the request with the least seek time from the current head position. In other words, SSTF chooses the pending request closest to the current head position. For our example request queue, the closest request to the initial head position (53) is at cylinder 65. Once we are at cylinder 65, the next closest request is at cylinder 67. From there, the request at cylinder 37 is closer than the one at 98, so 37 is served next. Continuing, we service the request at cylinder 14, then 98, 122, 124, and finally 183 (Figure 10.5). This scheduling method results in a total head movement of only 236 cylinders—little more than one-third of the distance needed for FCFS scheduling of this request queue. Clearly, this algorithm gives a substantial improvement in performance. SSTF scheduling is essentially a form of shortest-job-first (SJF) scheduling; and like SJF scheduling, it may cause starvation of some requests. Remember that requests may arrive at any time. Suppose that we have two requests in the queue, for cylinders 14 and 186, and while the request from 14 is being serviced, a new request near 14 arrives. This new request will be serviced next, making the request at 186 wait. While this request is being serviced, another request close to 14 could arrive. In theory, a continual stream of requests near one another could cause the request for cylinder 186 to wait indefinitely.
  • 247. 10.4 Disk Scheduling 475 0 14 37 536567 98 122124 183199 queue 98, 183, 37, 122, 14, 124, 65, 67 head starts at 53 Figure 10.5 SSTF disk scheduling. This scenario becomes increasingly likely as the pending-request queue grows longer. Although the SSTF algorithm is a substantial improvement over the FCFS algorithm, it is not optimal. In the example, we can do better by moving the head from 53 to 37, even though the latter is not closest, and then to 14, before turning around to service 65, 67, 98, 122, 124, and 183. This strategy reduces the total head movement to 208 cylinders. 10.4.3 SCAN Scheduling In the SCAN algorithm, the disk arm starts at one end of the disk and moves toward the other end, servicing requests as it reaches each cylinder, until it gets to the other end of the disk. At the other end, the direction of head movement is reversed, and servicing continues. The head continuously scans back and forth across the disk. The SCAN algorithm is sometimes called the elevator algorithm, since the disk arm behaves just like an elevator in a building, first servicing all the requests going up and then reversing to service requests the other way. Let’s return to our example to illustrate. Before applying SCAN to schedule the requests on cylinders 98, 183, 37, 122, 14, 124, 65, and 67, we need to know the direction of head movement in addition to the head’s current position. Assuming that the disk arm is moving toward 0 and that the initial head position is again 53, the head will next service 37 and then 14. At cylinder 0, the arm will reverse and will move toward the other end of the disk, servicing the requests at 65, 67, 98, 122, 124, and 183 (Figure 10.6). If a request arrives in the queue just in front of the head, it will be serviced almost immediately; a request arriving just behind the head will have to wait until the arm moves to the end of the disk, reverses direction, and comes back. Assuming a uniform distribution of requests for cylinders, consider the density of requests when the head reaches one end and reverses direction. At this point, relatively few requests are immediately in front of the head, since these cylinders have recently been serviced. The heaviest density of requests
  • 248. 476 Chapter 10 Mass-Storage Structure 0 14 37 536567 98 122124 183199 queue 98, 183, 37, 122, 14, 124, 65, 67 head starts at 53 Figure 10.6 SCAN disk scheduling. is at the other end of the disk. These requests have also waited the longest, so why not go there first? That is the idea of the next algorithm. 10.4.4 C-SCAN Scheduling Circular SCAN (C-SCAN) scheduling is a variant of SCAN designed to provide a more uniform wait time. Like SCAN, C-SCAN moves the head from one end of the disk to the other, servicing requests along the way. When the head reaches the other end, however, it immediately returns to the beginning of the disk without servicing any requests on the return trip (Figure 10.7). The C-SCAN scheduling algorithm essentially treats the cylinders as a circular list that wraps around from the final cylinder to the first one. 0 14 37 53 65 67 98 122124 183199 queue = 98, 183, 37, 122, 14, 124, 65, 67 head starts at 53 Figure 10.7 C-SCAN disk scheduling.
  • 249. 10.4 Disk Scheduling 477 10.4.5 LOOK Scheduling As we described them, both SCAN and C-SCAN move the disk arm across the full width of the disk. In practice, neither algorithm is often implemented this way. More commonly, the arm goes only as far as the final request in each direction. Then, it reverses direction immediately, without going all the way to the end of the disk. Versions of SCAN and C-SCAN that follow this pattern are called LOOK and C-LOOK scheduling, because they look for a request before continuing to move in a given direction (Figure 10.8). 10.4.6 Selection of a Disk-Scheduling Algorithm Given so many disk-scheduling algorithms, how do we choose the best one? SSTF is common and has a natural appeal because it increases performance over FCFS. SCAN and C-SCAN perform better for systems that place a heavy load on the disk, because they are less likely to cause a starvation problem. For any particular list of requests, we can define an optimal order of retrieval, but the computation needed to find an optimal schedule may not justify the savings over SSTF or SCAN. With any scheduling algorithm, however, performance depends heavily on the number and types of requests. For instance, suppose that the queue usually has just one outstanding request. Then, all scheduling algorithms behave the same, because they have only one choice of where to move the disk head: they all behave like FCFS scheduling. Requests for disk service can be greatly influenced by the file-allocation method. A program reading a contiguously allocated file will generate several requests that are close together on the disk, resulting in limited head movement. A linked or indexed file, in contrast, may include blocks that are widely scattered on the disk, resulting in greater head movement. The location of directories and index blocks is also important. Since every file must be opened to be used, and opening a file requires searching the directory structure, the directories will be accessed frequently. Suppose that a directory entry is on the first cylinder and a file’s data are on the final cylinder. In this case, the disk head has to move the entire width of the disk. If the directory 0 14 37 536567 98 122124 183199 queue = 98, 183, 37, 122, 14, 124, 65, 67 head starts at 53 Figure 10.8 C-LOOK disk scheduling.
  • 250. 478 Chapter 10 Mass-Storage Structure DISK SCHEDULING and SSDs The disk-scheduling algorithms discussed in this section focus primarily on minimizing the amount of disk head movement in magnetic disk drives. SSDs—which do not contain moving disk heads—commonly use a simple FCFS policy. For example, the Linux Noop scheduler uses an FCFS policy but modifies it to merge adjacent requests. The observed behavior of SSDs indicates that the time required to service reads is uniform but that, because of the properties of flash memory, write service time is not uniform. Some SSD schedulers have exploited this property and merge only adjacent write requests, servicing all read requests in FCFS order. entry were on the middle cylinder, the head would have to move only one-half the width. Caching the directories and index blocks in main memory can also help to reduce disk-arm movement, particularly for read requests. Because of these complexities, the disk-scheduling algorithm should be written as a separate module of the operating system, so that it can be replaced with a different algorithm if necessary. Either SSTF or LOOK is a reasonable choice for the default algorithm. The scheduling algorithms described here consider only the seek distances. For modern disks, the rotational latency can be nearly as large as the average seek time. It is difficult for the operating system to schedule for improved rotational latency, though, because modern disks do not disclose the physical location of logical blocks. Disk manufacturers have been alleviating this problem by implementing disk-scheduling algorithms in the controller hardware built into the disk drive. If the operating system sends a batch of requests to the controller, the controller can queue them and then schedule them to improve both the seek time and the rotational latency. If I/O performance were the only consideration, the operating system would gladly turn over the responsibility of disk scheduling to the disk hard- ware. In practice, however, the operating system may have other constraints on the service order for requests. For instance, demand paging may take priority over application I/O, and writes are more urgent than reads if the cache is running out of free pages. Also, it may be desirable to guarantee the order of a set of disk writes to make the file system robust in the face of system crashes. Consider what could happen if the operating system allocated a disk page to a file and the application wrote data into that page before the operating system had a chance to flush the file system metadata back to disk. To accommodate such requirements, an operating system may choose to do its own disk scheduling and to spoon-feed the requests to the disk controller, one by one, for some types of I/O. 10.5 Disk Management The operating system is responsible for several other aspects of disk manage- ment, too. Here we discuss disk initialization, booting from disk, and bad-block recovery.
  • 251. 11 C H A P T E R File-System Interface For most users, the file system is the most visible aspect of an operating system. It provides the mechanism for on-line storage of and access to both data and programs of the operating system and all the users of the computer system. The file system consists of two distinct parts: a collection of files, each storing related data, and a directory structure, which organizes and provides information about all the files in the system. File systems live on devices, which we described in the preceding chapter and will continue to discuss in the following one. In this chapter, we consider the various aspects of files and the major directory structures. We also discuss the semantics of sharing files among multiple processes, users, and computers. Finally, we discuss ways to handle file protection, necessary when we have multiple users and we want to control who may access files and how files may be accessed. CHAPTER OBJECTIVES • To explain the function of file systems. • To describe the interfaces to file systems. • To discuss file-system design tradeoffs, including access methods, file sharing, file locking, and directory structures. • To explore file-system protection. 11.1 File Concept Computers can store information on various storage media, such as magnetic disks, magnetic tapes, and optical disks. So that the computer system will be convenient to use, the operating system provides a uniform logical view of stored information. The operating system abstracts from the physical properties of its storage devices to define a logical storage unit, the file. Files are mapped by the operating system onto physical devices. These storage devices are usually nonvolatile, so the contents are persistent between system reboots. 503
  • 252. 504 Chapter 11 File-System Interface A file is a named collection of related information that is recorded on secondary storage. From a user’s perspective, a file is the smallest allotment of logical secondary storage; that is, data cannot be written to secondary storage unless they are within a file. Commonly, files represent programs (both source and object forms) and data. Data files may be numeric, alphabetic, alphanumeric, or binary. Files may be free form, such as text files, or may be formatted rigidly. In general, a file is a sequence of bits, bytes, lines, or records, the meaning of which is defined by the file’s creator and user. The concept of a file is thus extremely general. The information in a file is defined by its creator. Many different types of information may be stored in a file—source or executable programs, numeric or text data, photos, music, video, and so on. A file has a certain defined structure, which depends on its type. A text file is a sequence of characters organized into lines (and possibly pages). A source file is a sequence of functions, each of which is further organized as declarations followed by executable statements. An executable file is a series of code sections that the loader can bring into memory and execute. 11.1.1 File Attributes A file is named, for the convenience of its human users, and is referred to by its name. A name is usually a string of characters, such as example.c. Some systems differentiate between uppercase and lowercase characters in names, whereas other systems do not. When a file is named, it becomes independent of the process, the user, and even the system that created it. For instance, one user might create the file example.c, and another user might edit that file by specifying its name. The file’s owner might write the file to a USB disk, send it as an e-mail attachment, or copy it across a network, and it could still be called example.c on the destination system. A file’s attributes vary from one operating system to another but typically consist of these: • Name. The symbolic file name is the only information kept in human- readable form. • Identifier. This unique tag, usually a number, identifies the file within the file system; it is the non-human-readable name for the file. • Type. This information is needed for systems that support different types of files. • Location. This information is a pointer to a device and to the location of the file on that device. • Size. The current size of the file (in bytes, words, or blocks) and possibly the maximum allowed size are included in this attribute. • Protection. Access-control information determines who can do reading, writing, executing, and so on. • Time, date, and user identification. This information may be kept for creation, last modification, and last use. These data can be useful for protection, security, and usage monitoring.
  • 253. 11.1 File Concept 505 Figure 11.1 A file info window on Mac OS X. Some newer file systems also support extended file attributes, including character encoding of the file and security features such as a file checksum. Figure 11.1 illustrates a file info window on Mac OS X, which displays a file’s attributes. The information about all files is kept in the directory structure, which also resides on secondary storage. Typically, a directory entry consists of the file’s name and its unique identifier. The identifier in turn locates the other
  • 254. 506 Chapter 11 File-System Interface file attributes. It may take more than a kilobyte to record this information for each file. In a system with many files, the size of the directory itself may be megabytes. Because directories, like files, must be nonvolatile, they must be stored on the device and brought into memory piecemeal, as needed. 11.1.2 File Operations A file is an abstract data type. To define a file properly, we need to consider the operations that can be performed on files. The operating system can provide system calls to create, write, read, reposition, delete, and truncate files. Let’s examine what the operating system must do to perform each of these six basic file operations. It should then be easy to see how other similar operations, such as renaming a file, can be implemented. • Creating a file. Two steps are necessary to create a file. First, space in the file system must be found for the file. We discuss how to allocate space for the file in Chapter 12. Second, an entry for the new file must be made in the directory. • Writing a file. To write a file, we make a system call specifying both the name of the file and the information to be written to the file. Given the name of the file, the system searches the directory to find the file’s location. The system must keep a write pointer to the location in the file where the next write is to take place. The write pointer must be updated whenever a write occurs. • Reading a file. To read from a file, we use a system call that specifies the name of the file and where (in memory) the next block of the file should be put. Again, the directory is searched for the associated entry, and the system needs to keep a read pointer to the location in the file where the next read is to take place. Once the read has taken place, the read pointer is updated. Because a process is usually either reading from or writing to a file, the current operation location can be kept as a per-process current- file-position pointer. Both the read and write operations use this same pointer, saving space and reducing system complexity. • Repositioning within a file. The directory is searched for the appropriate entry, and the current-file-position pointer is repositioned to a given value. Repositioning within a file need not involve any actual I/O. This file operation is also known as a file seek. • Deleting a file. To delete a file, we search the directory for the named file. Having found the associated directory entry, we release all file space, so that it can be reused by other files, and erase the directory entry. • Truncating a file. The user may want to erase the contents of a file but keep its attributes. Rather than forcing the user to delete the file and then recreate it, this function allows all attributes to remain unchanged—except for file length—but lets the file be reset to length zero and its file space released. These six basic operations comprise the minimal set of required file operations. Other common operations include appending new information
  • 255. 11.1 File Concept 507 to the end of an existing file and renaming an existing file. These primitive operations can then be combined to perform other file operations. For instance, we can create a copy of a file—or copy the file to another I/O device, such as a printer or a display—by creating a new file and then reading from the old and writing to the new. We also want to have operations that allow a user to get and set the various attributes of a file. For example, we may want to have operations that allow a user to determine the status of a file, such as the file’s length, and to set file attributes, such as the file’s owner. Most of the file operations mentioned involve searching the directory for the entry associated with the named file. To avoid this constant searching, many systems require that an open() system call be made before a file is first used. The operating system keeps a table, called the open-file table, containing information about all open files. When a file operation is requested, the file is specified via an index into this table, so no searching is required. When the file is no longer being actively used, it is closed by the process, and the operating system removes its entry from the open-file table. create() and delete() are system calls that work with closed rather than open files. Some systems implicitly open a file when the first reference to it is made. The file is automatically closed when the job or program that opened the file terminates. Most systems, however, require that the programmer open a file explicitly with the open() system call before that file can be used. The open() operation takes a file name and searches the directory, copying the directory entry into the open-file table. The open() call can also accept access- mode information—create, read-only, read–write, append-only, and so on. This mode is checked against the file’s permissions. If the request mode is allowed, the file is opened for the process. The open() system call typically returns a pointer to the entry in the open-file table. This pointer, not the actual file name, is used in all I/O operations, avoiding any further searching and simplifying the system-call interface. The implementation of the open() and close() operations is more complicated in an environment where several processes may open the file simultaneously. This may occur in a system where several different applications open the same file at the same time. Typically, the operating system uses two levels of internal tables: a per-process table and a system-wide table. The per- process table tracks all files that a process has open. Stored in this table is information regarding the process’s use of the file. For instance, the current file pointer for each file is found here. Access rights to the file and accounting information can also be included. Each entry in the per-process table in turn points to a system-wide open-file table. The system-wide table contains process-independent information, such as the location of the file on disk, access dates, and file size. Once a file has been opened by one process, the system-wide table includes an entry for the file. When another process executes an open() call, a new entry is simply added to the process’s open-file table pointing to the appropriate entry in the system-wide table. Typically, the open-file table also has an open count associated with each file to indicate how many processes have the file open. Each close() decreases this open count, and when the open count reaches zero, the file is no longer in use, and the file’s entry is removed from the open-file table.
  • 256. 508 Chapter 11 File-System Interface In summary, several pieces of information are associated with an open file. • File pointer. On systems that do not include a file offset as part of the read() and write() system calls, the system must track the last read– write location as a current-file-position pointer. This pointer is unique to each process operating on the file and therefore must be kept separate from the on-disk file attributes. • File-open count. As files are closed, the operating system must reuse its open-file table entries, or it could run out of space in the table. Multiple processes may have opened a file, and the system must wait for the last file to close before removing the open-file table entry. The file-open count tracks the number of opens and closes and reaches zero on the last close. The system can then remove the entry. • Disk location of the file. Most file operations require the system to modify data within the file. The information needed to locate the file on disk is kept in memory so that the system does not have to read it from disk for each operation. • Access rights. Each process opens a file in an access mode. This information is stored on the per-process table so the operating system can allow or deny subsequent I/O requests. Some operating systems provide facilities for locking an open file (or sections of a file). File locks allow one process to lock a file and prevent other processes from gaining access to it. File locks are useful for files that are shared by several processes—for example, a system log file that can be accessed and modified by a number of processes in the system. File locks provide functionality similar to reader–writer locks, covered in Section 5.7.2. A shared lock is akin to a reader lock in that several processes can acquire the lock concurrently. An exclusive lock behaves like a writer lock; only one process at a time can acquire such a lock. It is important to note that not all operating systems provide both types of locks: some systems only provide exclusive file locking. FILE LOCKING IN JAVA In the Java API, acquiring a lock requires first obtaining the FileChannel for the file to be locked. The lock() method of the FileChannel is used to acquire the lock. The API of the lock() method is FileLock lock(long begin, long end, boolean shared) where begin and end are the beginning and ending positions of the region being locked. Setting shared to true is for shared locks; setting shared to false acquires the lock exclusively. The lock is released by invoking the release() of the FileLock returned by the lock() operation. The program in Figure 11.2 illustrates file locking in Java. This program acquires two locks on the file file.txt. The first half of the file is acquired as an exclusive lock; the lock for the second half is a shared lock.
  • 257. 11.1 File Concept 509 FILE LOCKING IN JAVA (Continued) import java.io.*; import java.nio.channels.*; public class LockingExample { public static final boolean EXCLUSIVE = false; public static final boolean SHARED = true; public static void main(String args[]) throws IOException { FileLock sharedLock = null; FileLock exclusiveLock = null; try { RandomAccessFile raf = new RandomAccessFile(file.txt,rw); // get the channel for the file FileChannel ch = raf.getChannel(); // this locks the first half of the file - exclusive exclusiveLock = ch.lock(0, raf.length()/2, EXCLUSIVE); /** Now modify the data . . . */ // release the lock exclusiveLock.release(); // this locks the second half of the file - shared sharedLock = ch.lock(raf.length()/2+1,raf.length(),SHARED); /** Now read the data . . . */ // release the lock sharedLock.release(); } catch (java.io.IOException ioe) { System.err.println(ioe); } finally { if (exclusiveLock != null) exclusiveLock.release(); if (sharedLock != null) sharedLock.release(); } } } Figure 11.2 File-locking example in Java. Furthermore, operating systems may provide either mandatory or advi- sory file-locking mechanisms. If a lock is mandatory, then once a process acquires an exclusive lock, the operating system will prevent any other process
  • 258. 510 Chapter 11 File-System Interface from accessing the locked file. For example, assume a process acquires an exclusive lock on the file system.log. If we attempt to open system.log from another process—for example, a text editor—the operating system will prevent access until the exclusive lock is released. This occurs even if the text editor is not written explicitly to acquire the lock. Alternatively, if the lock is advisory, then the operating system will not prevent the text editor from acquiring access to system.log. Rather, the text editor must be written so that it manually acquires the lock before accessing the file. In other words, if the locking scheme is mandatory, the operating system ensures locking integrity. For advisory locking, it is up to software developers to ensure that locks are appropriately acquired and released. As a general rule, Windows operating systems adopt mandatory locking, and UNIX systems employ advisory locks. The use of file locks requires the same precautions as ordinary process synchronization. For example, programmers developing on systems with mandatory locking must be careful to hold exclusive file locks only while they are accessing the file. Otherwise, they will prevent other processes from accessing the file as well. Furthermore, some measures must be taken to ensure that two or more processes do not become involved in a deadlock while trying to acquire file locks. 11.1.3 File Types When we design a file system—indeed, an entire operating system—we always consider whether the operating system should recognize and support file types. If an operating system recognizes the type of a file, it can then operate on the file in reasonable ways. For example, a common mistake occurs when a user tries to output the binary-object form of a program. This attempt normally produces garbage; however, the attempt can succeed if the operating system has been told that the file is a binary-object program. A common technique for implementing file types is to include the type as part of the file name. The name is split into two parts—a name and an extension, usually separated by a period (Figure 11.3). In this way, the user and the operating system can tell from the name alone what the type of a file is. Most operating systems allow users to specify a file name as a sequence of characters followed by a period and terminated by an extension made up of additional characters. Examples include resume.docx, server.c, and ReaderThread.cpp. The system uses the extension to indicate the type of the file and the type of operations that can be done on that file. Only a file with a .com, .exe, or .sh extension can be executed, for instance. The .com and .exe files are two forms of binary executable files, whereas the .sh file is a shell script containing, in ASCII format, commands to the operating system. Application programs also use extensions to indicate file types in which they are interested. For example, Java compilers expect source files to have a .java extension, and the Microsoft Word word processor expects its files to end with a .doc or .docx extension. These extensions are not always required, so a user may specify a file without the extension (to save typing), and the application will look for a file with the given name and the extension it expects. Because these extensions are not supported by the operating system, they can be considered “hints” to the applications that operate on them.
  • 259. 11.1 File Concept 511 file type usual extension function ready-to-run machine- language program executable exe, com, bin or none compiled, machine language, not linked object obj, o binary file containing audio or A/V information multimedia mpeg, mov, mp3, mp4, avi related files grouped into one file, sometimes com- pressed, for archiving or storage archive rar, zip, tar ASCII or binary file in a format for printing or viewing print or view gif, pdf, jpg libraries of routines for programmers library lib, a, so, dll various word-processor formats word processor docx commands to the command interpreter batch bat, sh textual data, documents markup xml, html, tex source code in various languages source code c, cc, java, perl, asm xml, rtf, Figure 11.3 Common file types. Consider, too, the Mac OS X operating system. In this system, each file has a type, such as .app (for application). Each file also has a creator attribute containing the name of the program that created it. This attribute is set by the operating system during the create() call, so its use is enforced and supported by the system. For instance, a file produced by a word processor has the word processor’s name as its creator. When the user opens that file, by double-clicking the mouse on the icon representing the file, the word processor is invoked automatically and the file is loaded, ready to be edited. The UNIX system uses a crude magic number stored at the beginning of some files to indicate roughly the type of the file—executable program, shell script, PDF file, and so on. Not all files have magic numbers, so system features cannot be based solely on this information. UNIX does not record the name of the creating program, either. UNIX does allow file-name-extension hints, but these extensions are neither enforced nor depended on by the operating system; they are meant mostly to aid users in determining what type of contents the file contains. Extensions can be used or ignored by a given application, but that is up to the application’s programmer. 11.1.4 File Structure File types also can be used to indicate the internal structure of the file. As mentioned in Section 11.1.3, source and object files have structures that match the expectations of the programs that read them. Further, certain files must
  • 260. 512 Chapter 11 File-System Interface conform to a required structure that is understood by the operating system. For example, the operating system requires that an executable file have a specific structure so that it can determine where in memory to load the file and what the location of the first instruction is. Some operating systems extend this idea into a set of system-supported file structures, with sets of special operations for manipulating files with those structures. This point brings us to one of the disadvantages of having the operating system support multiple file structures: the resulting size of the operating system is cumbersome. If the operating system defines five different file structures, it needs to contain the code to support these file structures. In addition, it may be necessary to define every file as one of the file types supported by the operating system. When new applications require information structured in ways not supported by the operating system, severe problems may result. For example, assume that a system supports two types of files: text files (composed of ASCII characters separated by a carriage return and line feed) and executable binary files. Now, if we (as users) want to define an encrypted file to protect the contents from being read by unauthorized people, we may find neither file type to be appropriate. The encrypted file is not ASCII text lines but rather is (apparently) random bits. Although it may appear to be a binary file, it is not executable. As a result, we may have to circumvent or misuse the operating system’s file-type mechanism or abandon our encryption scheme. Some operating systems impose (and support) a minimal number of file structures. This approach has been adopted in UNIX, Windows, and others. UNIX considers each file to be a sequence of 8-bit bytes; no interpretation of these bits is made by the operating system. This scheme provides maximum flexibility but little support. Each application program must include its own code to interpret an input file as to the appropriate structure. However, all operating systems must support at least one structure—that of an executable file—so that the system is able to load and run programs. 11.1.5 Internal File Structure Internally, locating an offset within a file can be complicated for the operating system. Disk systems typically have a well-defined block size determined by the size of a sector. All disk I/O is performed in units of one block (physical record), and all blocks are the same size. It is unlikely that the physical record size will exactly match the length of the desired logical record. Logical records may even vary in length. Packing a number of logical records into physical blocks is a common solution to this problem. For example, the UNIX operating system defines all files to be simply streams of bytes. Each byte is individually addressable by its offset from the beginning (or end) of the file. In this case, the logical record size is 1 byte. The file system automatically packs and unpacks bytes into physical disk blocks— say, 512 bytes per block—as necessary. The logical record size, physical block size, and packing technique deter- mine how many logical records are in each physical block. The packing can be done either by the user’s application program or by the operating system. In either case, the file may be considered a sequence of blocks. All the basic I/O
  • 261. 11.2 Access Methods 513 beginning end current position rewind read or write Figure 11.4 Sequential-access file. functions operate in terms of blocks. The conversion from logical records to physical blocks is a relatively simple software problem. Because disk space is always allocated in blocks, some portion of the last block of each file is generally wasted. If each block were 512 bytes, for example, then a file of 1,949 bytes would be allocated four blocks (2,048 bytes); the last 99 bytes would be wasted. The waste incurred to keep everything in units of blocks (instead of bytes) is internal fragmentation. All file systems suffer from internal fragmentation; the larger the block size, the greater the internal fragmentation. 11.2 Access Methods Files store information. When it is used, this information must be accessed and read into computer memory. The information in the file can be accessed in several ways. Some systems provide only one access method for files. while others support many access methods, and choosing the right one for a particular application is a major design problem. 11.2.1 Sequential Access The simplest access method is sequential access. Information in the file is processed in order, one record after the other. This mode of access is by far the most common; for example, editors and compilers usually access files in this fashion. Reads and writes make up the bulk of the operations on a file. A read operation—read next()—reads the next portion of the file and automatically advances a file pointer, which tracks the I/O location. Similarly, the write operation—write next()—appends to the end of the file and advances to the end of the newly written material (the new end of file). Such a file can be reset to the beginning, and on some systems, a program may be able to skip forward or backward n records for some integer n—perhaps only for n = 1. Sequential access, which is depicted in Figure 11.4, is based on a tape model of a file and works as well on sequential-access devices as it does on random-access ones. 11.2.2 Direct Access Another method is direct access (or relative access). Here, a file is made up of fixed-length logical records that allow programs to read and write records rapidly in no particular order. The direct-access method is based on a disk model of a file, since disks allow random access to any file block. For direct
  • 262. 514 Chapter 11 File-System Interface access, the file is viewed as a numbered sequence of blocks or records. Thus, we may read block 14, then read block 53, and then write block 7. There are no restrictions on the order of reading or writing for a direct-access file. Direct-access files are of great use for immediate access to large amounts of information. Databases are often of this type. When a query concerning a particular subject arrives, we compute which block contains the answer and then read that block directly to provide the desired information. As a simple example, on an airline-reservation system, we might store all the information about a particular flight (for example, flight 713) in the block identified by the flight number. Thus, the number of available seats for flight 713 is stored in block 713 of the reservation file. To store information about a larger set, such as people, we might compute a hash function on the people’s names or search a small in-memory index to determine a block to read and search. For the direct-access method, the file operations must be modified to include the block number as a parameter. Thus, we have read(n), where n is the block number, rather than read next(), and write(n) rather than write next(). An alternative approach is to retain read next() and write next(), as with sequential access, and to add an operation posi- tion file(n) where n is the block number. Then, to effect a read(n), we would position file(n) and then read next(). The block number provided by the user to the operating system is normally a relative block number. A relative block number is an index relative to the beginning of the file. Thus, the first relative block of the file is 0, the next is 1, and so on, even though the absolute disk address may be 14703 for the first block and 3192 for the second. The use of relative block numbers allows the operating system to decide where the file should be placed (called the allocation problem, as we discuss in Chapter 12) and helps to prevent the user from accessing portions of the file system that may not be part of her file. Some systems start their relative block numbers at 0; others start at 1. How, then, does the system satisfy a request for record N in a file? Assuming we have a logical record length L, the request for record N is turned into an I/O request for L bytes starting at location L ∗ (N) within the file (assuming the first record is N = 0). Since logical records are of a fixed size, it is also easy to read, write, or delete a record. Not all operating systems support both sequential and direct access for files. Some systems allow only sequential file access; others allow only direct access. Some systems require that a file be defined as sequential or direct when it is created. Such a file can be accessed only in a manner consistent with its declaration. We can easily simulate sequential access on a direct-access file by simply keeping a variable cp that defines our current position, as shown in Figure 11.5. Simulating a direct-access file on a sequential-access file, however, is extremely inefficient and clumsy. 11.2.3 Other Access Methods Other access methods can be built on top of a direct-access method. These methods generally involve the construction of an index for the file. The index, like an index in the back of a book, contains pointers to the various blocks. To
  • 263. 11.3 Directory and Disk Structure 515 sequential access reset read_next write_next cp 0; read cp ; cp cp 1; write cp; cp cp 1; implementation for direct access Figure 11.5 Simulation of sequential access on a direct-access file. find a record in the file, we first search the index and then use the pointer to access the file directly and to find the desired record. For example, a retail-price file might list the universal product codes (UPCs) for items, with the associated prices. Each record consists of a 10-digit UPC and a 6-digit price, for a 16-byte record. If our disk has 1,024 bytes per block, we can store 64 records per block. A file of 120,000 records would occupy about 2,000 blocks (2 million bytes). By keeping the file sorted by UPC, we can define an index consisting of the first UPC in each block. This index would have 2,000 entries of 10 digits each, or 20,000 bytes, and thus could be kept in memory. To find the price of a particular item, we can make a binary search of the index. From this search, we learn exactly which block contains the desired record and access that block. This structure allows us to search a large file doing little I/O. With large files, the index file itself may become too large to be kept in memory. One solution is to create an index for the index file. The primary index file contains pointers to secondary index files, which point to the actual data items. For example, IBM’s indexed sequential-access method (ISAM) uses a small master index that points to disk blocks of a secondary index. The secondary index blocks point to the actual file blocks. The file is kept sorted on a defined key. To find a particular item, we first make a binary search of the master index, which provides the block number of the secondary index. This block is read in, and again a binary search is used to find the block containing the desired record. Finally, this block is searched sequentially. In this way, any record can be located from its key by at most two direct-access reads. Figure 11.6 shows a similar situation as implemented by VMS index and relative files. 11.3 Directory and Disk Structure Next, we consider how to store files. Certainly, no general-purpose computer stores just one file. There are typically thousands, millions, even billions of files within a computer. Files are stored on random-access storage devices, including hard disks, optical disks, and solid-state (memory-based) disks. A storage device can be used in its entirety for a file system. It can also be subdivided for finer-grained control. For example, a disk can be partitioned into quarters, and each quarter can hold a separate file system. Storage devices can also be collected together into RAID sets that provide protection from the failure of a single disk (as described in Section 10.7). Sometimes, disks are subdivided and also collected into RAID sets.
  • 264. 516 Chapter 11 File-System Interface index file relative file Smith last name smith, john social-security age logical record number Adams Arthur Asher • • • Figure 11.6 Example of index and relative files. Partitioning is useful for limiting the sizes of individual file systems, putting multiple file-system types on the same device, or leaving part of the device available for other uses, such as swap space or unformatted (raw) disk space. A file system can be created on each of these parts of the disk. Any entity containing a file system is generally known as a volume. The volume may be a subset of a device, a whole device, or multiple devices linked together into a RAID set. Each volume can be thought of as a virtual disk. Volumes can also store multiple operating systems, allowing a system to boot and run more than one operating system. Each volume that contains a file system must also contain information about the files in the system. This information is kept in entries in a device directory or volume table of contents. The device directory (more commonly known simply as the directory) records information—such as name, location, size, and type—for all files on that volume. Figure 11.7 shows a typical file-system organization. directory directory directory files partition A partition B partition C files disk 1 disk 2 disk 3 files Figure 11.7 A typical file-system organization.
  • 265. 11.3 Directory and Disk Structure 517 / ufs /devices devfs /dev dev /system/contract ctfs /proc proc /etc/mnttab mntfs /etc/svc/volatile tmpfs /system/object objfs /lib/libc.so.1 lofs /dev/fd fd /var ufs /tmp tmpfs /var/run tmpfs /opt ufs /zpbge zfs /zpbge/backup zfs /export/home zfs /var/mail zfs /var/spool/mqueue zfs /zpbg zfs /zpbg/zones zfs Figure 11.8 Solaris file systems. 11.3.1 Storage Structure As we have just seen, a general-purpose computer system has multiple storage devices, and those devices can be sliced up into volumes that hold file systems. Computer systems may have zero or more file systems, and the file systems may be of varying types. For example, a typical Solaris system may have dozens of file systems of a dozen different types, as shown in the file system list in Figure 11.8. In this book, we consider only general-purpose file systems. It is worth noting, though, that there are many special-purpose file systems. Consider the types of file systems in the Solaris example mentioned above: • tmpfs—a “temporary” file system that is created in volatile main memory and has its contents erased if the system reboots or crashes • objfs—a “virtual” file system (essentially an interface to the kernel that looks like a file system) that gives debuggers access to kernel symbols • ctfs—a virtual file system that maintains “contract” information to manage which processes start when the system boots and must continue to run during operation • lofs—a “loop back” file system that allows one file system to be accessed in place of another one • procfs—a virtual file system that presents information on all processes as a file system • ufs, zfs—general-purpose file systems
  • 266. 518 Chapter 11 File-System Interface The file systems of computers, then, can be extensive. Even within a file system, it is useful to segregate files into groups and manage and act on those groups. This organization involves the use of directories. In the remainder of this section, we explore the topic of directory structure. 11.3.2 Directory Overview The directory can be viewed as a symbol table that translates file names into their directory entries. If we take such a view, we see that the directory itself can be organized in many ways. The organization must allow us to insert entries, to delete entries, to search for a named entry, and to list all the entries in the directory. In this section, we examine several schemes for defining the logical structure of the directory system. When considering a particular directory structure, we need to keep in mind the operations that are to be performed on a directory: • Search for a file. We need to be able to search a directory structure to find the entry for a particular file. Since files have symbolic names, and similar names may indicate a relationship among files, we may want to be able to find all files whose names match a particular pattern. • Create a file. New files need to be created and added to the directory. • Delete a file. When a file is no longer needed, we want to be able to remove it from the directory. • List a directory. We need to be able to list the files in a directory and the contents of the directory entry for each file in the list. • Rename a file. Because the name of a file represents its contents to its users, we must be able to change the name when the contents or use of the file changes. Renaming a file may also allow its position within the directory structure to be changed. • Traverse the file system. We may wish to access every directory and every file within a directory structure. For reliability, it is a good idea to save the contents and structure of the entire file system at regular intervals. Often, we do this by copying all files to magnetic tape. This technique provides a backup copy in case of system failure. In addition, if a file is no longer in use, the file can be copied to tape and the disk space of that file released for reuse by another file. In the following sections, we describe the most common schemes for defining the logical structure of a directory. 11.3.3 Single-Level Directory The simplest directory structure is the single-level directory. All files are contained in the same directory, which is easy to support and understand (Figure 11.9). A single-level directory has significant limitations, however, when the number of files increases or when the system has more than one user. Since all files are in the same directory, they must have unique names. If two users call
  • 267. 11.3 Directory and Disk Structure 519 cat files directory bo a test data mail cont hex records Figure 11.9 Single-level directory. their data file test.txt, then the unique-name rule is violated. For example, in one programming class, 23 students called the program for their second assignment prog2.c; another 11 called it assign2.c. Fortunately, most file systems support file names of up to 255 characters, so it is relatively easy to select unique file names. Even a single user on a single-level directory may find it difficult to remember the names of all the files as the number of files increases. It is not uncommon for a user to have hundreds of files on one computer system and an equal number of additional files on another system. Keeping track of so many files is a daunting task. 11.3.4 Two-Level Directory As we have seen, a single-level directory often leads to confusion of file names among different users. The standard solution is to create a separate directory for each user. In the two-level directory structure, each user has his own user file directory (UFD). The UFDs have similar structures, but each lists only the files of a single user. When a user job starts or a user logs in, the system’s master file directory (MFD) is searched. The MFD is indexed by user name or account number, and each entry points to the UFD for that user (Figure 11.10). When a user refers to a particular file, only his own UFD is searched. Thus, different users may have files with the same name, as long as all the file names within each UFD are unique. To create a file for a user, the operating system searches only that user’s UFD to ascertain whether another file of that name exists. To delete a file, the operating system confines its search to the local UFD; thus, it cannot accidentally delete another user’s file that has the same name. cat bo a test x data a a user 1 user 2 user 3 user 4 data a test user file directory master file directory Figure 11.10 Two-level directory structure.
  • 268. 520 Chapter 11 File-System Interface The user directories themselves must be created and deleted as necessary. A special system program is run with the appropriate user name and account information. The program creates a new UFD and adds an entry for it to the MFD. The execution of this program might be restricted to system administrators. The allocation of disk space for user directories can be handled with the techniques discussed in Chapter 12 for files themselves. Although the two-level directory structure solves the name-collision prob- lem, it still has disadvantages. This structure effectively isolates one user from another. Isolation is an advantage when the users are completely independent but is a disadvantage when the users want to cooperate on some task and to access one another’s files. Some systems simply do not allow local user files to be accessed by other users. If access is to be permitted, one user must have the ability to name a file in another user’s directory. To name a particular file uniquely in a two-level directory, we must give both the user name and the file name. A two-level directory can be thought of as a tree, or an inverted tree, of height 2. The root of the tree is the MFD. Its direct descendants are the UFDs. The descendants of the UFDs are the files themselves. The files are the leaves of the tree. Specifying a user name and a file name defines a path in the tree from the root (the MFD) to a leaf (the specified file). Thus, a user name and a file name define a path name. Every file in the system has a path name. To name a file uniquely, a user must know the path name of the file desired. For example, if user A wishes to access her own test file named test.txt, she can simply refer to test.txt. To access the file named test.txt of user B (with directory-entry name userb), however, she might have to refer to /userb/test.txt. Every system has its own syntax for naming files in directories other than the user’s own. Additional syntax is needed to specify the volume of a file. For instance, in Windows a volume is specified by a letter followed by a colon. Thus, a file specification might be C:userbtest. Some systems go even fur- ther and separate the volume, directory name, and file name parts of the specification. In VMS, for instance, the file login.com might be specified as: u:[sst.jdeck]login.com;1, where u is the name of the volume, sst is the name of the directory, jdeck is the name of the subdirectory, and 1 is the version number. Other systems—such as UNIX and Linux—simply treat the volume name as part of the directory name. The first name given is that of the volume, and the rest is the directory and file. For instance, /u/pbg/test might specify volume u, directory pbg, and file test. A special instance of this situation occurs with the system files. Programs provided as part of the system—loaders, assemblers, compilers, utility rou- tines, libraries, and so on—are generally defined as files. When the appropriate commands are given to the operating system, these files are read by the loader and executed. Many command interpreters simply treat such a command as the name of a file to load and execute. In the directory system as we defined it above, this file name would be searched for in the current UFD. One solution would be to copy the system files into each UFD. However, copying all the system files would waste an enormous amount of space. (If the system files require 5 MB, then supporting 12 users would require 5 × 12 = 60 MB just for copies of the system files.)
  • 269. 11.3 Directory and Disk Structure 521 The standard solution is to complicate the search procedure slightly. A special user directory is defined to contain the system files (for example, user 0). Whenever a file name is given to be loaded, the operating system first searches the local UFD. If the file is found, it is used. If it is not found, the system automatically searches the special user directory that contains the system files. The sequence of directories searched when a file is named is called the search path. The search path can be extended to contain an unlimited list of directories to search when a command name is given. This method is the one most used in UNIX and Windows. Systems can also be designed so that each user has his own search path. 11.3.5 Tree-Structured Directories Once we have seen how to view a two-level directory as a two-level tree, the natural generalization is to extend the directory structure to a tree of arbitrary height (Figure 11.11). This generalization allows users to create their own subdirectories and to organize their files accordingly. A tree is the most common directory structure. The tree has a root directory, and every file in the system has a unique path name. A directory (or subdirectory) contains a set of files or subdirectories. A directory is simply another file, but it is treated in a special way. All directories have the same internal format. One bit in each directory entry defines the entry as a file (0) or as a subdirectory (1). Special system calls are used to create and delete directories. In normal use, each process has a current directory. The current directory should contain most of the files that are of current interest to the process. When reference is made to a file, the current directory is searched. If a file is needed that is not in the current directory, then the user usually must list obj spell find count hex reorder stat mail dist root spell bin programs p e mail reorder list find prog copy prt exp last first hex count all Figure 11.11 Tree-structured directory structure.
  • 270. 522 Chapter 11 File-System Interface either specify a path name or change the current directory to be the directory holding that file. To change directories, a system call is provided that takes a directory name as a parameter and uses it to redefine the current directory. Thus, the user can change her current directory whenever she wants. From one change directory() system call to the next, all open() system calls search the current directory for the specified file. Note that the search path may or may not contain a special entry that stands for “the current directory.” The initial current directory of a user’s login shell is designated when the user job starts or the user logs in. The operating system searches the accounting file (or some other predefined location) to find an entry for this user (for accounting purposes). In the accounting file is a pointer to (or the name of) the user’s initial directory. This pointer is copied to a local variable for this user that specifies the user’s initial current directory. From that shell, other processes can be spawned. The current directory of any subprocess is usually the current directory of the parent when it was spawned. Path names can be of two types: absolute and relative. An absolute path name begins at the root and follows a path down to the specified file, giving the directory names on the path. A relative path name defines a path from the current directory. For example, in the tree-structured file system of Figure 11.11, if the current directory is root/spell/mail, then the relative path name prt/first refers to the same file as does the absolute path name root/spell/mail/prt/first. Allowing a user to define her own subdirectories permits her to impose a structure on her files. This structure might result in separate directories for files associated with different topics (for example, a subdirectory was created to hold the text of this book) or different forms of information (for example, the directory programs may contain source programs; the directory bin may store all the binaries). An interesting policy decision in a tree-structured directory concerns how to handle the deletion of a directory. If a directory is empty, its entry in the directory that contains it can simply be deleted. However, suppose the directory to be deleted is not empty but contains several files or subdirectories. One of two approaches can be taken. Some systems will not delete a directory unless it is empty. Thus, to delete a directory, the user must first delete all the files in that directory. If any subdirectories exist, this procedure must be applied recursively to them, so that they can be deleted also. This approach can result in a substantial amount of work. An alternative approach, such as that taken by the UNIX rm command, is to provide an option: when a request is made to delete a directory, all that directory’s files and subdirectories are also to be deleted. Either approach is fairly easy to implement; the choice is one of policy. The latter policy is more convenient, but it is also more dangerous, because an entire directory structure can be removed with one command. If that command is issued in error, a large number of files and directories will need to be restored (assuming a backup exists). With a tree-structured directory system, users can be allowed to access, in addition to their files, the files of other users. For example, user B can access a file of user A by specifying its path names. User B can specify either an absolute or a relative path name. Alternatively, user B can change her current directory to be user A’s directory and access the file by its file names.
  • 271. 11.3 Directory and Disk Structure 523 11.3.6 Acyclic-Graph Directories Consider two programmers who are working on a joint project. The files asso- ciated with that project can be stored in a subdirectory, separating them from other projects and files of the two programmers. But since both programmers are equally responsible for the project, both want the subdirectory to be in their own directories. In this situation, the common subdirectory should be shared. A shared directory or file exists in the file system in two (or more) places at once. A tree structure prohibits the sharing of files or directories. An acyclic graph —that is, a graph with no cycles—allows directories to share subdirectories and files (Figure 11.12). The same file or subdirectory may be in two different directories. The acyclic graph is a natural generalization of the tree-structured directory scheme. It is important to note that a shared file (or directory) is not the same as two copies of the file. With two copies, each programmer can view the copy rather than the original, but if one programmer changes the file, the changes will not appear in the other’s copy. With a shared file, only one actual file exists, so any changes made by one person are immediately visible to the other. Sharing is particularly important for subdirectories; a new file created by one person will automatically appear in all the shared subdirectories. When people are working as a team, all the files they want to share can be put into one directory. The UFD of each team member will contain this directory of shared files as a subdirectory. Even in the case of a single user, the user’s file organization may require that some file be placed in different subdirectories. For example, a program written for a particular project should be both in the directory of all programs and in the directory for that project. Shared files and subdirectories can be implemented in several ways. A common way, exemplified by many of the UNIX systems, is to create a new directory entry called a link. A link is effectively a pointer to another file list all w count words list list rade w7 count root dict spell Figure 11.12 Acyclic-graph directory structure.
  • 272. 524 Chapter 11 File-System Interface or subdirectory. For example, a link may be implemented as an absolute or a relative path name. When a reference to a file is made, we search the directory. If the directory entry is marked as a link, then the name of the real file is included in the link information. We resolve the link by using that path name to locate the real file. Links are easily identified by their format in the directory entry (or by having a special type on systems that support types) and are effectively indirect pointers. The operating system ignores these links when traversing directory trees to preserve the acyclic structure of the system. Another common approach to implementing shared files is simply to duplicate all information about them in both sharing directories. Thus, both entries are identical and equal. Consider the difference between this approach and the creation of a link. The link is clearly different from the original directory entry; thus, the two are not equal. Duplicate directory entries, however, make the original and the copy indistinguishable. A major problem with duplicate directory entries is maintaining consistency when a file is modified. An acyclic-graph directory structure is more flexible than a simple tree structure, but it is also more complex. Several problems must be considered carefully. A file may now have multiple absolute path names. Consequently, distinct file names may refer to the same file. This situation is similar to the aliasing problem for programming languages. If we are trying to traverse the entire file system—to find a file, to accumulate statistics on all files, or to copy all files to backup storage—this problem becomes significant, since we do not want to traverse shared structures more than once. Another problem involves deletion. When can the space allocated to a shared file be deallocated and reused? One possibility is to remove the file whenever anyone deletes it, but this action may leave dangling pointers to the now-nonexistent file. Worse, if the remaining file pointers contain actual disk addresses, and the space is subsequently reused for other files, these dangling pointers may point into the middle of other files. In a system where sharing is implemented by symbolic links, this situation is somewhat easier to handle. The deletion of a link need not affect the original file; only the link is removed. If the file entry itself is deleted, the space for the file is deallocated, leaving the links dangling. We can search for these links and remove them as well, but unless a list of the associated links is kept with each file, this search can be expensive. Alternatively, we can leave the links until an attempt is made to use them. At that time, we can determine that the file of the name given by the link does not exist and can fail to resolve the link name; the access is treated just as with any other illegal file name. (In this case, the system designer should consider carefully what to do when a file is deleted and another file of the same name is created, before a symbolic link to the original file is used.) In the case of UNIX, symbolic links are left when a file is deleted, and it is up to the user to realize that the original file is gone or has been replaced. Microsoft Windows uses the same approach. Another approach to deletion is to preserve the file until all references to it are deleted. To implement this approach, we must have some mechanism for determining that the last reference to the file has been deleted. We could keep a list of all references to a file (directory entries or symbolic links). When a link or a copy of the directory entry is established, a new entry is added to the file-reference list. When a link or directory entry is deleted, we remove its entry on the list. The file is deleted when its file-reference list is empty.
  • 273. 11.3 Directory and Disk Structure 525 The trouble with this approach is the variable and potentially large size of the file-reference list. However, we really do not need to keep the entire list—we need to keep only a count of the number of references. Adding a new link or directory entry increments the reference count. Deleting a link or entry decrements the count. When the count is 0, the file can be deleted; there are no remaining references to it. The UNIX operating system uses this approach for nonsymbolic links (or hard links), keeping a reference count in the file information block (or inode; see Section A.7.2). By effectively prohibiting multiple references to directories, we maintain an acyclic-graph structure. To avoid problems such as the ones just discussed, some systems simply do not allow shared directories or links. 11.3.7 General Graph Directory A serious problem with using an acyclic-graph structure is ensuring that there are no cycles. If we start with a two-level directory and allow users to create subdirectories, a tree-structured directory results. It should be fairly easy to see that simply adding new files and subdirectories to an existing tree-structured directory preserves the tree-structured nature. However, when we add links, the tree structure is destroyed, resulting in a simple graph structure (Figure 11.13). The primary advantage of an acyclic graph is the relative simplicity of the algorithms to traverse the graph and to determine when there are no more references to a file. We want to avoid traversing shared sections of an acyclic graph twice, mainly for performance reasons. If we have just searched a major shared subdirectory for a particular file without finding it, we want to avoid searching that subdirectory again; the second search would be a waste of time. If cycles are allowed to exist in the directory, we likewise want to avoid searching any component twice, for reasons of correctness as well as performance. A poorly designed algorithm might result in an infinite loop continually searching through the cycle and never terminating. One solution text mail avi count unhex hex count book book mail unhex hyp root avi tc jim Figure 11.13 General graph directory.
  • 274. 526 Chapter 11 File-System Interface is to limit arbitrarily the number of directories that will be accessed during a search. A similar problem exists when we are trying to determine when a file can be deleted. With acyclic-graph directory structures, a value of 0 in the reference count means that there are no more references to the file or directory, and the file can be deleted. However, when cycles exist, the reference count may not be 0 even when it is no longer possible to refer to a directory or file. This anomaly results from the possibility of self-referencing (or a cycle) in the directory structure. In this case, we generally need to use a garbage collection scheme to determine when the last reference has been deleted and the disk space can be reallocated. Garbage collection involves traversing the entire file system, marking everything that can be accessed. Then, a second pass collects everything that is not marked onto a list of free space. (A similar marking procedure can be used to ensure that a traversal or search will cover everything in the file system once and only once.) Garbage collection for a disk-based file system, however, is extremely time consuming and is thus seldom attempted. Garbage collection is necessary only because of possible cycles in the graph. Thus, an acyclic-graph structure is much easier to work with. The difficulty is to avoid cycles as new links are added to the structure. How do we know when a new link will complete a cycle? There are algorithms to detect cycles in graphs; however, they are computationally expensive, especially when the graph is on disk storage. A simpler algorithm in the special case of directories and links is to bypass links during directory traversal. Cycles are avoided, and no extra overhead is incurred. 11.4 File-System Mounting Just as a file must be opened before it is used, a file system must be mounted before it can be available to processes on the system. More specifically, the directory structure may be built out of multiple volumes, which must be mounted to make them available within the file-system name space. The mount procedure is straightforward. The operating system is given the name of the device and the mount point—the location within the file structure where the file system is to be attached. Some operating systems require that a file system type be provided, while others inspect the structures of the device and determine the type of file system. Typically, a mount point is an empty directory. For instance, on a UNIX system, a file system containing a user’s home directories might be mounted as /home; then, to access the directory structure within that file system, we could precede the directory names with /home, as in /home/jane. Mounting that file system under /users would result in the path name /users/jane, which we could use to reach the same directory. Next, the operating system verifies that the device contains a valid file system. It does so by asking the device driver to read the device directory and verifying that the directory has the expected format. Finally, the operating system notes in its directory structure that a file system is mounted at the specified mount point. This scheme enables the operating system to traverse its directory structure, switching among file systems, and even file systems of varying types, as appropriate.
  • 275. 12 C H A P T E R File-System Implementation As we saw in Chapter 11, the file system provides the mechanism for on-line storage and access to file contents, including data and programs. The file system resides permanently on secondary storage, which is designed to hold a large amount of data permanently. This chapter is primarily concerned with issues surrounding file storage and access on the most common secondary-storage medium, the disk. We explore ways to structure file use, to allocate disk space, to recover freed space, to track the locations of data, and to interface other parts of the operating system to secondary storage. Performance issues are considered throughout the chapter. CHAPTER OBJECTIVES • To describe the details of implementing local file systems and directory structures. • To describe the implementation of remote file systems. • To discuss block allocation and free-block algorithms and trade-offs. 12.1 File-System Structure Disks provide most of the secondary storage on which file systems are maintained. Two characteristics make them convenient for this purpose: 1. A disk can be rewritten in place; it is possible to read a block from the disk, modify the block, and write it back into the same place. 2. A disk can access directly any block of information it contains. Thus, it is simple to access any file either sequentially or randomly, and switching from one file to another requires only moving the read–write heads and waiting for the disk to rotate. We discuss disk structure in great detail in Chapter 10. To improve I/O efficiency, I/O transfers between memory and disk are performed in units of blocks. Each block has one or more sectors. Depending 543
  • 276. 544 Chapter 12 File-System Implementation on the disk drive, sector size varies from 32 bytes to 4,096 bytes; the usual size is 512 bytes. File systems provide efficient and convenient access to the disk by allowing data to be stored, located, and retrieved easily. A file system poses two quite different design problems. The first problem is defining how the file system should look to the user. This task involves defining a file and its attributes, the operations allowed on a file, and the directory structure for organizing files. The second problem is creating algorithms and data structures to map the logical file system onto the physical secondary-storage devices. The file system itself is generally composed of many different levels. The structure shown in Figure 12.1 is an example of a layered design. Each level in the design uses the features of lower levels to create new features for use by higher levels. The I/O control level consists of device drivers and interrupt handlers to transfer information between the main memory and the disk system. A device driver can be thought of as a translator. Its input consists of high- level commands such as “retrieve block 123.” Its output consists of low-level, hardware-specific instructions that are used by the hardware controller, which interfaces the I/O device to the rest of the system. The device driver usually writes specific bit patterns to special locations in the I/O controller’s memory to tell the controller which device location to act on and what actions to take. The details of device drivers and the I/O infrastructure are covered in Chapter 13. The basic file system needs only to issue generic commands to the appropriate device driver to read and write physical blocks on the disk. Each physical block is identified by its numeric disk address (for example, drive 1, cylinder 73, track 2, sector 10). This layer also manages the memory buffers and caches that hold various file-system, directory, and data blocks. A block in the buffer is allocated before the transfer of a disk block can occur. When the buffer is full, the buffer manager must find more buffer memory or free application programs file-organization module basic file system I/O control devices logical file system Figure 12.1 Layered file system.
  • 277. 12.1 File-System Structure 545 up buffer space to allow a requested I/O to complete. Caches are used to hold frequently used file-system metadata to improve performance, so managing their contents is critical for optimum system performance. The file-organization module knows about files and their logical blocks, as well as physical blocks. By knowing the type of file allocation used and the location of the file, the file-organization module can translate logical block addresses to physical block addresses for the basic file system to transfer. Each file’s logical blocks are numbered from 0 (or 1) through N. Since the physical blocks containing the data usually do not match the logical numbers, a translation is needed to locate each block. The file-organization module also includes the free-space manager, which tracks unallocated blocks and provides these blocks to the file-organization module when requested. Finally, the logical file system manages metadata information. Metadata includes all of the file-system structure except the actual data (or contents of the files). The logical file system manages the directory structure to provide the file-organization module with the information the latter needs, given a symbolic file name. It maintains file structure via file-control blocks. A file- control block (FCB) (an inode in UNIX file systems) contains information about the file, including ownership, permissions, and location of the file contents. The logical file system is also responsible for protection, as discussed in Chaptrers 11 and 14. When a layered structure is used for file-system implementation, duplica- tion of code is minimized. The I/O control and sometimes the basic file-system code can be used by multiple file systems. Each file system can then have its own logical file-system and file-organization modules. Unfortunately, layering can introduce more operating-system overhead, which may result in decreased performance. The use of layering, including the decision about how many layers to use and what each layer should do, is a major challenge in designing new systems. Many file systems are in use today, and most operating systems support more than one. For example, most CD-ROMs are written in the ISO 9660 format, a standard format agreed on by CD-ROM manufacturers. In addition to removable-media file systems, each operating system has one or more disk- based file systems. UNIX uses the UNIX file system (UFS), which is based on the Berkeley Fast File System (FFS). Windows supports disk file-system formats of FAT, FAT32, and NTFS (or Windows NT File System), as well as CD-ROM and DVD file-system formats. Although Linux supports over forty different file systems, the standard Linux file system is known as the extended file system, with the most common versions being ext3 and ext4. There are also distributed file systems in which a file system on a server is mounted by one or more client computers across a network. File-system research continues to be an active area of operating-system design and implementation. Google created its own file system to meet the company’s specific storage and retrieval needs, which include high- performance access from many clients across a very large number of disks. Another interesting project is the FUSE file system, which provides flexibility in file-system development and use by implementing and executing file systems as user-level rather than kernel-level code. Using FUSE, a user can add a new file system to a variety of operating systems and can use that file system to manage her files.
  • 278. 546 Chapter 12 File-System Implementation 12.2 File-System Implementation As was described in Section 11.1.2, operating systems implement open() and close() systems calls for processes to request access to file contents. In this section, we delve into the structures and operations used to implement file-system operations. 12.2.1 Overview Several on-disk and in-memory structures are used to implement a file system. These structures vary depending on the operating system and the file system, but some general principles apply. On disk, the file system may contain information about how to boot an operating system stored there, the total number of blocks, the number and location of free blocks, the directory structure, and individual files. Many of these structures are detailed throughout the remainder of this chapter. Here, we describe them briefly: • A boot control block (per volume) can contain information needed by the system to boot an operating system from that volume. If the disk does not contain an operating system, this block can be empty. It is typically the first block of a volume. In UFS, it is called the boot block. In NTFS, it is the partition boot sector. • A volume control block (per volume) contains volume (or partition) details, such as the number of blocks in the partition, the size of the blocks, a free-block count and free-block pointers, and a free-FCB count and FCB pointers. In UFS, this is called a superblock. In NTFS, it is stored in the master file table. • A directory structure (per file system) is used to organize the files. In UFS, this includes file names and associated inode numbers. In NTFS, it is stored in the master file table. • A per-file FCB contains many details about the file. It has a unique identifier number to allow association with a directory entry. In NTFS, this information is actually stored within the master file table, which uses a relational database structure, with a row per file. The in-memory information is used for both file-system management and performance improvement via caching. The data are loaded at mount time, updated during file-system operations, and discarded at dismount. Several types of structures may be included. • An in-memory mount table contains information about each mounted volume. • An in-memory directory-structure cache holds the directory information of recently accessed directories. (For directories at which volumes are mounted, it can contain a pointer to the volume table.) • The system-wide open-file table contains a copy of the FCB of each open file, as well as other information.
  • 279. 12.2 File-System Implementation 547 file permissions file dates (create, access, write) file owner, group, ACL file size file data blocks or pointers to file data blocks Figure 12.2 A typical file-control block. • The per-process open-file table contains a pointer to the appropriate entry in the system-wide open-file table, as well as other information. • Buffers hold file-system blocks when they are being read from disk or written to disk. To create a new file, an application program calls the logical file system. The logical file system knows the format of the directory structures. To create a new file, it allocates a new FCB. (Alternatively, if the file-system implementation creates all FCBs at file-system creation time, an FCB is allocated from the set of free FCBs.) The system then reads the appropriate directory into memory, updates it with the new file name and FCB, and writes it back to the disk. A typical FCB is shown in Figure 12.2. Some operating systems, including UNIX, treat a directory exactly the same as a file—one with a “type” field indicating that it is a directory. Other operating systems, including Windows, implement separate system calls for files and directories and treat directories as entities separate from files. Whatever the larger structural issues, the logical file system can call the file-organization module to map the directory I/O into disk-block numbers, which are passed on to the basic file system and I/O control system. Now that a file has been created, it can be used for I/O. First, though, it must be opened. The open() call passes a file name to the logical file system. The open() system call first searches the system-wide open-file table to see if the file is already in use by another process. If it is, a per-process open-file table entry is created pointing to the existing system-wide open-file table. This algorithm can save substantial overhead. If the file is not already open, the directory structure is searched for the given file name. Parts of the directory structure are usually cached in memory to speed directory operations. Once the file is found, the FCB is copied into a system-wide open-file table in memory. This table not only stores the FCB but also tracks the number of processes that have the file open. Next, an entry is made in the per-process open-file table, with a pointer to the entry in the system-wide open-file table and some other fields. These other fields may include a pointer to the current location in the file (for the next read() or write() operation) and the access mode in which the file is open. The open() call returns a pointer to the appropriate entry in the per-process
  • 280. 548 Chapter 12 File-System Implementation directory structure directory structure open (file name) kernel memory user space index (a) file-control block secondary storage data blocks per-process open-file table system-wide open-file table read (index) kernel memory user space (b) file-control block secondary storage Figure 12.3 In-memory file-system structures. (a) File open. (b) File read. file-system table. All file operations are then performed via this pointer. The file name may not be part of the open-file table, as the system has no use for it once the appropriate FCB is located on disk. It could be cached, though, to save time on subsequent opens of the same file. The name given to the entry varies. UNIX systems refer to it as a file descriptor; Windows refers to it as a file handle. When a process closes the file, the per-process table entry is removed, and the system-wide entry’s open count is decremented. When all users that have opened the file close it, any updated metadata is copied back to the disk-based directory structure, and the system-wide open-file table entry is removed. Some systems complicate this scheme further by using the file system as an interface to other system aspects, such as networking. For example, in UFS, the system-wide open-file table holds the inodes and other information for files and directories. It also holds similar information for network connections and devices. In this way, one mechanism can be used for multiple purposes. The caching aspects of file-system structures should not be overlooked. Most systems keep all information about an open file, except for its actual data blocks, in memory. The BSD UNIX system is typical in its use of caches wherever disk I/O can be saved. Its average cache hit rate of 85 percent shows that these techniques are well worth implementing. The BSD UNIX system is described fully in Appendix A. The operating structures of a file-system implementation are summarized in Figure 12.3.
  • 281. 12.2 File-System Implementation 549 12.2.2 Partitions and Mounting The layout of a disk can have many variations, depending on the operating system. A disk can be sliced into multiple partitions, or a volume can span multiple partitions on multiple disks. The former layout is discussed here, while the latter, which is more appropriately considered a form of RAID, is covered in Section 10.7. Each partition can be either “raw,” containing no file system, or “cooked,” containing a file system. Raw disk is used where no file system is appropriate. UNIX swap space can use a raw partition, for example, since it uses its own format on disk and does not use a file system. Likewise, some databases use raw disk and format the data to suit their needs. Raw disk can also hold information needed by disk RAID systems, such as bit maps indicating which blocks are mirrored and which have changed and need to be mirrored. Similarly, raw disk can contain a miniature database holding RAID configuration information, such as which disks are members of each RAID set. Raw disk use is discussed in Section 10.5.1. Boot information can be stored in a separate partition, as described in Section 10.5.2. Again, it has its own format, because at boot time the system does not have the file-system code loaded and therefore cannot interpret the file-system format. Rather, boot information is usually a sequential series of blocks, loaded as an image into memory. Execution of the image starts at a predefined location, such as the first byte. This boot loader in turn knows enough about the file-system structure to be able to find and load the kernel and start it executing. It can contain more than the instructions for how to boot a specific operating system. For instance, many systems can be dual-booted, allowing us to install multiple operating systems on a single system. How does the system know which one to boot? A boot loader that understands multiple file systems and multiple operating systems can occupy the boot space. Once loaded, it can boot one of the operating systems available on the disk. The disk can have multiple partitions, each containing a different type of file system and a different operating system. The root partition, which contains the operating-system kernel and some- times other system files, is mounted at boot time. Other volumes can be automatically mounted at boot or manually mounted later, depending on the operating system. As part of a successful mount operation, the operating system verifies that the device contains a valid file system. It does so by asking the device driver to read the device directory and verifying that the directory has the expected format. If the format is invalid, the partition must have its consistency checked and possibly corrected, either with or without user intervention. Finally, the operating system notes in its in-memory mount table that a file system is mounted, along with the type of the file system. The details of this function depend on the operating system. Microsoft Windows–based systems mount each volume in a separate name space, denoted by a letter and a colon. To record that a file system is mounted at F:, for example, the operating system places a pointer to the file system in a field of the device structure corresponding to F:. When a process specifies the driver letter, the operating system finds the appropriate file-system pointer and traverses the directory structures on that device to find the specified file
  • 282. 550 Chapter 12 File-System Implementation or directory. Later versions of Windows can mount a file system at any point within the existing directory structure. On UNIX, file systems can be mounted at any directory. Mounting is implemented by setting a flag in the in-memory copy of the inode for that directory. The flag indicates that the directory is a mount point. A field then points to an entry in the mount table, indicating which device is mounted there. The mount table entry contains a pointer to the superblock of the file system on that device. This scheme enables the operating system to traverse its directory structure, switching seamlessly among file systems of varying types. 12.2.3 Virtual File Systems The previous section makes it clear that modern operating systems must concurrently support multiple types of file systems. But how does an operating system allow multiple types of file systems to be integrated into a directory structure? And how can users seamlessly move between file-system types as they navigate the file-system space? We now discuss some of these implementation details. An obvious but suboptimal method of implementing multiple types of file systems is to write directory and file routines for each type. Instead, however, most operating systems, including UNIX, use object-oriented techniques to simplify, organize, and modularize the implementation. The use of these methods allows very dissimilar file-system types to be implemented within the same structure, including network file systems, such as NFS. Users can access files contained within multiple file systems on the local disk or even on file systems available across the network. Data structures and procedures are used to isolate the basic system- call functionality from the implementation details. Thus, the file-system implementation consists of three major layers, as depicted schematically in Figure 12.4. The first layer is the file-system interface, based on the open(), read(), write(), and close() calls and on file descriptors. The second layer is called the virtual file system (VFS) layer. The VFS layer serves two important functions: 1. It separates file-system-generic operations from their implementation by defining a clean VFS interface. Several implementations for the VFS interface may coexist on the same machine, allowing transparent access to different types of file systems mounted locally. 2. It provides a mechanism for uniquely representing a file throughout a network. The VFS is based on a file-representation structure, called a vnode, that contains a numerical designator for a network-wide unique file. (UNIX inodes are unique within only a single file system.) This network-wide uniqueness is required for support of network file systems. The kernel maintains one vnode structure for each active node (file or directory). Thus, the VFS distinguishes local files from remote ones, and local files are further distinguished according to their file-system types. The VFS activates file-system-specific operations to handle local requests according to their file-system types and calls the NFS protocol procedures for
  • 283. 12.2 File-System Implementation 551 local file system type 1 disk local file system type 2 disk remote file system type 1 network file-system interface VFS interface Figure 12.4 Schematic view of a virtual file system. remote requests. File handles are constructed from the relevant vnodes and are passed as arguments to these procedures. The layer implementing the file-system type or the remote-file-system protocol is the third layer of the architecture. Let’s briefly examine the VFS architecture in Linux. The four main object types defined by the Linux VFS are: • The inode object, which represents an individual file • The file object, which represents an open file • The superblock object, which represents an entire file system • The dentry object, which represents an individual directory entry For each of these four object types, the VFS defines a set of operations that may be implemented. Every object of one of these types contains a pointer to a function table. The function table lists the addresses of the actual functions that implement the defined operations for that particular object. For example, an abbreviated API for some of the operations for the file object includes: • int open(. . .)—Open a file. • int close(...)—Close an already-open file. • ssize t read(. . .)—Read from a file. • ssize t write(. . .)—Write to a file. • int mmap(. . .)—Memory-map a file.
  • 284. 552 Chapter 12 File-System Implementation An implementation of the file object for a specific file type is required to imple- ment each function specified in the definition of the file object. (The complete definition of the file object is specified in the struct file operations, which is located in the file /usr/include/linux/fs.h.) Thus, the VFS software layer can perform an operation on one of these objects by calling the appropriate function from the object’s function table, without having to know in advance exactly what kind of object it is dealing with. The VFS does not know, or care, whether an inode represents a disk file, a directory file, or a remote file. The appropriate function for that file’s read() operation will always be at the same place in its function table, and the VFS software layer will call that function without caring how the data are actually read. 12.3 Directory Implementation The selection of directory-allocation and directory-management algorithms significantly affects the efficiency, performance, and reliability of the file system. In this section, we discuss the trade-offs involved in choosing one of these algorithms. 12.3.1 Linear List The simplest method of implementing a directory is to use a linear list of file names with pointers to the data blocks. This method is simple to program but time-consuming to execute. To create a new file, we must first search the directory to be sure that no existing file has the same name. Then, we add a new entry at the end of the directory. To delete a file, we search the directory for the named file and then release the space allocated to it. To reuse the directory entry, we can do one of several things. We can mark the entry as unused (by assigning it a special name, such as an all-blank name, or by including a used– unused bit in each entry), or we can attach it to a list of free directory entries. A third alternative is to copy the last entry in the directory into the freed location and to decrease the length of the directory. A linked list can also be used to decrease the time required to delete a file. The real disadvantage of a linear list of directory entries is that finding a file requires a linear search. Directory information is used frequently, and users will notice if access to it is slow. In fact, many operating systems implement a software cache to store the most recently used directory information. A cache hit avoids the need to constantly reread the information from disk. A sorted list allows a binary search and decreases the average search time. However, the requirement that the list be kept sorted may complicate creating and deleting files, since we may have to move substantial amounts of directory information to maintain a sorted directory. A more sophisticated tree data structure, such as a balanced tree, might help here. An advantage of the sorted list is that a sorted directory listing can be produced without a separate sort step. 12.3.2 Hash Table Another data structure used for a file directory is a hash table. Here, a linear list stores the directory entries, but a hash data structure is also used. The hash table takes a value computed from the file name and returns a pointer to the file