distributes systemss and its power point

FUNDAMENTALS OF DISTRIBUTED SYSTEM
Chapter Three
Distributed File system
1

OUTLINE
ф Distributed File Systems: Introduction
 File Service Architecture
 Case Study 1: Sun Network File System
 Case Study 2: The Andrew File System
2

INTRODUCTION
 The main goal of distributed system is sharing resources.
 Enables programs to store and access remote files (web
servers) exactly as they do local files
 Distributed file systems support the sharing of information in
the form of files and hardware resources in the form of
persistent storage throughout an intranet.
 File systems were originally developed for centralized
computer systems and desktop computers as an operating
system facility providing a convenient programming interface
to disk storage.
 A well designed file service provides access to files stored at a
server with performance and reliability similar to, and in some
cases better than, files stored on local disks 3

STORAGE SYSTEM AND THEIR PROPERTIES
Sharing Persistence Distributed cache /replica Consistency
maintenance
Example
Main memory x X X strict one-copy RAM
File system x / X strict one-copy UNIX file
system
Distributed file
system
/ / / slightly weaker
guarantees
Sun NFS
Web / / / No automatic
consistency
Web server
Distributed
shared memory
/ X / slightly weaker
guarantees
Ivy
Remote Objects / X X strict one-copy CORBA
Persistent object
store
/ / X strict one-copy CORBA
persistent state
service
Peer to peer
storage system
/ / / considerably
weaker guarantees
oceanStore
4

CHARACTERISTICS OF FILE SYSTEM
 Responsible for the organization, storage, retrieval, naming,
sharing and protection of files.
 Provide programming interface that characterize the file
abstraction
 Files contains both data and attribute
 File attributes
5
File length
Creation timestamp
Read timestamp
Write time stamp
Attribute timestamp
Owner
File type
Access control list

DISTRIBUTED FILE SYSTEM REQUIREMENTS
 Transparency(access , location, mobility, performance,
scaling)
 Concurrent file updates
 File replication
 File heterogeny
 Fault tolerance
 Consistency
 Security
 Efficiency
6

FILE SERVICE ARCHITECTURE
 An architecture that offers a clear separation of the main
concerns
 Access to files is obtained by structuring the file service
 Structure of the file has three components
A. a flat file service
B. a directory service
C. a client module
7

FLAT FILES SERVICE
 Concerned with implementing on the contents of the files
 Unique file identifiers(UFIDS) are used to refer to the files
 Refer to file in all requests
 UFIDS is long sequence of bits
 Each file has a UFIDS that is unique among all of the files
9

DIRECTORY SERVICE
 The directory service provides a mapping between text names
for files and their UFIDs.
 Clients may obtain the UFID of a file by quoting its text name
to the directory service.
 The directory service provides the functions needed to
generate directories, to add new file names to directories and
to obtain UFIDs from directories.
 It is a client of the flat file service
 Its directory files are stored in files of the flat file service.
 When a hierarchic file-naming scheme is adopted, as in
UNIX, directories hold references to other directories. 10

CLIENT MODULE
 A client module runs in each client computer, integrating and
extending the operations of the flat file service
 The directory service under a single application programming
interface that is available to user-level programs in client
computers.
 For example, in UNIX hosts, a client module would be provided
that emulates the full set of UNIX file operations, interpreting
UNIX multi-part file names by iterative requests to the directory
service.
 The client module also holds information about the network
locations of the flat file server and directory server processes.
 Finally, the client module can play an important role in achieving
satisfactory performance through the implementation of a cache of
recently used file blocks at the client. 11

FLAT FILE SERVICE INTERFACE
 This is the RPC interface used by client modules. It is not
normally used directly by user-level programs.
 A FileId is invalid if the file that it refers to is not present in
the server processing the request or if its access permissions
are inappropriate for the operation requested.
 All of the procedures in the interface except Create throw
exceptions if the FileId argument contains an invalid UFID or
the user doesn’t have sufficient access rights. These
exceptions are omitted from the definition for clarity.
12

FLAT FILE SERVICE OPERATIONS
 Read(FileId, i, n)  Data: Reads a sequence of up to n items from a file starting at item i and
returns it in Data.
 Write(FileId, i, Data):Write a sequence of Data to a file, starting at item i, extending the file if
necessary.
 Create()FileId: Creates a new file of length0 and delivers a UFID for it.
 Delete(FileId): Removes the file from the file store.
 GetAttributes(FileId)Attr:Returns the file attributes for the file.
 SetAttributes(FileId, Attr):Sets the file attributes .
 GetAttributes and SetAttributes enable clients to access the attribute record.
 GetAttributes is normally available to any client that is allowed to read the file.
 Access to the SetAttributes operation would normally be restricted to the directory service that
provides access to the file.
13

DIRECTORY SERVICE OPERATIONS
 Lookup(Dir, Name) FileId
 throws NotFound
 AddName(Dir, Name, FileId)
 throws NameDuplicate
 UnName(Dir, Name)
 throws NotFound
 GetNames(Dir, Pattern) NameSeq
Locates the text name in the directory and
returns the relevant UFID. If Name is not
in the directory, throws an exception.
If Name is not in the directory, adds
(Name, File) to the directory and updates
the file’s attribute record. If Name is
already in the directory, throws an
exception.
If Name is in the directory, removes the
entry containing Name from the directory.
If Name is not in the directory, throws an
exception.
Returns all the text names in the directory
that match the regular expression Pattern.
14

FILE ACCESSING MODELS
 File access models are methods used for accessing remote
files and the unit of data access.
 A distributed file system may use one of the following models
to service a client "file access request when the accessed file
is remote:
1. Remote service model
 Processing of a client's request is performed at the server's node.
 Thus, the client's request for file access is delivered across the
network as a message to the server, the server machine performs
the access request, and the result is sent to the client.
 This need to minimize the number of messages sent and the
overhead per message. 15

2. Data-caching model
 This model attempts to reduce the network traffic of the previous
model by caching the data obtained from the server node.
 This takes advantage of the locality feature of the found in file
accesses.
 This model is increased performance and greater system scalability
 It refers to the friction of file that is transferred to clients in a single
read and write operation
3.File-level transfer model
 In this model when file data is to be transferred, the entire file is
moved.
 This reduces server load and network traffic since it accesses the
server only once. This has better scalability.
 This model requires sufficient storage space on the client machine.
FILE ACCESSING MODELS…
16

4. Block-level transfer model
 File transfer takes place in file blocks.
 A file block is a contiguous portion of a file and is of fixed
length
 This does not require client nodes to have large storage space
 It eliminates the need to copy an entire file when only a small
portion of the data is needed.
 When an entire file is to be accessed, multiple server requests
are needed, resulting in more network traffic and more
network protocol overhead.
 NFS uses block-level transfer model.
17

5. Byte-level transfer model
 Unit of transfer is a byte.
 Model provides maximum flexibility because it allows
storage and retrieval of an arbitrary amount of a file, specified
by an offset within a file and length.
 The drawback is that cache management is harder due to the
variable-length data for different access requests.
6.Record-level transfer model
 This model is used with structured files and the unit of
transfer is the record.
18

FILE STRUCTURE ARCHITECTURE
 Hierarchal file system
 Consists of a number of directories arranged in a tree structure
 File Group
 Is a collection of files that can be located on any server or
moved between servers while maintaining the same names
 A similar construct is used in UNIX file system
 Helps with distributing the load of file serving between
several servers
 File groups have identifiers which are unique throughout the
system
19

DFS
 DFS has two methods
 Sun Netwrok File System
 Andrew File System
 NFS have stateless server where as AFS have stateful server
 AFS provides location independence (the physical location of
file can be changed without having to change the path of the
file) as well as location transparency
 NFS provides location transparency
 AFS is more scalable
20

 A stateless server does not keep information on the state of its
clients and can change its own state without informing any client.
 It does not retain any information on the state of its clients.
 a stateful server maintains persistent information on its clients and
requires explicit deletion of information by the server.
 a stateful server retains persistent information on its clients.
21
DFS

 This file system was developed by Sun Microsystems. So, its name is
Microsystems’s Network File System.
 Microsystems’s Network File System (NFS) has been widely adopted in
industry and in academic environments since its introduction in 1985.
 NFS is client-server application. So, the user can view, store and update
file on a remote computer.
 In most cases, all the client and server are on the same LAN.
 Each NFS exports one or more of its directories for access the remote
clients.
 NFS provides transparent access to remote files for client programs
running on UNIX and other systems.
 NFS allows the user or system administrator is to mount(designate as
accessible) all or portion of file system on a server.
 An important goal of NFS is to achieve a high level of support for
hardware and operating system heterogeneity.
CASE STUDY: SUN NETWORK FILE SYSTEM
22

CASE STUDY: SUN NETWORK FILE SYSTEM…
 NFS protocol (RFC 1813 ) is designed to be independent of
the computer, OS, network architecture, and transport
protocol.
 NFS uses RPC to route requests between the client and server.
 The NFS protocol was originally developed for use in
networks of UNIX systems
 The NFS server module resides in the kernel on each
computer that acts as an NFS server.
 Requests referring to files in a remote file system are
translated by the client module to NFS protocol operations
and then passed to the NFS server module at the computer
holding the relevant file system.
23

Sun network File Model 24

25

26
It consists of three layers
 System call layer: this handles like OPEN, READ, AND CLOSE.
 Virtual File System: the task of VFS layer is to maintain a table with one entry
for each open file, analogous to the table of I-nodes for open files in UNIX. VFS
layers has an entry called a v-node(virtual i-node) for every open file telling
whether the file is local or remote.
 NFS client code: to create an r-node(remote i-node) in its internal tables to hold the
file handles.
 Each v-node in the VFS layer will ultimately contain either apointer to an r-node in
the NFS client code, or a pointer to an i-node in the local operating system.
 Thus from the v-node it is possible to see if a file or director is local or remote, and
if it is to find its file handle

 The file identifiers used in NFS are called file handles.
 The virtual file system layer has one VFS structure for each
mounted file system and one v-node per open file.
 A VFS structure relates a remote file system to the local
directory on which it is mounted.
 The v-node contains an indicator to show whether a file is
local or remote.
 If the file is local, the v-node contains a reference to the index
of the local file (an i-node in a UNIX implementation).
 If the file is remote, it contains the file handle of the remote
file.
 Reading Assignment NFS operation
27

Server caching
 Caching in both the client and the server computer are
indispensable features of NFS implementations in order to
achieve adequate performance.
 In conventional UNIX systems: file pages, directories and file
attributes that have been read from disk are retained in a main
memory buffer cache until the buffer space is required for
other pages.
28

Benefits of NFS: Among many benefits for organizations using NFS are
the following:
 Mature: NFS is a mature protocol, which means most aspects of
implementing, securing and using it are well understood, as are its potential
weaknesses.
 Open: NFS is an open protocol, with its continued development documented
in internet specifications as a free and open network protocol.
 Cost-effective: NFS is a low-cost solution for network file sharing that is easy
to set up because it uses the existing network infrastructure.
 Centrally managed: NFS's centralized management decreases the need for
added software and disk space on individual user systems.
 User-friendly: The protocol is easy to use and enables users to access remote
files on remote hosts in the same way they access local ones.
 Distributed: NFS can be used as a distributed file system, reducing the need
for removable media storage devices.
 Secure: With NFS, there is less removeable media like CDs, DVDs, Blu-ray
disks, diskettes and USB devices in circulation, making the system more
secure.
29

Disadvantages of NFS: NFS include the following:
 Dependence on RPCs makes NFS inherently insecure and should only be
used on a trusted network behind a firewall. Otherwise, NFS will be
vulnerable to internet threats.
 Some reviews of NFSv4 and NFSv4.1 suggest that these versions have
limited bandwidth and scalability and that NFS slows down during heavy
network traffic. The bandwidth and scalability issue is reported to have
improved with NFSv4.2.
30

NFS summary
 Sun NFS closely follows abstract model.
 The resulting design provides good location and access transparency if the
NFS mount service is used properly to produce similar name spaces at all
clients.
 NFS supports heterogeneous hardware and operating systems.
 The NFS server implementation is stateless, enabling clients and servers to
resume execution after a failure without the need for any recovery
procedures.
 Migration of files or filesystems is not supported, except at the level of
manual intervention to reconfigure mount directives after the movement of
a filesystem to a new location.
 Design goals of NFS are Access transparency, Location transparency,
Mobility transparency, Scalability, File replication, Hardware and
operating system heterogeneity, Fault tolerance, Consistency, Security and
Efficiency
31

CASE STUDY: THE ANDREW FILE SYSTEM
 It location-independent file system
 It is DFS that uses a set of trusted servers to present a homogonous,
location transparent file name space to all the client workstation.
(vices)
 By AFS, people can work together on the same files, no matter
where the files are located.
 AFS users do not have which machine is storing a file
 AFS make easy to access a files stored on a remote computer as
files stored on the local disks
 All files you store on AFS is available to use online by just
connecting to you AFS server. We can connect with AFS server
through an AFS an AFS client
 AFS uses a set of remote servers to access a file
 AFS uses local cache to reduce the workload and increase the
performance a distributed computing environment 32

 In AFS, the server keeps of which files are opened by which
clients(as not in the case of NFSs)
 Like NFS, AFS provides transparent access to remote shared
files for UNIX programs running on workstations.
 Access to AFS files is via the normal UNIX file primitives,
enabling existing UNIX programs to access AFS files without
modification or recompilation.
 AFS is compatible with NFS. AFS servers hold ‘local’ UNIX
files, but the filing system in the servers is NFS-based, so files
are referenced by NFS-style file handles rather than i-node
numbers, and the files may be remotely accessed via NFS.
CASE STUDY: THE ANDREW FILE SYSTEM…
33

Implementation of AFS
 Venus: the client-side manager which act as an interface between the
application program and vice
 Vice: the server side processes that resides on the top of the UNIX kernel
providing shared file services to each client
 All files in AFS are distributed among all servers. The set of file in one server
is referred to as volume
 In the case of the request can not be satisfied from this set of files, the vice
server informs the client where it can find the required files
 The files available to user processes running on workstation are either local or
shared
 Local files are handled as normal UNIX files
 Shared files are stored on servers, and copies of them are cached on local disks
of workstation
34

Features of AFS
 File backup: AFS data files are backup nightly. Backup are kept on
site for six months
 File security: AFS data files are protected by Kerberos
authentication system
 Physical security: AFS data files are stored on servers located in
UCSC data center
 Reliability and availability: AFS servers and storage are maintained
on redundant hardware
 Authentication: AFS use Kerberos for Authentication. Kerberos
account are automatically provisioned for all UCSC data, faciality
and staff
 Space to user(Quota): AFS provides 500 MB space per user and can
request to increase automatically up to 10GB.
35

36

 AFS has two unusual design characteristics:
Whole-file serving:
 The entire contents of directories and files are transmitted to client
computers by AFS servers
Whole-file caching:
 Once a copy of a file or a chunk has been transferred to a client
computer it is stored in a cache on the local disk.
 The cache contains several hundred of the files most recently used
on that computer.
 The cache is permanent, surviving reboots of the client computer.
 Local copies of files are used to satisfy clients’ open requests in
preference to remote copies whenever possible.
37

A FIRST REQUEST FOR DATA TO SERVER FROM WORKSTATION IS SATISFIED BY THE SERVER AND PLACED IN
LOCAL CACHE
AFS is implemented as two software components that exist as UNIX processes
called Vice and Venus. Vice is the name given to the server software that runs as a
user-level UNIX process in each server computer, and Venus is a user-level
process that runs in each client computer and corresponds to the client module in
our abstract model.
The set of files in one server is referred as volume.
38
IMPLEMENTATION ANDREW FILE SYSTEM

 The files available to user processes running on workstations are either local or shared.
 Local files are handled as normal UNIX files. They are stored on a workstation’s disk and are
available only to local user processes.
 Shared files are stored on servers, and copies of them are cached on the local disks of
workstations.
 It is a conventional UNIX directory hierarchy, with a specific subtree (called cmu) containing
all of the shared files. This splitting of the file name space into local and shared files leads to
some loss of location transparency, but this is hardly noticeable to users other than system
administrators.
 Local files are used only for temporary files (/tmp) and processes that are essential for
workstation startup.
 Other standard UNIX files (such as those normally found in /bin, /lib and so on) are
implemented as symbolic links from local directories to files held in the shared space.
 Users’ directories are in the shared space, enabling users to access their files from any
workstation.
39
IMPLEMENTATION ANDREW FILE SYSTEM…

 A flat file service is implemented
by the Vice servers, and the
hierarchic directory structure
required by UNIX user programs is
implemented by the set of Venus
processes in the workstations.
 Each file and directory in the
shared file space is identified by a
unique, 96-bit file identifier (fid)
similar to a UFID. The Venus
processes translate the pathnames
issued by clients to fids
40
Implementation Andrew File System…

 Files are grouped into volumes for ease
of location and movement. Volumes are
generally smaller than the UNIX
filesystems, which are the unit of file
grouping in NFS. For example, each
user’s personal files are generally
located in a separate volume. Other
volumes are allocated for system
binaries, documentation and library
code.
 The representation of fids includes the
volume number for the volume
containing the file (cf. the file group
identifier in UFIDs), an NFS file handle
identifying the file within the volume
(cf. the file number in UFIDs) and a
uniquifier to ensure that file identifiers
are not reused:
32 bits 32 bits 32 bits
Volume number File handle Uniquifier 41

Cache consistency
 Stateful servers in AFS allow, the server to inform all clients
with open files about any updates made to that file by another
client through what is known as callback
 Callbacks to all clients with a copy of that file is ensured as a
callback promise is issued by the server to a client when it
request to for a copy of a file
42

 When Vice supplies a copy of a file to a Venus process it also
provides a callback promise – a token issued by the Vice
server that is guaranteeing that it will notify the Venus process
when any other client modifies the file.
 Callback promises are stored with the cached files on the
workstation disks and have two states:
valid
cancelled.
43

 When a server performs a request to update a file it notifies
all of the Venus processes to which it has issued callback
promises by sending a callback to each
 A callback is a remote procedure call from a server to a Venus
process.
 When the Venus process receives a callback, it sets the
callback promise token for the relevant file to cancelled.
44

 Whenever Venus handles an open on behalf of a client, it checks
the cache.
 If the required file is found in the cache, then its token is checked.
 If its value is cancelled, then a fresh copy of the file must be
fetched from the Vice server,
 But if the token is valid, then the cached copy can be opened and
used without reference to Vice.
45
Implementation Andrew File System…

 When a workstation is restarted after a failure or a shutdown
 Venus aims to retain as many as possible of the cached files
on the local disk
 But it cannot assume that the callback promise tokens are
correct, since some callbacks may have been missed.
46
IMPLEMENTATION ANDREW FILE
SYSTEM…

 Before the first use of each cached file or directory after a
restart,
 Venus therefore generates a cache validation request
containing the file modification timestamp to the server that is
the custodian of the file.
 If the timestamp is current, the server responds with valid, and
the token is reinstated.
 If the timestamp shows that the file is out of date, then the
server responds with cancelled, and the token is set to
cancelled.
 Callbacks must be renewed
47
IMPLEMENTATION ANDREW FILE
SYSTEM…

Implementation of file system calls in AFS
48

OPERATION OF AFS
 Returns the attributes (status) and, optionally, the
contents of the file identified by fid and records a
callback promise on it.
 Updates the attributes and (optionally) the contents of a
specified file.
 Creates a new file and records a callback promise on it.
 Deletes the specified file
 Sets a lock on the specified file or directory. The mode of
the lock may be shared or exclusive. Locks that are not
removed expire after 30 minutes.
 Informs the server that a Venus process has
flushed a file from its cache.
 BreakCallback(fid)  Call made by a Vice server to a Venus process;
cancels the callback promise on the relevant file.
 RemoveCallback(fid)
 Fetch(fid) attr, data
 Store(fid, attr, data)
 SetLock(fid, mode)
 Create() fid
 Remove(fid)
 ReleaseLock(fid)  Unlocks the specified file or directory.
49

THANK YOU FOR YOUR ATTENTION
50

Chapter Four
Naming
1

 we will discuss
 some general issues in naming
 how human-friendly names are organized and implemented;
e.g., those for file systems and the WWW; classes of
naming systems
 flat naming
 structured naming, and
 attribute-based naming
2
OBJECTIVES OF THE CHAPTER

INTRODUCTION
 names play an important role to:
 share resources
 uniquely identify entities
 refer to locations
 etc.
 an important issue is that a name can be resolved to
the entity it refers
 to resolve names, it is necessary to implement a
naming system
 in a distributed system, the implementation of a naming
system is itself often distributed, unlike in non-distributed
systems
 efficiency and scalability of the naming system are the
main issues
3

 Uniform Resource Identifiers (URIs) [Berners-Lee et al.
2005] came about from the need to identify resources
on the Web, and other Internet resources such as
electronic mailboxes. An important goal was to identify
resources in a coherent way, so that they could all be
processed by common software such as browsers.
 Uniform Resource Locator (URL) is often used for URIs
that provide location information and specify the method
for accessing the resource, including the ‘http’.
 Uniform Resource Names (URNs) are URIs that are
used as pure resource names rather than locators.
4
COMMON TERMS

 A name service stores information about a collection of
textual names, in the form of bindings between the names
and the attributes of the entities they denote, such as
users, computers, services and objects.
 A name space is the collection of all valid names
recognized by a particular service.
 A naming domain is a name space for which there exists a
single overall administrative authority responsible for
assigning names within it.
 The Domain Name System is a name service design
whose main naming database is used across the Internet.
 A service that stores collections of bindings between
names and attributes and that looks up entries that match
attribute-based specifications is called a directory service.
Directory services are also sometimes known as attribute-
based name services.
5

Names, Identifiers, and Addresses
a name in a distributed system is a string of bits or
characters that is used to refer to an entity
an entity is anything; e.g., resources such as hosts, printers,
disks, files, objects, processes, users, Web pages,
newsgroups, mailboxes, network connections, ...
entities can be operated on
 e.g., a resource such as a printer offers an interface
containing operations for printing a document, requesting
the status of a job, etc.
 a network connection may provide operations for sending
and receiving data, setting quality of service parameters, etc.
to operate on an entity, it is necessary to access it through its
access point, itself an entity (special)
6

 access point
 the name of an access point is called an address (such as
IP address and port number as used by the transport layer)
 the address of the access point of an entity is also referred
to as the address of the entity
 an entity can have more than one access point (similar to
accessing an individual through different telephone
numbers)
 an entity may change its access point in the course of time
(e.g., a mobile computer getting a new IP address as it
moves)
7

 an address is a special kind of name
 it refers to at most one entity
 each entity is referred by at most one address; even when
replicated such as in Web pages
 an entity may change an access point, or an access point
may be reassigned to a different entity (like telephone
numbers in offices)
 separating the name of an entity and its address makes it
easier and more flexible; such a name is called location
independent
 there are also other types of names that uniquely identify an
entity; in any case a true identifier is a name with the
following properties
 it refers to at most one entity
 each entity is referred by at most one identifier
 it always refers to the same entity (never reused)
 identifiers allow us to unambiguously refer to an entity
8

 examples
 name of an FTP server (entity)
 URL of the FTP server
 address of the FTP server
 IP number:port number
 the address of the FTP server may change
 there are three classes on naming systems: flat naming,
structured naming, and attribute-based naming
9

A.Flat Naming
a name is a sequence of characters without structure; like
human names? may be if it is not an Ethiopian name!
difficult to be used in a large system since it must be
centrally controlled to avoid duplication
moreover, it does not contain any information on how to
locate the access point of its associated entity
how are flat names resolved (or how to locate an entity when
a flat name is given)
 name resolution: mapping a name to an address or an
address to a name is called name-address resolution
 possible solutions: simple solutions, home-based
approaches, and hierarchical approaches
10

1. Simple Solutions
 two solutions (for LANs only): Broadcasting and
Multicasting, and Forwarding Pointers
a. Broadcasting and Multicasting
 broadcast a message containing the identifier of an entity;
only machines that can offer an access point for the entity
send a reply
 e.g., ARP (Address Resolution Protocol) in the Internet to find
the data link address (MAC address) of a machine
 a computer that wants to access another computer for
which it knows its IP address broadcasts this address
 the owner responds by sending its Ethernet address
 broadcasting is inefficient when the network grows (wastage
of bandwidth and too much interruption to other machines)
 multicasting is better when the network grows - send only to
a restricted group of hosts 11

12
 multicasting can also be used to locate the nearest replica
- choose the one whose reply comes in first
b. Forwarding Pointers
 how to look for mobile entities
 when an entity moves from A to B, it leaves behind a
reference to its new location
 advantage
 simple: as soon as the first name is located using
traditional naming service, the chain of forwarding
pointers can be used to find the current address
 drawbacks
 the chain can be too long - locating becomes expensive
 all the intermediary locations in a chain have to maintain
their pointers
 vulnerability if links are broken
 hence, making sure that chains are short and that
forwarding pointers are robust is an important issue

2. HOME-BASED APPROACHES
 broadcasting and multicasting have scalability problems;
performance and broken links are problems in forwarding
pointers
 a home location keeps track of the current location of an
entity; often it is the place where an entity was created
 it is a two-tiered approach
 an example where it is used in Mobile IP
 each mobile host uses a fixed IP address
 all communication to that IP address is initially directly
sent to the host’s home agent located on the LAN
corresponding to the network address contained in the
mobile host’s IP address
 whenever the mobile host moves to another network, it
requests a temporary address in the new network
(called care-of-address) and informs the new address to
the home agent
13

12
 when the home agent receives a message for the mobile host
(from a correspondent agent) it forwards it to its new address
(if it has moved) and also informs the sender the host’s
current location for sending other packets
home-based approach: the principle of Mobile IP

 problems:
 creates communication latency (Triangle routing:
correspondent-home network-mobile)
 the home location must always exist; the host is
unreachable if the home does no more exist (permanently
changed); the solution is to register the home at a traditional
name service and let a client first look up the location of the
home
15

B. Structured Naming
flat names are not convenient for humans
Name Spaces
 names are organized into a name space
 each name is made of several parts; the first may define
the nature of the organization, the second the name, the
third departments, ...
 the authority to assign and control the name spaces can be
decentralized where a central authority assigns only the
first two parts
 a name space is generally organized as a labeled, directed
graph with two types of nodes
 leaf node: represents the named entity and stores
information such as its address or the state of that entity
 directory node: a special entity that has a number of
outgoing edges, each labeled with a name
each node in a naming graph is considered as another entity
with an identifier
16

17
a general naming graph with a single root node, no
 a directory node stores a table in which an outgoing edge is
represented as a pair (edge label, node identifier), called a
directory table
 each path in a naming graph can be referred to by the
sequence of labels corresponding to the edges of the path
and the first node in the path, such as
N:<label-1, label-2, ..., label-n>, where N refers to the first
node in the path

 such a sequence is called a path name
 if the first node is the root of the naming graph, it is called an
absolute path name; otherwise it is a relative path name
 instead of the path name n0:<home, steen, mbox>, we often use
its string representation /home/steen/mbox
 there may also be several paths leading to the same node, e.g.,
node n5 can be represented as /keys or
/home/steen/keys
 although the above naming graph is directed acyclic graph (a
node can have more than one incoming edge but is not
permitted to have a cycle), the common way is to use a tree
(hierarchical) with a single root (as is used in file systems)
 in a tree structure, each node except the root has exactly one
incoming edge; the root has no incoming edges
 each node also has exactly one associated (absolute) path name
18

 e.g., file naming in UNIX file system
 a directory node represents a directory and a leaf node
represents a file
 there is a single root directory, represented in the naming
graph by the root node
 we have a contiguous series of blocks from a logical disk
 the boot block is used to load the operating system
 the superblock contains information on the entire file
system such as its size, etc.
 inodes are referred to by an index number, starting at
number zero, which is for the inode representing the root
directory
 given the index number of an inode, it is possible to access
its associated file
19

 Name Resolution
 given a path name, the process of looking up a name
stored in the node is referred to as name resolution; it
consists of finding the address when the name is given (by
following the path)
 knowing how and where to start name resolution is referred
to as closure mechanism; e.g., UNIX file system
 Linking and Mounting
 Linking: giving another name for the same entity (an alias)
e.g., environment variables in UNIX such as HOME that
refer to the home directory of a user
 two types of links (or two ways to implement an alias):
hard link and symbolic link
 hard link: to allow multiple absolute path names to
refer to the same node in a naming graph
e.g., in the previous graph, there are two different path names
for node n5: /keys and /home/steen/keys 20

 symbolic link: representing an entity by a leaf node and
instead of storing the address or state of the entity, the
node stores an absolute path name
21
the concept of a symbolic link explained in a naming graph
when first resolving an absolute path name stored in a
node (e.g., /home/steen/keys in node n6), name
resolution will return the path name stored in the node
(/keys), at which point it can continue with resolving that
new path name, i.e., closure mechanism

 so far name resolution was discussed as taking place
within a single name space
 name resolution can also be used to merge different name
spaces in a transparent way
 the solution is to use mounting
 as an example, consider a mounted file system, which
can be generalized to other name spaces as well
 let a directory node store the directory node from a
different (foreign) name space
 the directory node storing the node identifier is called a
mount point
 the directory node in the foreign name space is called a
mounting point, normally the root of a name space
 during name resolution, the mounting point is looked up
and resolution proceeds by accessing its directory table

 consider a collection of name spaces distributed across
different machines (each name space implemented by a
different server)
 to mount a foreign name space in a distributed system, the
following are at least required
 the name of an access protocol (for communication)
 the name of the server
 the name of the mounting point in the foreign name space
 each of these names needs to be resolved
 to the implementation of the protocol so that
communication can take place properly
 to an address where the server can be reached
 to a node identifier in the foreign name space (to be
resolved by the server of the foreign name space)
 the three names can be listed as a URL

 example: Sun’s Network File System (NFS) is a distributed file
system with a protocol that describes how a client can
access a file stored on a (remote) NFS file server
 an NFS URL may look like nfs://flits.cs.vu.nl/home/steen
- nfs is an implementation of a protocol
- flits.cs.vu.nl is a server name to be resolved using DNS
- /home/steen is resolved by the foreign server
 e.g., the subdirectory /remote includes mount points for
foreign name spaces on the client machine
 a directory node named /remote/vu is used to store
nfs://flits.cs.vu.nl/home/steen
 consider /remote/vu/mbox
 this name is resolved by starting at the root directory on the
client’s machine until node /remote/vu, which returns the
URL nfs://flits.cs.vu.nl/home/steen
 this leads the client machine to contact flits.cs.vu.nl using
the NFS protocol
 then the file mbox is read in the directory /home/steen

mounting remote name spaces through a specific process protocol
MOUNT POINT
MOUNTING POINT

 distributed systems that allow mounting a remote file system
also allow to execute some commands
 example commands to access the file system
cd /remote/vu /*changing directory on the remote machine ls -l
/*listing the files on the remote machine
 by doing so the user is not supposed to worry about the
details of the actual access; the name space on the local
machine and that on the remote machine look to form a
single name space

 The Implementation of a Name Space
 a name space forms the heart of a naming service
 a naming service allows users and processes to add,
remove, and lookup names
 a naming service is implemented by name servers
 for a distributed system on a single LAN, a single server
might suffice; for a large-scale distributed system the
implementation of a name space is distributed over multiple
name servers
 Name Space Distribution
 in large scale distributed systems, it is necessary to
distribute the name service over multiple name servers,
usually organized hierarchically
 a name service can be partitioned into logical layers
 the following three layers can be distinguished (according to
Cheriton and Mann)

 global layer
 formed by highest level nodes (root node and nodes close
to it or its children)
 nodes on this layer are characterized by their stability, i.e.,
directory tables are rarely changed
 they may represent organizations, groups of
organizations, ..., where names are stored in the name
space
 administrational layer
 groups of entities that belong to the same organization or
administrational unit, e.g., departments
 relatively stable
 managerial layer
 nodes that may change regularly, e.g., nodes representing
hosts of a LAN, shared files such as libraries or binaries,
…
 nodes are managed not only by system administrators,
but also by end users

an example partitioning of the DNS name space, including Internet-
accessible files, into three layers

 the name space is divided into nonoverlapping parts, called
zones in DNS
 a zone is a part of the name space that is implemented by a
separate name server
 some requirements of servers at different layers: performance
(responsiveness to lookups), availability (failure rate), etc.
 high availability is critical for the global layer, since name
resolution cannot proceed beyond the failing server; it is
also important at the administrational layer for clients in the
same organization
 performance is very important in the lowest layer, since
results of lookups can be cached and used due to the
relative stability of the higher layers
 they may be enhanced by client side caching (for global and
administrational layers since names do not change often)
and replication; they create implementation problems since
they may introduce inconsistency (see Chapter 7)

a comparison between name servers for implementing nodes from a large-
scale name space partitioned into a global layer, an administrational
layer, and a managerial layer
Item Global Administrational Managerial
Geographical scale of network Worldwide Organization Department
Total number of nodes Few Many Vast numbers
Responsiveness to lookups Seconds Milliseconds Immediate
Update propagation Lazy Immediate Immediate
Availability requirement Very High High low
Number of replicas Many None or few None
Is client-side caching applied? Yes Yes Sometimes

 Implementation of Name Resolution
 recall that name resolution consists of finding the address
when the name is given
 assume that name servers are not replicated and that no
client-side caches are allowed
 each client has access to a local name resolver, responsible
for ensuring that the name resolution process is carried out
 e.g., assume the path name
root:<nl, vu, cs, ftp, pub, globe, index.txt> is to be resolved
or using a URL notation, this path name would correspond to
ftp://ftp.cs.vu.nl/pub/globe/index.txt

 a host that needs to map a name to an address calls a DNS
client named a resolver (and provides it the name to be
resolved - ftp.cs.vu.nl)
 the resolver accesses the closest DNS server with a mapping
request
 if the server has the information it satisfies the resolver;
otherwise, it either refers the resolver to other servers (called
Iterative Resolution) or asks other servers to provide it with
the information (called Recursive Resolution)
 Iterative Resolution
 a name resolver hands over the complete name to the root
name server
 the root name server will resolve the name as far as it can
and return the result to the client; at the minimum it can
resolve the first level and sends the name of the first level
name server to the client
 the client calls the first level name server, then the second,
..., until it finds the address of the entity

the principle of iterative name resolution

 Recursive Resolution
 a name resolver hands over the whole name to the root name
server
 the root name server will try to resolve the name and if it
can’t, it requests the first level name server to resolve it and
to return the address
 the first level will do the same thing recursively
the principle of recursive name resolution

 Advantages and drawbacks
 recursive name resolution puts a higher performance
demand on each name server; hence name servers in the
global layer support only iterative name resolution
 caching is more effective with recursive name resolution
 each name server gradually learns the address of each
name server responsible for implementing lower-level
nodes
 eventually lookup operations can be handled efficiently

recursive name resolution of <nl, vu, cs, ftp>; name servers cache
intermediate results for subsequent lookups
Server for
node
Should
resolve
Looks
up
Passes to
child
Receives
and caches
Returns to
requester
cs <ftp> #<ftp> -- -- #<ftp>
vu <cs,ftp> #<cs> <ftp> #<ftp> #<cs> #<cs,
ftp>
nl <vu,cs,ftp> #<vu> <cs,ftp> #<cs>
#<cs,ftp>
#<vu>
#<vu,cs>
#<vu,cs,ftp>
root <nl,vu,cs,ftp> #<nl> <vu,cs,ftp> #<vu>
#<vu,cs>
#<vu,cs,ftp>
#<nl> #<nl,vu>
#<nl,vu,cs>
#<nl,vu,cs,ftp>

the comparison between recursive and iterative name resolution with
respect to communication costs; assume the client is in Ethiopia and
the name servers in the Netherlands
 communication costs may be reduced in recursive name
resolution
 Summary
Method Advantages
Recursive Less Communication cost; Caching is more effective
Iterative Less performance demand on name servers

 Example - The Domain Name System (DNS)
 one of the largest distributed naming services is the
Internet DNS
 it is used for looking up host addresses and mail servers
 hierarchical, defined in an inverted tree structure with the
root at the top
 the tree can have only 128 levels

 Label
 each node has a label, a string with a
maximum of 63 characters (case
insensitive)
 the root label is null (has no label)
 children of a node must have
different names (to guarantee
uniqueness)
 Domain Name
 each node has a domain name; it is
a path name to its root node
 a full domain name is a sequence of
labels separated by dots (the last
character is a dot)
 domain names are read from the
node up to the root
 full path names must not exceed
255 characters

 the contents of a node is formed by a collection of resource
records; the important ones are the following
Type of
record
Associated
entity
Description
SOA (start of
authority)
Zone
Holds information on the represented zone, such as an
e-mail address of the system administrator
A (address) Host Contains an IP address of the host this node represents
MX (mail
exchange)
Domain
Refers to a mail server to handle mail addressed to this
node; it is a symbolic link; e.g. name of a mail server
SRV Domain Refers to a server handling a specific service
NS (name
server)
Zone
Refers to a name server that implements the
represented zone
CNAME Node Contains the canonical name of a host; an alias
PTR (pointer) Host
Symbolic link with the primary name of the represented
node; for mapping an IP address to a name
HINFO (host
info)
Host
Holds information on the host this node represents;
such as machine type and OS
TXT Any kind
Contains any entity-specific information considered
useful; cannot be automatically processed

 cs.vu.nl represents the domain as well as the zone; it has 4
name servers (ns, star, top, solo) and 3 mail servers
 name server for this zone with 2 network addresses (star)
 mail servers; the numbers preceding the name show
priorities; first the one with the lowest number is tried
an excerpt from the DNS database for the zone cs.vu.nl

an excerpt from the DNS database for the zone cs.vu.nl, cont’d
 a Web server and an FTP server, implemented by a single
machine (soling)
 older server clusters (vucs-das1)
 two printers (inkt and pen) with a local address; i.e., they
cannot be accessed from outside

part of the description for the vu.nl domain which contains the cs.vu.nl domain
 cs.vu.nl is implemented as a single zone
 hence, the records in the previous slides do not include
references to other zones
 nodes in a subdomain that are implemented in a different
zone are specified by giving the domain name and IP
address

C. Attribute-Based Naming
flat naming: provides a unique and location-independent way
of referring to entities
structured naming: also provides a unique and location-
independent way of referring to entities as well as human-
friendly names
but both do not allow searching entities by giving a
description of an entity
in attribute-based naming, each entity is assumed to have a
collection of attributes that say something about the entity
then a user can search an entity by specifying (attribute, value)
pairs known as attribute-based naming
Directory Services
 attribute-based naming systems are also called directory
services whereas systems that support structured naming
are called naming systems

 how are resources described? one possibility is to use RDF
(Resource Description Framework) that uses triplets
consisting of a subject, a predicate, and an object
 e.g., (person, name, Alice) to describe a resource Person
whose Name is Alice
 or in e-mail systems, we can use sender, recipient, subject,
etc. for searching
 Hierarchical Implementations: LDAP
 distributed directory services are implemented by combining
structured naming with attribute-based naming
 e.g., Microsoft’s Active Directory service
 such systems rely on the lightweight directory access
protocol or LADP which is derived from OSI’s X.500
directory service
 a LADP directory service consists of a number of records
called directory entries (attribute, value) pairs, similar to a
resource record in DNS; could be single- or multiple-valued
(e.g., Mail_Servers on next slide)

a simple example of an LDAP directory entry using LDAP naming
conventions to identify the network addresses of some servers
Attribute Abbr. Value
Country C NL
Locality L Amsterdam
Organization O Vrije Universiteit
OrganizationalUnit OU Comp. Sc.
CommonName CN Main server
Mail_Servers -- 137.37.20.3, 130.37.24.6,137.37.20.10
FTP_Server -- 130.37.20.20
WWW_Server -- 130.37.20.20

 the collection of all directory entries is called a Directory
Information Base (DIB)
 each record is uniquely named so that it can be looked up
 each naming attribute is called a Relative Distinguished
Name (RDN); the first 5 entries above
 a globally unique name is formed using abbreviations of
naming attributes, e.g.,
/C=NL/O=Vrije Universiteit/OU=Comp. Sc.
 this is similar to the DNS name nl.vu.cs
 listing RDNs in sequence leads to a hierarchy of the
collection of directory entries, called a Directory
Information Tree (DIT)
 a DIT forms the naming graph of an LDAP directory service
where each node represents a directory entry

part of the directory information tree
 node N corresponds to the directory entry shown earlier; it
also acts as a parent of other directory entries that have an
additional attribute, Host_Name; such entries may be used
to represent hosts

Attribute Value
Country NL
Locality Amsterdam
Organization Vrije Universiteit
OrganizationalUnit Comp. Sc.
CommonName Main server
Host_Name star
Host_Address 192.31.231.42
Attribute Value
Country NL
Locality Amsterdam
Organization Vrije Universiteit
OrganizationalUnit Comp. Sc.
CommonName Main server
Host_Name zephyr
Host_Address 137.37.20.10
two directory entries having Host_Name as RDN
Reading Assignment: case study of global Name services,
Distributed Shared Memory

51

Chapter Five
Synchronization
1

OUTLINE
 Clock synchronization, physical clocks and clock synchronization algorithms
 Logical clocks and time stamps
 Global state
 Distributed transactions and concurrency control
Election algorithms
 Mutual exclusion and various algorithms to achieve mutual Exclusion
2

INTRODUCTION
 Synchronization is the coordination of action between processers.
 Asynchronous contains independent of events.
 cooperation is partly supported by naming; it allows processes to at
least share resources (entities)
 synchronization deals on how to ensure that processes do not
simultaneously access a shared resource
 how events can be ordered such as two processes sending messages to
each other
We will study:
 Synchronization based on “Actual Time”.
 Synchronization based on “Relative Time”.
 Synchronization based on Co-ordination (with Election Algorithms).
 Mutual Exclusion
3

 Essential of Synchronization
4

 Issue of synchronization
5

Clock Synchronization
 in centralized systems, synchronization is decided shared memory.
 The event ordering is clean because all the events are timed by the
same clock.
 One processor and one clock i.e., different processors share the
memory.
 achieving agreement on time in distributed systems is difficult
 Each system has own time.
 e.g., consider the make program on a UNIX machine; it compiles
only source files for which the time of their last update was later
than the existing object file
7

Physical Clocks:
clocks whose values must not deviate from the real time by more than a certain amount.
 Clock is an electronic devices that counts (stored in computer register) oscillation in
crystal at a particular frequency.
 Physical Clocks can be used time stamp an event on that computer.
E1t1 and E2t2
 Many applications are interested only in order of events not the exact time of day at
which they occurred
 The time difference between two computer is drift while the clock drift over the time is
skew
 Several methods are used to attempt the synchronization of physical clock distributed
system
 There are different methods of physical clocks
Cristian Algorism
Barkley Algorithm
Network Time Protocol
8

12
 Clock Synchronization Algorithm
(T _sever)

 Cristian's Algorithm
 The client processes fetches the response from the server at a time T1 and calculates the new
synchronized client clock time by
T _client= Server+(T1-T0)/2
Send request at 5:08:15:100 (T0)
Receiver response at 5:08:15:900 (T1)
Response contains 5:09:25:300 (T _server)
So, we want calculate elapsed time is called round trip time
T1-T2/2=800
Set to time
T _client= Server+(T1-T0)/2
T _client= 5:09:25:300 +400
T _client= 5:09:25:700
14

17
• Is a protocol that supports the computers clock time
to be synchronized in a network

SUMMARY
• Overview
• Essentiality of Synchronization
• Issues in Synchronization
• Clock Synchronization
• Physical Clocks
• Clock Synchronization
• Logical Clocks
• Happens-Before Relationship
• Lamport's Logical Clocks
• Vector Clocks
• Mutual Exclusion
• A Centralized Algorithm
• A Decentralized Algorithm
• A Distributed Algorithm
• A Token Ring Algorithm
• Election Algorithm
• Introduction
• The Bully Algorithm
• A Ring Algorithm
• Superpeer Selection
46

47

Chapter Six
Fault Tolerance
1

 a major difference between distributed systems and single
machine systems is that with the former, partial failure is
possible, i.e., when one component in a distributed system
fails
 such a failure may affect some components while others
will continue to function properly
 an important goal of distributed systems design is to
construct a system that can automatically recover from
partial failure
 it should tolerate faults and continue to operate to some
extent
2
INTRODUCTION

 we will discuss
 fault tolerance, and making distributed systems fault
tolerant
 process resilience (techniques by which one or more
processes can fail without seriously disturbing the rest of
the system)
 reliable multicasting to keep processes synchronized (by
which message transmission to a collection of processes
is guaranteed to succeed)
 distributed commit protocols for ensuring atomicity in
distributed systems
 failure recovery by saving the state of a distributed
system (when and how)
3
OBJECTIVES OF THE CHAPTER

Introduction to Fault Tolerance
 Basic Concepts
 fault tolerance is strongly related to dependable
systems
 dependability covers the following
 availability
 refers to the probability that the system is
operating correctly at any given time; defined in
terms of an instant in time
 reliability
 a property that a system can run continuously
without failure; defined in terms of a time interval
 safety
 refers to the situation that even when a system
temporarily fails to operate correctly, nothing
catastrophic happens
 maintainability
 how easily a failed system can be repaired
4

 dependable systems are also required to provide a high
degree of security
 a system is said to fail when it cannot meet its promises; for
instance failing to provide its users one or more of the
services it promises
 an error is a part of a system’s state the may lead to a failure;
e.g., damaged packets in communication
 the cause of an error is called a fault
 building dependable systems closely relates to controlling
faults
 a distinction is made between preventing, removing, and
forecasting faults
 a fault tolerant system is a system that can provide its
services even in the presence of faults
5

 faults are classified into three
 transient
 occurs once and then disappears; if the operation is
repeated, the fault goes away; e.g., a bird flying through a
beam of a microwave transmitter may cause some lost
bits
 intermittent
 it occurs, then vanishes on its own accord, then
reappears, ...; e.g., a loose connection; difficult to
diagnose; take your sick child to the nearest clinic, but
the child does not show any sickness by the time you
reach there
 permanent
 one that continues to exist until the faulty component is
repaired; e.g, disk head crash, software bug
6

 Failure Types - 5 of them
 Crash failure (also called fail-stop failure): a server halts,
but was working correctly until it stopped; e.g., the OS
halts; reboot the system
 Omission failure: a server fails to respond to incoming
requests
 Receive omission: a server fails to receive incoming
messages; e.g., may be no thread is listening
 Send omission: a server fails to send messages
 Timing failure: a server's response lies outside the
specified time interval; e.g., may be it is too fast over
flooding the receiver or too slow
 Response failure: the server's response is incorrect
 Value failure: the value of the response is wrong; e.g., a
search engine returning wrong Web pages as a result of
a search 7

8
 State transition failure: the server deviates from the correct
flow of control; e.g., taking default actions when it fails to
understand the request
 Arbitrary failure (or Byzantine failure): a server may produce
arbitrary responses at arbitrary times; the most serious
 Failure Masking by Redundancy
 to be fault tolerant, the system tries to hide the occurrence of
failures from other processes - masking
 the key technique for masking faults is redundancy
 three kinds are possible
 information redundancy; add extra bits to allow recovery
from garbled bits (error correction)
 time redundancy: an action is performed more than once if
needed; e.g., redo an aborted transaction; useful for
transient and intermittent faults
 physical redundancy: add (replicate) extra equipment
(hardware) or processes (software)

Process Resilience
 how can fault tolerance be achieved in distributed
systems
 one method is protection against process failures by
replicating processes into groups
 we discuss
 what are the general design issues of process groups
 what actually is a fault tolerant group
 how to reach agreement within a process group when
one or more of its members cannot be trusted to give
correct answers
9

 Design Issues
 the key approach to tolerating a faulty process is to
organize several identical processes into a group
 all members of a group receive a message hoping that if
one process fails, another one will take over
 the purpose is to allow processes to deal with collection of
processes as a single abstraction
 process groups may be dynamic
 new groups can be created and old groups can be
destroyed
 a process can join or leave a group
 a process can be a member of several groups at the
same time
 hence group management and membership mechanisms
are required
10

11
 the internal structure of a group may be flat or hierarchical
 flat: all processes are equal and decisions are made
collectively
 hierarchical: a coordinator and several workers; the
coordinator decides which worker is best suited to carry
a task
(a) communication in a flat group
(b) communication in a simple hierarchical group

 the flat group
 has no single point of failure
 but decision making is more complicated (voting may be
required for decision making which may create a delay and
overhead)
 the hierarchical group has the opposite properties
 group membership may be handled
 through a group server where all requests (joining, leaving,
...) are sent; it has a single point of failure
 in a distributed way where membership is multicasted (if a
reliable multicasting mechanism is available)
 but what if a member crashes; other members have to
find out this by noticing that it no more responds
12

 Failure Masking and Replication
 how to replicate processes so that they can form groups and
failures can be masked?
 there are two ways for such replication:
 primary-based replication
 for fault tolerance, primary-backup protocol is used
 organize processes hierarchically and let the primary
(i.e., the coordinator) coordinate all write operations
 if the primary crashes, the backups hold an election
 replicated-write protocols
 in the form of active replication or by means of quorum-
based protocols
 that means, processes are organized as flat groups
13

 another important issue is how much replication is needed
 for simplicity consider only replicated-write systems
 a system is said to be k fault tolerant if it can survive faults
in k components and still meets its specifications
 if the processes fail silently, then having k+1 replicas is
enough; if k of them fail, the remaining one can function
 if processes exhibit Byzantine failure, 2k+1 replicas are
required; if the k processes generate the same reply
(wrong), the k+1 will also produce the same answer
(correct); but which of the two is correct is difficult to
ascertain; unless we believe the majority
14

Distributed Commit
 atomic multicasting is an example of the more generalized
problem known as distributed commit
 in atomic multicasting, the operation is delivery of a
message
 but the distributed commit problem involves having an(y)
operation being performed by each member of a process
group, or none at all
 there are three protocols: one-phase commit, two-phase
commit, and three-phase commit
 One-Phase Commit Protocol
 a coordinator tells all other processes, called
participants, whether or not to (locally) perform an
operation
 drawback: if one of the participants cannot perform the
operation, there is no way to tell the coordinator; for
example due to violation of concurrency control
constraints in distributed transactions
15

 Two-Phase Commit Protocol (2PC)
 it has two phases: voting phase and decision phase, each
involving two steps
 voting phase
 the coordinator sends a VOTE_REQUEST message to all
participants
 each participant then sends a VOTE_COMMIT or
VOTE_ABORT message depending on its local situation
 decision phase
 the coordinator collects all votes; if all vote to commit the
transaction, it sends a GLOBAL_COMMIT message; if at
least one participant sends VOTE_ABORT, it sends a
GLOBAL_ABORT message
 each participant that voted for a commit waits for the
final reaction of the coordinator and commits or aborts
16

a) the finite state machine for the coordinator in 2PC
b) the finite state machine for a participant
17

 problems may occur in the event of failures
 the coordinator and participants have states in which they
block waiting for messages: INIT, READY, WAIT
 when a process crashes, other processes may wait
indefinitely
 hence, timeout mechanisms are required
 a participant waiting in its INIT state for VOTE_REQUEST
from the coordinator aborts and sends VOTE_ABORT if it
does not receive a vote request after some time
 the coordinator blocking in state WAIT aborts and sends
GLOBAL_ABORT if all votes have not been collected on time
 a participant P waiting in its READY state waiting for the
global vote cannot abort; instead it must find out which
message the coordinator actually sent
 by blocking until the coordinator recovers
 or requesting another participant, say Q
18

actions taken by a participant P when residing in state READY and having
contacted another participant Q
19
State of Q Action by P Comments
COMMIT
Make transition to
COMMIT
Coordinator sent
GLOBAL_COMMIT before
crashing, but P didn’t receive it
ABORT Make transition to ABORT
Coordinator sent
GLOBAL_ABORT before
crashing, but P didn’t receive it
INIT Make transition to ABORT
Coordinator sent
VOTE_REQUEST before
crashing, P received it but not Q
READY Contact another
participant
If all are in state READY, wait
until the coordinator recovers

 a process (participant or coordinator) can recover from crash
if its state has been saved to persistent storage
 actions by a participant after recovery
20
 there are two critical states for the coordinator
State before Crash Action by the Process after Recovery
INIT Locally abort the transaction and inform the
coordinator
COMMIT or ABORT Retransmit its decision to the coordinator
READY Can not decide on its own what it should do
next; has to contact other participants
State before Crash Action by the Coordinator after Recovery
WAIT Retransmit the VOTE_REQUEST message
After the Decision in
the 2nd phase
Retransmit the decision

Recovery
 fundamental to fault tolerance is recovery from an error
 recall: an error is that part of a system that may lead to a
failure
 error recovery means to replace an erroneous state with an
error-free state
 two forms of error recovery: backward recovery and
forward recovery
 Backward Recovery
 bring the system from its present erroneous state back
into a previously correct state
 for this, the system’s state must be recorded from time to
time; each time a state is recorded, a checkpoint is said
to be made
 e.g., retransmitting lost or damaged packets in the
implementation of reliable communication
21

 most widely used, since it is a generally applicable method
and can be integrated into the middleware layer of a
distributed system
 disadvantages:
 checkpointing and restoring a process to its previous state
are costly and performance bottlenecks
 no guarantee can be given that the error will not recur,
which may take an application into a loop of recovery
 some actions may be irreversible; e.g., deleting a file,
handing over cash to a customer
 Forward Recovery
 bring the system from its present erroneous state to a
correct new state from which it can continue to execute
 it has to be known in advance which errors may occur so
as to correct those errors
 e.g., erasure correction (or simply error correction) where a
lost or damaged packet is constructed from other
successfully delivered packets

23

distributes systemss and its power point

More Related Content

Similar to distributes systemss and its power point (20)

Recently uploaded (20)

distributes systemss and its power point