SlideShare a Scribd company logo
FUNDAMENTALS OF DISTRIBUTED SYSTEM
Chapter Three
Distributed File system
1
OUTLINE
ф Distributed File Systems: Introduction
 File Service Architecture
 Case Study 1: Sun Network File System
 Case Study 2: The Andrew File System
2
INTRODUCTION
 The main goal of distributed system is sharing resources.
 Enables programs to store and access remote files (web
servers) exactly as they do local files
 Distributed file systems support the sharing of information in
the form of files and hardware resources in the form of
persistent storage throughout an intranet.
 File systems were originally developed for centralized
computer systems and desktop computers as an operating
system facility providing a convenient programming interface
to disk storage.
 A well designed file service provides access to files stored at a
server with performance and reliability similar to, and in some
cases better than, files stored on local disks 3
STORAGE SYSTEM AND THEIR PROPERTIES
Sharing Persistence Distributed cache /replica Consistency
maintenance
Example
Main memory x X X strict one-copy RAM
File system x / X strict one-copy UNIX file
system
Distributed file
system
/ / / slightly weaker
guarantees
Sun NFS
Web / / / No automatic
consistency
Web server
Distributed
shared memory
/ X / slightly weaker
guarantees
Ivy
Remote Objects / X X strict one-copy CORBA
Persistent object
store
/ / X strict one-copy CORBA
persistent state
service
Peer to peer
storage system
/ / / considerably
weaker guarantees
oceanStore
4
CHARACTERISTICS OF FILE SYSTEM
 Responsible for the organization, storage, retrieval, naming,
sharing and protection of files.
 Provide programming interface that characterize the file
abstraction
 Files contains both data and attribute
 File attributes
5
File length
Creation timestamp
Read timestamp
Write time stamp
Attribute timestamp
Owner
File type
Access control list
DISTRIBUTED FILE SYSTEM REQUIREMENTS
 Transparency(access , location, mobility, performance,
scaling)
 Concurrent file updates
 File replication
 File heterogeny
 Fault tolerance
 Consistency
 Security
 Efficiency
6
FILE SERVICE ARCHITECTURE
 An architecture that offers a clear separation of the main
concerns
 Access to files is obtained by structuring the file service
 Structure of the file has three components
A. a flat file service
B. a directory service
C. a client module
7
File service Architecture
8
FLAT FILES SERVICE
 Concerned with implementing on the contents of the files
 Unique file identifiers(UFIDS) are used to refer to the files
 Refer to file in all requests
 UFIDS is long sequence of bits
 Each file has a UFIDS that is unique among all of the files
9
DIRECTORY SERVICE
 The directory service provides a mapping between text names
for files and their UFIDs.
 Clients may obtain the UFID of a file by quoting its text name
to the directory service.
 The directory service provides the functions needed to
generate directories, to add new file names to directories and
to obtain UFIDs from directories.
 It is a client of the flat file service
 Its directory files are stored in files of the flat file service.
 When a hierarchic file-naming scheme is adopted, as in
UNIX, directories hold references to other directories. 10
CLIENT MODULE
 A client module runs in each client computer, integrating and
extending the operations of the flat file service
 The directory service under a single application programming
interface that is available to user-level programs in client
computers.
 For example, in UNIX hosts, a client module would be provided
that emulates the full set of UNIX file operations, interpreting
UNIX multi-part file names by iterative requests to the directory
service.
 The client module also holds information about the network
locations of the flat file server and directory server processes.
 Finally, the client module can play an important role in achieving
satisfactory performance through the implementation of a cache of
recently used file blocks at the client. 11
FLAT FILE SERVICE INTERFACE
 This is the RPC interface used by client modules. It is not
normally used directly by user-level programs.
 A FileId is invalid if the file that it refers to is not present in
the server processing the request or if its access permissions
are inappropriate for the operation requested.
 All of the procedures in the interface except Create throw
exceptions if the FileId argument contains an invalid UFID or
the user doesn’t have sufficient access rights. These
exceptions are omitted from the definition for clarity.
12
FLAT FILE SERVICE OPERATIONS
 Read(FileId, i, n)  Data: Reads a sequence of up to n items from a file starting at item i and
returns it in Data.
 Write(FileId, i, Data):Write a sequence of Data to a file, starting at item i, extending the file if
necessary.
 Create()FileId: Creates a new file of length0 and delivers a UFID for it.
 Delete(FileId): Removes the file from the file store.
 GetAttributes(FileId)Attr:Returns the file attributes for the file.
 SetAttributes(FileId, Attr):Sets the file attributes .
 GetAttributes and SetAttributes enable clients to access the attribute record.
 GetAttributes is normally available to any client that is allowed to read the file.
 Access to the SetAttributes operation would normally be restricted to the directory service that
provides access to the file.
13
DIRECTORY SERVICE OPERATIONS
 Lookup(Dir, Name) FileId
 throws NotFound
 AddName(Dir, Name, FileId)
 throws NameDuplicate
 UnName(Dir, Name)
 throws NotFound
 GetNames(Dir, Pattern) NameSeq
Locates the text name in the directory and
returns the relevant UFID. If Name is not
in the directory, throws an exception.
If Name is not in the directory, adds
(Name, File) to the directory and updates
the file’s attribute record. If Name is
already in the directory, throws an
exception.
If Name is in the directory, removes the
entry containing Name from the directory.
If Name is not in the directory, throws an
exception.
Returns all the text names in the directory
that match the regular expression Pattern.
14
FILE ACCESSING MODELS
 File access models are methods used for accessing remote
files and the unit of data access.
 A distributed file system may use one of the following models
to service a client "file access request when the accessed file
is remote:
1. Remote service model
 Processing of a client's request is performed at the server's node.
 Thus, the client's request for file access is delivered across the
network as a message to the server, the server machine performs
the access request, and the result is sent to the client.
 This need to minimize the number of messages sent and the
overhead per message. 15
2. Data-caching model
 This model attempts to reduce the network traffic of the previous
model by caching the data obtained from the server node.
 This takes advantage of the locality feature of the found in file
accesses.
 This model is increased performance and greater system scalability
 It refers to the friction of file that is transferred to clients in a single
read and write operation
3.File-level transfer model
 In this model when file data is to be transferred, the entire file is
moved.
 This reduces server load and network traffic since it accesses the
server only once. This has better scalability.
 This model requires sufficient storage space on the client machine.
FILE ACCESSING MODELS…
16
4. Block-level transfer model
 File transfer takes place in file blocks.
 A file block is a contiguous portion of a file and is of fixed
length
 This does not require client nodes to have large storage space
 It eliminates the need to copy an entire file when only a small
portion of the data is needed.
 When an entire file is to be accessed, multiple server requests
are needed, resulting in more network traffic and more
network protocol overhead.
 NFS uses block-level transfer model.
FILE ACCESSING MODELS…
17
5. Byte-level transfer model
 Unit of transfer is a byte.
 Model provides maximum flexibility because it allows
storage and retrieval of an arbitrary amount of a file, specified
by an offset within a file and length.
 The drawback is that cache management is harder due to the
variable-length data for different access requests.
6.Record-level transfer model
 This model is used with structured files and the unit of
transfer is the record.
FILE ACCESSING MODELS…
18
FILE STRUCTURE ARCHITECTURE
 Hierarchal file system
 Consists of a number of directories arranged in a tree structure
 File Group
 Is a collection of files that can be located on any server or
moved between servers while maintaining the same names
 A similar construct is used in UNIX file system
 Helps with distributing the load of file serving between
several servers
 File groups have identifiers which are unique throughout the
system
19
DFS
 DFS has two methods
 Sun Netwrok File System
 Andrew File System
 NFS have stateless server where as AFS have stateful server
 AFS provides location independence (the physical location of
file can be changed without having to change the path of the
file) as well as location transparency
 NFS provides location transparency
 AFS is more scalable
20
 A stateless server does not keep information on the state of its
clients and can change its own state without informing any client.
 It does not retain any information on the state of its clients.
 a stateful server maintains persistent information on its clients and
requires explicit deletion of information by the server.
 a stateful server retains persistent information on its clients.
21
DFS
 This file system was developed by Sun Microsystems. So, its name is
Microsystems’s Network File System.
 Microsystems’s Network File System (NFS) has been widely adopted in
industry and in academic environments since its introduction in 1985.
 NFS is client-server application. So, the user can view, store and update
file on a remote computer.
 In most cases, all the client and server are on the same LAN.
 Each NFS exports one or more of its directories for access the remote
clients.
 NFS provides transparent access to remote files for client programs
running on UNIX and other systems.
 NFS allows the user or system administrator is to mount(designate as
accessible) all or portion of file system on a server.
 An important goal of NFS is to achieve a high level of support for
hardware and operating system heterogeneity.
CASE STUDY: SUN NETWORK FILE SYSTEM
22
CASE STUDY: SUN NETWORK FILE SYSTEM…
 NFS protocol (RFC 1813 ) is designed to be independent of
the computer, OS, network architecture, and transport
protocol.
 NFS uses RPC to route requests between the client and server.
 The NFS protocol was originally developed for use in
networks of UNIX systems
 The NFS server module resides in the kernel on each
computer that acts as an NFS server.
 Requests referring to files in a remote file system are
translated by the client module to NFS protocol operations
and then passed to the NFS server module at the computer
holding the relevant file system.
23
CASE STUDY: SUN NETWORK FILE SYSTEM…
Sun network File Model 24
25
CASE STUDY: SUN NETWORK FILE SYSTEM…
26
It consists of three layers
 System call layer: this handles like OPEN, READ, AND CLOSE.
 Virtual File System: the task of VFS layer is to maintain a table with one entry
for each open file, analogous to the table of I-nodes for open files in UNIX. VFS
layers has an entry called a v-node(virtual i-node) for every open file telling
whether the file is local or remote.
 NFS client code: to create an r-node(remote i-node) in its internal tables to hold the
file handles.
 Each v-node in the VFS layer will ultimately contain either apointer to an r-node in
the NFS client code, or a pointer to an i-node in the local operating system.
 Thus from the v-node it is possible to see if a file or director is local or remote, and
if it is to find its file handle
CASE STUDY: SUN NETWORK FILE SYSTEM…
 The file identifiers used in NFS are called file handles.
 The virtual file system layer has one VFS structure for each
mounted file system and one v-node per open file.
 A VFS structure relates a remote file system to the local
directory on which it is mounted.
 The v-node contains an indicator to show whether a file is
local or remote.
 If the file is local, the v-node contains a reference to the index
of the local file (an i-node in a UNIX implementation).
 If the file is remote, it contains the file handle of the remote
file.
 Reading Assignment NFS operation
CASE STUDY: SUN NETWORK FILE SYSTEM…
27
Server caching
 Caching in both the client and the server computer are
indispensable features of NFS implementations in order to
achieve adequate performance.
 In conventional UNIX systems: file pages, directories and file
attributes that have been read from disk are retained in a main
memory buffer cache until the buffer space is required for
other pages.
CASE STUDY: SUN NETWORK FILE SYSTEM…
28
Benefits of NFS: Among many benefits for organizations using NFS are
the following:
 Mature: NFS is a mature protocol, which means most aspects of
implementing, securing and using it are well understood, as are its potential
weaknesses.
 Open: NFS is an open protocol, with its continued development documented
in internet specifications as a free and open network protocol.
 Cost-effective: NFS is a low-cost solution for network file sharing that is easy
to set up because it uses the existing network infrastructure.
 Centrally managed: NFS's centralized management decreases the need for
added software and disk space on individual user systems.
 User-friendly: The protocol is easy to use and enables users to access remote
files on remote hosts in the same way they access local ones.
 Distributed: NFS can be used as a distributed file system, reducing the need
for removable media storage devices.
 Secure: With NFS, there is less removeable media like CDs, DVDs, Blu-ray
disks, diskettes and USB devices in circulation, making the system more
secure.
29
CASE STUDY: SUN NETWORK FILE SYSTEM…
Disadvantages of NFS: NFS include the following:
 Dependence on RPCs makes NFS inherently insecure and should only be
used on a trusted network behind a firewall. Otherwise, NFS will be
vulnerable to internet threats.
 Some reviews of NFSv4 and NFSv4.1 suggest that these versions have
limited bandwidth and scalability and that NFS slows down during heavy
network traffic. The bandwidth and scalability issue is reported to have
improved with NFSv4.2.
30
CASE STUDY: SUN NETWORK FILE SYSTEM…
NFS summary
 Sun NFS closely follows abstract model.
 The resulting design provides good location and access transparency if the
NFS mount service is used properly to produce similar name spaces at all
clients.
 NFS supports heterogeneous hardware and operating systems.
 The NFS server implementation is stateless, enabling clients and servers to
resume execution after a failure without the need for any recovery
procedures.
 Migration of files or filesystems is not supported, except at the level of
manual intervention to reconfigure mount directives after the movement of
a filesystem to a new location.
 Design goals of NFS are Access transparency, Location transparency,
Mobility transparency, Scalability, File replication, Hardware and
operating system heterogeneity, Fault tolerance, Consistency, Security and
Efficiency
CASE STUDY: SUN NETWORK FILE SYSTEM…
31
CASE STUDY: THE ANDREW FILE SYSTEM
 It location-independent file system
 It is DFS that uses a set of trusted servers to present a homogonous,
location transparent file name space to all the client workstation.
(vices)
 By AFS, people can work together on the same files, no matter
where the files are located.
 AFS users do not have which machine is storing a file
 AFS make easy to access a files stored on a remote computer as
files stored on the local disks
 All files you store on AFS is available to use online by just
connecting to you AFS server. We can connect with AFS server
through an AFS an AFS client
 AFS uses a set of remote servers to access a file
 AFS uses local cache to reduce the workload and increase the
performance a distributed computing environment 32
 In AFS, the server keeps of which files are opened by which
clients(as not in the case of NFSs)
 Like NFS, AFS provides transparent access to remote shared
files for UNIX programs running on workstations.
 Access to AFS files is via the normal UNIX file primitives,
enabling existing UNIX programs to access AFS files without
modification or recompilation.
 AFS is compatible with NFS. AFS servers hold ‘local’ UNIX
files, but the filing system in the servers is NFS-based, so files
are referenced by NFS-style file handles rather than i-node
numbers, and the files may be remotely accessed via NFS.
CASE STUDY: THE ANDREW FILE SYSTEM…
33
Implementation of AFS
 Venus: the client-side manager which act as an interface between the
application program and vice
 Vice: the server side processes that resides on the top of the UNIX kernel
providing shared file services to each client
 All files in AFS are distributed among all servers. The set of file in one server
is referred to as volume
 In the case of the request can not be satisfied from this set of files, the vice
server informs the client where it can find the required files
 The files available to user processes running on workstation are either local or
shared
 Local files are handled as normal UNIX files
 Shared files are stored on servers, and copies of them are cached on local disks
of workstation
34
CASE STUDY: THE ANDREW FILE SYSTEM…
Features of AFS
 File backup: AFS data files are backup nightly. Backup are kept on
site for six months
 File security: AFS data files are protected by Kerberos
authentication system
 Physical security: AFS data files are stored on servers located in
UCSC data center
 Reliability and availability: AFS servers and storage are maintained
on redundant hardware
 Authentication: AFS use Kerberos for Authentication. Kerberos
account are automatically provisioned for all UCSC data, faciality
and staff
 Space to user(Quota): AFS provides 500 MB space per user and can
request to increase automatically up to 10GB.
35
CASE STUDY: THE ANDREW FILE SYSTEM…
36
CASE STUDY: THE ANDREW FILE SYSTEM…
 AFS has two unusual design characteristics:
Whole-file serving:
 The entire contents of directories and files are transmitted to client
computers by AFS servers
Whole-file caching:
 Once a copy of a file or a chunk has been transferred to a client
computer it is stored in a cache on the local disk.
 The cache contains several hundred of the files most recently used
on that computer.
 The cache is permanent, surviving reboots of the client computer.
 Local copies of files are used to satisfy clients’ open requests in
preference to remote copies whenever possible.
CASE STUDY: THE ANDREW FILE SYSTEM…
37
A FIRST REQUEST FOR DATA TO SERVER FROM WORKSTATION IS SATISFIED BY THE SERVER AND PLACED IN
LOCAL CACHE
AFS is implemented as two software components that exist as UNIX processes
called Vice and Venus. Vice is the name given to the server software that runs as a
user-level UNIX process in each server computer, and Venus is a user-level
process that runs in each client computer and corresponds to the client module in
our abstract model.
The set of files in one server is referred as volume.
38
IMPLEMENTATION ANDREW FILE SYSTEM
 The files available to user processes running on workstations are either local or shared.
 Local files are handled as normal UNIX files. They are stored on a workstation’s disk and are
available only to local user processes.
 Shared files are stored on servers, and copies of them are cached on the local disks of
workstations.
 It is a conventional UNIX directory hierarchy, with a specific subtree (called cmu) containing
all of the shared files. This splitting of the file name space into local and shared files leads to
some loss of location transparency, but this is hardly noticeable to users other than system
administrators.
 Local files are used only for temporary files (/tmp) and processes that are essential for
workstation startup.
 Other standard UNIX files (such as those normally found in /bin, /lib and so on) are
implemented as symbolic links from local directories to files held in the shared space.
 Users’ directories are in the shared space, enabling users to access their files from any
workstation.
39
IMPLEMENTATION ANDREW FILE SYSTEM…
 A flat file service is implemented
by the Vice servers, and the
hierarchic directory structure
required by UNIX user programs is
implemented by the set of Venus
processes in the workstations.
 Each file and directory in the
shared file space is identified by a
unique, 96-bit file identifier (fid)
similar to a UFID. The Venus
processes translate the pathnames
issued by clients to fids
40
Implementation Andrew File System…
 Files are grouped into volumes for ease
of location and movement. Volumes are
generally smaller than the UNIX
filesystems, which are the unit of file
grouping in NFS. For example, each
user’s personal files are generally
located in a separate volume. Other
volumes are allocated for system
binaries, documentation and library
code.
 The representation of fids includes the
volume number for the volume
containing the file (cf. the file group
identifier in UFIDs), an NFS file handle
identifying the file within the volume
(cf. the file number in UFIDs) and a
uniquifier to ensure that file identifiers
are not reused:
32 bits 32 bits 32 bits
Volume number File handle Uniquifier 41
IMPLEMENTATION ANDREW FILE SYSTEM…
Cache consistency
 Stateful servers in AFS allow, the server to inform all clients
with open files about any updates made to that file by another
client through what is known as callback
 Callbacks to all clients with a copy of that file is ensured as a
callback promise is issued by the server to a client when it
request to for a copy of a file
42
IMPLEMENTATION ANDREW FILE SYSTEM…
 When Vice supplies a copy of a file to a Venus process it also
provides a callback promise – a token issued by the Vice
server that is guaranteeing that it will notify the Venus process
when any other client modifies the file.
 Callback promises are stored with the cached files on the
workstation disks and have two states:
valid
cancelled.
43
IMPLEMENTATION ANDREW FILE SYSTEM…
 When a server performs a request to update a file it notifies
all of the Venus processes to which it has issued callback
promises by sending a callback to each
 A callback is a remote procedure call from a server to a Venus
process.
 When the Venus process receives a callback, it sets the
callback promise token for the relevant file to cancelled.
44
IMPLEMENTATION ANDREW FILE SYSTEM…
 Whenever Venus handles an open on behalf of a client, it checks
the cache.
 If the required file is found in the cache, then its token is checked.
 If its value is cancelled, then a fresh copy of the file must be
fetched from the Vice server,
 But if the token is valid, then the cached copy can be opened and
used without reference to Vice.
45
Implementation Andrew File System…
 When a workstation is restarted after a failure or a shutdown
 Venus aims to retain as many as possible of the cached files
on the local disk
 But it cannot assume that the callback promise tokens are
correct, since some callbacks may have been missed.
46
IMPLEMENTATION ANDREW FILE
SYSTEM…
 Before the first use of each cached file or directory after a
restart,
 Venus therefore generates a cache validation request
containing the file modification timestamp to the server that is
the custodian of the file.
 If the timestamp is current, the server responds with valid, and
the token is reinstated.
 If the timestamp shows that the file is out of date, then the
server responds with cancelled, and the token is set to
cancelled.
 Callbacks must be renewed
47
IMPLEMENTATION ANDREW FILE
SYSTEM…
Implementation of file system calls in AFS
48
OPERATION OF AFS
 Returns the attributes (status) and, optionally, the
contents of the file identified by fid and records a
callback promise on it.
 Updates the attributes and (optionally) the contents of a
specified file.
 Creates a new file and records a callback promise on it.
 Deletes the specified file
 Sets a lock on the specified file or directory. The mode of
the lock may be shared or exclusive. Locks that are not
removed expire after 30 minutes.
 Informs the server that a Venus process has
flushed a file from its cache.
 BreakCallback(fid)  Call made by a Vice server to a Venus process;
cancels the callback promise on the relevant file.
 RemoveCallback(fid)
 Fetch(fid) attr, data
 Store(fid, attr, data)
 SetLock(fid, mode)
 Create() fid
 Remove(fid)
 ReleaseLock(fid)  Unlocks the specified file or directory.
49
THANK YOU FOR YOUR ATTENTION
50
FUNDAMENTALS OF DISTRIBUTED SYSTEM
Chapter Four
Naming
1
 we will discuss
 some general issues in naming
 how human-friendly names are organized and implemented;
e.g., those for file systems and the WWW; classes of
naming systems
 flat naming
 structured naming, and
 attribute-based naming
2
OBJECTIVES OF THE CHAPTER
INTRODUCTION
 names play an important role to:
 share resources
 uniquely identify entities
 refer to locations
 etc.
 an important issue is that a name can be resolved to
the entity it refers
 to resolve names, it is necessary to implement a
naming system
 in a distributed system, the implementation of a naming
system is itself often distributed, unlike in non-distributed
systems
 efficiency and scalability of the naming system are the
main issues
3
 Uniform Resource Identifiers (URIs) [Berners-Lee et al.
2005] came about from the need to identify resources
on the Web, and other Internet resources such as
electronic mailboxes. An important goal was to identify
resources in a coherent way, so that they could all be
processed by common software such as browsers.
 Uniform Resource Locator (URL) is often used for URIs
that provide location information and specify the method
for accessing the resource, including the ‘http’.
 Uniform Resource Names (URNs) are URIs that are
used as pure resource names rather than locators.
4
COMMON TERMS
 A name service stores information about a collection of
textual names, in the form of bindings between the names
and the attributes of the entities they denote, such as
users, computers, services and objects.
 A name space is the collection of all valid names
recognized by a particular service.
 A naming domain is a name space for which there exists a
single overall administrative authority responsible for
assigning names within it.
 The Domain Name System is a name service design
whose main naming database is used across the Internet.
 A service that stores collections of bindings between
names and attributes and that looks up entries that match
attribute-based specifications is called a directory service.
Directory services are also sometimes known as attribute-
based name services.
5
Names, Identifiers, and Addresses
a name in a distributed system is a string of bits or
characters that is used to refer to an entity
an entity is anything; e.g., resources such as hosts, printers,
disks, files, objects, processes, users, Web pages,
newsgroups, mailboxes, network connections, ...
entities can be operated on
 e.g., a resource such as a printer offers an interface
containing operations for printing a document, requesting
the status of a job, etc.
 a network connection may provide operations for sending
and receiving data, setting quality of service parameters, etc.
to operate on an entity, it is necessary to access it through its
access point, itself an entity (special)
6
 access point
 the name of an access point is called an address (such as
IP address and port number as used by the transport layer)
 the address of the access point of an entity is also referred
to as the address of the entity
 an entity can have more than one access point (similar to
accessing an individual through different telephone
numbers)
 an entity may change its access point in the course of time
(e.g., a mobile computer getting a new IP address as it
moves)
7
 an address is a special kind of name
 it refers to at most one entity
 each entity is referred by at most one address; even when
replicated such as in Web pages
 an entity may change an access point, or an access point
may be reassigned to a different entity (like telephone
numbers in offices)
 separating the name of an entity and its address makes it
easier and more flexible; such a name is called location
independent
 there are also other types of names that uniquely identify an
entity; in any case a true identifier is a name with the
following properties
 it refers to at most one entity
 each entity is referred by at most one identifier
 it always refers to the same entity (never reused)
 identifiers allow us to unambiguously refer to an entity
8
 examples
 name of an FTP server (entity)
 URL of the FTP server
 address of the FTP server
 IP number:port number
 the address of the FTP server may change
 there are three classes on naming systems: flat naming,
structured naming, and attribute-based naming
9
A.Flat Naming
a name is a sequence of characters without structure; like
human names? may be if it is not an Ethiopian name!
difficult to be used in a large system since it must be
centrally controlled to avoid duplication
moreover, it does not contain any information on how to
locate the access point of its associated entity
how are flat names resolved (or how to locate an entity when
a flat name is given)
 name resolution: mapping a name to an address or an
address to a name is called name-address resolution
 possible solutions: simple solutions, home-based
approaches, and hierarchical approaches
10
1. Simple Solutions
 two solutions (for LANs only): Broadcasting and
Multicasting, and Forwarding Pointers
a. Broadcasting and Multicasting
 broadcast a message containing the identifier of an entity;
only machines that can offer an access point for the entity
send a reply
 e.g., ARP (Address Resolution Protocol) in the Internet to find
the data link address (MAC address) of a machine
 a computer that wants to access another computer for
which it knows its IP address broadcasts this address
 the owner responds by sending its Ethernet address
 broadcasting is inefficient when the network grows (wastage
of bandwidth and too much interruption to other machines)
 multicasting is better when the network grows - send only to
a restricted group of hosts 11
12
 multicasting can also be used to locate the nearest replica
- choose the one whose reply comes in first
b. Forwarding Pointers
 how to look for mobile entities
 when an entity moves from A to B, it leaves behind a
reference to its new location
 advantage
 simple: as soon as the first name is located using
traditional naming service, the chain of forwarding
pointers can be used to find the current address
 drawbacks
 the chain can be too long - locating becomes expensive
 all the intermediary locations in a chain have to maintain
their pointers
 vulnerability if links are broken
 hence, making sure that chains are short and that
forwarding pointers are robust is an important issue
2. HOME-BASED APPROACHES
 broadcasting and multicasting have scalability problems;
performance and broken links are problems in forwarding
pointers
 a home location keeps track of the current location of an
entity; often it is the place where an entity was created
 it is a two-tiered approach
 an example where it is used in Mobile IP
 each mobile host uses a fixed IP address
 all communication to that IP address is initially directly
sent to the host’s home agent located on the LAN
corresponding to the network address contained in the
mobile host’s IP address
 whenever the mobile host moves to another network, it
requests a temporary address in the new network
(called care-of-address) and informs the new address to
the home agent
13
12
 when the home agent receives a message for the mobile host
(from a correspondent agent) it forwards it to its new address
(if it has moved) and also informs the sender the host’s
current location for sending other packets
home-based approach: the principle of Mobile IP
 problems:
 creates communication latency (Triangle routing:
correspondent-home network-mobile)
 the home location must always exist; the host is
unreachable if the home does no more exist (permanently
changed); the solution is to register the home at a traditional
name service and let a client first look up the location of the
home
15
B. Structured Naming
flat names are not convenient for humans
Name Spaces
 names are organized into a name space
 each name is made of several parts; the first may define
the nature of the organization, the second the name, the
third departments, ...
 the authority to assign and control the name spaces can be
decentralized where a central authority assigns only the
first two parts
 a name space is generally organized as a labeled, directed
graph with two types of nodes
 leaf node: represents the named entity and stores
information such as its address or the state of that entity
 directory node: a special entity that has a number of
outgoing edges, each labeled with a name
each node in a naming graph is considered as another entity
with an identifier
16
17
a general naming graph with a single root node, no
 a directory node stores a table in which an outgoing edge is
represented as a pair (edge label, node identifier), called a
directory table
 each path in a naming graph can be referred to by the
sequence of labels corresponding to the edges of the path
and the first node in the path, such as
N:<label-1, label-2, ..., label-n>, where N refers to the first
node in the path
 such a sequence is called a path name
 if the first node is the root of the naming graph, it is called an
absolute path name; otherwise it is a relative path name
 instead of the path name n0:<home, steen, mbox>, we often use
its string representation /home/steen/mbox
 there may also be several paths leading to the same node, e.g.,
node n5 can be represented as /keys or
/home/steen/keys
 although the above naming graph is directed acyclic graph (a
node can have more than one incoming edge but is not
permitted to have a cycle), the common way is to use a tree
(hierarchical) with a single root (as is used in file systems)
 in a tree structure, each node except the root has exactly one
incoming edge; the root has no incoming edges
 each node also has exactly one associated (absolute) path name
18
 e.g., file naming in UNIX file system
 a directory node represents a directory and a leaf node
represents a file
 there is a single root directory, represented in the naming
graph by the root node
 we have a contiguous series of blocks from a logical disk
 the boot block is used to load the operating system
 the superblock contains information on the entire file
system such as its size, etc.
 inodes are referred to by an index number, starting at
number zero, which is for the inode representing the root
directory
 given the index number of an inode, it is possible to access
its associated file
19
 Name Resolution
 given a path name, the process of looking up a name
stored in the node is referred to as name resolution; it
consists of finding the address when the name is given (by
following the path)
 knowing how and where to start name resolution is referred
to as closure mechanism; e.g., UNIX file system
 Linking and Mounting
 Linking: giving another name for the same entity (an alias)
e.g., environment variables in UNIX such as HOME that
refer to the home directory of a user
 two types of links (or two ways to implement an alias):
hard link and symbolic link
 hard link: to allow multiple absolute path names to
refer to the same node in a naming graph
e.g., in the previous graph, there are two different path names
for node n5: /keys and /home/steen/keys 20
 symbolic link: representing an entity by a leaf node and
instead of storing the address or state of the entity, the
node stores an absolute path name
21
the concept of a symbolic link explained in a naming graph
when first resolving an absolute path name stored in a
node (e.g., /home/steen/keys in node n6), name
resolution will return the path name stored in the node
(/keys), at which point it can continue with resolving that
new path name, i.e., closure mechanism
 so far name resolution was discussed as taking place
within a single name space
 name resolution can also be used to merge different name
spaces in a transparent way
 the solution is to use mounting
 as an example, consider a mounted file system, which
can be generalized to other name spaces as well
 let a directory node store the directory node from a
different (foreign) name space
 the directory node storing the node identifier is called a
mount point
 the directory node in the foreign name space is called a
mounting point, normally the root of a name space
 during name resolution, the mounting point is looked up
and resolution proceeds by accessing its directory table
 consider a collection of name spaces distributed across
different machines (each name space implemented by a
different server)
 to mount a foreign name space in a distributed system, the
following are at least required
 the name of an access protocol (for communication)
 the name of the server
 the name of the mounting point in the foreign name space
 each of these names needs to be resolved
 to the implementation of the protocol so that
communication can take place properly
 to an address where the server can be reached
 to a node identifier in the foreign name space (to be
resolved by the server of the foreign name space)
 the three names can be listed as a URL
 example: Sun’s Network File System (NFS) is a distributed file
system with a protocol that describes how a client can
access a file stored on a (remote) NFS file server
 an NFS URL may look like nfs://flits.cs.vu.nl/home/steen
- nfs is an implementation of a protocol
- flits.cs.vu.nl is a server name to be resolved using DNS
- /home/steen is resolved by the foreign server
 e.g., the subdirectory /remote includes mount points for
foreign name spaces on the client machine
 a directory node named /remote/vu is used to store
nfs://flits.cs.vu.nl/home/steen
 consider /remote/vu/mbox
 this name is resolved by starting at the root directory on the
client’s machine until node /remote/vu, which returns the
URL nfs://flits.cs.vu.nl/home/steen
 this leads the client machine to contact flits.cs.vu.nl using
the NFS protocol
 then the file mbox is read in the directory /home/steen
mounting remote name spaces through a specific process protocol
MOUNT POINT
MOUNTING POINT
 distributed systems that allow mounting a remote file system
also allow to execute some commands
 example commands to access the file system
cd /remote/vu /*changing directory on the remote machine ls -l
/*listing the files on the remote machine
 by doing so the user is not supposed to worry about the
details of the actual access; the name space on the local
machine and that on the remote machine look to form a
single name space
 The Implementation of a Name Space
 a name space forms the heart of a naming service
 a naming service allows users and processes to add,
remove, and lookup names
 a naming service is implemented by name servers
 for a distributed system on a single LAN, a single server
might suffice; for a large-scale distributed system the
implementation of a name space is distributed over multiple
name servers
 Name Space Distribution
 in large scale distributed systems, it is necessary to
distribute the name service over multiple name servers,
usually organized hierarchically
 a name service can be partitioned into logical layers
 the following three layers can be distinguished (according to
Cheriton and Mann)
 global layer
 formed by highest level nodes (root node and nodes close
to it or its children)
 nodes on this layer are characterized by their stability, i.e.,
directory tables are rarely changed
 they may represent organizations, groups of
organizations, ..., where names are stored in the name
space
 administrational layer
 groups of entities that belong to the same organization or
administrational unit, e.g., departments
 relatively stable
 managerial layer
 nodes that may change regularly, e.g., nodes representing
hosts of a LAN, shared files such as libraries or binaries,
…
 nodes are managed not only by system administrators,
but also by end users
an example partitioning of the DNS name space, including Internet-
accessible files, into three layers
 the name space is divided into nonoverlapping parts, called
zones in DNS
 a zone is a part of the name space that is implemented by a
separate name server
 some requirements of servers at different layers: performance
(responsiveness to lookups), availability (failure rate), etc.
 high availability is critical for the global layer, since name
resolution cannot proceed beyond the failing server; it is
also important at the administrational layer for clients in the
same organization
 performance is very important in the lowest layer, since
results of lookups can be cached and used due to the
relative stability of the higher layers
 they may be enhanced by client side caching (for global and
administrational layers since names do not change often)
and replication; they create implementation problems since
they may introduce inconsistency (see Chapter 7)
a comparison between name servers for implementing nodes from a large-
scale name space partitioned into a global layer, an administrational
layer, and a managerial layer
Item Global Administrational Managerial
Geographical scale of network Worldwide Organization Department
Total number of nodes Few Many Vast numbers
Responsiveness to lookups Seconds Milliseconds Immediate
Update propagation Lazy Immediate Immediate
Availability requirement Very High High low
Number of replicas Many None or few None
Is client-side caching applied? Yes Yes Sometimes
 Implementation of Name Resolution
 recall that name resolution consists of finding the address
when the name is given
 assume that name servers are not replicated and that no
client-side caches are allowed
 each client has access to a local name resolver, responsible
for ensuring that the name resolution process is carried out
 e.g., assume the path name
root:<nl, vu, cs, ftp, pub, globe, index.txt> is to be resolved
or using a URL notation, this path name would correspond to
ftp://ftp.cs.vu.nl/pub/globe/index.txt
 a host that needs to map a name to an address calls a DNS
client named a resolver (and provides it the name to be
resolved - ftp.cs.vu.nl)
 the resolver accesses the closest DNS server with a mapping
request
 if the server has the information it satisfies the resolver;
otherwise, it either refers the resolver to other servers (called
Iterative Resolution) or asks other servers to provide it with
the information (called Recursive Resolution)
 Iterative Resolution
 a name resolver hands over the complete name to the root
name server
 the root name server will resolve the name as far as it can
and return the result to the client; at the minimum it can
resolve the first level and sends the name of the first level
name server to the client
 the client calls the first level name server, then the second,
..., until it finds the address of the entity
the principle of iterative name resolution
 Recursive Resolution
 a name resolver hands over the whole name to the root name
server
 the root name server will try to resolve the name and if it
can’t, it requests the first level name server to resolve it and
to return the address
 the first level will do the same thing recursively
the principle of recursive name resolution
 Advantages and drawbacks
 recursive name resolution puts a higher performance
demand on each name server; hence name servers in the
global layer support only iterative name resolution
 caching is more effective with recursive name resolution
 each name server gradually learns the address of each
name server responsible for implementing lower-level
nodes
 eventually lookup operations can be handled efficiently
recursive name resolution of <nl, vu, cs, ftp>; name servers cache
intermediate results for subsequent lookups
Server for
node
Should
resolve
Looks
up
Passes to
child
Receives
and caches
Returns to
requester
cs <ftp> #<ftp> -- -- #<ftp>
vu <cs,ftp> #<cs> <ftp> #<ftp> #<cs> #<cs,
ftp>
nl <vu,cs,ftp> #<vu> <cs,ftp> #<cs>
#<cs,ftp>
#<vu>
#<vu,cs>
#<vu,cs,ftp>
root <nl,vu,cs,ftp> #<nl> <vu,cs,ftp> #<vu>
#<vu,cs>
#<vu,cs,ftp>
#<nl> #<nl,vu>
#<nl,vu,cs>
#<nl,vu,cs,ftp>
the comparison between recursive and iterative name resolution with
respect to communication costs; assume the client is in Ethiopia and
the name servers in the Netherlands
 communication costs may be reduced in recursive name
resolution
 Summary
Method Advantages
Recursive Less Communication cost; Caching is more effective
Iterative Less performance demand on name servers
 Example - The Domain Name System (DNS)
 one of the largest distributed naming services is the
Internet DNS
 it is used for looking up host addresses and mail servers
 hierarchical, defined in an inverted tree structure with the
root at the top
 the tree can have only 128 levels
 Label
 each node has a label, a string with a
maximum of 63 characters (case
insensitive)
 the root label is null (has no label)
 children of a node must have
different names (to guarantee
uniqueness)
 Domain Name
 each node has a domain name; it is
a path name to its root node
 a full domain name is a sequence of
labels separated by dots (the last
character is a dot)
 domain names are read from the
node up to the root
 full path names must not exceed
255 characters
 the contents of a node is formed by a collection of resource
records; the important ones are the following
Type of
record
Associated
entity
Description
SOA (start of
authority)
Zone
Holds information on the represented zone, such as an
e-mail address of the system administrator
A (address) Host Contains an IP address of the host this node represents
MX (mail
exchange)
Domain
Refers to a mail server to handle mail addressed to this
node; it is a symbolic link; e.g. name of a mail server
SRV Domain Refers to a server handling a specific service
NS (name
server)
Zone
Refers to a name server that implements the
represented zone
CNAME Node Contains the canonical name of a host; an alias
PTR (pointer) Host
Symbolic link with the primary name of the represented
node; for mapping an IP address to a name
HINFO (host
info)
Host
Holds information on the host this node represents;
such as machine type and OS
TXT Any kind
Contains any entity-specific information considered
useful; cannot be automatically processed
 cs.vu.nl represents the domain as well as the zone; it has 4
name servers (ns, star, top, solo) and 3 mail servers
 name server for this zone with 2 network addresses (star)
 mail servers; the numbers preceding the name show
priorities; first the one with the lowest number is tried
an excerpt from the DNS database for the zone cs.vu.nl
an excerpt from the DNS database for the zone cs.vu.nl, cont’d
 a Web server and an FTP server, implemented by a single
machine (soling)
 older server clusters (vucs-das1)
 two printers (inkt and pen) with a local address; i.e., they
cannot be accessed from outside
part of the description for the vu.nl domain which contains the cs.vu.nl domain
 cs.vu.nl is implemented as a single zone
 hence, the records in the previous slides do not include
references to other zones
 nodes in a subdomain that are implemented in a different
zone are specified by giving the domain name and IP
address
C. Attribute-Based Naming
flat naming: provides a unique and location-independent way
of referring to entities
structured naming: also provides a unique and location-
independent way of referring to entities as well as human-
friendly names
but both do not allow searching entities by giving a
description of an entity
in attribute-based naming, each entity is assumed to have a
collection of attributes that say something about the entity
then a user can search an entity by specifying (attribute, value)
pairs known as attribute-based naming
Directory Services
 attribute-based naming systems are also called directory
services whereas systems that support structured naming
are called naming systems
 how are resources described? one possibility is to use RDF
(Resource Description Framework) that uses triplets
consisting of a subject, a predicate, and an object
 e.g., (person, name, Alice) to describe a resource Person
whose Name is Alice
 or in e-mail systems, we can use sender, recipient, subject,
etc. for searching
 Hierarchical Implementations: LDAP
 distributed directory services are implemented by combining
structured naming with attribute-based naming
 e.g., Microsoft’s Active Directory service
 such systems rely on the lightweight directory access
protocol or LADP which is derived from OSI’s X.500
directory service
 a LADP directory service consists of a number of records
called directory entries (attribute, value) pairs, similar to a
resource record in DNS; could be single- or multiple-valued
(e.g., Mail_Servers on next slide)
a simple example of an LDAP directory entry using LDAP naming
conventions to identify the network addresses of some servers
Attribute Abbr. Value
Country C NL
Locality L Amsterdam
Organization O Vrije Universiteit
OrganizationalUnit OU Comp. Sc.
CommonName CN Main server
Mail_Servers -- 137.37.20.3, 130.37.24.6,137.37.20.10
FTP_Server -- 130.37.20.20
WWW_Server -- 130.37.20.20
 the collection of all directory entries is called a Directory
Information Base (DIB)
 each record is uniquely named so that it can be looked up
 each naming attribute is called a Relative Distinguished
Name (RDN); the first 5 entries above
 a globally unique name is formed using abbreviations of
naming attributes, e.g.,
/C=NL/O=Vrije Universiteit/OU=Comp. Sc.
 this is similar to the DNS name nl.vu.cs
 listing RDNs in sequence leads to a hierarchy of the
collection of directory entries, called a Directory
Information Tree (DIT)
 a DIT forms the naming graph of an LDAP directory service
where each node represents a directory entry
part of the directory information tree
 node N corresponds to the directory entry shown earlier; it
also acts as a parent of other directory entries that have an
additional attribute, Host_Name; such entries may be used
to represent hosts
Attribute Value
Country NL
Locality Amsterdam
Organization Vrije Universiteit
OrganizationalUnit Comp. Sc.
CommonName Main server
Host_Name star
Host_Address 192.31.231.42
Attribute Value
Country NL
Locality Amsterdam
Organization Vrije Universiteit
OrganizationalUnit Comp. Sc.
CommonName Main server
Host_Name zephyr
Host_Address 137.37.20.10
two directory entries having Host_Name as RDN
Reading Assignment: case study of global Name services,
Distributed Shared Memory
THANK YOU FOR YOUR ATTENTION
51
FUNDAMENTALS OF DISTRIBUTED SYSTEM
Chapter Five
Synchronization
1
OUTLINE
 Clock synchronization, physical clocks and clock synchronization algorithms
 Logical clocks and time stamps
 Global state
 Distributed transactions and concurrency control
Election algorithms
 Mutual exclusion and various algorithms to achieve mutual Exclusion
2
INTRODUCTION
 Synchronization is the coordination of action between processers.
 Asynchronous contains independent of events.
 cooperation is partly supported by naming; it allows processes to at
least share resources (entities)
 synchronization deals on how to ensure that processes do not
simultaneously access a shared resource
 how events can be ordered such as two processes sending messages to
each other
We will study:
 Synchronization based on “Actual Time”.
 Synchronization based on “Relative Time”.
 Synchronization based on Co-ordination (with Election Algorithms).
 Mutual Exclusion
3
 Essential of Synchronization
4
 Issue of synchronization
5
 Physical Clock
6
Clock Synchronization
 in centralized systems, synchronization is decided shared memory.
 The event ordering is clean because all the events are timed by the
same clock.
 One processor and one clock i.e., different processors share the
memory.
 achieving agreement on time in distributed systems is difficult
 Each system has own time.
 e.g., consider the make program on a UNIX machine; it compiles
only source files for which the time of their last update was later
than the existing object file
7
Physical Clocks:
clocks whose values must not deviate from the real time by more than a certain amount.
 Clock is an electronic devices that counts (stored in computer register) oscillation in
crystal at a particular frequency.
 Physical Clocks can be used time stamp an event on that computer.
E1t1 and E2t2
 Many applications are interested only in order of events not the exact time of day at
which they occurred
 The time difference between two computer is drift while the clock drift over the time is
skew
 Several methods are used to attempt the synchronization of physical clock distributed
system
 There are different methods of physical clocks
Cristian Algorism
Barkley Algorithm
Network Time Protocol
8
9
 Clock Synchronization
10
 Clock Synchronization
Clock Synchronization
11
12
 Clock Synchronization Algorithm
(T _sever)
13
 Cristian's Algorithm
 The client processes fetches the response from the server at a time T1 and calculates the new
synchronized client clock time by
T _client= Server+(T1-T0)/2
Send request at 5:08:15:100 (T0)
Receiver response at 5:08:15:900 (T1)
Response contains 5:09:25:300 (T _server)
So, we want calculate elapsed time is called round trip time
T1-T2/2=800
Set to time
T _client= Server+(T1-T0)/2
T _client= 5:09:25:300 +400
T _client= 5:09:25:700
14
15
16
17
• Is a protocol that supports the computers clock time
to be synchronized in a network
18
NETWORK TIME PROTOCOL
20
21
22
23
24
25
26
27
28
29
30
31
MUTUAL EXCLUSION
32
MUTUAL EXCLUSION
33
34
35
36
37
38
39
ELECTION ALGORITHM
40
41
42
43
44
45
SUMMARY
• Overview
• Essentiality of Synchronization
• Issues in Synchronization
• Clock Synchronization
• Physical Clocks
• Clock Synchronization
• Logical Clocks
• Happens-Before Relationship
• Lamport's Logical Clocks
• Vector Clocks
• Mutual Exclusion
• A Centralized Algorithm
• A Decentralized Algorithm
• A Distributed Algorithm
• A Token Ring Algorithm
• Election Algorithm
• Introduction
• The Bully Algorithm
• A Ring Algorithm
• Superpeer Selection
46
THANK YOU FOR YOUR ATTENTION
47
FUNDAMENTALS OF DISTRIBUTED SYSTEM
Chapter Six
Fault Tolerance
1
 a major difference between distributed systems and single
machine systems is that with the former, partial failure is
possible, i.e., when one component in a distributed system
fails
 such a failure may affect some components while others
will continue to function properly
 an important goal of distributed systems design is to
construct a system that can automatically recover from
partial failure
 it should tolerate faults and continue to operate to some
extent
2
INTRODUCTION
 we will discuss
 fault tolerance, and making distributed systems fault
tolerant
 process resilience (techniques by which one or more
processes can fail without seriously disturbing the rest of
the system)
 reliable multicasting to keep processes synchronized (by
which message transmission to a collection of processes
is guaranteed to succeed)
 distributed commit protocols for ensuring atomicity in
distributed systems
 failure recovery by saving the state of a distributed
system (when and how)
3
OBJECTIVES OF THE CHAPTER
Introduction to Fault Tolerance
 Basic Concepts
 fault tolerance is strongly related to dependable
systems
 dependability covers the following
 availability
 refers to the probability that the system is
operating correctly at any given time; defined in
terms of an instant in time
 reliability
 a property that a system can run continuously
without failure; defined in terms of a time interval
 safety
 refers to the situation that even when a system
temporarily fails to operate correctly, nothing
catastrophic happens
 maintainability
 how easily a failed system can be repaired
4
 dependable systems are also required to provide a high
degree of security
 a system is said to fail when it cannot meet its promises; for
instance failing to provide its users one or more of the
services it promises
 an error is a part of a system’s state the may lead to a failure;
e.g., damaged packets in communication
 the cause of an error is called a fault
 building dependable systems closely relates to controlling
faults
 a distinction is made between preventing, removing, and
forecasting faults
 a fault tolerant system is a system that can provide its
services even in the presence of faults
5
 faults are classified into three
 transient
 occurs once and then disappears; if the operation is
repeated, the fault goes away; e.g., a bird flying through a
beam of a microwave transmitter may cause some lost
bits
 intermittent
 it occurs, then vanishes on its own accord, then
reappears, ...; e.g., a loose connection; difficult to
diagnose; take your sick child to the nearest clinic, but
the child does not show any sickness by the time you
reach there
 permanent
 one that continues to exist until the faulty component is
repaired; e.g, disk head crash, software bug
6
 Failure Types - 5 of them
 Crash failure (also called fail-stop failure): a server halts,
but was working correctly until it stopped; e.g., the OS
halts; reboot the system
 Omission failure: a server fails to respond to incoming
requests
 Receive omission: a server fails to receive incoming
messages; e.g., may be no thread is listening
 Send omission: a server fails to send messages
 Timing failure: a server's response lies outside the
specified time interval; e.g., may be it is too fast over
flooding the receiver or too slow
 Response failure: the server's response is incorrect
 Value failure: the value of the response is wrong; e.g., a
search engine returning wrong Web pages as a result of
a search 7
8
 State transition failure: the server deviates from the correct
flow of control; e.g., taking default actions when it fails to
understand the request
 Arbitrary failure (or Byzantine failure): a server may produce
arbitrary responses at arbitrary times; the most serious
 Failure Masking by Redundancy
 to be fault tolerant, the system tries to hide the occurrence of
failures from other processes - masking
 the key technique for masking faults is redundancy
 three kinds are possible
 information redundancy; add extra bits to allow recovery
from garbled bits (error correction)
 time redundancy: an action is performed more than once if
needed; e.g., redo an aborted transaction; useful for
transient and intermittent faults
 physical redundancy: add (replicate) extra equipment
(hardware) or processes (software)
Process Resilience
 how can fault tolerance be achieved in distributed
systems
 one method is protection against process failures by
replicating processes into groups
 we discuss
 what are the general design issues of process groups
 what actually is a fault tolerant group
 how to reach agreement within a process group when
one or more of its members cannot be trusted to give
correct answers
9
 Design Issues
 the key approach to tolerating a faulty process is to
organize several identical processes into a group
 all members of a group receive a message hoping that if
one process fails, another one will take over
 the purpose is to allow processes to deal with collection of
processes as a single abstraction
 process groups may be dynamic
 new groups can be created and old groups can be
destroyed
 a process can join or leave a group
 a process can be a member of several groups at the
same time
 hence group management and membership mechanisms
are required
10
11
 the internal structure of a group may be flat or hierarchical
 flat: all processes are equal and decisions are made
collectively
 hierarchical: a coordinator and several workers; the
coordinator decides which worker is best suited to carry
a task
(a) communication in a flat group
(b) communication in a simple hierarchical group
 the flat group
 has no single point of failure
 but decision making is more complicated (voting may be
required for decision making which may create a delay and
overhead)
 the hierarchical group has the opposite properties
 group membership may be handled
 through a group server where all requests (joining, leaving,
...) are sent; it has a single point of failure
 in a distributed way where membership is multicasted (if a
reliable multicasting mechanism is available)
 but what if a member crashes; other members have to
find out this by noticing that it no more responds
12
 Failure Masking and Replication
 how to replicate processes so that they can form groups and
failures can be masked?
 there are two ways for such replication:
 primary-based replication
 for fault tolerance, primary-backup protocol is used
 organize processes hierarchically and let the primary
(i.e., the coordinator) coordinate all write operations
 if the primary crashes, the backups hold an election
 replicated-write protocols
 in the form of active replication or by means of quorum-
based protocols
 that means, processes are organized as flat groups
13
 another important issue is how much replication is needed
 for simplicity consider only replicated-write systems
 a system is said to be k fault tolerant if it can survive faults
in k components and still meets its specifications
 if the processes fail silently, then having k+1 replicas is
enough; if k of them fail, the remaining one can function
 if processes exhibit Byzantine failure, 2k+1 replicas are
required; if the k processes generate the same reply
(wrong), the k+1 will also produce the same answer
(correct); but which of the two is correct is difficult to
ascertain; unless we believe the majority
14
Distributed Commit
 atomic multicasting is an example of the more generalized
problem known as distributed commit
 in atomic multicasting, the operation is delivery of a
message
 but the distributed commit problem involves having an(y)
operation being performed by each member of a process
group, or none at all
 there are three protocols: one-phase commit, two-phase
commit, and three-phase commit
 One-Phase Commit Protocol
 a coordinator tells all other processes, called
participants, whether or not to (locally) perform an
operation
 drawback: if one of the participants cannot perform the
operation, there is no way to tell the coordinator; for
example due to violation of concurrency control
constraints in distributed transactions
15
 Two-Phase Commit Protocol (2PC)
 it has two phases: voting phase and decision phase, each
involving two steps
 voting phase
 the coordinator sends a VOTE_REQUEST message to all
participants
 each participant then sends a VOTE_COMMIT or
VOTE_ABORT message depending on its local situation
 decision phase
 the coordinator collects all votes; if all vote to commit the
transaction, it sends a GLOBAL_COMMIT message; if at
least one participant sends VOTE_ABORT, it sends a
GLOBAL_ABORT message
 each participant that voted for a commit waits for the
final reaction of the coordinator and commits or aborts
16
a) the finite state machine for the coordinator in 2PC
b) the finite state machine for a participant
17
 problems may occur in the event of failures
 the coordinator and participants have states in which they
block waiting for messages: INIT, READY, WAIT
 when a process crashes, other processes may wait
indefinitely
 hence, timeout mechanisms are required
 a participant waiting in its INIT state for VOTE_REQUEST
from the coordinator aborts and sends VOTE_ABORT if it
does not receive a vote request after some time
 the coordinator blocking in state WAIT aborts and sends
GLOBAL_ABORT if all votes have not been collected on time
 a participant P waiting in its READY state waiting for the
global vote cannot abort; instead it must find out which
message the coordinator actually sent
 by blocking until the coordinator recovers
 or requesting another participant, say Q
18
actions taken by a participant P when residing in state READY and having
contacted another participant Q
19
State of Q Action by P Comments
COMMIT
Make transition to
COMMIT
Coordinator sent
GLOBAL_COMMIT before
crashing, but P didn’t receive it
ABORT Make transition to ABORT
Coordinator sent
GLOBAL_ABORT before
crashing, but P didn’t receive it
INIT Make transition to ABORT
Coordinator sent
VOTE_REQUEST before
crashing, P received it but not Q
READY Contact another
participant
If all are in state READY, wait
until the coordinator recovers
 a process (participant or coordinator) can recover from crash
if its state has been saved to persistent storage
 actions by a participant after recovery
20
 there are two critical states for the coordinator
State before Crash Action by the Process after Recovery
INIT Locally abort the transaction and inform the
coordinator
COMMIT or ABORT Retransmit its decision to the coordinator
READY Can not decide on its own what it should do
next; has to contact other participants
State before Crash Action by the Coordinator after Recovery
WAIT Retransmit the VOTE_REQUEST message
After the Decision in
the 2nd phase
Retransmit the decision
Recovery
 fundamental to fault tolerance is recovery from an error
 recall: an error is that part of a system that may lead to a
failure
 error recovery means to replace an erroneous state with an
error-free state
 two forms of error recovery: backward recovery and
forward recovery
 Backward Recovery
 bring the system from its present erroneous state back
into a previously correct state
 for this, the system’s state must be recorded from time to
time; each time a state is recorded, a checkpoint is said
to be made
 e.g., retransmitting lost or damaged packets in the
implementation of reliable communication
21
 most widely used, since it is a generally applicable method
and can be integrated into the middleware layer of a
distributed system
 disadvantages:
 checkpointing and restoring a process to its previous state
are costly and performance bottlenecks
 no guarantee can be given that the error will not recur,
which may take an application into a loop of recovery
 some actions may be irreversible; e.g., deleting a file,
handing over cash to a customer
 Forward Recovery
 bring the system from its present erroneous state to a
correct new state from which it can continue to execute
 it has to be known in advance which errors may occur so
as to correct those errors
 e.g., erasure correction (or simply error correction) where a
lost or damaged packet is constructed from other
successfully delivered packets
THANK YOU FOR YOUR ATTENTION
23

More Related Content

PPTX
Distributed file systems chapter 9
PPT
Chapter 8 distributed file systems
PPTX
a distributed implementation of the classical time-sharing model of a file sy...
PPT
Chapter-5-DFS.ppt
PPTX
DFSNov1.pptx
PPTX
Dos unit 4
Distributed file systems chapter 9
Chapter 8 distributed file systems
a distributed implementation of the classical time-sharing model of a file sy...
Chapter-5-DFS.ppt
DFSNov1.pptx
Dos unit 4

Similar to distributes systemss and its power point (20)

PPTX
Distributed file system
PPT
Ranjitbanshpal
PPT
File models and file accessing models
PPT
4.file service architecture (1)
PPT
4.file service architecture
PDF
CS9222 ADVANCED OPERATING SYSTEMS
PPTX
File service architecture and network file system
PPT
Unit 3.1 cs6601 Distributed File System
PPTX
best presentation ever by tayyab.pptx
PPT
3. distributed file system requirements
PPTX
Distributed file system
PPT
Distributed file system
PPTX
Applications of Distributed Systems
PPTX
DFS PPT.pptx
PPTX
File Service Architecture
PPT
Introduction to distributed file systems
PPT
DISTRIBUTED FILE SYSTEM- Design principles, consistency models
PPT
Dfs (Distributed computing)
PPTX
Ds
PPTX
Ds
Distributed file system
Ranjitbanshpal
File models and file accessing models
4.file service architecture (1)
4.file service architecture
CS9222 ADVANCED OPERATING SYSTEMS
File service architecture and network file system
Unit 3.1 cs6601 Distributed File System
best presentation ever by tayyab.pptx
3. distributed file system requirements
Distributed file system
Distributed file system
Applications of Distributed Systems
DFS PPT.pptx
File Service Architecture
Introduction to distributed file systems
DISTRIBUTED FILE SYSTEM- Design principles, consistency models
Dfs (Distributed computing)
Ds
Ds
Ad

Recently uploaded (20)

PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Pre independence Education in Inndia.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Institutional Correction lecture only . . .
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Cell Structure & Organelles in detailed.
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Business Ethics Teaching Materials for college
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
01-Introduction-to-Information-Management.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Pre independence Education in Inndia.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Renaissance Architecture: A Journey from Faith to Humanism
Microbial disease of the cardiovascular and lymphatic systems
Institutional Correction lecture only . . .
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
O7-L3 Supply Chain Operations - ICLT Program
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPH.pptx obstetrics and gynecology in nursing
Cell Structure & Organelles in detailed.
STATICS OF THE RIGID BODIES Hibbelers.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
TR - Agricultural Crops Production NC III.pdf
Business Ethics Teaching Materials for college
Ad

distributes systemss and its power point

  • 1. FUNDAMENTALS OF DISTRIBUTED SYSTEM Chapter Three Distributed File system 1
  • 2. OUTLINE ф Distributed File Systems: Introduction  File Service Architecture  Case Study 1: Sun Network File System  Case Study 2: The Andrew File System 2
  • 3. INTRODUCTION  The main goal of distributed system is sharing resources.  Enables programs to store and access remote files (web servers) exactly as they do local files  Distributed file systems support the sharing of information in the form of files and hardware resources in the form of persistent storage throughout an intranet.  File systems were originally developed for centralized computer systems and desktop computers as an operating system facility providing a convenient programming interface to disk storage.  A well designed file service provides access to files stored at a server with performance and reliability similar to, and in some cases better than, files stored on local disks 3
  • 4. STORAGE SYSTEM AND THEIR PROPERTIES Sharing Persistence Distributed cache /replica Consistency maintenance Example Main memory x X X strict one-copy RAM File system x / X strict one-copy UNIX file system Distributed file system / / / slightly weaker guarantees Sun NFS Web / / / No automatic consistency Web server Distributed shared memory / X / slightly weaker guarantees Ivy Remote Objects / X X strict one-copy CORBA Persistent object store / / X strict one-copy CORBA persistent state service Peer to peer storage system / / / considerably weaker guarantees oceanStore 4
  • 5. CHARACTERISTICS OF FILE SYSTEM  Responsible for the organization, storage, retrieval, naming, sharing and protection of files.  Provide programming interface that characterize the file abstraction  Files contains both data and attribute  File attributes 5 File length Creation timestamp Read timestamp Write time stamp Attribute timestamp Owner File type Access control list
  • 6. DISTRIBUTED FILE SYSTEM REQUIREMENTS  Transparency(access , location, mobility, performance, scaling)  Concurrent file updates  File replication  File heterogeny  Fault tolerance  Consistency  Security  Efficiency 6
  • 7. FILE SERVICE ARCHITECTURE  An architecture that offers a clear separation of the main concerns  Access to files is obtained by structuring the file service  Structure of the file has three components A. a flat file service B. a directory service C. a client module 7
  • 9. FLAT FILES SERVICE  Concerned with implementing on the contents of the files  Unique file identifiers(UFIDS) are used to refer to the files  Refer to file in all requests  UFIDS is long sequence of bits  Each file has a UFIDS that is unique among all of the files 9
  • 10. DIRECTORY SERVICE  The directory service provides a mapping between text names for files and their UFIDs.  Clients may obtain the UFID of a file by quoting its text name to the directory service.  The directory service provides the functions needed to generate directories, to add new file names to directories and to obtain UFIDs from directories.  It is a client of the flat file service  Its directory files are stored in files of the flat file service.  When a hierarchic file-naming scheme is adopted, as in UNIX, directories hold references to other directories. 10
  • 11. CLIENT MODULE  A client module runs in each client computer, integrating and extending the operations of the flat file service  The directory service under a single application programming interface that is available to user-level programs in client computers.  For example, in UNIX hosts, a client module would be provided that emulates the full set of UNIX file operations, interpreting UNIX multi-part file names by iterative requests to the directory service.  The client module also holds information about the network locations of the flat file server and directory server processes.  Finally, the client module can play an important role in achieving satisfactory performance through the implementation of a cache of recently used file blocks at the client. 11
  • 12. FLAT FILE SERVICE INTERFACE  This is the RPC interface used by client modules. It is not normally used directly by user-level programs.  A FileId is invalid if the file that it refers to is not present in the server processing the request or if its access permissions are inappropriate for the operation requested.  All of the procedures in the interface except Create throw exceptions if the FileId argument contains an invalid UFID or the user doesn’t have sufficient access rights. These exceptions are omitted from the definition for clarity. 12
  • 13. FLAT FILE SERVICE OPERATIONS  Read(FileId, i, n)  Data: Reads a sequence of up to n items from a file starting at item i and returns it in Data.  Write(FileId, i, Data):Write a sequence of Data to a file, starting at item i, extending the file if necessary.  Create()FileId: Creates a new file of length0 and delivers a UFID for it.  Delete(FileId): Removes the file from the file store.  GetAttributes(FileId)Attr:Returns the file attributes for the file.  SetAttributes(FileId, Attr):Sets the file attributes .  GetAttributes and SetAttributes enable clients to access the attribute record.  GetAttributes is normally available to any client that is allowed to read the file.  Access to the SetAttributes operation would normally be restricted to the directory service that provides access to the file. 13
  • 14. DIRECTORY SERVICE OPERATIONS  Lookup(Dir, Name) FileId  throws NotFound  AddName(Dir, Name, FileId)  throws NameDuplicate  UnName(Dir, Name)  throws NotFound  GetNames(Dir, Pattern) NameSeq Locates the text name in the directory and returns the relevant UFID. If Name is not in the directory, throws an exception. If Name is not in the directory, adds (Name, File) to the directory and updates the file’s attribute record. If Name is already in the directory, throws an exception. If Name is in the directory, removes the entry containing Name from the directory. If Name is not in the directory, throws an exception. Returns all the text names in the directory that match the regular expression Pattern. 14
  • 15. FILE ACCESSING MODELS  File access models are methods used for accessing remote files and the unit of data access.  A distributed file system may use one of the following models to service a client "file access request when the accessed file is remote: 1. Remote service model  Processing of a client's request is performed at the server's node.  Thus, the client's request for file access is delivered across the network as a message to the server, the server machine performs the access request, and the result is sent to the client.  This need to minimize the number of messages sent and the overhead per message. 15
  • 16. 2. Data-caching model  This model attempts to reduce the network traffic of the previous model by caching the data obtained from the server node.  This takes advantage of the locality feature of the found in file accesses.  This model is increased performance and greater system scalability  It refers to the friction of file that is transferred to clients in a single read and write operation 3.File-level transfer model  In this model when file data is to be transferred, the entire file is moved.  This reduces server load and network traffic since it accesses the server only once. This has better scalability.  This model requires sufficient storage space on the client machine. FILE ACCESSING MODELS… 16
  • 17. 4. Block-level transfer model  File transfer takes place in file blocks.  A file block is a contiguous portion of a file and is of fixed length  This does not require client nodes to have large storage space  It eliminates the need to copy an entire file when only a small portion of the data is needed.  When an entire file is to be accessed, multiple server requests are needed, resulting in more network traffic and more network protocol overhead.  NFS uses block-level transfer model. FILE ACCESSING MODELS… 17
  • 18. 5. Byte-level transfer model  Unit of transfer is a byte.  Model provides maximum flexibility because it allows storage and retrieval of an arbitrary amount of a file, specified by an offset within a file and length.  The drawback is that cache management is harder due to the variable-length data for different access requests. 6.Record-level transfer model  This model is used with structured files and the unit of transfer is the record. FILE ACCESSING MODELS… 18
  • 19. FILE STRUCTURE ARCHITECTURE  Hierarchal file system  Consists of a number of directories arranged in a tree structure  File Group  Is a collection of files that can be located on any server or moved between servers while maintaining the same names  A similar construct is used in UNIX file system  Helps with distributing the load of file serving between several servers  File groups have identifiers which are unique throughout the system 19
  • 20. DFS  DFS has two methods  Sun Netwrok File System  Andrew File System  NFS have stateless server where as AFS have stateful server  AFS provides location independence (the physical location of file can be changed without having to change the path of the file) as well as location transparency  NFS provides location transparency  AFS is more scalable 20
  • 21.  A stateless server does not keep information on the state of its clients and can change its own state without informing any client.  It does not retain any information on the state of its clients.  a stateful server maintains persistent information on its clients and requires explicit deletion of information by the server.  a stateful server retains persistent information on its clients. 21 DFS
  • 22.  This file system was developed by Sun Microsystems. So, its name is Microsystems’s Network File System.  Microsystems’s Network File System (NFS) has been widely adopted in industry and in academic environments since its introduction in 1985.  NFS is client-server application. So, the user can view, store and update file on a remote computer.  In most cases, all the client and server are on the same LAN.  Each NFS exports one or more of its directories for access the remote clients.  NFS provides transparent access to remote files for client programs running on UNIX and other systems.  NFS allows the user or system administrator is to mount(designate as accessible) all or portion of file system on a server.  An important goal of NFS is to achieve a high level of support for hardware and operating system heterogeneity. CASE STUDY: SUN NETWORK FILE SYSTEM 22
  • 23. CASE STUDY: SUN NETWORK FILE SYSTEM…  NFS protocol (RFC 1813 ) is designed to be independent of the computer, OS, network architecture, and transport protocol.  NFS uses RPC to route requests between the client and server.  The NFS protocol was originally developed for use in networks of UNIX systems  The NFS server module resides in the kernel on each computer that acts as an NFS server.  Requests referring to files in a remote file system are translated by the client module to NFS protocol operations and then passed to the NFS server module at the computer holding the relevant file system. 23
  • 24. CASE STUDY: SUN NETWORK FILE SYSTEM… Sun network File Model 24
  • 25. 25 CASE STUDY: SUN NETWORK FILE SYSTEM…
  • 26. 26 It consists of three layers  System call layer: this handles like OPEN, READ, AND CLOSE.  Virtual File System: the task of VFS layer is to maintain a table with one entry for each open file, analogous to the table of I-nodes for open files in UNIX. VFS layers has an entry called a v-node(virtual i-node) for every open file telling whether the file is local or remote.  NFS client code: to create an r-node(remote i-node) in its internal tables to hold the file handles.  Each v-node in the VFS layer will ultimately contain either apointer to an r-node in the NFS client code, or a pointer to an i-node in the local operating system.  Thus from the v-node it is possible to see if a file or director is local or remote, and if it is to find its file handle CASE STUDY: SUN NETWORK FILE SYSTEM…
  • 27.  The file identifiers used in NFS are called file handles.  The virtual file system layer has one VFS structure for each mounted file system and one v-node per open file.  A VFS structure relates a remote file system to the local directory on which it is mounted.  The v-node contains an indicator to show whether a file is local or remote.  If the file is local, the v-node contains a reference to the index of the local file (an i-node in a UNIX implementation).  If the file is remote, it contains the file handle of the remote file.  Reading Assignment NFS operation CASE STUDY: SUN NETWORK FILE SYSTEM… 27
  • 28. Server caching  Caching in both the client and the server computer are indispensable features of NFS implementations in order to achieve adequate performance.  In conventional UNIX systems: file pages, directories and file attributes that have been read from disk are retained in a main memory buffer cache until the buffer space is required for other pages. CASE STUDY: SUN NETWORK FILE SYSTEM… 28
  • 29. Benefits of NFS: Among many benefits for organizations using NFS are the following:  Mature: NFS is a mature protocol, which means most aspects of implementing, securing and using it are well understood, as are its potential weaknesses.  Open: NFS is an open protocol, with its continued development documented in internet specifications as a free and open network protocol.  Cost-effective: NFS is a low-cost solution for network file sharing that is easy to set up because it uses the existing network infrastructure.  Centrally managed: NFS's centralized management decreases the need for added software and disk space on individual user systems.  User-friendly: The protocol is easy to use and enables users to access remote files on remote hosts in the same way they access local ones.  Distributed: NFS can be used as a distributed file system, reducing the need for removable media storage devices.  Secure: With NFS, there is less removeable media like CDs, DVDs, Blu-ray disks, diskettes and USB devices in circulation, making the system more secure. 29 CASE STUDY: SUN NETWORK FILE SYSTEM…
  • 30. Disadvantages of NFS: NFS include the following:  Dependence on RPCs makes NFS inherently insecure and should only be used on a trusted network behind a firewall. Otherwise, NFS will be vulnerable to internet threats.  Some reviews of NFSv4 and NFSv4.1 suggest that these versions have limited bandwidth and scalability and that NFS slows down during heavy network traffic. The bandwidth and scalability issue is reported to have improved with NFSv4.2. 30 CASE STUDY: SUN NETWORK FILE SYSTEM…
  • 31. NFS summary  Sun NFS closely follows abstract model.  The resulting design provides good location and access transparency if the NFS mount service is used properly to produce similar name spaces at all clients.  NFS supports heterogeneous hardware and operating systems.  The NFS server implementation is stateless, enabling clients and servers to resume execution after a failure without the need for any recovery procedures.  Migration of files or filesystems is not supported, except at the level of manual intervention to reconfigure mount directives after the movement of a filesystem to a new location.  Design goals of NFS are Access transparency, Location transparency, Mobility transparency, Scalability, File replication, Hardware and operating system heterogeneity, Fault tolerance, Consistency, Security and Efficiency CASE STUDY: SUN NETWORK FILE SYSTEM… 31
  • 32. CASE STUDY: THE ANDREW FILE SYSTEM  It location-independent file system  It is DFS that uses a set of trusted servers to present a homogonous, location transparent file name space to all the client workstation. (vices)  By AFS, people can work together on the same files, no matter where the files are located.  AFS users do not have which machine is storing a file  AFS make easy to access a files stored on a remote computer as files stored on the local disks  All files you store on AFS is available to use online by just connecting to you AFS server. We can connect with AFS server through an AFS an AFS client  AFS uses a set of remote servers to access a file  AFS uses local cache to reduce the workload and increase the performance a distributed computing environment 32
  • 33.  In AFS, the server keeps of which files are opened by which clients(as not in the case of NFSs)  Like NFS, AFS provides transparent access to remote shared files for UNIX programs running on workstations.  Access to AFS files is via the normal UNIX file primitives, enabling existing UNIX programs to access AFS files without modification or recompilation.  AFS is compatible with NFS. AFS servers hold ‘local’ UNIX files, but the filing system in the servers is NFS-based, so files are referenced by NFS-style file handles rather than i-node numbers, and the files may be remotely accessed via NFS. CASE STUDY: THE ANDREW FILE SYSTEM… 33
  • 34. Implementation of AFS  Venus: the client-side manager which act as an interface between the application program and vice  Vice: the server side processes that resides on the top of the UNIX kernel providing shared file services to each client  All files in AFS are distributed among all servers. The set of file in one server is referred to as volume  In the case of the request can not be satisfied from this set of files, the vice server informs the client where it can find the required files  The files available to user processes running on workstation are either local or shared  Local files are handled as normal UNIX files  Shared files are stored on servers, and copies of them are cached on local disks of workstation 34 CASE STUDY: THE ANDREW FILE SYSTEM…
  • 35. Features of AFS  File backup: AFS data files are backup nightly. Backup are kept on site for six months  File security: AFS data files are protected by Kerberos authentication system  Physical security: AFS data files are stored on servers located in UCSC data center  Reliability and availability: AFS servers and storage are maintained on redundant hardware  Authentication: AFS use Kerberos for Authentication. Kerberos account are automatically provisioned for all UCSC data, faciality and staff  Space to user(Quota): AFS provides 500 MB space per user and can request to increase automatically up to 10GB. 35 CASE STUDY: THE ANDREW FILE SYSTEM…
  • 36. 36 CASE STUDY: THE ANDREW FILE SYSTEM…
  • 37.  AFS has two unusual design characteristics: Whole-file serving:  The entire contents of directories and files are transmitted to client computers by AFS servers Whole-file caching:  Once a copy of a file or a chunk has been transferred to a client computer it is stored in a cache on the local disk.  The cache contains several hundred of the files most recently used on that computer.  The cache is permanent, surviving reboots of the client computer.  Local copies of files are used to satisfy clients’ open requests in preference to remote copies whenever possible. CASE STUDY: THE ANDREW FILE SYSTEM… 37
  • 38. A FIRST REQUEST FOR DATA TO SERVER FROM WORKSTATION IS SATISFIED BY THE SERVER AND PLACED IN LOCAL CACHE AFS is implemented as two software components that exist as UNIX processes called Vice and Venus. Vice is the name given to the server software that runs as a user-level UNIX process in each server computer, and Venus is a user-level process that runs in each client computer and corresponds to the client module in our abstract model. The set of files in one server is referred as volume. 38 IMPLEMENTATION ANDREW FILE SYSTEM
  • 39.  The files available to user processes running on workstations are either local or shared.  Local files are handled as normal UNIX files. They are stored on a workstation’s disk and are available only to local user processes.  Shared files are stored on servers, and copies of them are cached on the local disks of workstations.  It is a conventional UNIX directory hierarchy, with a specific subtree (called cmu) containing all of the shared files. This splitting of the file name space into local and shared files leads to some loss of location transparency, but this is hardly noticeable to users other than system administrators.  Local files are used only for temporary files (/tmp) and processes that are essential for workstation startup.  Other standard UNIX files (such as those normally found in /bin, /lib and so on) are implemented as symbolic links from local directories to files held in the shared space.  Users’ directories are in the shared space, enabling users to access their files from any workstation. 39 IMPLEMENTATION ANDREW FILE SYSTEM…
  • 40.  A flat file service is implemented by the Vice servers, and the hierarchic directory structure required by UNIX user programs is implemented by the set of Venus processes in the workstations.  Each file and directory in the shared file space is identified by a unique, 96-bit file identifier (fid) similar to a UFID. The Venus processes translate the pathnames issued by clients to fids 40 Implementation Andrew File System…
  • 41.  Files are grouped into volumes for ease of location and movement. Volumes are generally smaller than the UNIX filesystems, which are the unit of file grouping in NFS. For example, each user’s personal files are generally located in a separate volume. Other volumes are allocated for system binaries, documentation and library code.  The representation of fids includes the volume number for the volume containing the file (cf. the file group identifier in UFIDs), an NFS file handle identifying the file within the volume (cf. the file number in UFIDs) and a uniquifier to ensure that file identifiers are not reused: 32 bits 32 bits 32 bits Volume number File handle Uniquifier 41 IMPLEMENTATION ANDREW FILE SYSTEM…
  • 42. Cache consistency  Stateful servers in AFS allow, the server to inform all clients with open files about any updates made to that file by another client through what is known as callback  Callbacks to all clients with a copy of that file is ensured as a callback promise is issued by the server to a client when it request to for a copy of a file 42 IMPLEMENTATION ANDREW FILE SYSTEM…
  • 43.  When Vice supplies a copy of a file to a Venus process it also provides a callback promise – a token issued by the Vice server that is guaranteeing that it will notify the Venus process when any other client modifies the file.  Callback promises are stored with the cached files on the workstation disks and have two states: valid cancelled. 43 IMPLEMENTATION ANDREW FILE SYSTEM…
  • 44.  When a server performs a request to update a file it notifies all of the Venus processes to which it has issued callback promises by sending a callback to each  A callback is a remote procedure call from a server to a Venus process.  When the Venus process receives a callback, it sets the callback promise token for the relevant file to cancelled. 44 IMPLEMENTATION ANDREW FILE SYSTEM…
  • 45.  Whenever Venus handles an open on behalf of a client, it checks the cache.  If the required file is found in the cache, then its token is checked.  If its value is cancelled, then a fresh copy of the file must be fetched from the Vice server,  But if the token is valid, then the cached copy can be opened and used without reference to Vice. 45 Implementation Andrew File System…
  • 46.  When a workstation is restarted after a failure or a shutdown  Venus aims to retain as many as possible of the cached files on the local disk  But it cannot assume that the callback promise tokens are correct, since some callbacks may have been missed. 46 IMPLEMENTATION ANDREW FILE SYSTEM…
  • 47.  Before the first use of each cached file or directory after a restart,  Venus therefore generates a cache validation request containing the file modification timestamp to the server that is the custodian of the file.  If the timestamp is current, the server responds with valid, and the token is reinstated.  If the timestamp shows that the file is out of date, then the server responds with cancelled, and the token is set to cancelled.  Callbacks must be renewed 47 IMPLEMENTATION ANDREW FILE SYSTEM…
  • 48. Implementation of file system calls in AFS 48
  • 49. OPERATION OF AFS  Returns the attributes (status) and, optionally, the contents of the file identified by fid and records a callback promise on it.  Updates the attributes and (optionally) the contents of a specified file.  Creates a new file and records a callback promise on it.  Deletes the specified file  Sets a lock on the specified file or directory. The mode of the lock may be shared or exclusive. Locks that are not removed expire after 30 minutes.  Informs the server that a Venus process has flushed a file from its cache.  BreakCallback(fid)  Call made by a Vice server to a Venus process; cancels the callback promise on the relevant file.  RemoveCallback(fid)  Fetch(fid) attr, data  Store(fid, attr, data)  SetLock(fid, mode)  Create() fid  Remove(fid)  ReleaseLock(fid)  Unlocks the specified file or directory. 49
  • 50. THANK YOU FOR YOUR ATTENTION 50
  • 51. FUNDAMENTALS OF DISTRIBUTED SYSTEM Chapter Four Naming 1
  • 52.  we will discuss  some general issues in naming  how human-friendly names are organized and implemented; e.g., those for file systems and the WWW; classes of naming systems  flat naming  structured naming, and  attribute-based naming 2 OBJECTIVES OF THE CHAPTER
  • 53. INTRODUCTION  names play an important role to:  share resources  uniquely identify entities  refer to locations  etc.  an important issue is that a name can be resolved to the entity it refers  to resolve names, it is necessary to implement a naming system  in a distributed system, the implementation of a naming system is itself often distributed, unlike in non-distributed systems  efficiency and scalability of the naming system are the main issues 3
  • 54.  Uniform Resource Identifiers (URIs) [Berners-Lee et al. 2005] came about from the need to identify resources on the Web, and other Internet resources such as electronic mailboxes. An important goal was to identify resources in a coherent way, so that they could all be processed by common software such as browsers.  Uniform Resource Locator (URL) is often used for URIs that provide location information and specify the method for accessing the resource, including the ‘http’.  Uniform Resource Names (URNs) are URIs that are used as pure resource names rather than locators. 4 COMMON TERMS
  • 55.  A name service stores information about a collection of textual names, in the form of bindings between the names and the attributes of the entities they denote, such as users, computers, services and objects.  A name space is the collection of all valid names recognized by a particular service.  A naming domain is a name space for which there exists a single overall administrative authority responsible for assigning names within it.  The Domain Name System is a name service design whose main naming database is used across the Internet.  A service that stores collections of bindings between names and attributes and that looks up entries that match attribute-based specifications is called a directory service. Directory services are also sometimes known as attribute- based name services. 5
  • 56. Names, Identifiers, and Addresses a name in a distributed system is a string of bits or characters that is used to refer to an entity an entity is anything; e.g., resources such as hosts, printers, disks, files, objects, processes, users, Web pages, newsgroups, mailboxes, network connections, ... entities can be operated on  e.g., a resource such as a printer offers an interface containing operations for printing a document, requesting the status of a job, etc.  a network connection may provide operations for sending and receiving data, setting quality of service parameters, etc. to operate on an entity, it is necessary to access it through its access point, itself an entity (special) 6
  • 57.  access point  the name of an access point is called an address (such as IP address and port number as used by the transport layer)  the address of the access point of an entity is also referred to as the address of the entity  an entity can have more than one access point (similar to accessing an individual through different telephone numbers)  an entity may change its access point in the course of time (e.g., a mobile computer getting a new IP address as it moves) 7
  • 58.  an address is a special kind of name  it refers to at most one entity  each entity is referred by at most one address; even when replicated such as in Web pages  an entity may change an access point, or an access point may be reassigned to a different entity (like telephone numbers in offices)  separating the name of an entity and its address makes it easier and more flexible; such a name is called location independent  there are also other types of names that uniquely identify an entity; in any case a true identifier is a name with the following properties  it refers to at most one entity  each entity is referred by at most one identifier  it always refers to the same entity (never reused)  identifiers allow us to unambiguously refer to an entity 8
  • 59.  examples  name of an FTP server (entity)  URL of the FTP server  address of the FTP server  IP number:port number  the address of the FTP server may change  there are three classes on naming systems: flat naming, structured naming, and attribute-based naming 9
  • 60. A.Flat Naming a name is a sequence of characters without structure; like human names? may be if it is not an Ethiopian name! difficult to be used in a large system since it must be centrally controlled to avoid duplication moreover, it does not contain any information on how to locate the access point of its associated entity how are flat names resolved (or how to locate an entity when a flat name is given)  name resolution: mapping a name to an address or an address to a name is called name-address resolution  possible solutions: simple solutions, home-based approaches, and hierarchical approaches 10
  • 61. 1. Simple Solutions  two solutions (for LANs only): Broadcasting and Multicasting, and Forwarding Pointers a. Broadcasting and Multicasting  broadcast a message containing the identifier of an entity; only machines that can offer an access point for the entity send a reply  e.g., ARP (Address Resolution Protocol) in the Internet to find the data link address (MAC address) of a machine  a computer that wants to access another computer for which it knows its IP address broadcasts this address  the owner responds by sending its Ethernet address  broadcasting is inefficient when the network grows (wastage of bandwidth and too much interruption to other machines)  multicasting is better when the network grows - send only to a restricted group of hosts 11
  • 62. 12  multicasting can also be used to locate the nearest replica - choose the one whose reply comes in first b. Forwarding Pointers  how to look for mobile entities  when an entity moves from A to B, it leaves behind a reference to its new location  advantage  simple: as soon as the first name is located using traditional naming service, the chain of forwarding pointers can be used to find the current address  drawbacks  the chain can be too long - locating becomes expensive  all the intermediary locations in a chain have to maintain their pointers  vulnerability if links are broken  hence, making sure that chains are short and that forwarding pointers are robust is an important issue
  • 63. 2. HOME-BASED APPROACHES  broadcasting and multicasting have scalability problems; performance and broken links are problems in forwarding pointers  a home location keeps track of the current location of an entity; often it is the place where an entity was created  it is a two-tiered approach  an example where it is used in Mobile IP  each mobile host uses a fixed IP address  all communication to that IP address is initially directly sent to the host’s home agent located on the LAN corresponding to the network address contained in the mobile host’s IP address  whenever the mobile host moves to another network, it requests a temporary address in the new network (called care-of-address) and informs the new address to the home agent 13
  • 64. 12  when the home agent receives a message for the mobile host (from a correspondent agent) it forwards it to its new address (if it has moved) and also informs the sender the host’s current location for sending other packets home-based approach: the principle of Mobile IP
  • 65.  problems:  creates communication latency (Triangle routing: correspondent-home network-mobile)  the home location must always exist; the host is unreachable if the home does no more exist (permanently changed); the solution is to register the home at a traditional name service and let a client first look up the location of the home 15
  • 66. B. Structured Naming flat names are not convenient for humans Name Spaces  names are organized into a name space  each name is made of several parts; the first may define the nature of the organization, the second the name, the third departments, ...  the authority to assign and control the name spaces can be decentralized where a central authority assigns only the first two parts  a name space is generally organized as a labeled, directed graph with two types of nodes  leaf node: represents the named entity and stores information such as its address or the state of that entity  directory node: a special entity that has a number of outgoing edges, each labeled with a name each node in a naming graph is considered as another entity with an identifier 16
  • 67. 17 a general naming graph with a single root node, no  a directory node stores a table in which an outgoing edge is represented as a pair (edge label, node identifier), called a directory table  each path in a naming graph can be referred to by the sequence of labels corresponding to the edges of the path and the first node in the path, such as N:<label-1, label-2, ..., label-n>, where N refers to the first node in the path
  • 68.  such a sequence is called a path name  if the first node is the root of the naming graph, it is called an absolute path name; otherwise it is a relative path name  instead of the path name n0:<home, steen, mbox>, we often use its string representation /home/steen/mbox  there may also be several paths leading to the same node, e.g., node n5 can be represented as /keys or /home/steen/keys  although the above naming graph is directed acyclic graph (a node can have more than one incoming edge but is not permitted to have a cycle), the common way is to use a tree (hierarchical) with a single root (as is used in file systems)  in a tree structure, each node except the root has exactly one incoming edge; the root has no incoming edges  each node also has exactly one associated (absolute) path name 18
  • 69.  e.g., file naming in UNIX file system  a directory node represents a directory and a leaf node represents a file  there is a single root directory, represented in the naming graph by the root node  we have a contiguous series of blocks from a logical disk  the boot block is used to load the operating system  the superblock contains information on the entire file system such as its size, etc.  inodes are referred to by an index number, starting at number zero, which is for the inode representing the root directory  given the index number of an inode, it is possible to access its associated file 19
  • 70.  Name Resolution  given a path name, the process of looking up a name stored in the node is referred to as name resolution; it consists of finding the address when the name is given (by following the path)  knowing how and where to start name resolution is referred to as closure mechanism; e.g., UNIX file system  Linking and Mounting  Linking: giving another name for the same entity (an alias) e.g., environment variables in UNIX such as HOME that refer to the home directory of a user  two types of links (or two ways to implement an alias): hard link and symbolic link  hard link: to allow multiple absolute path names to refer to the same node in a naming graph e.g., in the previous graph, there are two different path names for node n5: /keys and /home/steen/keys 20
  • 71.  symbolic link: representing an entity by a leaf node and instead of storing the address or state of the entity, the node stores an absolute path name 21 the concept of a symbolic link explained in a naming graph when first resolving an absolute path name stored in a node (e.g., /home/steen/keys in node n6), name resolution will return the path name stored in the node (/keys), at which point it can continue with resolving that new path name, i.e., closure mechanism
  • 72.  so far name resolution was discussed as taking place within a single name space  name resolution can also be used to merge different name spaces in a transparent way  the solution is to use mounting  as an example, consider a mounted file system, which can be generalized to other name spaces as well  let a directory node store the directory node from a different (foreign) name space  the directory node storing the node identifier is called a mount point  the directory node in the foreign name space is called a mounting point, normally the root of a name space  during name resolution, the mounting point is looked up and resolution proceeds by accessing its directory table
  • 73.  consider a collection of name spaces distributed across different machines (each name space implemented by a different server)  to mount a foreign name space in a distributed system, the following are at least required  the name of an access protocol (for communication)  the name of the server  the name of the mounting point in the foreign name space  each of these names needs to be resolved  to the implementation of the protocol so that communication can take place properly  to an address where the server can be reached  to a node identifier in the foreign name space (to be resolved by the server of the foreign name space)  the three names can be listed as a URL
  • 74.  example: Sun’s Network File System (NFS) is a distributed file system with a protocol that describes how a client can access a file stored on a (remote) NFS file server  an NFS URL may look like nfs://flits.cs.vu.nl/home/steen - nfs is an implementation of a protocol - flits.cs.vu.nl is a server name to be resolved using DNS - /home/steen is resolved by the foreign server  e.g., the subdirectory /remote includes mount points for foreign name spaces on the client machine  a directory node named /remote/vu is used to store nfs://flits.cs.vu.nl/home/steen  consider /remote/vu/mbox  this name is resolved by starting at the root directory on the client’s machine until node /remote/vu, which returns the URL nfs://flits.cs.vu.nl/home/steen  this leads the client machine to contact flits.cs.vu.nl using the NFS protocol  then the file mbox is read in the directory /home/steen
  • 75. mounting remote name spaces through a specific process protocol MOUNT POINT MOUNTING POINT
  • 76.  distributed systems that allow mounting a remote file system also allow to execute some commands  example commands to access the file system cd /remote/vu /*changing directory on the remote machine ls -l /*listing the files on the remote machine  by doing so the user is not supposed to worry about the details of the actual access; the name space on the local machine and that on the remote machine look to form a single name space
  • 77.  The Implementation of a Name Space  a name space forms the heart of a naming service  a naming service allows users and processes to add, remove, and lookup names  a naming service is implemented by name servers  for a distributed system on a single LAN, a single server might suffice; for a large-scale distributed system the implementation of a name space is distributed over multiple name servers  Name Space Distribution  in large scale distributed systems, it is necessary to distribute the name service over multiple name servers, usually organized hierarchically  a name service can be partitioned into logical layers  the following three layers can be distinguished (according to Cheriton and Mann)
  • 78.  global layer  formed by highest level nodes (root node and nodes close to it or its children)  nodes on this layer are characterized by their stability, i.e., directory tables are rarely changed  they may represent organizations, groups of organizations, ..., where names are stored in the name space  administrational layer  groups of entities that belong to the same organization or administrational unit, e.g., departments  relatively stable  managerial layer  nodes that may change regularly, e.g., nodes representing hosts of a LAN, shared files such as libraries or binaries, …  nodes are managed not only by system administrators, but also by end users
  • 79. an example partitioning of the DNS name space, including Internet- accessible files, into three layers
  • 80.  the name space is divided into nonoverlapping parts, called zones in DNS  a zone is a part of the name space that is implemented by a separate name server  some requirements of servers at different layers: performance (responsiveness to lookups), availability (failure rate), etc.  high availability is critical for the global layer, since name resolution cannot proceed beyond the failing server; it is also important at the administrational layer for clients in the same organization  performance is very important in the lowest layer, since results of lookups can be cached and used due to the relative stability of the higher layers  they may be enhanced by client side caching (for global and administrational layers since names do not change often) and replication; they create implementation problems since they may introduce inconsistency (see Chapter 7)
  • 81. a comparison between name servers for implementing nodes from a large- scale name space partitioned into a global layer, an administrational layer, and a managerial layer Item Global Administrational Managerial Geographical scale of network Worldwide Organization Department Total number of nodes Few Many Vast numbers Responsiveness to lookups Seconds Milliseconds Immediate Update propagation Lazy Immediate Immediate Availability requirement Very High High low Number of replicas Many None or few None Is client-side caching applied? Yes Yes Sometimes
  • 82.  Implementation of Name Resolution  recall that name resolution consists of finding the address when the name is given  assume that name servers are not replicated and that no client-side caches are allowed  each client has access to a local name resolver, responsible for ensuring that the name resolution process is carried out  e.g., assume the path name root:<nl, vu, cs, ftp, pub, globe, index.txt> is to be resolved or using a URL notation, this path name would correspond to ftp://ftp.cs.vu.nl/pub/globe/index.txt
  • 83.  a host that needs to map a name to an address calls a DNS client named a resolver (and provides it the name to be resolved - ftp.cs.vu.nl)  the resolver accesses the closest DNS server with a mapping request  if the server has the information it satisfies the resolver; otherwise, it either refers the resolver to other servers (called Iterative Resolution) or asks other servers to provide it with the information (called Recursive Resolution)  Iterative Resolution  a name resolver hands over the complete name to the root name server  the root name server will resolve the name as far as it can and return the result to the client; at the minimum it can resolve the first level and sends the name of the first level name server to the client  the client calls the first level name server, then the second, ..., until it finds the address of the entity
  • 84. the principle of iterative name resolution
  • 85.  Recursive Resolution  a name resolver hands over the whole name to the root name server  the root name server will try to resolve the name and if it can’t, it requests the first level name server to resolve it and to return the address  the first level will do the same thing recursively the principle of recursive name resolution
  • 86.  Advantages and drawbacks  recursive name resolution puts a higher performance demand on each name server; hence name servers in the global layer support only iterative name resolution  caching is more effective with recursive name resolution  each name server gradually learns the address of each name server responsible for implementing lower-level nodes  eventually lookup operations can be handled efficiently
  • 87. recursive name resolution of <nl, vu, cs, ftp>; name servers cache intermediate results for subsequent lookups Server for node Should resolve Looks up Passes to child Receives and caches Returns to requester cs <ftp> #<ftp> -- -- #<ftp> vu <cs,ftp> #<cs> <ftp> #<ftp> #<cs> #<cs, ftp> nl <vu,cs,ftp> #<vu> <cs,ftp> #<cs> #<cs,ftp> #<vu> #<vu,cs> #<vu,cs,ftp> root <nl,vu,cs,ftp> #<nl> <vu,cs,ftp> #<vu> #<vu,cs> #<vu,cs,ftp> #<nl> #<nl,vu> #<nl,vu,cs> #<nl,vu,cs,ftp>
  • 88. the comparison between recursive and iterative name resolution with respect to communication costs; assume the client is in Ethiopia and the name servers in the Netherlands  communication costs may be reduced in recursive name resolution  Summary Method Advantages Recursive Less Communication cost; Caching is more effective Iterative Less performance demand on name servers
  • 89.  Example - The Domain Name System (DNS)  one of the largest distributed naming services is the Internet DNS  it is used for looking up host addresses and mail servers  hierarchical, defined in an inverted tree structure with the root at the top  the tree can have only 128 levels
  • 90.  Label  each node has a label, a string with a maximum of 63 characters (case insensitive)  the root label is null (has no label)  children of a node must have different names (to guarantee uniqueness)  Domain Name  each node has a domain name; it is a path name to its root node  a full domain name is a sequence of labels separated by dots (the last character is a dot)  domain names are read from the node up to the root  full path names must not exceed 255 characters
  • 91.  the contents of a node is formed by a collection of resource records; the important ones are the following Type of record Associated entity Description SOA (start of authority) Zone Holds information on the represented zone, such as an e-mail address of the system administrator A (address) Host Contains an IP address of the host this node represents MX (mail exchange) Domain Refers to a mail server to handle mail addressed to this node; it is a symbolic link; e.g. name of a mail server SRV Domain Refers to a server handling a specific service NS (name server) Zone Refers to a name server that implements the represented zone CNAME Node Contains the canonical name of a host; an alias PTR (pointer) Host Symbolic link with the primary name of the represented node; for mapping an IP address to a name HINFO (host info) Host Holds information on the host this node represents; such as machine type and OS TXT Any kind Contains any entity-specific information considered useful; cannot be automatically processed
  • 92.  cs.vu.nl represents the domain as well as the zone; it has 4 name servers (ns, star, top, solo) and 3 mail servers  name server for this zone with 2 network addresses (star)  mail servers; the numbers preceding the name show priorities; first the one with the lowest number is tried an excerpt from the DNS database for the zone cs.vu.nl
  • 93. an excerpt from the DNS database for the zone cs.vu.nl, cont’d  a Web server and an FTP server, implemented by a single machine (soling)  older server clusters (vucs-das1)  two printers (inkt and pen) with a local address; i.e., they cannot be accessed from outside
  • 94. part of the description for the vu.nl domain which contains the cs.vu.nl domain  cs.vu.nl is implemented as a single zone  hence, the records in the previous slides do not include references to other zones  nodes in a subdomain that are implemented in a different zone are specified by giving the domain name and IP address
  • 95. C. Attribute-Based Naming flat naming: provides a unique and location-independent way of referring to entities structured naming: also provides a unique and location- independent way of referring to entities as well as human- friendly names but both do not allow searching entities by giving a description of an entity in attribute-based naming, each entity is assumed to have a collection of attributes that say something about the entity then a user can search an entity by specifying (attribute, value) pairs known as attribute-based naming Directory Services  attribute-based naming systems are also called directory services whereas systems that support structured naming are called naming systems
  • 96.  how are resources described? one possibility is to use RDF (Resource Description Framework) that uses triplets consisting of a subject, a predicate, and an object  e.g., (person, name, Alice) to describe a resource Person whose Name is Alice  or in e-mail systems, we can use sender, recipient, subject, etc. for searching  Hierarchical Implementations: LDAP  distributed directory services are implemented by combining structured naming with attribute-based naming  e.g., Microsoft’s Active Directory service  such systems rely on the lightweight directory access protocol or LADP which is derived from OSI’s X.500 directory service  a LADP directory service consists of a number of records called directory entries (attribute, value) pairs, similar to a resource record in DNS; could be single- or multiple-valued (e.g., Mail_Servers on next slide)
  • 97. a simple example of an LDAP directory entry using LDAP naming conventions to identify the network addresses of some servers Attribute Abbr. Value Country C NL Locality L Amsterdam Organization O Vrije Universiteit OrganizationalUnit OU Comp. Sc. CommonName CN Main server Mail_Servers -- 137.37.20.3, 130.37.24.6,137.37.20.10 FTP_Server -- 130.37.20.20 WWW_Server -- 130.37.20.20
  • 98.  the collection of all directory entries is called a Directory Information Base (DIB)  each record is uniquely named so that it can be looked up  each naming attribute is called a Relative Distinguished Name (RDN); the first 5 entries above  a globally unique name is formed using abbreviations of naming attributes, e.g., /C=NL/O=Vrije Universiteit/OU=Comp. Sc.  this is similar to the DNS name nl.vu.cs  listing RDNs in sequence leads to a hierarchy of the collection of directory entries, called a Directory Information Tree (DIT)  a DIT forms the naming graph of an LDAP directory service where each node represents a directory entry
  • 99. part of the directory information tree  node N corresponds to the directory entry shown earlier; it also acts as a parent of other directory entries that have an additional attribute, Host_Name; such entries may be used to represent hosts
  • 100. Attribute Value Country NL Locality Amsterdam Organization Vrije Universiteit OrganizationalUnit Comp. Sc. CommonName Main server Host_Name star Host_Address 192.31.231.42 Attribute Value Country NL Locality Amsterdam Organization Vrije Universiteit OrganizationalUnit Comp. Sc. CommonName Main server Host_Name zephyr Host_Address 137.37.20.10 two directory entries having Host_Name as RDN Reading Assignment: case study of global Name services, Distributed Shared Memory
  • 101. THANK YOU FOR YOUR ATTENTION 51
  • 102. FUNDAMENTALS OF DISTRIBUTED SYSTEM Chapter Five Synchronization 1
  • 103. OUTLINE  Clock synchronization, physical clocks and clock synchronization algorithms  Logical clocks and time stamps  Global state  Distributed transactions and concurrency control Election algorithms  Mutual exclusion and various algorithms to achieve mutual Exclusion 2
  • 104. INTRODUCTION  Synchronization is the coordination of action between processers.  Asynchronous contains independent of events.  cooperation is partly supported by naming; it allows processes to at least share resources (entities)  synchronization deals on how to ensure that processes do not simultaneously access a shared resource  how events can be ordered such as two processes sending messages to each other We will study:  Synchronization based on “Actual Time”.  Synchronization based on “Relative Time”.  Synchronization based on Co-ordination (with Election Algorithms).  Mutual Exclusion 3
  • 105.  Essential of Synchronization 4
  • 106.  Issue of synchronization 5
  • 108. Clock Synchronization  in centralized systems, synchronization is decided shared memory.  The event ordering is clean because all the events are timed by the same clock.  One processor and one clock i.e., different processors share the memory.  achieving agreement on time in distributed systems is difficult  Each system has own time.  e.g., consider the make program on a UNIX machine; it compiles only source files for which the time of their last update was later than the existing object file 7
  • 109. Physical Clocks: clocks whose values must not deviate from the real time by more than a certain amount.  Clock is an electronic devices that counts (stored in computer register) oscillation in crystal at a particular frequency.  Physical Clocks can be used time stamp an event on that computer. E1t1 and E2t2  Many applications are interested only in order of events not the exact time of day at which they occurred  The time difference between two computer is drift while the clock drift over the time is skew  Several methods are used to attempt the synchronization of physical clock distributed system  There are different methods of physical clocks Cristian Algorism Barkley Algorithm Network Time Protocol 8
  • 113. 12  Clock Synchronization Algorithm (T _sever)
  • 114. 13
  • 115.  Cristian's Algorithm  The client processes fetches the response from the server at a time T1 and calculates the new synchronized client clock time by T _client= Server+(T1-T0)/2 Send request at 5:08:15:100 (T0) Receiver response at 5:08:15:900 (T1) Response contains 5:09:25:300 (T _server) So, we want calculate elapsed time is called round trip time T1-T2/2=800 Set to time T _client= Server+(T1-T0)/2 T _client= 5:09:25:300 +400 T _client= 5:09:25:700 14
  • 116. 15
  • 117. 16
  • 118. 17 • Is a protocol that supports the computers clock time to be synchronized in a network
  • 119. 18
  • 121. 20
  • 122. 21
  • 123. 22
  • 124. 23
  • 125. 24
  • 126. 25
  • 127. 26
  • 128. 27
  • 129. 28
  • 130. 29
  • 131. 30
  • 132. 31
  • 135. 34
  • 136. 35
  • 137. 36
  • 138. 37
  • 139. 38
  • 140. 39
  • 142. 41
  • 143. 42
  • 144. 43
  • 145. 44
  • 146. 45
  • 147. SUMMARY • Overview • Essentiality of Synchronization • Issues in Synchronization • Clock Synchronization • Physical Clocks • Clock Synchronization • Logical Clocks • Happens-Before Relationship • Lamport's Logical Clocks • Vector Clocks • Mutual Exclusion • A Centralized Algorithm • A Decentralized Algorithm • A Distributed Algorithm • A Token Ring Algorithm • Election Algorithm • Introduction • The Bully Algorithm • A Ring Algorithm • Superpeer Selection 46
  • 148. THANK YOU FOR YOUR ATTENTION 47
  • 149. FUNDAMENTALS OF DISTRIBUTED SYSTEM Chapter Six Fault Tolerance 1
  • 150.  a major difference between distributed systems and single machine systems is that with the former, partial failure is possible, i.e., when one component in a distributed system fails  such a failure may affect some components while others will continue to function properly  an important goal of distributed systems design is to construct a system that can automatically recover from partial failure  it should tolerate faults and continue to operate to some extent 2 INTRODUCTION
  • 151.  we will discuss  fault tolerance, and making distributed systems fault tolerant  process resilience (techniques by which one or more processes can fail without seriously disturbing the rest of the system)  reliable multicasting to keep processes synchronized (by which message transmission to a collection of processes is guaranteed to succeed)  distributed commit protocols for ensuring atomicity in distributed systems  failure recovery by saving the state of a distributed system (when and how) 3 OBJECTIVES OF THE CHAPTER
  • 152. Introduction to Fault Tolerance  Basic Concepts  fault tolerance is strongly related to dependable systems  dependability covers the following  availability  refers to the probability that the system is operating correctly at any given time; defined in terms of an instant in time  reliability  a property that a system can run continuously without failure; defined in terms of a time interval  safety  refers to the situation that even when a system temporarily fails to operate correctly, nothing catastrophic happens  maintainability  how easily a failed system can be repaired 4
  • 153.  dependable systems are also required to provide a high degree of security  a system is said to fail when it cannot meet its promises; for instance failing to provide its users one or more of the services it promises  an error is a part of a system’s state the may lead to a failure; e.g., damaged packets in communication  the cause of an error is called a fault  building dependable systems closely relates to controlling faults  a distinction is made between preventing, removing, and forecasting faults  a fault tolerant system is a system that can provide its services even in the presence of faults 5
  • 154.  faults are classified into three  transient  occurs once and then disappears; if the operation is repeated, the fault goes away; e.g., a bird flying through a beam of a microwave transmitter may cause some lost bits  intermittent  it occurs, then vanishes on its own accord, then reappears, ...; e.g., a loose connection; difficult to diagnose; take your sick child to the nearest clinic, but the child does not show any sickness by the time you reach there  permanent  one that continues to exist until the faulty component is repaired; e.g, disk head crash, software bug 6
  • 155.  Failure Types - 5 of them  Crash failure (also called fail-stop failure): a server halts, but was working correctly until it stopped; e.g., the OS halts; reboot the system  Omission failure: a server fails to respond to incoming requests  Receive omission: a server fails to receive incoming messages; e.g., may be no thread is listening  Send omission: a server fails to send messages  Timing failure: a server's response lies outside the specified time interval; e.g., may be it is too fast over flooding the receiver or too slow  Response failure: the server's response is incorrect  Value failure: the value of the response is wrong; e.g., a search engine returning wrong Web pages as a result of a search 7
  • 156. 8  State transition failure: the server deviates from the correct flow of control; e.g., taking default actions when it fails to understand the request  Arbitrary failure (or Byzantine failure): a server may produce arbitrary responses at arbitrary times; the most serious  Failure Masking by Redundancy  to be fault tolerant, the system tries to hide the occurrence of failures from other processes - masking  the key technique for masking faults is redundancy  three kinds are possible  information redundancy; add extra bits to allow recovery from garbled bits (error correction)  time redundancy: an action is performed more than once if needed; e.g., redo an aborted transaction; useful for transient and intermittent faults  physical redundancy: add (replicate) extra equipment (hardware) or processes (software)
  • 157. Process Resilience  how can fault tolerance be achieved in distributed systems  one method is protection against process failures by replicating processes into groups  we discuss  what are the general design issues of process groups  what actually is a fault tolerant group  how to reach agreement within a process group when one or more of its members cannot be trusted to give correct answers 9
  • 158.  Design Issues  the key approach to tolerating a faulty process is to organize several identical processes into a group  all members of a group receive a message hoping that if one process fails, another one will take over  the purpose is to allow processes to deal with collection of processes as a single abstraction  process groups may be dynamic  new groups can be created and old groups can be destroyed  a process can join or leave a group  a process can be a member of several groups at the same time  hence group management and membership mechanisms are required 10
  • 159. 11  the internal structure of a group may be flat or hierarchical  flat: all processes are equal and decisions are made collectively  hierarchical: a coordinator and several workers; the coordinator decides which worker is best suited to carry a task (a) communication in a flat group (b) communication in a simple hierarchical group
  • 160.  the flat group  has no single point of failure  but decision making is more complicated (voting may be required for decision making which may create a delay and overhead)  the hierarchical group has the opposite properties  group membership may be handled  through a group server where all requests (joining, leaving, ...) are sent; it has a single point of failure  in a distributed way where membership is multicasted (if a reliable multicasting mechanism is available)  but what if a member crashes; other members have to find out this by noticing that it no more responds 12
  • 161.  Failure Masking and Replication  how to replicate processes so that they can form groups and failures can be masked?  there are two ways for such replication:  primary-based replication  for fault tolerance, primary-backup protocol is used  organize processes hierarchically and let the primary (i.e., the coordinator) coordinate all write operations  if the primary crashes, the backups hold an election  replicated-write protocols  in the form of active replication or by means of quorum- based protocols  that means, processes are organized as flat groups 13
  • 162.  another important issue is how much replication is needed  for simplicity consider only replicated-write systems  a system is said to be k fault tolerant if it can survive faults in k components and still meets its specifications  if the processes fail silently, then having k+1 replicas is enough; if k of them fail, the remaining one can function  if processes exhibit Byzantine failure, 2k+1 replicas are required; if the k processes generate the same reply (wrong), the k+1 will also produce the same answer (correct); but which of the two is correct is difficult to ascertain; unless we believe the majority 14
  • 163. Distributed Commit  atomic multicasting is an example of the more generalized problem known as distributed commit  in atomic multicasting, the operation is delivery of a message  but the distributed commit problem involves having an(y) operation being performed by each member of a process group, or none at all  there are three protocols: one-phase commit, two-phase commit, and three-phase commit  One-Phase Commit Protocol  a coordinator tells all other processes, called participants, whether or not to (locally) perform an operation  drawback: if one of the participants cannot perform the operation, there is no way to tell the coordinator; for example due to violation of concurrency control constraints in distributed transactions 15
  • 164.  Two-Phase Commit Protocol (2PC)  it has two phases: voting phase and decision phase, each involving two steps  voting phase  the coordinator sends a VOTE_REQUEST message to all participants  each participant then sends a VOTE_COMMIT or VOTE_ABORT message depending on its local situation  decision phase  the coordinator collects all votes; if all vote to commit the transaction, it sends a GLOBAL_COMMIT message; if at least one participant sends VOTE_ABORT, it sends a GLOBAL_ABORT message  each participant that voted for a commit waits for the final reaction of the coordinator and commits or aborts 16
  • 165. a) the finite state machine for the coordinator in 2PC b) the finite state machine for a participant 17
  • 166.  problems may occur in the event of failures  the coordinator and participants have states in which they block waiting for messages: INIT, READY, WAIT  when a process crashes, other processes may wait indefinitely  hence, timeout mechanisms are required  a participant waiting in its INIT state for VOTE_REQUEST from the coordinator aborts and sends VOTE_ABORT if it does not receive a vote request after some time  the coordinator blocking in state WAIT aborts and sends GLOBAL_ABORT if all votes have not been collected on time  a participant P waiting in its READY state waiting for the global vote cannot abort; instead it must find out which message the coordinator actually sent  by blocking until the coordinator recovers  or requesting another participant, say Q 18
  • 167. actions taken by a participant P when residing in state READY and having contacted another participant Q 19 State of Q Action by P Comments COMMIT Make transition to COMMIT Coordinator sent GLOBAL_COMMIT before crashing, but P didn’t receive it ABORT Make transition to ABORT Coordinator sent GLOBAL_ABORT before crashing, but P didn’t receive it INIT Make transition to ABORT Coordinator sent VOTE_REQUEST before crashing, P received it but not Q READY Contact another participant If all are in state READY, wait until the coordinator recovers
  • 168.  a process (participant or coordinator) can recover from crash if its state has been saved to persistent storage  actions by a participant after recovery 20  there are two critical states for the coordinator State before Crash Action by the Process after Recovery INIT Locally abort the transaction and inform the coordinator COMMIT or ABORT Retransmit its decision to the coordinator READY Can not decide on its own what it should do next; has to contact other participants State before Crash Action by the Coordinator after Recovery WAIT Retransmit the VOTE_REQUEST message After the Decision in the 2nd phase Retransmit the decision
  • 169. Recovery  fundamental to fault tolerance is recovery from an error  recall: an error is that part of a system that may lead to a failure  error recovery means to replace an erroneous state with an error-free state  two forms of error recovery: backward recovery and forward recovery  Backward Recovery  bring the system from its present erroneous state back into a previously correct state  for this, the system’s state must be recorded from time to time; each time a state is recorded, a checkpoint is said to be made  e.g., retransmitting lost or damaged packets in the implementation of reliable communication 21
  • 170.  most widely used, since it is a generally applicable method and can be integrated into the middleware layer of a distributed system  disadvantages:  checkpointing and restoring a process to its previous state are costly and performance bottlenecks  no guarantee can be given that the error will not recur, which may take an application into a loop of recovery  some actions may be irreversible; e.g., deleting a file, handing over cash to a customer  Forward Recovery  bring the system from its present erroneous state to a correct new state from which it can continue to execute  it has to be known in advance which errors may occur so as to correct those errors  e.g., erasure correction (or simply error correction) where a lost or damaged packet is constructed from other successfully delivered packets
  • 171. THANK YOU FOR YOUR ATTENTION 23