p2 p grid

In Grid Computing
 In grid computing

environment, resources
are made available across
geographically distributed
locations.
 Since resources are
distributed there may a
chances of failure.
 Techniques for handling
failures.

Abstract:
 To reduce the probability of failures a checkpoint

strategy is introduced.
 The checkpoint strategy works along with a
timestamp.
 Delay in time stamp- Failure occurs, checkpoint
triggers a roll back mechanism to identify origin and
cause of failure.
 Rectifies and the process is rescheduled to the correct
node by using SMF( select most fitting resource).

Peer-to-Peer (P2P) computing
 The robust availability of distributed resources is a very

important issue.
 P2P grid system has to guarantee the correctness when
faults occur.
 To improve reliability, fault-tolerant mechanism for
such systems is mandatory.
 Although extensive fault tolerance policies were
proposed but the fact that mechanisms was seldom
taken into consideration.

Fault Tolerance Policy on Check
Point Monitoring (FTCPM)
 Here we are proposing a Fault Tolerance policy –

FTCPM.
 For improving the system reliability, FTCPM
duplicates jobs into different sites to tolerate failures.

 Check pointing is a process to save the state of a
running application to a stable storage that can be
used to resume execution of the application in case of
any failures.
 Thus reduce the execution time.

Previous Work:
Fault Tolerance policy on Dynamic Load Balance:
 In order to provide the uninterrupted services, there

were many proposed works.
 Fault tolerance mechanisms include the redundant
strategy uses redundant tasks to carry out concurrent
computations on different nodes for keeping a
higher availability.
 When a failure occurs, the system will take a
redundant copy of the failed job from another node.
 Such an approach wastes resources.

 In the P2P grid system:
 each site has a backup storage two queues, named

Ready Queue (REQ) and
Running Queue (RQ).
 Each site regularly sends "heartbeat" messages to its
neighbour site.
 neighbour site fails.
 The fault tolerance policy is triggered.

Proposed System:
Check pointing Techniques:
 New way approach of check point technique is






introduced.
The checkpoint strategy which helps to keep the job’s
turnaround time stable.
Timestamp delay - rollback process takes place.
The check pointing mechanism reschedules the job
to the correct node using the SMF (Select Most
Fitting Resource) algorithm.
Improves the system reliability, can achieve better job
execution and job completion rate, and thereby reduce
failure probability.

System Architecture:
 Organized in a layered

architecture.
 The job allocation
testing data
 The job allocation
services are MM, EM,
configure, FT, SMF, FT,
 Finding fault tolerance
node is based on 3
mechanisms.

CP (check pointing) Algorithm:
 The algorithm schedules check pointing interval in a

dynamic way:
1. Schedule initial check pointing at short time period
after job starts.
2. Next schedule depends on machine mean failure
interval by topping up (inew= Iold+Iinit).
3.If inequality remaining execution time<mean failure
interval & Inew< a* execution time is false then new
checkpoint interval decreased to Inew= Iold-Iinit.

SMF Algorithm (Select Most Fitting
Resource)
 1. Resources register to GIS( geo info system) –

Machines are grouped into L discrete levels based on
node speed.
2. User submits jobs with resource requirement and
closeness factor.
3. Starting from smallest level, it searches for an
available node that meets the needs.
4. If node found, the job is assigned. Else it will search
for a free node upward, from level rj+1, rj+2, rj+3 to
rL, and assign job to the first free node found.

 5. If node found, the job is assigned. Else it finds







downward from level rj-1, rj-2, rj-3 to r1, and assign
job to the first free node found.
6. If the job still not assigned, it will wait in queue.
7. When job starts running, it schedules initial
check pointing at a short time period.
8. During first check pointing interval, executing job
image is sent to resource’s storage machine. Storage
saves job image and informs resource upon
completion.
9. Algorithm schedules next checkpointing based on
remaining execution time and machine failure interval
information kept at GIS.

 12. During next intervals, instead of whole job

image, resource only sends updated
attributes/properties of the job to storage. Storage
updates job.
13.job is completed. It notifies storage machine to
remove the saved checkpoint to free storage space.
 14. On machine failure or when the resource is out
of order, storage machine notifies job’s corresponding
user whether to load saved checkpoint.
 15. When user receives the load checkpoint request, it
will use SMF to choose a fitting free machine in
same or from other resources, to resume execution of
the job.

Job
Allocati
on
(Testing
Data)

 Gridlets creation:
 Job Allocation

(Testing Data):
 Check point
Controlling:
 Applying Rollback
Mechanism: check
point triggers a roll
back mechanism to
identify cause &
origin of failure.
 Finding Fault Tolerant
Node in the Network

Time
Stamp
Event
Tracki
ng

Mem
ory

Reso
urce
Allo
catio
n

Fault
Toler
ant
File
Trans
fer

Che
ck
Poin
tnt
Con
trol

SMF
(Select
Most
Fitting
Resourc
e)

Find
Fault
Tolera
nt &
Rollback

OU
TPU
T
(Tas
k
Exe
cuti
on)

Thank you!!!
 Thank you for Giving us this GOLDEN

opportunity!!!

p2 p grid

More Related Content

What's hot (20)

Similar to p2 p grid (20)

Recently uploaded (20)

p2 p grid