Matchmaking in glideinWMS in CMS

glideinWMS for users

Matchmaking in glideinWMS
in CMS
by Igor Sfiligoi (UCSD)

CERN, Dec 2012 glideinWMS matchmaking 1

Scope of this talk

This talk provides a
high level description of how
glideinWMS matchmaking
works in CMS.

Reader is expected to be familiar with the CMS experiment environment
http://cms.web.cern.ch/


glideinWMS architecture
● A reminder G.F.
+3
VO FE Grid
G.F.
+1
Execute node

Central manager Execute node
Submit node
Execute node
Negotiator
Submit node
Execute node
Submit node
Execute node
Schedd Condor


Two levels of matchmaking
● First in the VO Frontend
● To decide where G.F.

to provision resources VO FE
+3

+1
G.F.
Grid
Execute node

● i.e. where Submit node
Central manager Execute node
Execute node

to send glideins
Negotiator
Submit node
Execute node
Submit node
Execute node
Schedd

Then in the
Condor
●

HTCondor Negotiator
● To decide The two
which Job gets the glidein Slot must have
compatible
policies


Defining the policy
● The VO FE configures the glideins
● So it can define the Slot Requirements
● Preferred strategy to leave all policy
decisions in the VO FE hands, i.e. both
● VO FE matchmaking policy Easier keep them
in sync this way
● HTCondor matchmaking policy
● This implies
● Users should not define Job Requirements
● Instead, publish attributes describing requirements
http://www.slideshare.net/igor_sfiligoi/condor-week-12-attribute-matchmaking-move-req-out-of-user-hands


CMS Production @ CERN
Policies


Description
● The VO FE @ CERN serves
the production needs
● i.e. Reconstruction and MC production
● Job submission regulated by service managed
by a dedicated team,
so jobs are
● Targeted
● Well behaved
At least by and large


Matchmaking policy
● Two dimensions
● Grid Site
● Single CPU vs HTPC
● The actual policy is the AND of both
● Both VO FE policy and HTCondor policy
defined in the VO FE instance configuration


Matching on Grid site name
● User Jobs expected to publish the attribute
DESIRED_Sites String list

● e.g. +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
● The G.F. and the glideins advertising
GLIDEIN_CMSSite
● The matchmaking policy is
GLIDEIN_CMSSite ∈ DESIRED_Sites


Matching on Job Type
● Use Jobs can publish the attribute
DESIRES_HTPC Integer representation of Boolean values

● e.g. +DESIRES_HTPS = 1
● If not defined, defaults to 0
● The G.F. And the glideins may advertise
GLIDEIN_Is_HTPC Boolean value

● If not defined, defaults to False
● The matchmaking policy is
(GLIDEIN_Is_HTPC==True)==(DESIRES_HTPC==1)


Example submit file

Universe
Universe = vanilla
= vanilla
Executable = mcgen
Executable = mcgen
Arguments = -k 1543.3
Output
Output = mcgen.out
= mcgen.out
Error
Error = mcgen.err
= mcgen.err
Log
Log = mcgen.log
= mcgen.log
+DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
+DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
+DESIRES_HTPC = 0
+DESIRES_HTPC = 0
Requirements = True
Requirements = True
Queue 1
Queue 1


CMS AnaOps @ UCSD
Policies


Description
● VO FE @ UCSD serves CMS analysis users
● User Jobs much more chaotic
● Most users don't really understand their needs
● Must protect from accidental errors
● Yet keep the system flexible
● Net result
● More complex policy


Two different policies
● The AnaOps FE actually has two policies
● The Regular policy
● The Overflow policy
● The Regular policy tries to match resources
● Based on User desires
● The Overflow policy “outsmarts” the Users
● Will violate User desires without breaking the Jobs
● The aim is to finish user jobs sooner
● User can opt-out, if he wishes

The Regular M.M. policy
● Four+one dimensions
● Grid Site
● Single CPU vs HTPC
● Memory usage
● Job duration
Due to preemption
● Number of Job Starts
● The actual policy is the AND of both
● Both VO FE policy and HTCondor policy
defined in the VO FE instance configuration

Grid site selection
● This is both similar and different compared to
the Production FE @CERN
● Serves the same purpose, but supports three
different ways to select a site
– Due to historical evolution
● The three options are
● GLIDEIN_CMSSite ∈ DESIRED_Sites
Planning to extend to
● GLIDEIN_SEs ∈ DESIRED_SEs (GLIDEIN_SEs ∩ DESIRED_SEs) ≠∅

● GLIDEIN_Gatekeeper ∈ DESIRED_Gatekeepers
● The actual policy is the OR of the three


Job type selection
● Just like @ CERN


Memory Usage
● Most Grid sites put strict limits on the amount of
memory that can be used
● Will kill glideins if they exceed the limit
● G.F. and glideins advertise the Entry-specific limit
GLIDEIN_MaxMemMBs
● Jobs can explicitly declare the needed memory
request_memory Native Condor attribute, no + needed
● Condor will also measure it at run time Use a combination
of these to calculate
– ImageSize – Virtual memory used the actual JobMemory

– ResidentSetSize – True memory usage
● Policy: JobMemory <= GLIDEIN_MaxMemMBs

Job Duration 1/2

● Glideins have a limited lifetime
● Must fit within the limits of the Grid site's queue
● Glideins publish the deadline
GLIDEIN_ToDie
– Jobs must finish before reaching the deadline
● Final user job lifetime unpredictable
● Depends on the type of computing done
● User should indicate the expected job lifetime
– Else we have to assume reasonable defaults
Not many users set
this value(s) right now

Job Duration 2/2

● The same type of computation may take
different amount of time
● e.g. Based on the type of input
● Jobs can declare two attributes
● NormMaxWallTimeMins – Expected limit
● MaxWallTimeMins – Absolute max limit
● The matchmaking logic is
● Use NormMaxWallTimeMins for
Based on simple assumption
the first job startup that the job was killed for
hitting the deadline.
● Use MaxWallTimeMins for all others


Cut on number of re-starts
● Not really a user configurable property
● More an emergency break
● In a properly configured system,
should never be triggered
● But unexpected problems happen
● So better limit the damage


The Overflow Use case
● User Jobs specify a list of sites,
because the data they need is there
● With recent versions of CMSSW, jobs can
access the data from remote
● With a small performance penalty
● We can thus schedule jobs “anywhere”
● As long as the needed data is
at a Site that has joined the xrootd federation
● But only if no CPU available “close to the data”
– And not too far, either
http://indico.cern.ch/contributionDisplay.py?contribId=381&sessionId=5&confId=149557
http://indico.cern.ch/contributionDisplay.py?contribId=232&sessionId=8&confId=149557


The Overflow M.M. policy
● Violate only the “Site selection” rule
● Keep all the others
● Plus, add one+one more:
● An opt-out mechanism
● Delayed matching


New Site M.M. policy
● The user specified attribute is used
to flag the job as “Overflowable”
● i.e. the job will match if and only if
(DESIRED_<site>s ∩ SUPPORTED_<site>s) ≠∅
Still support all 3 types of site identification
● Matching jobs can then run on any glidein
● Additional limits can be put in place by the FE,
but mostly invisible to the user


The opt-out mechanism
● The Overflow policy
considers all jobs by default
● But Users may want to opt-out some of the Jobs
– Sometimes it is just a need
(to get deterministic results, e.g. for testing a site)
● To opt-out, the user defines
+CMS_ALLOW_OVERFLOW = False
● The FE will not consider such jobs for Overflowing


Delayed matching
● As said initially,
Jobs should preferentially run close to the data
● Overflow should only consider jobs
“that cannot find resources close to the data”
● We implemented it based on time
● Jobs are matched only
if waiting in the queue for more than 6 hours

Users cannot influence it


Example submit file

Universe
Universe = vanilla
= vanilla
Executable = myana
Executable = myana
Output
Output = myana.out
= myana.out
Error
Error = myana.err
= myana.err
Log
Log = myana.log
= myana.log
request_memory = 1500
request_memory = 1500
+DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"
+DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"
+NormMaxWallTimeMins = 7200
+NormMaxWallTimeMins = 7200
+MaxWallTimeMins = 14400
+MaxWallTimeMins = 14400
+DESIRES_HTPC = 0
+DESIRES_HTPC = 0
+CMS_ALLOW_OVERFLOW = True
+CMS_ALLOW_OVERFLOW = True
Requirements = True
Requirements = True
Queue 1
Queue 1


The End


Pointers
● glideinWMS Home Page
http://tinyurl.com/glideinWMS
● HTCondor Home Page
http://research.cs.wisc.edu/htcondor/
● HTCondor support
htcondor-users@cs.wisc.edu
htcondor-admin@cs.wisc.edu
● glideinWMS support
glideinwms-support@fnal.gov


Acknowledgments
● The creation of this document was sponsored
by grants from the US NSF and US DOE,
and by the University of California system


Matchmaking in glideinWMS in CMS

More Related Content

Viewers also liked (9)

Similar to Matchmaking in glideinWMS in CMS (17)

More from Igor Sfiligoi (20)

Recently uploaded (20)

Matchmaking in glideinWMS in CMS