Ibm tivoli workload scheduler load leveler using and administering v3.5

Tivoli Workload Scheduler LoadLeveler

Using and Administering
Version 3 Release 5

SA22-7881-08

Note
Before using this information and the product it supports, read the information in “Notices” on page 745.

Ninth Edition (November 2008)
This edition applies to version 3, release 5, modification 0 of IBM Tivoli Workload Scheduler LoadLeveler (product
numbers 5765-E69 and 5724-I23) and to all subsequent releases and modifications until otherwise indicated in new
editions. This edition replaces SA22-7881-07. Significant changes or additions to the text and illustrations are
indicated by a vertical line (|) to the left of the change.
IBM welcomes your comments. A form for readers’ comments may be provided at the back of this publication, or
you can send your comments to the address:
International Business Machines Corporation
Department 58HA, Mail Station P181
2455 South Road
Poughkeepsie, NY 12601-5400
United States of America

FAX (United States & Canada): 1+845+432-9405
FAX (Other Countries):
Your International Access Code +1+845+432-9405

IBMLink™ (United States customers only): IBMUSM10(MHVRCFS)
Internet e-mail: mhvrcfs@us.ibm.com
If you want a reply, be sure to include your name, address, and telephone or FAX number.
Make sure to include the following in your comment or note:
v Title and order number of this publication
v Page number or topic related to your comment
When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any
way it believes appropriate without incurring any obligation to you.
©Copyright 1986, 1987, 1988, 1989, 1990, 1991 by the Condor Design Team.
©Copyright International Business Machines Corporation 1986, 2008. All rights reserved. US Government Users
Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents
Figures . . . . . . . . . . . . . . . ix LoadLeveler for AIX and LoadLeveler for Linux
compatibility . . . . . . . . . . . . . . 35
Tables . . . . . . . . . . . . . . . xi Restrictions for LoadLeveler for Linux . . . . 36
Features not supported in LoadLeveler for Linux 36
Restrictions for LoadLeveler for AIX and
About this information . . . . . . . . xiii LoadLeveler for Linux mixed clusters . . . . 37
Who should use this information . . . . . . . xiii
Conventions and terminology used in this
information . . . . . . . . . . . . . . xiii Part 2. Configuring and managing
Prerequisite and related information . . . . . . xiv the TWS LoadLeveler environment . 39
How to send your comments . . . . . . . . xv
Chapter 4. Configuring the LoadLeveler
Summary of changes . . . . . . . . xvii environment . . . . . . . . . . . . 41
Modifying a configuration file . . . . . . . . 42
Part 1. Overview of TWS Defining LoadLeveler administrators . . . . . . 43
Defining a LoadLeveler cluster . . . . . . . . 44
LoadLeveler concepts and operation 1
Choosing a scheduler . . . . . . . . . . 44
Setting negotiator characteristics and policies . . 45
Chapter 1. What is LoadLeveler? . . . . 3 Specifying alternate central managers . . . . . 46
LoadLeveler basics . . . . . . . . . . . . 4 Defining network characteristics . . . . . . 47
LoadLeveler: A network job management and Specifying file and directory locations . . . . 47
scheduling system . . . . . . . . . . . . 4 Configuring recording activity and log files. . . 48
Job definition . . . . . . . . . . . . . 5 Setting up file system monitoring . . . . . . 54
Machine definition . . . . . . . . . . . 6 Defining LoadLeveler machine characteristics . . . 54
How LoadLeveler schedules jobs . . . . . . . 7 Defining job classes that a LoadLeveler machine
How LoadLeveler daemons process jobs . . . . . 8 will accept . . . . . . . . . . . . . . 55
The master daemon . . . . . . . . . . . 9 Specifying how many jobs a machine can run . . 55
The Schedd daemon . . . . . . . . . . 10 Defining security mechanisms . . . . . . . . 56
The startd daemon . . . . . . . . . . . 11 Configuring LoadLeveler to use cluster security
The negotiator daemon . . . . . . . . . 13 services . . . . . . . . . . . . . . . 57
The kbdd daemon . . . . . . . . . . . 14 Defining usage policies for consumable resources . . 60
The gsmonitor daemon . . . . . . . . . 14 Enabling support for bulk data transfer and rCxt
The LoadLeveler job cycle . . . . . . . . . 16 blocks . . . . . . . . . . . . . . . . 61
LoadLeveler job states . . . . . . . . . . 19 Gathering job accounting data . . . . . . . . 61
Consumable resources . . . . . . . . . . . 22 Collecting job resource data on serial and parallel
Consumable resources and AIX Workload jobs . . . . . . . . . . . . . . . . 62
Manager . . . . . . . . . . . . . . 24 | Collecting accounting information for recurring
Overview of reservations . . . . . . . . . . 25 | jobs . . . . . . . . . . . . . . . . 63
Fair share scheduling overview . . . . . . . . 27 Collecting accounting data for reservations . . . 63
Collecting job resource data based on machines 64
Chapter 2. Getting a quick start using Collecting job resource data based on events . . 64
the default configuration . . . . . . . 29 Collecting job resource information based on user
What you need to know before you begin . . . . 29 accounts . . . . . . . . . . . . . . 65
Using the default configuration files . . . . . . 29 Collecting the accounting information and
LoadLeveler for Linux quick start . . . . . . . 30 storing it into files . . . . . . . . . . . 66
Quick installation . . . . . . . . . . . 30 Producing accounting reports . . . . . . . 66
Quick configuration . . . . . . . . . . 30 Correlating AIX and LoadLeveler accounting
Quick verification . . . . . . . . . . . 30 records . . . . . . . . . . . . . . . 66
Post-installation considerations . . . . . . . . 31 64-bit support for accounting functions . . . . 67
Starting LoadLeveler . . . . . . . . . . 31 Example: Setting up job accounting files . . . . 67
Location of directories following installation . . 32 Managing job status through control expressions . . 68
How control expressions affect jobs . . . . . 69
Tracking job processes . . . . . . . . . . . 70
Chapter 3. What operating systems are
Querying multiple LoadLeveler clusters . . . . . 71
supported by LoadLeveler? . . . . . . 35 Handling switch-table errors. . . . . . . . . 72

iii

Providing additional job-processing controls through | Configuring LoadLeveler to support data
installation exits . . . . . . . . . . . . . 72 | staging . . . . . . . . . . . . . . 114
Controlling the central manager scheduling cycle 73 Using an external scheduler . . . . . . . . 115
Handling DCE security credentials . . . . . 74 Replacing the default LoadLeveler scheduling
Handling an AFS token . . . . . . . . . 75 algorithm with an external scheduler . . . . 116
Filtering a job script . . . . . . . . . . 76 Customizing the configuration file to define an
Writing prolog and epilog programs . . . . . 77 external scheduler . . . . . . . . . . . 118
Using your own mail program . . . . . . . 81 Steps for getting information about the
LoadLeveler cluster, its machines, and jobs . . 118
Chapter 5. Defining LoadLeveler Assigning resources and dispatching jobs . . . 122
resources to administer . . . . . . . 83 Example: Changing scheduler types . . . . . . 126
Preempting and resuming jobs . . . . . . . 126
Steps for modifying an administration file . . . . 83
Overview of preemption . . . . . . . . 127
Defining machines . . . . . . . . . . . . 84
Planning to preempt jobs . . . . . . . . 128
Planning considerations for defining machines . 85
Steps for configuring a scheduler to preempt
Machine stanza format and keyword summary 86
jobs . . . . . . . . . . . . . . . 130
Examples: Machine stanzas . . . . . . . . 86
Configuring LoadLeveler to support reservations 131
Defining adapters . . . . . . . . . . . . 86
Steps for configuring reservations in a
Configuring dynamic adapters . . . . . . . 87
LoadLeveler cluster . . . . . . . . . . 132
Configuring InfiniBand adapters . . . . . . 87
Steps for integrating LoadLeveler with the AIX
Adapter stanza format and keyword summary 88
Workload Manager . . . . . . . . . . . 137
Examples: Adapter stanzas . . . . . . . . 89
LoadLeveler support for checkpointing jobs . . . 139
Defining classes . . . . . . . . . . . . . 89
Checkpoint keyword summary . . . . . . 139
Using limit keywords . . . . . . . . . . 89
Planning considerations for checkpointing jobs 140
Allowing users to use a class . . . . . . . 92
AIX checkpoint and restart limitations . . . . 141
Class stanza format and keyword summary . . 92
Naming checkpoint files and directories . . . 145
Examples: Class stanzas . . . . . . . . . 93
Removing old checkpoint files . . . . . . . 146
Defining user substanzas in class stanzas . . . . 94
LoadLeveler scheduling affinity support . . . . 146
Examples: Substanzas . . . . . . . . . . 95
Configuring LoadLeveler to use scheduling
Defining users . . . . . . . . . . . . . 97
affinity . . . . . . . . . . . . . . 147
User stanza format and keyword summary . . . 97
LoadLeveler multicluster support. . . . . . . 148
Examples: User stanzas . . . . . . . . . 98
Configuring a LoadLeveler multicluster . . . 150
Defining groups . . . . . . . . . . . . . 99
| Scale-across scheduling with multiclusters . . . 153
Group stanza format and keyword summary . . 99
LoadLeveler Blue Gene support . . . . . . . 155
Examples: Group stanzas . . . . . . . . . 99
Configuring LoadLeveler Blue Gene support 157
Defining clusters . . . . . . . . . . . . 100
Blue Gene reservation support. . . . . . . 159
Cluster stanza format and keyword summary 100
Blue Gene fair share scheduling support . . . 159
Examples: Cluster stanzas . . . . . . . . 100
Blue Gene heterogeneous memory support . . 160
Blue Gene preemption support . . . . . . 160
Chapter 6. Performing additional Blue Gene/L HTC partition support . . . . . 160
administrator tasks . . . . . . . . . 103 Using fair share scheduling . . . . . . . . . 160
Setting up the environment for parallel jobs . . . 104 Fair share scheduling keywords . . . . . . 161
Scheduling considerations for parallel jobs . . 104 Reconfiguring fair share scheduling keywords 163
Steps for reducing job launch overhead for Example: three groups share a LoadLeveler
parallel jobs . . . . . . . . . . . . . 105 cluster . . . . . . . . . . . . . . . 164
Steps for allowing users to submit interactive Example: two thousand students share a
POE jobs . . . . . . . . . . . . . . 106 LoadLeveler cluster . . . . . . . . . . 165
Setting up a class for parallel jobs . . . . . 106 Querying information about fair share
| Striping when some networks fail . . . . . 107 scheduling . . . . . . . . . . . . . 166
Setting up a parallel master node . . . . . . 108 Resetting fair share scheduling . . . . . . 166
Configuring LoadLeveler to support MPICH Saving historic data . . . . . . . . . . 166
jobs . . . . . . . . . . . . . . . 108 Restoring saved historic data . . . . . . . 167
Configuring LoadLeveler to support MVAPICH Procedure for recovering a job spool. . . . . . 167
jobs . . . . . . . . . . . . . . . 108
Configuring LoadLeveler to support Chapter 7. Using LoadLeveler’s GUI to
MPICH-GM jobs . . . . . . . . . . . 109
perform administrator tasks . . . . . 169
Using the BACKFILL scheduler . . . . . . . 110
Job-related administrative actions. . . . . . . 169
Tips for using the BACKFILL scheduler . . . 112
Machine-related administrative actions . . . . . 172
Example: BACKFILL scheduling . . . . . . 113
| Data staging . . . . . . . . . . . . . . 113

iv TWS LoadLeveler: Using and Administering

Part 3. Submitting and managing Checkpointing a job . . . . . . . . . . . 232
TWS LoadLeveler jobs . . . . . . 177
Chapter 10. Example: Using
commands to build, submit, and
Chapter 8. Building and submitting
manage jobs . . . . . . . . . . . . 235
jobs . . . . . . . . . . . . . . . 179
Building a job command file . . . . . . . . 179
Using multiple steps in a job command file . . 180 Chapter 11. Using LoadLeveler’s GUI
Examples: Job command files . . . . . . . 181 to build, submit, and manage jobs . . 237
Editing job command files . . . . . . . . . 185 Building jobs . . . . . . . . . . . . . 237
Defining resources for a job step . . . . . . . 185 Editing the job command file . . . . . . . . 249
| Submitting jobs requesting data staging . . . . 186 Submitting a job command file . . . . . . . 250
Working with coscheduled job steps . . . . . . 187 Displaying and refreshing job status . . . . . . 251
Submitting coscheduled job steps . . . . . . 187 Sorting the Jobs window . . . . . . . . . 252
Determining priority for coscheduled job steps 187 Changing the priority of your jobs . . . . . . 253
Supporting preemption of coscheduled job steps 187 Placing a job on hold . . . . . . . . . . . 253
Coscheduled job steps and commands and APIs 188 Releasing the hold on a job . . . . . . . . . 253
Termination of coscheduled steps . . . . . . 188 Canceling a job . . . . . . . . . . . . . 254
Using bulk data transfer . . . . . . . . . . 188 Modifying consumable resources and other job
Preparing a job for checkpoint/restart . . . . . 190 attributes . . . . . . . . . . . . . . . 254
Preparing a job for preemption . . . . . . . 193 Taking a checkpoint . . . . . . . . . . . 254
Submitting a job command file . . . . . . . 193 Adding a job to a reservation . . . . . . . . 255
Submitting a job using a submit-only machine 194 Removing a job from a reservation . . . . . . 255
Working with parallel jobs . . . . . . . . . 194 Displaying and refreshing machine status . . . . 255
Step for controlling whether LoadLeveler copies Sorting the Machines window . . . . . . . . 257
environment variables to all executing nodes . . 195 Finding the location of the central manager . . . 257
Ensuring that parallel jobs in a cluster run on Finding the location of the public scheduling
the correct levels of PE and LoadLeveler machines . . . . . . . . . . . . . . . 258
software . . . . . . . . . . . . . . 195 Finding the type of scheduler in use . . . . . . 258
Task-assignment considerations . . . . . . 196 Specifying which jobs appear in the Jobs window 258
Submitting jobs that use striping . . . . . . 198 Specifying which machines appear in Machines
Running interactive POE jobs . . . . . . . 203 window . . . . . . . . . . . . . . . 259
Running MPICH, MVAPICH, and MPICH-GM Saving LoadLeveler messages in a file . . . . . 259
jobs . . . . . . . . . . . . . . . 204
Examples: Building parallel job command files 207 Part 4. TWS LoadLeveler
Obtaining status of parallel jobs . . . . . . 212
Obtaining allocated host names . . . . . . 212 interfaces reference . . . . . . . 261
Working with reservations . . . . . . . . . 213
Understanding the reservation life cycle . . . 214 Chapter 12. Configuration file
Creating new reservations . . . . . . . . 216 reference . . . . . . . . . . . . . 263
Submitting jobs to run under a reservation . . 218 Configuration file syntax . . . . . . . . . 263
Removing bound jobs from the reservation . . 220 Numerical and alphabetical constants . . . . 264
Querying existing reservations . . . . . . 221 Mathematical operators . . . . . . . . . 264
Modifying existing reservations . . . . . . 221 64-bit support for configuration file keywords
Canceling existing reservations . . . . . . 222 and expressions . . . . . . . . . . . 264
Submitting jobs requesting scheduling affinity . . 222 Configuration file keyword descriptions . . . . 265
Submitting and monitoring jobs in a LoadLeveler User-defined keywords . . . . . . . . . . 313
multicluster . . . . . . . . . . . . . . 223 LoadLeveler variables . . . . . . . . . . 314
Steps for submitting jobs in a LoadLeveler Variables to use for setting dates . . . . . . 319
multicluster environment . . . . . . . . 224 Variables to use for setting times . . . . . . 320
Submitting and monitoring Blue Gene jobs . . . 226
Chapter 13. Administration file
Chapter 9. Managing submitted jobs 229 reference . . . . . . . . . . . . . 321
Querying the status of a job . . . . . . . . 229 Administration file structure and syntax . . . . 321
Working with machines . . . . . . . . . . 230 Stanza characteristics . . . . . . . . . . 323
Displaying currently available resources . . . . 230 Syntax for limit keywords . . . . . . . . 324
Setting and changing the priority of a job . . . . 230 64-bit support for administration file keywords 325
Example: How does a job’s priority affect Administration file keyword descriptions . . . . 327
dispatching order?. . . . . . . . . . . 231
Placing and releasing a hold on a job . . . . . 232
Canceling a job . . . . . . . . . . . . . 232

Contents v

Chapter 14. Job command file llstatus - Query machine status . . . . . . . 512
reference . . . . . . . . . . . . . 357 llsubmit - Submit a job . . . . . . . . . . 531
Job command file syntax . . . . . . . . . 357 llsummary - Return job resource information for
Serial job command file . . . . . . . . . 357 accounting . . . . . . . . . . . . . . 535
Parallel job command file . . . . . . . . 358
Syntax for limit keywords . . . . . . . . 358 Chapter 17. Application programming
64-bit support for job command file keywords 358 interfaces (APIs) . . . . . . . . . . 541
Job command file keyword descriptions . . . . 359 64-bit support for the LoadLeveler APIs . . . . 543
Job command file variables . . . . . . . . 399 LoadLeveler for AIX APIs . . . . . . . . 543
Run-time environment variables . . . . . . 400 LoadLeveler for Linux APIs . . . . . . . 544
Job command file examples . . . . . . . 401 Accounting API . . . . . . . . . . . . 544
GetHistory subroutine . . . . . . . . . 545
Chapter 15. Graphical user interface llacctval user exit . . . . . . . . . . . 547
(GUI) reference . . . . . . . . . . . 403 Checkpointing API . . . . . . . . . . . 548
Starting the GUI . . . . . . . . . . . . 403 ckpt subroutine . . . . . . . . . . . . 549
Specifying GUI options . . . . . . . . . 404 ll_ckpt subroutine . . . . . . . . . . . 550
The LoadLeveler main window . . . . . . 404 ll_init_ckpt subroutine . . . . . . . . . 553
Getting help using the GUI . . . . . . . . 405 ll_set_ckpt_callbacks subroutine . . . . . . 555
Differences between LoadLeveler’s GUI and ll_unset_ckpt_callbacks subroutine . . . . . 556
other graphical user interfaces . . . . . . . 406 Configuration API . . . . . . . . . . . . 557
GUI typographic conventions . . . . . . . 406 ll_config_changed subroutine . . . . . . . 558
64-bit support for the GUI . . . . . . . . 407 ll_read_config subroutine . . . . . . . . 559
Customizing the GUI . . . . . . . . . . . 407 Data access API . . . . . . . . . . . . 560
Syntax of an Xloadl file . . . . . . . . . 407 Using the data access API . . . . . . . . 560
Modifying windows and buttons . . . . . . 408 Understanding the LoadLeveler data access
Creating your own pull-down menus . . . . 409 object model. . . . . . . . . . . . . 561
Customizing fields on the Jobs window and the Understanding the Blue Gene object model . . 562
Machines window . . . . . . . . . . . 409 Understanding the Class object model . . . . 562
Modifying help panels . . . . . . . . . 410 Understanding the Cluster object model . . . 563
Understanding the Fairshare object model . . . 563
Understanding the Job object model . . . . . 564
Chapter 16. Commands . . . . . . . 411 Understanding the Machine object model . . . 565
llacctmrg - Collect machine history files . . . . 413 Understanding the MCluster object model . . . 566
llbind - Bind job steps to a reservation . . . . . 415 Understanding the Reservations object model 566
llcancel - Cancel a submitted job . . . . . . . 421 Understanding the Wlmstat object model . . . 567
llchres - Change attributes of a reservation . . . 424 ll_deallocate subroutine . . . . . . . . . 568
llckpt - Checkpoint a running job step . . . . . 430 ll_free_objs subroutine . . . . . . . . . 569
llclass - Query class information . . . . . . . 433 ll_get_data subroutine . . . . . . . . . 570
llclusterauth - Generates public and private keys 438 ll_get_objs subroutine . . . . . . . . . 624
llctl - Control LoadLeveler daemons . . . . . . 439 ll_next_obj subroutine . . . . . . . . . 627
llextRPD - Extract data from an RSCT peer domain 443 ll_query subroutine . . . . . . . . . . 628
llfavorjob - Reorder system queue by job . . . . 447 ll_reset_request subroutine . . . . . . . . 629
llfavoruser - Reorder system queue by user . . . 449 ll_set_request subroutine . . . . . . . . 630
llfs - Fair share scheduling queries and operations 450 Examples of using the data access API . . . . 633
llhold - Hold or release a submitted job . . . . 454 Error handling API . . . . . . . . . . . 639
llinit - Initialize machines in the LoadLeveler ll_error subroutine. . . . . . . . . . . 640
cluster . . . . . . . . . . . . . . . . 457 Fair share scheduling API . . . . . . . . . 641
llmkres - Make a reservation . . . . . . . . 459 ll_fair_share subroutine . . . . . . . . . 642
llmodify - Change attributes of a submitted job Reservation API . . . . . . . . . . . . 643
step . . . . . . . . . . . . . . . . 464 ll_bind subroutine . . . . . . . . . . . 645
llmovejob - Move a single idle job from the local ll_change_reservation subroutine . . . . . . 648
cluster to another cluster . . . . . . . . . 470 ll_init_reservation_param subroutine . . . . 652
llmovespool - Move job records . . . . . . . 472 ll_make_reservation subroutine . . . . . . 653
llpreempt - Preempt a submitted job step . . . . 474 ll_remove_reservation subroutine . . . . . . 658
llprio - Change the user priority of submitted job | ll_remove_reservation_xtnd subroutine . . . . 660
steps . . . . . . . . . . . . . . . . 477 Submit API . . . . . . . . . . . . . . 663
llq - Query job status . . . . . . . . . . . 479 llfree_job_info subroutine . . . . . . . . 664
llqres - Query a reservation . . . . . . . . . 500 llsubmit subroutine . . . . . . . . . . 665
llrmres - Cancel a reservation . . . . . . . . 508 monitor_program user exit . . . . . . . . 667
llrunscheduler - Run the central manager’s Workload management API . . . . . . . . 668
scheduling algorithm . . . . . . . . . . . 511 ll_cluster subroutine . . . . . . . . . . 669

vi TWS LoadLeveler: Using and Administering

ll_cluster_auth subroutine . . . . . . . . 671 How do I find my remote job? . . . . . . 716
ll_control subroutine . . . . . . . . . . 673 Why won’t my remote job run? . . . . . . 717
ll_modify subroutine . . . . . . . . . . 677 Why does llq -X all show no jobs running when
ll_move_job subroutine . . . . . . . . . 681 there are jobs running? . . . . . . . . . 717
ll_move_spool subroutine . . . . . . . . 683 Troubleshooting in a Blue Gene environment . . . 717
ll_preempt subroutine . . . . . . . . . 686 Why do all of my Blue Gene jobs fail even
ll_preempt_jobs subroutine . . . . . . . . 688 though llstatus shows that Blue Gene is present? 718
ll_run_scheduler subroutine . . . . . . . 691 Why does llstatus show that Blue Gene is
ll_start_job_ext subroutine . . . . . . . . 692 absent? . . . . . . . . . . . . . . 718
ll_terminate_job subroutine . . . . . . . . 696 Why did my Blue Gene job fail when the job
was submitted to a remote cluster? . . . . . 718
Appendix A. Troubleshooting | Why does llmkres or llchres return ″Insufficient
LoadLeveler . . . . . . . . . . . . 699 | resources to meet the request″ for a Blue Gene
| reservation when resources appear to be
Frequently asked questions . . . . . . . . . 699
| available?. . . . . . . . . . . . . . 719
Why won’t LoadLeveler start? . . . . . . . 700
Helpful hints . . . . . . . . . . . . . 719
Why won’t my job run? . . . . . . . . . 700
Scaling considerations . . . . . . . . . 719
Why won’t my parallel job run? . . . . . . 703
Hints for running jobs . . . . . . . . . 720
Why won’t my checkpointed job restart? . . . 704
Hints for using machines . . . . . . . . 723
Why won’t my submit-only job run? . . . . 705
History files and Schedd . . . . . . . . 724
Why won’t my job run on a cluster with both
Getting help from IBM . . . . . . . . . . 724
AIX and Linux machines? . . . . . . . . 705
| Why won’t my job run when scheduling affinity
| is enabled on x86 and x86_64 systems? . . . . 705 Appendix B. Sample command output 725
Why does a job stay in the Pending (or Starting) llclass -l command output listing . . . . . . . 725
state? . . . . . . . . . . . . . . . 706 llq -l command output listing . . . . . . . . 727
What happens to running jobs when a machine llq -l command output listing for a Blue Gene
goes down? . . . . . . . . . . . . . 706 enabled system . . . . . . . . . . . . . 729
Why won’t my jobs run that were directed to an llq -l -x command output listing . . . . . . . 730
idle pool? . . . . . . . . . . . . . 708 llstatus -l command output listing . . . . . . 733
What happens if the central manager isn’t llstatus -l -b command output listing . . . . . 733
operating? . . . . . . . . . . . . . 708 llstatus -B command output listing . . . . . . 735
How do I recover resources allocated by a llstatus -P command output listing . . . . . . 736
Schedd machine? . . . . . . . . . . . 710 llsummary -l -x command output listing . . . . 736
Why can’t I find a core file on Linux? . . . . 710 llsummary -l -x command output listing for a Blue
Why am I seeing inconsistencies in my llfs Gene-enabled system . . . . . . . . . . . 738
output? . . . . . . . . . . . . . . 711
Why don’t I see my job when I issue the llq Appendix C. LoadLeveler port usage 741
command? . . . . . . . . . . . . . 711
What happens if errors are found in my Accessibility features for TWS
configuration or administration file? . . . . . 711
LoadLeveler . . . . . . . . . . . . 743
Other questions . . . . . . . . . . . 712
Accessibility features . . . . . . . . . . . 743
Troubleshooting in a multicluster environment . . 714
Keyboard navigation . . . . . . . . . . . 743
How do I determine if I am in a multicluster
IBM and accessibility . . . . . . . . . . . 743
environment? . . . . . . . . . . . . 714
How do I determine how my multicluster
environment is defined and what are the Notices . . . . . . . . . . . . . . 745
inbound and outbound hosts defined for each Trademarks . . . . . . . . . . . . . . 746
cluster? . . . . . . . . . . . . . . 714
Why is my multicluster environment not Glossary . . . . . . . . . . . . . 749
enabled? . . . . . . . . . . . . . . 714
How do I find log messages from my Index . . . . . . . . . . . . . . . 753
multicluster-defined installation exits? . . . . 715
Why won’t my remote job be submitted or
moved? . . . . . . . . . . . . . . 715
Why did the CLUSTER_REMOTE_JOB_FILTER
not update the job with all of the statements I
defined? . . . . . . . . . . . . . . 716

Contents vii

viii TWS LoadLeveler: Using and Administering

Figures
1. Example of a LoadLeveler cluster . . . . . 3 28. MPICH job command file - sample 1 208
2. LoadLeveler job steps . . . . . . . . . 5 29. MPICH job command file - sample 2 209
3. Multiple roles of machines . . . . . . . . 7 30. MPICH-GM job command file - sample 1 210
4. High-level job flow . . . . . . . . . . 16 31. MPICH-GM job command file - sample 2 210
5. Job is submitted to LoadLeveler . . . . . . 17 32. MVAPICH job command file - sample 1 211
6. LoadLeveler authorizes the job . . . . . . 17 33. MVAPICH job command file - sample 2 212
7. LoadLeveler prepares to run the job . . . . 18 34. Using LOADL_PROCESSOR_LIST in a shell
8. LoadLeveler starts the job . . . . . . . . 18 script . . . . . . . . . . . . . . 213
9. LoadLeveler completes the job . . . . . . 19 35. Building a job command file . . . . . . 235
10. How control expressions affect jobs . . . . 70 36. LoadLeveler build a job window . . . . . 238
11. Format of a machine stanza . . . . . . . 86 37. Format of administration file stanzas 322
12. Format of an adapter stanza . . . . . . . 88 38. Format of administration file substanzas 322
13. Format of a class stanza . . . . . . . . 93 39. Sample administration file stanzas . . . . 322
14. Format of a user substanza . . . . . . . 95 40. Sample administration file stanza with user
15. Format of a user stanza . . . . . . . . 98 substanzas . . . . . . . . . . . . 323
16. Format of a group stanza . . . . . . . . 99 41. Serial job command file . . . . . . . . 358
17. Format of a cluster stanza . . . . . . . 100 42. Main window of the LoadLeveler GUI 405
18. Multicluster Example . . . . . . . . . 101 43. Creating a new pull-down menu . . . . . 409
19. Job command file with multiple steps 181 44. TWS LoadLeveler Blue Gene object model 562
20. Job command file with multiple steps and 45. TWS LoadLeveler Class object model 563
one executable . . . . . . . . . . . 181 46. TWS LoadLeveler Cluster object model 563
21. Job command file with varying input 47. TWS LoadLeveler Fairshare object model 563
statements . . . . . . . . . . . . 182 48. TWS LoadLeveler Job object model . . . . 565
22. Using LoadLeveler variables in a job 49. TWS LoadLeveler Machine object model 566
command file . . . . . . . . . . . 183 50. TWS LoadLeveler MCluster object model 566
23. Job command file used as the executable 185 51. TWS LoadLeveler Reservations object model 566
24. Striping over multiple networks . . . . . 200 52. TWS LoadLeveler Wlmstat object model 567
25. Striping over a single network . . . . . . 202 53. When the primary central manager is
26. POE job command file – multiple tasks per unavailable . . . . . . . . . . . . 709
node . . . . . . . . . . . . . . 207 54. Multiple central managers . . . . . . . 709
27. POE sample job command file – invoking
POE twice . . . . . . . . . . . . 208

ix

x TWS LoadLeveler: Using and Administering

Tables
1. Summary of typographic conventions xiv | 35. Keywords for configuring scale-across
2. Major topics in TWS LoadLeveler: Using and | scheduling . . . . . . . . . . . . 154
Administering . . . . . . . . . . . . 1 36. IBM System Blue Gene Solution
3. Topics in the TWS LoadLeveler overview 3 documentation . . . . . . . . . . . 156
4. LoadLeveler daemons . . . . . . . . . 8 37. Blue Gene subtasks and associated
5. startd determines whether its own state instructions . . . . . . . . . . . . 157
permits a new job to run . . . . . . . . 12 38. Blue Gene related topics and associated
6. Job state descriptions and abbreviations 20 information . . . . . . . . . . . . 157
7. Location and description of product directories 39. Blue Gene configuring subtasks and
following installation . . . . . . . . . 33 associated instructions . . . . . . . . 157
8. Location and description of directories for 40. Learning about building and submitting jobs 179
submit-only LoadLeveler . . . . . . . . 33 41. Roadmap of user tasks for building and
9. Roadmap of tasks for TWS LoadLeveler submitting jobs . . . . . . . . . . . 179
administrators . . . . . . . . . . . 41 42. Standard files for the five job steps . . . . 182
10. Roadmap of administrator tasks related to 43. Checkpoint configurations . . . . . . . 191
using or modifying the LoadLeveler | 44. Valid combinations of task assignment
configuration file . . . . . . . . . . . 42 | keywords are listed in each column . . . . 196
11. Roadmap for defining LoadLeveler cluster 45. node and total_tasks . . . . . . . . . 196
characteristics . . . . . . . . . . . . 44 46. Blocking . . . . . . . . . . . . . 197
12. Default locations for all of the files and 47. Unlimited blocking . . . . . . . . . 198
directories . . . . . . . . . . . . . 47 48. Roadmap of tasks for reservation owners and
13. Log control statements . . . . . . . . . 49 users . . . . . . . . . . . . . . 213
14. Roadmap of configuration tasks for securing 49. Reservation states, abbreviations, and usage
LoadLeveler operations . . . . . . . . 57 notes . . . . . . . . . . . . . . 214
15. Roadmap of tasks for gathering job accounting 50. Instructions for submitting a job to run under
data . . . . . . . . . . . . . . . 62 a reservation . . . . . . . . . . . . 219
16. Collecting account data - modifying the 51. Submitting and monitoring jobs in a
configuration file . . . . . . . . . . . 67 LoadLeveler multicluster . . . . . . . . 224
17. Roadmap of administrator tasks accomplished 52. Roadmap of user tasks for managing
through installation exits . . . . . . . . 72 submitted jobs . . . . . . . . . . . 229
18. Roadmap of tasks for modifying the 53. How LoadLeveler handles job priorities 231
LoadLeveler administration file . . . . . . 83 54. User tasks available through the GUI 237
19. Types of limit keywords . . . . . . . . 90 55. GUI fields and input . . . . . . . . . 239
20. Enforcing job step limits . . . . . . . . 91 56. Nodes dialog box . . . . . . . . . . 243
21. Setting limits . . . . . . . . . . . . 92 57. Network dialog box fields . . . . . . . 244
22. Roadmap of additional administrator tasks 103 58. Build a job dialog box fields . . . . . . 245
23. Roadmap of BACKFILL scheduler tasks 111 59. Limits dialog box fields . . . . . . . . 247
24. Roadmap of tasks for using an external 60. Checkpointing dialog box fieldsF . . . . . 248
scheduler . . . . . . . . . . . . . 116 61. Blue Gene job fields . . . . . . . . . 248
25. Effect of LoadLeveler keywords under an 62. Modifying the job command file with the Edit
external scheduler . . . . . . . . . . 116 pull-down menu . . . . . . . . . . 249
26. Roadmap of tasks for using preemption 127 63. Modifying the job command file with the
27. Preemption methods for which LoadLeveler Tools pull-down menu . . . . . . . . 250
automatically resumes preempted jobs . . . 129 64. Saving and submitting information . . . . 250
28. Preemption methods for which administrator 65. Sorting the jobs window . . . . . . . . 252
or user intervention is required . . . . . 130 66. Sorting the machines window . . . . . . 257
29. Roadmap of reservation tasks for 67. Specifying which jobs appear in the Jobs
administrators . . . . . . . . . . . 132 window . . . . . . . . . . . . . 258
30. Roadmap of tasks for checkpointing jobs 139 68. Specifying which machines appear in
31. Deciding where to define the directory for Machines window . . . . . . . . . . 259
staging executables . . . . . . . . . 141 69. Configuration subtasks . . . . . . . . 263
32. Multicluster support subtasks and associated 70. BG_MIN_PARTITION_SIZE values . . . . 268
instructions . . . . . . . . . . . . 149 71. Administration file subtasks . . . . . . 321
33. Multicluster support related topics . . . . 149 72. Notes on 64-bit support for administration
34. Subtasks for configuring a LoadLeveler file keywords . . . . . . . . . . . 325
multicluster . . . . . . . . . . . . 150

xi

73. Summary of possible values set for the 90. FAIRSHARE specifications for ll_get_data
env_copy keyword in the administration file . 335 subroutine . . . . . . . . . . . . 582
74. Sample user and group settings for the 91. JOBS specifications for ll_get_data subroutine 583
max_reservations keyword . . . . . . . 345 92. MACHINES specifications for ll_get_data
75. Job command file subtasks . . . . . . . 357 subroutine . . . . . . . . . . . . 614
76. Notes on 64-bit support for job command file 93. MCLUSTERS specifications for ll_get_data
keywords . . . . . . . . . . . . . 358 subroutine . . . . . . . . . . . . 619
77. mcm_affinity_options default values . . . . 381 94. RESERVATIONS specifications for ll_get_data
78. Example of a selection table . . . . . . . 406 subroutine . . . . . . . . . . . . 620
79. Decision table . . . . . . . . . . . 407 95. WLMSTAT specifications for ll_get_data
80. Decision table actions . . . . . . . . . 407 subroutine . . . . . . . . . . . . 622
81. Window identifiers in the Xloadl file 408 96. query_daemon summary . . . . . . . . 624
82. Resource variables for all the windows and 97. query_flags summary . . . . . . . . . 630
the buttons . . . . . . . . . . . . 408 98. object_filter value related to the query flags
83. Modifying help panels . . . . . . . . 410 value . . . . . . . . . . . . . . 631
84. LoadLeveler command summary . . . . . 411 99. enum LL_reservation_data type . . . . . 649
85. llmodify options and keywords . . . . . 468 100. How nodes should be arranged in the node
86. LoadLeveler API summary . . . . . . . 541 list . . . . . . . . . . . . . . . 694
87. BLUE_GENE specifications for ll_get_data 101. Why your job might not be running . . . . 700
subroutine . . . . . . . . . . . . 571 102. Why your job might not be running . . . . 703
88. CLASSES specifications for ll_get_data 103. Troubleshooting running jobs when a
subroutine . . . . . . . . . . . . 576 machine goes down . . . . . . . . . 706
89. CLUSTERS specifications for ll_get_data 104. LoadLeveler default port usage . . . . . 741
subroutine . . . . . . . . . . . . 580

xii TWS LoadLeveler: Using and Administering

About this information
IBM® Tivoli® Workload Scheduler (TWS) LoadLeveler® provides various ways of
scheduling and managing applications for best performance and most efficient use
of resources. LoadLeveler manages both serial and parallel jobs over a cluster of
machines or servers, which may be desktop workstations, dedicated servers, or
parallel machines. This information describes how to configure and administer this
LoadLeveler cluster environment, and to submit and manage jobs that run on
machines in the cluster.

Who should use this information
This information is intended for two separate audiences:
v Personnel who are responsible for installing, configuring and managing the
LoadLeveler cluster environment. These people are called LoadLeveler
administrators. LoadLeveler administrative tasks include:
– Setting up configuration and administration files
– Maintaining the LoadLeveler product
– Setting up the distributed environment for allocating batch jobs
v Users who submit and manage serial and parallel jobs to run in the LoadLeveler
cluster.

Both LoadLeveler administrators and general users should be experienced with the
UNIX® commands. Administrators also should be familiar with:
v Cluster system management techniques such as SMIT, as it is used in the AIX®
environment
v Networking and NFS or AFS® protocols

Conventions and terminology used in this information
Throughout the TWS LoadLeveler product information:
v TWS LoadLeveler for Linux® Multiplatform includes:
| – IBM System servers with Advanced Micro Devices (AMD) Opteron or Intel®
| Extended Memory 64 Technology (EM64T) processors
– IBM System x™ servers
– IBM BladeCenter® Intel processor-based servers
– IBM Cluster 1350™

Note: IBM Tivoli Workload Scheduler LoadLeveler is supported when running
Linux on non-IBM Intel-based and AMD hardware servers.

Supported hardware includes:
| – Servers with Intel 32-bit and Intel EM64T
| – Servers with AMD 64-bit technology
v Note that in this information:
– LoadLeveler is also referred to as Tivoli Workload Scheduler LoadLeveler and
TWS LoadLeveler.
– Switch_Network_Interface_For_HPS is also referred to as HPS or High
Performance Switch.

xiii

Table 1 describes the typographic conventions used in this information.
Table 1. Summary of typographic conventions
Typographic Usage
Bold v Bold words or characters represent system elements that you must use
literally, such as commands, flags, and path names.
v Bold words also indicate the first use of a term included in the glossary.
Italic v Italic words or characters represent variable values that you must supply.
v Italics are also used for book titles and for general emphasis in text.
Constant Examples and information that the system displays appear in constant
width width typeface.
[] Brackets enclose optional items in format and syntax descriptions.
{} Braces enclose a list from which you must choose an item in format and
syntax descriptions.
| A vertical bar separates items in a list of choices. (In other words, it means
“or.”)
<> Angle brackets (less-than and greater-than) enclose the name of a key on
the keyboard. For example, <Enter> refers to the key on your terminal or
workstation that is labeled with the word Enter.
... An ellipsis indicates that you can repeat the preceding item one or more
times.
<Ctrl-x> The notation <Ctrl-x> indicates a control character sequence. For example,
<Ctrl-c> means that you hold down the control key while pressing <c>.
The continuation character is used in coding examples in this information
for formatting purposes.

Prerequisite and related information
The Tivoli Workload Scheduler LoadLeveler publications are:
v Installation Guide, GI10-0763
v Using and Administering, SA22-7881
v Diagnosis and Messages Guide, GA22-7882

To access all TWS LoadLeveler documentation, refer to the IBM Cluster
Information Center, which contains the most recent TWS LoadLeveler
documentation in PDF and HTML formats. This Web site is located at:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp

A TWS LoadLeveler Documentation Updates file also is maintained on this Web
site. The TWS LoadLeveler Documentation Updates file contains updates to the
TWS LoadLeveler documentation. These updates include documentation
corrections and clarifications that were discovered after the TWS LoadLeveler
books were published.

Both the current TWS LoadLeveler books and earlier versions of the library are
also available in PDF format from the IBM Publications Center Web site located at:
http://www.elink.ibmlink.ibm.com/publications/servlet/pbi.wss

To easily locate a book in the IBM Publications Center, supply the book’s
publication number. The publication number for each of the TWS LoadLeveler
books is listed after the book title in the preceding list.
xiv TWS LoadLeveler: Using and Administering

How to send your comments
Your feedback is important in helping us to produce accurate, high-quality
information. If you have any comments about this book or any other TWS
LoadLeveler documentation:
v Send your comments by e-mail to: mhvrcfs@us.ibm.com
Include the book title and order number, and, if applicable, the specific location
of the information you have comments on (for example, a page number or a
table number).
v Fill out one of the forms at the back of this book and return it by mail, by fax, or
by giving it to an IBM representative.

To contact the IBM cluster development organization, send your comments by
e-mail to: cluster@us.ibm.com

About this information xv

xvi TWS LoadLeveler: Using and Administering

Summary of changes
The following sections summarize changes to the IBM Tivoli Workload Scheduler
(TWS) LoadLeveler product and TWS LoadLeveler library for each new release or
major service update for a given product version. Within each information unit in
the library, a vertical line to the left of text and illustrations indicates technical
changes or additions made to the previous edition of the information.

Changes to TWS LoadLeveler for this release or update include:
v New information:
– Recurring reservation support:
- The TWS LoadLeveler commands and APIs have been enhanced to support
recurring reservation.
- Accounting records have been enhanced to have recurring reservation
entries.
- The new recurring job command file keyword will allow a user to specify
that the job can run in every occurrence of the recurring reservation to
which it is bound.
– Data staging support:
- Jobs can request data files from a remote storage location before the job
executes and back to remote storage after it finishes execution.
- Schedules data staging at submit time or just in time for the application
execution.
– Multicluster scale-across scheduling support:
- Allows a large job to span resources across more than one cluster
v Scale-across scheduling is a way to schedule jobs in the multicluster
environment to span resources across more than one cluster. This feature
allows large jobs that request more resources than any single cluster can
provide to combine the resources from more than one cluster and run
large jobs on the combined resources, effectively spanning resources
across more than one cluster.
v Allows utilization of fragmented resources from more than one cluster
– Fragmented resources occur when the resources available on a single
cluster cannot satisfy any single job on that cluster. This feature allows
any size job to take advantage of these resources by combining them
from multiple clusters.
– Enhanced WLM support:
- Integrates TWS LoadLeveler with AIX Workload Manager (WLM) virtual
memory and the large page resource limit support.
- Enforces virtual memory and the large page limit usage of a job.
- Reports statistics for virtual memory and the large page limit usage.
- Dynamically changes virtual memory and the large page limit usage of a
job.
– Enhanced adapter striping (sn_all) support:
- Submits jobs to nodes that have one or more networks in the failed
(NOTREADY) state provided that all of the nodes assigned to the job have
more than half of the networks in the READY state.

xvii

- A new striping_with_minimum_networks configuration keyword has been
added to the class stanza to support striping with failed networks.
– Enhanced affinity support:
- Task affinity support has been enhanced on nodes that are booted in single
threaded (ST) mode and on nodes that do not support simultaneous
multithreading (SMT).
– NetworkID64 for Mellanox adapters on Linux systems with InfiniBand
support:
- Generates unique NetworkID64 IDs for adapter ports that are connected to
the same switch and have the same IP subnet address. This ensures that
ports that are connected to the same switch, but are configured with
different IP subnet addresses, will get different NeworkID64 values.
v Changed information:
– This is the last release that will provide the following functions:
- The Motif-based graphical user interface xloadl. The function available in
xloadl has been frozen since TWS LoadLeveler 3.3.2 and there are no plans
to update this GUI with any new function added to TWS LoadLeveler after
that level.
- The IBM BladeCenter JS21 with a BladeCenter H chassis interconnected
with the InfiniBand Host Channel Adapters connected to a Cisco
InfiniBand SDR switch.
- The IBM Power System 575 (Model 9118-575) and IBM Power System 550
(Model 9133-55A) interconnected with the InfiniBand Host Channel
Adapter and Cisco switch.
- The High Performance Switch.
– If you have a mixed TWS LoadLeveler cluster and need to run your job on a
specific operating system or architecture, you must define the requirements
keyword statement in your job command file specifying the desired Arch or
OpSys. For example:
Requirements: (Arch == "RS6000") && (OpSys == "AIX53")
v Deleted information:
The following function is no longer supported and the information has been
removed:
– The scheduling of parallel jobs with the default scheduler
(SCHEDULER_TYPE=LL_DEFAULT)
– The min_processors and max_processors keywords
– The RSET_CONSUMABLE_CPUS option for the rset_support configuration
keyword and the rset job command file keyword
– The API functions:
- ll_get_nodes
- ll_free_nodes
- ll_get_jobs
- ll_free_jobs
- ll_start_job
– Red Hat Enterprise Linux 3
– The llctl purgeschedd function has been replaced by the llmovespool
function.
– The lldbconvert function is no longer needed for migration and the
lldbconvert command is not included in TWS LoadLeveler 3.5.

xviii TWS LoadLeveler: Using and Administering

Part 1. Overview of TWS LoadLeveler concepts and operation
Setting up IBM Tivoli Workload Scheduler (TWS) LoadLeveler involves defining
machines, users, jobs, and how they interact, in such a way that TWS LoadLeveler
is able to run jobs quickly and efficiently.

Once you have a basic understanding of the TWS LoadLeveler product and its
interfaces, you can find more details in the topics listed in Table 2.
Table 2. Major topics in TWS LoadLeveler: Using and Administering
To learn about: Read the following:
Performing administrator tasks Part 2, “Configuring and managing the TWS
LoadLeveler environment,” on page 39
Performing general user tasks Part 3, “Submitting and managing TWS
LoadLeveler jobs,” on page 177
Using TWS LoadLeveler interfaces Part 4, “TWS LoadLeveler interfaces reference,” on
page 261

1

2 TWS LoadLeveler: Using and Administering

Chapter 1. What is LoadLeveler?
LoadLeveler is a job management system that allows users to run more jobs in less
time by matching the jobs’ processing needs with the available resources.
LoadLeveler schedules jobs, and provides functions for building, submitting, and
processing jobs quickly and efficiently in a dynamic environment.

Figure 1 shows the different environments to which LoadLeveler can schedule jobs.
Together, these environments comprise the LoadLeveler cluster.

LoadLeveler
cluster

IBM Power Systems
running AIX

Submit-only
workstations

IBM eServer Cluster 1350
running Linux

IBM BladeCenter
running Linux

Figure 1. Example of a LoadLeveler cluster

As Figure 1 also illustrates, a LoadLeveler cluster can include submit-only machines,
which allow users to have access to a limited number of LoadLeveler features.

Throughout all the topics, the terms workstation, machine, node, and operating system
instance (OSI) refer to the machines in your cluster. In LoadLeveler, an OSI is
treated as a single instance of an operating system image.

If you are unfamiliar with the TWS LoadLeveler product, consider reading one or
more of the introductory topics listed in Table 3:
Table 3. Topics in the TWS LoadLeveler overview
Using the default configuration for Chapter 2, “Getting a quick start using the default
getting a quick start configuration,” on page 29
Specific products and features that are Chapter 3, “What operating systems are supported
required for or available through the by LoadLeveler?,” on page 35
TWS LoadLeveler environment

3

LoadLeveler basics
LoadLeveler has various types of interfaces that enable users to create and submit
jobs and allow system administrators to configure the system and control running
jobs.

These interfaces include:
v Control files that define the elements, characteristics, and policies of LoadLeveler
and the jobs it manages. These files are the configuration file, the administration
file, and job command file.
v The command line interface, which gives you access to basic job and
administrative functions.
v A graphical user interface (GUI), which provides system access similar to the
command line interface. Experienced users and administrators may find the
command line interface more efficient than the GUI for job and administrative
functions.
v An application programming interface (API), which allows application programs
written by users and administrators to interact with the LoadLeveler
environment.

The commands, GUI, and APIs permit different levels of access to administrators
and users. User access is typically restricted to submitting and managing
individual jobs, while administrative access allows setting up system
configurations, job scheduling, and accounting.

Using either the command line or the GUI, users create job command files that
instruct the system on how to process information. Each job command file consists
of keywords followed by the user defined association for that keyword. For
example, the keyword executable tells LoadLeveler that you are about to define
the name of a program you want to run. Therefore, executable = longjob tells
LoadLeveler to run the program called longjob.

After creating the job command file, you invoke LoadLeveler commands to
monitor and control the job as it moves through the system. LoadLeveler monitors
each job as it moves through the system using process control daemons. However,
the administrator maintains ultimate control over all LoadLeveler jobs by defining
job classes that control how and when LoadLeveler will run a job.

In addition to setting up job classes, the administrator can also control how jobs
move through the system by specifying the type of scheduler. LoadLeveler has
several different scheduler options that start jobs using specific algorithms to
balance job priority with available machine resources.

When LoadLeveler administrators are configuring clusters and when users are
planning jobs, they need to be aware of the machine resources available in the
cluster. These resources include items like the number of CPUs and the amount of
memory available for each job. Because resource availability will vary over time,
LoadLeveler defines them as consumable resources.

LoadLeveler: A network job management and scheduling system
A network job management and job scheduling system, such as LoadLeveler, is a
software program that schedules and manages jobs that you submit to one or more
machines under its control.


LoadLeveler accepts jobs that users submit and reviews the job requirements.
LoadLeveler then examines the machines under its control to determine which
machines are best suited to run each job.

Job definition
LoadLeveler schedules your jobs on one or more machines for processing. The
definition of a job, in this context, is a set of job steps.

or each job step, you can specify a different executable (the executable is the part
of the job that gets processed). You can use LoadLeveler to submit jobs which are
made up of one or more job steps, where each job step depends upon the
completion status of a previous job step. For example, Figure 2 illustrates a stream
of job steps:

1. Copy data from tape
2. Check exit status

Job
job command file exit status = y

exit status = x
Q Job step 1
Q Job step 2 1. Process data
2. Check exit status

Q Job step 3 exit status = y

exit status = x

Format and print results End program

Figure 2. LoadLeveler job steps

Each of these job steps is defined in a single job command file. A job command
file specifies the name of the job, as well as the job steps that you want to submit,
and can contain other LoadLeveler statements.

LoadLeveler tries to execute each of your job steps on a machine that has enough
resources to support executing and checkpointing each step. If your job command
file has multiple job steps, the job steps will not necessarily run on the same
machine, unless you explicitly request that they do.

You can submit batch jobs to LoadLeveler for scheduling. Batch jobs run in the
background and generally do not require any input from the user. Batch jobs can
either be serial or parallel. A serial job runs on a single machine. A parallel job is a
program designed to execute as a number of individual, but related, processes on
one or more of your system’s nodes. When executed, these related processes can
communicate with each other (through message passing or shared memory) to
exchange data or synchronize their execution.

For parallel jobs, LoadLeveler interacts with Parallel Operating Environment (POE)
to allocate nodes, assign tasks to nodes, and launch tasks.

Chapter 1. What is LoadLeveler? 5

Machine definition
For LoadLeveler to schedule a job on a machine, the machine must be a valid
member of the LoadLeveler cluster.

A cluster is the combination of all of the different types of machines that use
LoadLeveler.

To make a machine a member of the LoadLeveler cluster, the administrator has to
install the LoadLeveler software onto the machine and identify the central manager
(described in “Roles of machines”). Once a machine becomes a valid member of
the cluster, LoadLeveler can schedule jobs to it.

Roles of machines
Each machine in the LoadLeveler cluster performs one or more roles in scheduling
jobs.

Roles performed in scheduling jobs by each machine in the LoadLeveler cluster are
as follows:
v Scheduling Machine: When a job is submitted, it gets placed in a queue
managed by a scheduling machine. This machine contacts another machine that
serves as the central manager for the entire LoadLeveler cluster. This scheduling
machine asks the central manager to find a machine that can run the job, and
also keeps persistent information about the job. Some scheduling machines are
known as public scheduling machines, meaning that any LoadLeveler user can
access them. These machines schedule jobs submitted from submit-only
machines:
v Central Manager Machine: The role of the central manager is to examine the
job’s requirements and find one or more machines in the LoadLeveler cluster
that will run the job. Once it finds the machine(s), it notifies the scheduling
machine.
v Executing Machine: The machine that runs the job is known as the executing
machine.
v Submitting Machine: This type of machine is known as a submit-only machine.
It participates in the LoadLeveler cluster on a limited basis. Although the name
implies that users of these machines can only submit jobs, they can also query
and cancel jobs. Users of these machines also have their own Graphical User
Interface (GUI) that provides them with the submit-only subset of functions. The
submit-only machine feature allows workstations that are not part of the
LoadLeveler cluster to submit jobs to the cluster.
Keep in mind that one machine can assume multiple roles, as shown in Figure 3 on
page 7.


Scheduling
machine

Executing
LoadLeveler machine

cluster Central
manager
Scheduling
machine

Submit-only Executing
machines machine

Scheduling
machine

Executing
machine

Figure 3. Multiple roles of machines

Machine availability
There may be times when some of the machines in the LoadLeveler cluster are not
available to process jobs

For instance, when the owners of the machines have decided to make them
unavailable. This ability of LoadLeveler to allow users to restrict the use of their
machines provides flexibility and control over the resources.

Machine owners can make their personal workstations available to other
LoadLeveler users in several ways. For example, you can specify that:
v The machine will always be available
v The machine will be available only between certain hours
v The machine will be available when the keyboard and mouse are not being used
interactively.
Owners can also specify that their personal workstations never be made available
to other LoadLeveler users.

How LoadLeveler schedules jobs
When a user submits a job, LoadLeveler examines the job command file to
determine what resources the job will need. LoadLeveler determines which
machine, or group of machines, is best suited to provide these resources, then
LoadLeveler dispatches the job to the appropriate machines. To aid this process,
LoadLeveler uses queues.

A job queue is a list of jobs that are waiting to be processed. When a user submits
a job to LoadLeveler, the job is entered into an internal database, which resides on
one of the machines in the LoadLeveler cluster, until it is ready to be dispatched to
run on another machine.


Once LoadLeveler examines a job to determine its required resources, the job is
dispatched to a machine to be processed. A job can be dispatched to either one
machine, or in the case of parallel jobs, to multiple machines. Once the job reaches
the executing machine, the job runs.

Jobs do not necessarily get dispatched to machines in the cluster on a first-come,
first-serve basis. Instead, LoadLeveler examines the requirements and
characteristics of the job and the availability of machines, and then determines the
best time for the job to be dispatched.

LoadLeveler also uses job classes to schedule jobs to run on machines. A job class
is a classification to which a job can belong. For example, short running jobs may
belong to a job class called short_jobs. Similarly, jobs that are only allowed to run
on the weekends may belong to a class called weekend. The system administrator
can define these job classes and select the users that are authorized to submit jobs
of these classes.

You can specify which types of jobs will run on a machine by specifying the types
of job classes the machine will support. LoadLeveler also examines a job’s priority
to determine when to schedule the job on a machine. A priority of a job is used to
determine its position among a list of all jobs waiting to be dispatched.

“The LoadLeveler job cycle” on page 16 describes job flow in the LoadLeveler
environment in more detail.

How LoadLeveler daemons process jobs
LoadLeveler has its own set of daemons that control the processes moving jobs
through the LoadLeveler cluster.

The LoadLeveler daemons are programs that run continuously and control the
processes that move jobs through the LoadLeveler cluster. A master daemon
(LoadL_master) runs on all machines in the LoadLeveler cluster and manages
other daemons.

Table 4 summarizes these daemons, which are described in further detail in topics
immediately following the table.
Table 4. LoadLeveler daemons
Daemon Description
LoadL_master Referred to as the master daemon. Runs on all machines in
the LoadLeveler cluster and manages other daemons.
LoadL_schedd Referred to as the Schedd daemon. Receives jobs from the
llsubmit command and manages them on machines
selected by the negotiator daemon (as defined by the
administrator).
LoadL_startd Referred to as the startd daemon. Monitors job and
machine resources on local machines and forwards
information to the negotiator daemon.

The startd daemon spawns the starter process
(LoadL_starter) which manages running jobs on the
executing machine.


Table 4. LoadLeveler daemons (continued)
Daemon Description
LoadL_negotiator Referred to as the negotiator daemon. Monitors the status
of each job and machine in the cluster. Responds to queries
from llstatus and llq commands. Runs on the central
manager machine.
LoadL_kbdd Referred to as the keyboard daemon. Monitors keyboard
and mouse activity.
LoadL_GSmonitor Referred to as the gsmonitor daemon. Monitors for down
machines based on the heartbeat responses of the
MACHINE_UPDATE_INTERVAL time period.

The master daemon
The master daemon runs on every machine in the LoadLeveler cluster, except the
submit-only machines. The real and effective user ID of this daemon must be root.

The LoadL_master binary is installed as a setuid program with the owner set to
root. The master daemon and all daemons started by the master must be able to
run with root privileges in order to switch the identity to the owner of any job
being processed.

The master daemon determines whether to start any other daemons by checking
the START_DAEMONS keyword in the global or local configuration file. If the
keyword is set to true, the daemons are started. If the keyword is set to false, the
master daemon terminates and generates a message.

The master daemon will not start on a Linux machine if SEC_ENABLEMENT is
set to CTSEC. If the master daemon does not start, no other daemons will start.

On the machine designated as the central manager, the master runs the negotiator
daemon. The master also controls the central manager backup function. The
negotiator runs on either the primary or an alternate central manager. If a central
manager failure is detected, one of the alternate central managers becomes the
primary central manager by starting the negotiator.

The master daemon starts and if necessary, restarts all of the LoadLeveler daemons
that the machine it resides on is configured to run. As part of its startup procedure,
this daemon executes the .llrc file (a dummy file is provided in the bin
subdirectory of the release directory). You can use this script to customize your
local configuration file, specifying what particular data is stored locally. This
daemon also runs the kbdd daemon, which monitors keyboard and mouse activity.

When the master daemon detects a failure on one of the daemons that it is
monitoring, it attempts to restart it. Because this daemon recognizes that certain
situations may prevent a daemon from running, it limits its restart attempts to the
number defined for the RESTARTS_PER_HOUR keyword in the configuration file.
If this limit is exceeded, the master daemon forces all daemons including itself to
exit.

When a daemon must be restarted, the master sends mail to the administrators
identified by the LOADL_ADMIN keyword in the configuration file. The mail
contains the name of the failing daemon, its termination status, and a section of the
daemon’s most recent log file. If the master aborts after exceeding
RESTARTS_PER_HOUR, it will also send that mail before exiting.


The master daemon may perform the following actions in response to an llctl
command:
v Kill all daemons and exit (stop keyword)
v Kill all daemons and execute a new master (recycle keyword)
v Rerun the .llrc file, reread the configuration files, stop or start daemons as
appropriate for the new configuration files (reconfig keyword)
v Send drain request to startd and (drain keyword)
v Send flush request to startd and send result to caller (flush keyword)
v Send suspend request to startd and send result to caller (suspend keyword)
v Send resume request to startd and Schedd, and send result to caller (resume
keyword)

The Schedd daemon
The Schedd daemon receives jobs sent by the llsubmit command and manages
those jobs to machines selected by the negotiator daemon. The Schedd daemon is
started, restarted, signalled, and stopped by the master daemon.

The Schedd daemon can be in any one of the following activity states:
Available
This machine is available to schedule jobs.
Drained
The Schedd machine accepts no more jobs. There are no jobs in starting or
running state. Jobs in the Idle state are drained, meaning they will not get
dispatched.
Draining
The Schedd daemon is being drained by the administrator but some jobs
are still running. The state of the machine remains Draining until all
running jobs complete. At that time, the machine status changes to
Drained.
Down The daemon is not running on this machine. The Schedd daemon enters
this state when it has not reported its status to the negotiator. This can
occur when the machine is actually down, or because there is a network
failure.

The Schedd daemon performs the following functions:
v Assigns new job identifiers when requested by the job submission process (for
example, by the llsubmit command).
v Receives new jobs from the llsubmit command. A new job is received as a job
object for each job step. A job object is the data structure in memory containing
all the information about a job step. The Schedd forwards the job object to the
negotiator daemon as soon as it is received from the submit command.
v Maintains on disk copies of jobs submitted locally (on this machine) that are
either waiting or running on a remote (different) machine. The central manager
can use this information to reconstruct the job information in the event of a
failure. This information is also used for accounting purposes.
v Responds to directives sent by the administrator through the negotiator daemon.
The directives include:
– Run a job.
– Change the priority of a job.
– Remove a job.
– Hold or release a job.
– Send information about all jobs.


v Sends job events to the negotiator daemon when:
– Schedd is restarting.
– A new series of job objects are arriving.
– A job is started.
– A job was rejected, completed, removed, or vacated. Schedd determines the
status by examining the exit status returned by the startd.
v Communicates with the Parallel Operating Environment (POE) when you run an
interactive POE job.
v Requests that a remote startd daemon end a job.
v Receives accounting information from startd.
v Receives requests for reservations.
v Collects resource usage data when jobs terminate and stores it as historic fair
share data in the $(SPOOL) directory.
v Sends historic fair share data to the central manager when it is updated or when
the Schedd daemon is restarted.
v Maintains and stores records of historic CPU and IBM System Blue Gene®
Solution utilization for users and groups known to the Schedd.
v Passes the historic CPU and Blue Gene utilization data to the central manager.

The startd daemon
The startd daemon monitors the status of each job, reservation, and machine in the
cluster, and forwards this information to the negotiator daemon.

The startd also receives and executes job requests originating from remote
machines. The master daemon starts, restarts, signals, and stops the startd daemon.

Checkpoint/restart is not supported in LoadLeveler for Linux. If a checkpointed
job is sent to a Linux node, the Linux node will reject the job.

The startd daemon can be in any one of the following states:
Busy The maximum number of jobs are running on this machine as specified by
the MAX_STARTERS configuration keyword.
Down The daemon is not running on this machine. The startd daemon enters this
state when it has not reported its status to the negotiator. This can occur
when the machine is actually down, or because there is a network failure.
Drained
The startd machine will not accept any new jobs. No jobs are running
when startd is in the drained state.
Draining
The startd daemon is being drained by the administrator, but some jobs are
still running. The machine remains in the draining state until all of the
running jobs have completed, at which time the machine status changes to
drained. The startd daemon will not accept any new jobs while in the
draining state.
Flush Any running jobs have been vacated (terminated and returned to the
queue to be redispatched). The startd daemon will not accept any new
jobs.
Idle The machine is not running any jobs.
None LoadLeveler is running on this machine, but no jobs can run here.


Running
The machine is running one or more jobs and is capable of running more.
Suspend
All LoadLeveler jobs running on this machine are stopped (cease
processing), but remain in virtual memory. The startd daemon will not
accept any new jobs.

The startd daemon performs these functions:
v Runs a time-out procedure that includes building a snapshot of the state of the
machine that includes static and dynamic data. This time-out procedure is run at
the following times:
– After a job completes.
– According to the definition of the POLLING_FREQUENCY keyword in the
configuration file.
v Records the following information in LoadLeveler variables and sends the
information to the negotiator.
– State (of the startd daemon)
– EnteredCurrentState
– Memory
– Disk
– KeyboardIdle
– Cpus
– LoadAvg
– Machine
– Adapter
– AvailableClasses
v Calculates the SUSPEND, RESUME, CONTINUE, and VACATE expressions
through which you can manage job status.
v Receives job requests from the Schedd daemon to:
– Start a job
– Preempt or resume a job
– Vacate a job
– Cancel
When the Schedd daemon tells the startd daemon to start a job, the startd
determines whether its own state permits a new job to run:
Table 5. startd determines whether its own state permits a new job to run
If: Then this happens:
Yes, it can start a new The startd forks a starter process.
job
No, it cannot start a The startd rejects the request for one of the following reasons:
new job v Jobs have been suspended, flushed, or drained
v The job limit set for the MAX_STARTERS keyword has been
reached
v There are not enough classes available for the designated job class

v Receives requests from the master (through the llctl command) to do one of the
following:
– Drain (drain keyword)
– Flush (flush keyword)
– Suspend (suspend keyword)
– Resume (resume keyword)


v For each request, startd marks its own new state, forwards its new state to the
negotiator daemon, and then performs the appropriate action for any jobs that
are active.
v Receives notification of keyboard and mouse activity from the kbdd daemon
v Periodically examines the process table for LoadLeveler jobs and accumulates
resources consumed by those jobs. This resource data is used to determine if a
job has exceeded its job limit and for recording in the history file.
v Send accounting information to Schedd.

The starter process
The startd daemon spawns a starter process after the Schedd daemon tells the
startd daemon to start a job.

The starter process manages all the processes associated with a job step. The starter
process is responsible for running the job and reporting status back to the startd
daemon.

The starter process performs these functions:
v Processes the prolog and epilog programs as defined by the JOB_PROLOG and
JOB_EPILOG keywords in the configuration file. The job will not run if the
prolog program exits with a return code other than zero.
v Handles authentication. This includes:
– Authenticates AFS, if necessary
– Verifies that the submitting user is not root
– Verifies that the submitting user has access to the appropriate directories in
the local file system.
v Runs the job by forking a child process that runs with the user ID and all
groups of the submitting user. That child process creates a new process group of
which it is the process group leader, and executes the user’s program or a shell.
The starter process is responsible for detecting the termination of any process
that it forks. To ensure that all processes associated with a job are terminated
after the process forked by the starter terminates, process tracking must be
enabled. To configure LoadLeveler for process tracking, see “Tracking job
processes” on page 70.
v Responds to vacate and suspend orders from the startd.

The negotiator daemon
The negotiator daemon maintains status of each job and machine in the cluster
and responds to queries from the llstatus and llq commands.

The negotiator daemon runs on a single machine in the cluster (the central
manager machine). This daemon is started, restarted, signalled, and stopped by the
master daemon.

In a mixed cluster, the negotiator daemon must run on an AIX node.

The negotiator daemon receives status messages from each Schedd and startd
daemon running in the cluster. The negotiator daemon tracks:
v Which Schedd daemons are running
v Which startd daemons are running, and the status of each startd machine.


If the negotiator does not receive an update from any machine within the time
period defined by the MACHINE_UPDATE_INTERVAL keyword, then the
negotiator assumes that the machine is down, and therefore the Schedd and startd
daemons are also down.

The negotiator also maintains in its memory several queues and tables which
determine where the job should run.

The negotiator performs the following functions:
v Receives and records job status changes from the Schedd daemon.
v Schedules jobs based on a variety of scheduling criteria and policy options. Once
a job is selected, the negotiator contacts the Schedd that originally created the
job.
v Handles requests to:
– Set priorities
– Query about jobs, machines, classes, and reservations
– Change reservation attributes
– Bind jobs to reservations
– Remove a reservation
– Remove a job
– Hold or release a job
– Favor or unfavor a user or a job.
v Receives notification of Schedd resets indicating that a Schedd has restarted.

The kbdd daemon
The kbdd daemon monitors keyboard and mouse activity.

The kbdd daemon is spawned by the master daemon if the X_RUNS_HERE
keyword in the configuration file is set to true.

The kbdd daemon notifies the startd daemon when it detects keyboard or mouse
activity; however, kbdd is not interrupt driven. It sleeps for the number of seconds
defined by the POLLING_FREQUENCY keyword in the LoadLeveler
configuration file, and then determines if X events, in the form of mouse or
keyboard activity, have occurred. For more information on the configuration file,
see Chapter 5, “Defining LoadLeveler resources to administer,” on page 83.

The gsmonitor daemon
The gsmonitor daemon is not available in LoadLeveler for Linux.

The negotiator daemon monitors for down machines based on the heartbeat
responses of the MACHINE_UPDATE_INTERVAL time period. If the negotiator
has not received an update after two MACHINE_UPDATE_INTERVAL periods,
then it marks the machine as down, and notifies the Schedd to remove any jobs
running on that machine. The gsmonitor daemon (LoadL_GSmonitor) allows this
cleanup to occur more reliably. The gsmonitor daemon uses the Group Services
Application Programming Interface (GSAPI) to monitor machine availability on
peer domains and to notify the negotiator quickly when a machine is no longer
reachable.

If the GSMONITOR_DOMAIN keyword was not specified in the LoadLeveler
configuration file, then LoadLeveler will try to determine if the machine is running
in a peer (cluster) domain. The gsmonitor must run in a peer domain. The


gsmonitor will detect that it is running in an active peer domain, then it will use
the RMC API to determine the node numbers and names of machines running in
the cluster.

If the administrator sets up a LoadLeveler administration file that contains OSIs
spanning several peer domains then a gsmonitor daemon must be started in each
domain. A gsmonitor daemon can monitor only the OSIs contained in the domain
within which it is running. The administrator specifies which OSIs run the
gsmonitor daemon by specifying GSMONITOR_RUNS_HERE=TRUE in the local
configuration file for that OSI. The default for GSMONITOR_RUNS_HERE is
False.

The gsmonitor daemon should be run on one or two nodes in the peer domain. By
running LoadL_GSmonitor on more than one node in a domain you will have a
backup in case one of the nodes that the monitor is running on goes down.
LoadL_GSmonitor subscribes to the Group Services system-defined host
membership group, which is represented by the HA_GS_HOST_MEMBERSHIP
Group Services keyword. This group monitors every configured node in the
system partition and every node in the active peer domain.

Note:
1. The Group Services routines need to be run as root, so the
LoadL_GSmonitor executable must be owned by root and have the
setuid permission bit enabled.
2. It will not cause a problem to run more than one LoadL_GSmonitor
daemon per peer domain, this will just cause the negotiator to be
notified by each running daemon.
3. For more information about the Group Services subsystem, see the RSCT
Administration Guide, SA22-7889 for peer domains.
4. For more information about GSAPI, see Group Services Programming Guide
and Reference, SA22-7355.


The LoadLeveler job cycle
To illustrate the flow of job information through the LoadLeveler cluster, a
description and sequence of diagrams have been provided.

Scheduling
machine

Executing
machine

Central
manager

2 3
Scheduling
machine
1 Job Scheduling
4
machine Executing
machine
Executing
machine

Figure 4. High-level job flow

The managing machine in a LoadLeveler cluster is known as the central manager.
There are also machines that act as schedulers, and machines that serve as the
executing machines. The arrows in Figure 4 illustrate the following:
v Arrow 1 indicates that a job has been submitted to LoadLeveler.
v Arrow 2 indicates that the scheduling machine contacts the central manager to
inform it that a job has been submitted, and to find out if a machine exists that
matches the job requirements.
v Arrow 3 indicates that the central manager checks to determine if a machine
exists that is capable of running the job. Once a machine is found, the central
manager informs the scheduling machine which machine is available.
v Arrow 4 indicates that the scheduling machine contacts the executing machine
and provides it with information regarding the job. In this case, the scheduling
and executing machines are different machines in the cluster, but they do not
have to be different; the scheduling and executing machines may be the same
physical machine.

Figure 4 is broken down into the following more detailed diagrams illustrating
how LoadLeveler processes a job. The diagrams indicate specific job states for this
example, but do not list all of the possible states for LoadLeveler jobs. A complete
list of job states appears in “LoadLeveler job states” on page 19.
1. Submit a LoadLeveler job:


Central manager
LoadLeveler negotiator daemon
cluster
3

Scheduling
machine
1
Q
schedd daemon

2

Q
Q
Q Idle

Figure 5. Job is submitted to LoadLeveler

Figure 5 illustrates that the Schedd daemon runs on the scheduling machine.
This machine can also have the startd daemon running on it. The negotiator
daemon resides on the central manager machine. The arrows in Figure 5
illustrate the following:
v Arrow 1 indicates that a job has been submitted to the scheduling machine.
v Arrow 2 indicates that the Schedd daemon, on the scheduling machine,
stores all of the relevant job information on local disk.
v Arrow 3 indicates that the Schedd daemon sends job description information
to the negotiator daemon. At this point, the submitted job is in the Idle state.
2. Permit to run:

Central manager
negotiator daemon

4

Scheduling
machine

schedd daemon

Q
Q
Q Pending or Starting

Figure 6. LoadLeveler authorizes the job


In Figure 6 on page 17, arrow 4 indicates that the negotiator daemon authorizes
the Schedd daemon to begin taking steps to run the job. This authorization is
called a permit to run. Once this is done, the job is considered Pending or
Starting.
3. Prepare to run:

Central manager
negotiator daemon

Scheduling
machine Executing machine
remote 5 startd daemon
schedd daemon

local

Q startd daemon
Q
Q Pending or Starting

Figure 7. LoadLeveler prepares to run the job

In Figure 7, arrow 5 illustrates that the Schedd daemon contacts the startd
daemon on the executing machine and requests that it start the job. The
executing machine can either be a local machine (the machine to which the job
was submitted) or another machine in the cluster. In this example, the local
machine is not the executing machine.
4. Initiate job:

Central manager
negotiator daemon

8
Scheduling

schedd daemon startd daemon
7
6

Q
Q
starter
1010
1010
Q 1010
101010
Q Running

Figure 8. LoadLeveler starts the job


The arrows in Figure 8 on page 18 illustrate the following:
v Arrow 6 indicates that the startd daemon on the executing machine spawns a
starter process for the job.
v Arrow 7 indicates that the Schedd daemon sends the starter process the job
information and the executable.
v Arrow 8 indicates that the Schedd daemon notifies the negotiator daemon
that the job has been started and the negotiator daemon marks the job as
Running.
The starter forks and executes the user’s job, and the starter parent waits for
the child to complete.
5. Complete job:

Central manager
negotiator daemon

11

Scheduling

schedd daemon 10 startd daemon

9

Q
Q
starter

Q
Q Complete pending or
Completed

Figure 9. LoadLeveler completes the job

The arrows in Figure 9 illustrate the following:
v Arrow 9 indicates that when the job completes, the starter process notifies
the startd daemon.
v Arrow 10 indicates that the startd daemon notifies the Schedd daemon.
v Arrow 11 indicates that the Schedd daemon examines the information it has
received, and forwards it to the negotiator daemon. At this point, the job is
in Completed or Complete Pending state.

LoadLeveler job states
As LoadLeveler processes a job, the job moves through various states.

These states are listed in Table 6 on page 20. Job states that include “Pending,”
such as Complete Pending and Vacate Pending, are intermediate, temporary states.

Some options on LoadLeveler interfaces are valid only for jobs in certain states. For
example, the llmodify command has options that apply only to jobs that are in the
Idle state, or in states that are similar to it. To determine which job states are
similar to the Idle state, use the “Similar to...” column in Table 6 on page 20, which


indicates whether a particular job state is similar to the Idle, Running, or
Terminating state. A dash (—) indicates that the state is not similar to an Idle,
Running, or Terminating state.
Table 6. Job state descriptions and abbreviations
Job state Similar to Abbreviation Description
Idle or in displays /
Running output
state?
Canceled Terminating CA The job was canceled either by a user or
by an administrator.
Checkpointing Running CK Indicates that a checkpoint has been
initiated.
Completed Terminating C The job has completed.
Complete Terminating CP The job is in the process of being
Pending completed.
Deferred Idle D The job will not be assigned to a machine
until a specified date. This date may have
been specified by the user in the job
command file, or may have been
generated by the negotiator because a
parallel job did not accumulate enough
machines to run the job. Only the
negotiator places a job in the Deferred
state.
Idle Idle I The job is being considered to run on a
machine, though no machine has been
selected.
Not Queued Idle NQ The job is not being considered to run on
a machine. A job can enter this state
because the associated Schedd is down,
the user or group associated with the job
is at its maximum maxqueued or maxidle
value, or because the job has a
dependency which cannot be determined.
For more information on these keywords,
see “Controlling the mix of idle and
running jobs” on page 721. (Only the
negotiator places a job in the NotQueued
state.)
Not Run — NR The job will never be run because a
dependency associated with the job was
found to be false.
Pending Running P The job is in the process of starting on one
or more machines. (The negotiator
indicates this state until the Schedd
acknowledges that it has received the
request to start the job. Then the
negotiator changes the state of the job to
Starting. The Schedd indicates the
Pending state until all startd machines
have acknowledged receipt of the start
request. The Schedd then changes the
state of the job to Starting.)


Table 6. Job state descriptions and abbreviations (continued)
Running output
state?
Preempted Running E The job is preempted. This state applies
only when LoadLeveler uses the suspend
method to preempt the job.
Preempt Running EP The job is in the process of being
Pending preempted. This state applies only when
LoadLeveler uses the suspend method to
preempt the job.
Rejected Idle X The job is rejected.
Reject Pending Idle XP The job did not start. Possible reasons
why a job is rejected are: job requirements
were not met on the target machine, or
the user ID of the person running the job
is not valid on the target machine. After a
job leaves the Reject Pending state, it is
moved into one of the following states:
Idle, User Hold, or Removed.
Removed Terminating RM The job was stopped by LoadLeveler.
Remove Terminating RP The job is in the process of being
Pending removed, but not all associated machines
have acknowledged the removal of the
job.
Resume Pending Running MP The job is in the process of being
resumed.
Running Running R The job is running: the job was dispatched
and has started on the designated
machine.
Starting Running ST The job is starting: the job was dispatched,
was received by the target machine, and
LoadLeveler is setting up the environment
in which to run the job. For a parallel job,
LoadLeveler sets up the environment on
all required nodes. See the description of
the “Pending” state for more information
on when the negotiator or the Schedd
daemon moves a job into the Starting
state.
System Hold Idle S The job has been put in system hold.


Table 6. Job state descriptions and abbreviations (continued)
Running output
state?
Terminated Terminating TX If the negotiator and Schedd daemons
experience communication problems, they
may be temporarily unable to exchange
information concerning the status of jobs
in the system. During this period of time,
some of the jobs may actually complete
and therefore be removed from the
Schedd’s list of active jobs. When
communication resumes between the two
daemons, the negotiator will move such
jobs to the Terminated state, where they
will remain for a set period of time
(specified by the
NEGOTIATOR_REMOVE_COMPLETED
keyword in the configuration file). When
this time has passed, the negotiator will
remove the jobs from its active list.
User & System Idle HS The job has been put in both system hold
Hold and user hold.
User Hold Idle H The job has been put in user hold.
Vacated Idle V The job started but did not complete. The
negotiator will reschedule the job
(provided the job is allowed to be
rescheduled). Possible reasons why a job
moves to the Vacated state are: the
machine where the job was running was
flushed, the VACATE expression in the
configuration file evaluated to True, or
LoadLeveler detected a condition
indicating the job needed to be vacated.
For more information on the VACATE
expression, see “Managing job status
through control expressions” on page 68.
Vacate Pending Idle VP The job is in the process of being vacated.

Consumable resources
Consumable resources are assets available on machines in your LoadLeveler
cluster.

These assets are called ″resources″ because they model the commodities or services
| available on machines (including CPUs, real memory, virtual memory, large page
| memory, software licenses, disk space). They are considered ″consumable″ because
job steps use specified amounts of these commodities when the step is running.
Once the step finishes, the resource becomes available for another job step.

Consumable resources which model the characteristics of a specific machine (such
as the number of CPUs or the number of specific software licenses available only
on that machine) are called machine resources. Consumable resources which model
resources that are available across the LoadLeveler cluster (such as floating
software licenses) are called floating resources. For example, consider a

configuration with 10 licenses for a given program (which can be used on any
machine in the cluster). If these licenses are defined as floating resources, all 10 can
be used on one machine, or they can be spread across as many as 10 different
machines.

The LoadLeveler administrator can specify:
v Consumable resources to be considered by LoadLeveler’s scheduling algorithms
v Quantity of resources available on specific machines
v Quantity of floating resources available on machines in the cluster
v Consumable resources to be considered in determining the priority of executing
machines
v Default amount of resources consumed by a job step of a specified job class
| v Whether CPU, real memory, virtual memory, or large page resources should be
| enforced using AIX Workload Manager (WLM)
v Whether all jobs submitted need to specify resources

Users submitting jobs can specify the resources consumed by each task of a job
step, or the resources consumed by the job on each machine where it runs,
regardless of the number of tasks assigned to that machine.

If affinity scheduling support is enabled, the CPUs requested in the consumable
resources requirement of a job will be used by the scheduler to determine the
number of CPUs to be allocated and attached to that job’s tasks running on
machines enabled for affinity scheduling. However, if the affinity scheduling
request contains the processor-core affinity option, the number of CPUs will be
determined from the value specified by the task_affinity keyword instead of the
CPU’s value in the consumable resources requirement. For more information on
scheduling affinity, see “LoadLeveler scheduling affinity support” on page 146.

Note:
1. When software licenses are used as a consumable resource, LoadLeveler
does not attempt to obtain software licenses or to verify that software
licenses have been obtained. However, by providing a user exit that can
be invoked as a submit filter, the LoadLeveler administrator can provide
code to first obtain the required license and then allow the job step to
run. For more information on filtering job scripts, see “Filtering a job
script” on page 76.
| 2. LoadLeveler scheduling algorithms use the availability of requested
| consumable resources to determine the machine or machines on which a
| job will run. Consumable resources (except for CPU, real memory, virtual
| memory and large page) are only used for scheduling purposes and are
| not enforced. Instead, LoadLeveler’s negotiator daemon keeps track of
| the consumable resources available by reducing them by the amount
| requested when a job step is scheduled, and increasing them when a
| consuming job step completes.
3. If a job is preempted, the job continues to use all consumable resources
except for ConsumableCpus and ConsumableMemory (real memory)
which are made available to other jobs.
4. When the network adapters on a machine support RDMA, the machine
is automatically given a consumable resource called RDMA with an
available quantity defined by the limit on the number of concurrent jobs
that use RDMA. For machines with the ″Switch Network Interface for
HPS″ network adapters, this limit is 4. Machines with InfiniBand
adapters are given unlimited RDMA resources.


5. When steps require RDMA, either because they request bulkxfer or
because they request rcxtblocks on at least one network statement, the
job is automatically given a resource requirement for 1 RDMA.

Consumable resources and AIX Workload Manager
| If the administrator has indicated that resources should be enforced, LoadLeveler
| uses AIX Workload Manager (WLM) to give greater control over CPU, real
| memory, virtual memory and large page resource allocation.

WLM monitors system resources and regulates their allocation to processes
running on AIX. These actions prevent jobs from interfering with each other when
they have conflicting resource requirements. WLM achieves this control by creating
different classes of service and allowing attributes to be specified for those classes.

LoadLeveler dynamically generates WLM classes with specific resource
entitlements. A single WLM class is created for each job step and the process id of
that job step is assigned to that class. This is done for each node that a job step is
assigned to run on. LoadLeveler then defines resource shares or limits for that class
depending on the LoadLeveler enforcement policy defined. These resource shares
or limits represent the job’s requested resource usage in relation to the amount of
resources available on the machine.

| When LoadLeveler defines multiple memory resources under one WLM class, AIX
| WLM uses the following order to determine if resource limits have been exceeded:
1. Real Memory Absolute Limit
2. Virtual Memory Absolute Limit
3. Large Page Limit
| 4. Real Memory shares or percent limit

| Note: When real memory or CPU with either shares or percent limits are
| exceeded, the job processes in that class receive a lower scheduling priority
| until their utilization drops below the hard max limit. When virtual memory
| or absolute real memory limits are exceeded, the job processes are killed.
| When the large page limit is exceeded, any new large page requests are
| denied.

When the enforcement policy is shares, LoadLeveler assigns a share value to the
class based on the resources requested for the job step (one unit of resource equals
one share). When the job step process is running, AIX WLM dynamically calculates
an appropriate resource entitlement based on the WLM class share value of the job
step and the total number of shares requested by all active WLM classes. It is
important to note that AIX WLM will only enforce these target percentages when
the resource is under contention.

When the enforcement policy is limits (soft or hard), LoadLeveler assigns a
percentage value to the class based on the resources requested for the job step and
the total machine resources. This resource percentage is enforced regardless of any
other active WLM classes. A soft limit indicates the maximum amount of the
resource that can be made available when there is contention for the resources.
This maximum can be exceeded if no one else requires the resource. A hard limit
indicates the maximum amount of the resource that can be made available even if
there is no contention for the resources.


| Note: A WLM class is active for the duration of a job step and is deleted when the
| job step completes. There is a limit of 64 active WLM classes per machine.
| Therefore, when resources are being enforced, only 64 job steps can be
| running on one machine.

For additional information about integrating LoadLeveler with AIX Workload
Manager, see “Steps for integrating LoadLeveler with the AIX Workload Manager”
on page 137.

Overview of reservations
Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make
reservations, which specify a time period during which specific node resources are
reserved for exclusive use by particular users or groups. This capability is known
in the computing industry as advance reservation.

Normally, jobs wait to be dispatched until the resources they require become
available. Through the use of reservations, wait time can be reduced because the
jobs have exclusive use of the node resources (CPUs, memory, disk drives,
communication adapters, and so on) as soon as the reservation period begins.

Note: Advance reservation supports Blue Gene resources including the Blue Gene
compute nodes. For more information, see “Blue Gene reservation support”
on page 159.

In addition to reducing wait time, reservations also are useful for:
v Running a workload that needs to start or finish at a particular time. The job
steps must be associated with, or bound to, the reservation before LoadLeveler
can run them during the reservation period.
| v Reserving resources for a workload that repeats at regular intervals. You can
| make a single request to create a recurring reservation, which reserves a specific
| set of resources for a specific time slot that repeats on a regular basis for a
| defined interval.
v Setting aside a set of nodes for maintenance purposes. In this case, job steps are
not bound to the reservation.
Only bound job steps may run on the reserved nodes, which means that a bound
job step competes for reserved resources only with other job steps that are bound
to the same reservation.

The following sequence of events describes, in general terms, how you can set up
and use reservations in the LoadLeveler environment. It also describes how
LoadLeveler manages activities related to the use of reservations.
1. Configuring LoadLeveler to support reservations
An administrator uses specific keywords in the configuration and
administration files to define general reservation policies. These keywords
include:
| v max_reservations, when used in the global configuration file defines the
| maximum number of reservations for the entire cluster.
| v max_reservations, when used in a user or group stanza of the administration
| file can also be used to define both:
– The users or groups that will be allowed to create reservations. To be
authorized to create reservations, LoadLeveler administrators also must
have the max_reservations keyword set in their own user or group
stanzas.


– How many reservations users may own.

| Note: With recurring advance reservations, to avoid confusion about what
| counts as one reservation, LoadLeveler is using the approach that one
| reservation counts as one instance regardless of the number of times
| the reservation recurs before it expires. This applies to the system
| wide max_reservations configuation setting as well as the same type
| of configuration settings at the user and group levels.
v max_reservation_duration, which defines the maximum duration for
reservations.
v reservation_permitted, which defines the nodes that may be used for
reservations.
| v max_reservation_expiration which defines how long recurring reservations
| are permitted to last (expressed as the number of days).
Administrators also may configure LoadLeveler to collect accounting data
about reservations when the reservations complete or are canceled.
2. Creating reservations
After LoadLeveler is configured for reservations, an administrator or
authorized user may create specific reservations, defining reservation attributes
that include:
v The start time and the duration of the reservation. The start and end times
for a reservation are based on the time-of-day (TOD) clock on the central
manager machine.
| v Whether or not the reservation recurs and if it recurs, the interval in which it
| does so.
v The nodes to be reserved. Until the reservation period actually begins, the
selected nodes are available to run any jobs; when the reservation starts, only
jobs bound to the reservation may run on the reserved nodes.
v The users or groups that may use the reservation.
LoadLeveler assigns a unique ID to the reservation, and returns that ID to the
owner.
After the reservation is successfully created:
v Reservation owners may:
– Modify, query, and cancel their reservations.
– Allow other LoadLeveler users or groups to submit jobs to run during a
reservation period.
– Submit jobs to run during a reservation period.
v Users or groups that are allowed to use the reservation also may query
reservations, and submit jobs to run during a reservation period. To run jobs
during a reservation period, users must bind job steps to the reservation. You
may bind both batch and interactive POE job steps to a reservation.
3. Preparing for the start of a reservation
During the preparation time for a reservation, LoadLeveler:
v Preempts any jobs that are still running on the reserved nodes.
v Checks the condition of reserved nodes, and notifies the reservation owner
and LoadLeveler administrators by e-mail of any situations that might
require the reservation owner or an administrator to take corrective action.
Such conditions include:
– Reserved nodes that are down, suspended, no longer in the LoadLeveler
cluster, or otherwise unavailable for use.
– Non-preemptable job steps that cannot finish running before the
reservation start time.


During this time, reservation owners may modify, cancel, and add users or
groups to their reservations. Owners and users or groups that are allowed to
use the reservation may query the reservation or bind job steps to it.
4. Starting the reservation
When the reservation period begins, LoadLeveler dispatches job steps that are
bound to the reservation.
After the reservation period begins, reservation owners may modify, cancel,
and add users or groups to their reservations. Owners and users or groups that
are allowed to use the reservation may query the reservation or bind job steps
to it.
During the reservation period, LoadLeveler ignores system preemption rules
for bound job steps; however, LoadLeveler administrators may use the
llpreempt command to manually preempt bound job steps.

When the reservation ends or is canceled:
| v LoadLeveler unbinds all job steps from the reservation if there are no further
| occurrences remaining. At this point the unbound job steps compete with all
| other LoadLeveler jobs for available resources. If there are occurrences remaining
| in the reservation, job steps are automatically bound to the next occurrence.
v If accounting data is being collected for the reservation, LoadLeveler also
updates the reservation history file.

For more detailed information and instructions for setting up and using
reservations, see:
v “Configuring LoadLeveler to support reservations” on page 131.
v “Working with reservations” on page 213.

Fair share scheduling overview
Fair share scheduling in LoadLeveler provides a way to divide resources in a
LoadLeveler cluster among users or groups of users.

Historic resource usage data that is collected at the time the job ends can be used
to influence job priorities to achieve the resource usage proportions allocated to
users or groups of users in the LoadLeveler configuration files. The resource usage
data will decay over time so that the relatively recent historic resource usage will
have the most influence on job priorities. The CPU resources in the cluster and the
Blue Gene resources are currently supported by fair share scheduling.

For information about configuring fair share scheduling in LoadLeveler, see “Using
fair share scheduling” on page 160.


Chapter 2. Getting a quick start using the default
configuration
If you are very familiar with UNIX and Linux system administration and job
scheduling, follow these steps to get LoadLeveler up and running on your network
quickly in a default configuration.

This default configuration will merely enable you to submit serial jobs; for a more
complex setup, see Chapter 4, “Configuring the LoadLeveler environment,” on
page 41.

What you need to know before you begin
LoadLeveler sets up default values for configuration information.
v loadl is the recommended LoadLeveler user ID and the LoadLeveler group ID.
LoadLeveler daemons run under this user ID to perform file I/O, and many
LoadLeveler files are owned by this user ID.
v The home directory of loadl is the configuration directory.
v LoadL_config is the name of the configuration file.

For information about configuration file keyword syntax and other details, see
Chapter 12, “Configuration file reference,” on page 263.

Using the default configuration files
Follow these steps to use the default configuration files.

Note: You can find samples of the LoadL_admin and LoadL_config files in the
release directory (in the samples subdirectory).
1. Ensure that the installation procedure has completed successfully and that the
configuration file, LoadL_config, exists in LoadLeveler’s home directory or in
the directory specified by the LoadLConfig keyword.
2. Identify yourself as the LoadLeveler administrator in the LoadL_config file
using the LOADL_ADMIN keyword. The syntax of this keyword is:
LOADL_ADMIN = list_of_user_names (required)
Where list_of_user_names is a blank-delimited list of those individuals who
will have administrative authority.

Refer to “Defining LoadLeveler administrators” on page 43 for more
information.
3. Define a machine to act as the LoadLeveler central manager by coding one
machine stanza as follows in the administration file, which is called
LoadL_admin. (Replace machine_name with the actual name of the machine.)
machine_name: type = machine

central_manager = true
Do not specify more than one machine as the central manager. Also, if during
installation, you ran llinit with the -cm flag, the central manager is already
defined in the LoadL_admin file because the llinit command takes parameters
that you entered and updates the administration and configuration files. See
“Defining machines” on page 84 for more information.

29

LoadLeveler for Linux quick start
If you would like to quickly install and configure LoadLeveler for Linux and
submit a serial job on a single node, use these procedures.

Note: This setup is for a single node only and the node used for this example is:
c197blade1b05.ppd.pok.ibm.com.

Quick installation
Details of this installation apply tor RHEL 4 System x servers.

Note: This installation method is, however, applicable to all other systems. You
must install the corresponding license RPM for the system you are installing
on. This installation assumes that the LoadLeveler RPMs are located at:
/mnt/cdrom/.
1. Log on to node c197blade1b05.ppd.pok.ibm.com as root, which is the node you
are installing on.
2. Add a UNIX group for LoadLeveler users (make sure the group ID is correct)
by entering the following command:
groupadd -g 1000 loadl
3. Add a UNIX user for LoadLeveler (make sure the user ID is correct) by
entering the following command:
useradd -c "LoadLeveler User" -d /home/loadl -s /bin/bash -u 1001 -g 1000 -m loadl
| 4. Install the license RPM by entering the following command:
| rpm -ivh /mnt/cdrom/LoadL-full-license-RH4-X86-3.5.0.0-0.i386.rpm
5. Change to the LoadLeveler installation path by entering the following the
command:
cd /opt/ibmll/LoadL/sbin
6. Run the LoadLeveler installation script by entering:
./install_ll -y -d /mnt/cdrom
| 7. Install the required LoadLeveler services updates for 3.5.0.1 for this RPM.
| Updates and installation instructions are available at:
| https://www14.software.ibm.com/webapp/set2/sas/f/loadleveler/download/
| intel.html

Quick configuration
Use this method to perform a quick configuration.
1. Change the log in to the newly created LoadLeveler user by entering the
following command:
su - loadl
2. Add the LoadLeveler bin directory to the search path:
export PATH=$PATH:/opt/ibmll/LoadL/full/bin
3. Run the LoadLeveler initialization script:
/opt/ibmll/LoadL/full/bin/llinit -local /tmp/loadl -release /opt/ibmll/LoadL/full -cm
c197blade1b05.ppd.pok.ibm.com

Quick verification
Use this method to perform a quick verification.
| 1. Start LoadLeveler by entering the following command:
| llctl start


| You should receive a response similar to the following:
| llctl: Attempting to start LoadLeveler on host c197blade1b05.ppd.pok.ibm.com
| LoadL_master 3.5.0.1 rsats001a 2008/10/29 RHEL 4.0 140
| CentralManager = c197blade1b05.ppd.pok.ibm.com
| [loadl@c197blade1b05 bin]$
2. Check LoadLeveler status by entering the following command:
llstatus
You should receive a response similar to the following:
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys
c197blade1b05.ppd.pok.ibm Avail 0 0 Idle 0 0.00 1 i386 Linux2
i386/Linux2 1 machines 0 jobs 0 running task
Total Machines 1 machines 0 jobs 0 running task

The central manager is defined on c197blade1b05.ppd.pok.ibm.com

The BACKFILL scheduler is in use

All machines on the machine_list are present.
[loadl@c197blade1b05 bin]$
3. Submit a sample job, by entering the following command:
llsubmit /opt/ibmll/LoadL/full/samples/job1.cmd
llsubmit: The job "c197blade1b05.ppd.pok.ibm.com.1" with 2 job steps /
has been submitted.
[loadl@c197blade1b05 samples]$
4. Display the LoadLeveler job queue, by entering the following command:
llq
Id Owner Submitted ST PRI Class Running On
------------------------ ---------- ----------- -- --- ------------ -----------
c197blade1b05.1.0 loadl 8/15 17:25 R 50 No_Class c197blade1b05
c197blade1b05.1.1 loadl 8/15 17:25 I 50 No_Class
2 job step(s) in queue, 1 waiting, 0 pending, 1 running, 0 held, 0 preempted
[loadl@c197blade1b05 samples]$
5. Check output files into the home directory (/home/loadl) by entering the
following command:
ls -ltr job*
-rw-rw-r-- 1 loadl loadl 1940 Aug 15 17:26 job1.c197blade1b05.1.0.out
-rw-rw-rw- 1 loadl loadl 1940 Aug 15 17:27 job1.c197blade1b05.1.1.out
[loadl@c197blade1b05 ~]$

Post-installation considerations
This information explains how to start (or restart) and stop LoadLeveler. It also
tells you where files are located after you install LoadLeveler, and it points you to
troubleshooting information.

Starting LoadLeveler
You can start LoadLeveler using any LoadLeveler administrator user ID as defined
in the configuration file.

To start all of the machines that are defined in machine stanzas in the
administration file, enter:
llctl -g start

Chapter 2. Getting a quick start using the default configuration 31

The central manager machine is the first started, followed by other machines in the
order listed in the administration file. See “llctl - Control LoadLeveler daemons”
on page 439 for more information.

By default, llctl uses rsh to start LoadLeveler on the target machine. Other
mechanisms, such as ssh can be used by setting the LL_RSH_COMMAND
configuration keyword in LoadL_config. However you choose to start LoadLeveler
on remote hosts, you must have the authority to run commands remotely on that
host.

You can verify that the machine has been properly configured by running the
sample jobs in the appropriate samples directory (job1.cmd, job2.cmd, and
job3.cmd). You must read the job2.cmd and job3.cmd files before submitting them
because job2 must be edited and a C program must be compiled to use job3. It is a
good idea to copy the sample jobs to another directory before modifying them; you
must have read/write permission to the directory in which they are located. You
can use the llsubmit command to submit the sample jobs from several different
machines and verify that they complete (see “llsubmit - Submit a job” on page
531).

If you are running AFS and some jobs do not complete, you might need to use the
AFS fs command (fs listacl) to ensure that the you have write permission to the
spool, execute, and log directories.

If you are running with cluster security services enabled and some jobs do not
complete, ensure that you have write permission to the spool, execute, and log
directories. Also ensure that the user ID is authorized to run jobs on the submitting
machine (the identity of the user must exist in the .rhosts file of the user on the
machine on which the job is being run).

Note: LoadLeveler for Linux does not support cluster security services.

If you are running submit-only LoadLeveler, once the LoadLeveler pool is up and
running, you can use the llsubmit, llq, and llcancel commands from the
submit-only machines. For more information about these commands, see
v “llsubmit - Submit a job” on page 531
v “llq - Query job status” on page 479
v “llcancel - Cancel a submitted job” on page 421

You can also invoke the LoadLeveler graphical user interface xloadl_so from the
submit-only machines (see Chapter 15, “Graphical user interface (GUI) reference,”
on page 403).

Location of directories following installation
After installation, the product directories reside on disk.

The product directories that reside on disk after installation are shown in Table 7
on page 33. The installation process creates only those directories required to
service the LoadLeveler options specified during the installation. For AIX,
release_directory indicates /usr/lpp/LoadL/full and for Linux, it indicates
/opt/ibmll/LoadL/full.


Table 7. Location and description of product directories following installation
Directory Description
release_directory/bin Part of the release directory containing
daemons, commands, and other binaries
release_directory/lib Part of the release directory containing
product libraries and resource files
release_directory/man Part of the release directory containing man
pages
release_directory/samples Part of the release directory containing
sample administration and configuration files
and sample jobs
release_directory/include Part of the release directory containing
header files for the application programming
interfaces
Local directory spool, execute, and log directories for each
machine in the cluster
Home directory Administration and configuration files, and
symbolic links to the release directory
/usr/lpp/LoadL/codebase Configuration tasks for AIX

Table 8 shows the location of directories for submit-only LoadLeveler:
Table 8. Location and description of directories for submit-only LoadLeveler
Directory Description
release_directory/so/bin Part of the release directory containing
commands
release_directory/so/man Part of the release directory containing man
pages
release_directory/so/samples Part of the release directory containing
sample administration and configuration files
release_directory/so/lib Contains libraries and graphical user
interface resource files
Home directory Contains administration and configuration
files

If you have a mixed LoadLeveler cluster of AIX and Linux machines, you might
want to make the following symbolic links:
v On AIX, as root, enter:
mkdir -p /opt/ibmll
ln -s /usr/lpp/LoadL /opt/ibmll/LoadL
v On Linux, as root, enter:
mkdir -p /usr/lpp
ln -s /opt/ibmll/LoadL /usr/lpp/LoadL

With the addition of these symbolic links, a user application can use either
/usr/lpp/LoadL or /opt/ibmll/LoadL to refer to the location of LoadLeveler files
regardless of whether the application is running on AIX or Linux.

If LoadLeveler will not start following installation, see “Why won’t LoadLeveler
start?” on page 700 for troubleshooting information.

Chapter 2. Getting a quick start using the default configuration 33

Chapter 3. What operating systems are supported by
LoadLeveler?
LoadLeveler supports three operating systems.
| v AIX 6.1 and AIX 5.3
| IBM’s AIX 6.1 and AIX 5.3 are open UNIX operating environments that conform
| to The Open Group UNIX 98 Base Brand industry standard. AIX 6.1 and AIX 5.3
| provide high levels of integration, flexibility, and reliability and operate on IBM
| Power Systems and IBM Cluster 1600 servers and workstations.
| AIX 6.1 and AIX 5.3 support the concurrent operation of 32- and 64-bit
| applications, with key internet technologies such as Java™ and XML parser for
| Java included as part of the base operating system.
| A strong affinity between AIX and Linux permits popular applications
| developed on Linux to run on AIX 6.1 and AIX 5.3 with a simple recompilation.
| v Linux
| LoadLeveler supports the following distributions of Linux:
| – Red Hat® Enterprise Linux (RHEL) 4 and RHEL 5
| – SUSE Linux Enterprise Server (SLES) 9 and SLES 10
v IBM System Blue Gene Solution
While no LoadLeveler processes actually run on the Blue Gene machine,
LoadLeveler can interact with the Blue Gene machine and supports the
scheduling of jobs to the machine.

Note: For models of the Blue Gene system such as Blue Gene/S, which can only
run a single job at a time, LoadLeveler does not have to be configured to
schedule resources for Blue Gene jobs. For such systems, serial jobs can be
used to submit work to the front end node for the Blue Gene system.

LoadLeveler for AIX and LoadLeveler for Linux compatibility
LoadLeveler for Linux is compatible with LoadLeveler for AIX. Its command line
interfaces, graphical user interfaces, and application programming interfaces (APIs)
are the same as they have been for AIX. The formats of the job command file,
configuration file, and administration file also remain the same.

System administrators can set up and maintain a LoadLeveler cluster consisting of
some machines running LoadLeveler for AIX and some machines running
LoadLeveler for Linux. This is called a mixed cluster. In this mixed cluster jobs can
be submitted from either AIX or Linux machines. Jobs submitted to a Linux job
queue can be dispatched to an AIX machine for execution, and jobs submitted to
an AIX job queue can be dispatched to a Linux machine for execution.

Although the LoadLeveler products for AIX and Linux are compatible, they do
have some differences in the level of support for specific features. For further
details, see the following topics:
v “Restrictions for LoadLeveler for Linux” on page 36.
v “Features not supported in LoadLeveler for Linux” on page 36.
v “Restrictions for LoadLeveler for AIX and LoadLeveler for Linux mixed clusters”
on page 37.

35

Restrictions for LoadLeveler for Linux
LoadLeveler for Linux supports a subset of the features that are available in the
LoadLeveler for AIX product.

The following features are available, but are subject to restrictions:
v 32-bit applications using the LoadLeveler APIs
LoadLeveler for Linux supports only the 32-bit LoadLeveler API library
(libllapi.so) on the following platforms:
– RHEL 4 and RHEL 5 on IBM IA-32 xSeries® servers
– SLES 9 and SLES 10 on IBM IA-32 xSeries servers
Applications linked to the LoadLeveler APIs on these platforms must be 32-bit
applications.
v 64–bit applications using the LoadLeveler APIs
LoadLeveler for Linux supports only the 64-bit LoadLeveler API library
(libllapi.so) on the following platforms:
– RHEL 4 and RHEL 5 on IBM xSeries servers with AMD Opteron or Intel
EM64T processors
– RHEL 4 and RHEL 5 on POWER™ servers
– SLES 9 and SLES 10 on IBM xSeries servers with AMD Opteron or Intel
EM64T processors
– SLES 9 and SLES 10 on POWER servers
Applications linked to the LoadLeveler APIs on these platforms must be 64-bit
applications.
v Support for AFS file systems
LoadLeveler for Linux support for authenticated access to AFS file systems is
limited to RHEL 4 on xSeries servers and IBM xSeries servers with AMD
Opteron or Intel EM64T processors. It is not available on systems running SLES
9 or SLES 10.

Features not supported in LoadLeveler for Linux
LoadLeveler for Linux supports a subset of the features that are available in the
LoadLeveler for AIX product.

The following features are not supported:
v RDMA consumable resource
On systems with High Performance Switch adapters, RDMA consumable
resources are not supported on LoadLeveler for Linux.
v User context RDMA blocks
User context RDMA blocks are not supported by LoadLeveler for Linux.
v Checkpoint/restart
LoadLeveler for AIX uses a number of features that are specific to the AIX
kernel to provide support for checkpoint/restart of user applications running
under LoadLeveler. Checkpoint/restart is not available in this release of
LoadLeveler for Linux.
v AIX Workload management (WLM)
WLM can strictly control use of system resources. LoadLeveler for AIX uses
WLM to enforce the use of a number of consumable resources defined by
| LoadLeveler (such as ConsumableCpus, ConsumableVirtualMemory,


| ConsumableLargePageMemory , and ConsumableMemory). This enforcement
of consumable resources usage through WLM is not available in this release of
v CtSec security
LoadLeveler for AIX can exploit CtSec (Cluster Security Services) security
functions. These functions authenticate the identity of users and programs
interacting with LoadLeveler. These features are not available in this release of
v LoadL_GSmonitor daemon
The LoadL_GSmonitor daemon in the LoadLeveler for AIX product uses the
Group Services Application Programming Interface (GSAPI) to monitor machine
availability and notify the LoadLeveler central manager when a machine is no
longer reachable. This daemon is not available in the LoadLeveler for Linux
product.
v Task guide tool
v System error log
Each LoadLeveler daemon has its own log file where information relevant to its
operation is recorded. In addition to this feature which exists on all platforms,
LoadLeveler for AIX also uses the errlog function to record critical LoadLeveler
events into the AIX system log. Support for an equivalent Linux function is not
available in this release.

Restrictions for LoadLeveler for AIX and LoadLeveler for
Linux mixed clusters
| Several restrictions apply when operating a LoadLeveler cluster that contains AIX
| 6.1 and AIX 5.3 and Linux machines.

| When operating a LoadLeveler cluster that contains AIX 6.1 and AIX 5.3 and Linux
machines, the following restrictions apply:
v The central manager node must run a version of LoadLeveler equal to or higher
than any LoadLeveler version being run on a node in the cluster.
v CtSec security features cannot be used.
v AIX jobs that use checkpointing must be sent to AIX nodes for execution. This
can be done by either defining and specifying job checkpointing for job classes
that exist only on AIX nodes or by coding appropriate requirements expressions.
Checkpointing jobs that are sent to a Linux node will be rejected by the
LoadL_startd daemon running on the Linux node.
v WLM is supported in a mixed cluster. However, enforcement of the use of
consumable resources will occur through WLM on AIX nodes only.

Chapter 3. What operating systems are supported by LoadLeveler? 37

Part 2. Configuring and managing the TWS LoadLeveler
environment
After installing IBM Tivoli Workload Scheduler (TWS) LoadLeveler, you may
customize it by modifying both the configuration file and the administration file
(see Part 1, “Overview of TWS LoadLeveler concepts and operation,” on page 1 for
overview information). The configuration file contains many parameters that you
can set or modify that will control how TWS LoadLeveler operates. The
administration file optionally lists and defines the machines in the TWS
LoadLeveler cluster and the characteristics of classes, users, and groups.

To easily manage TWS LoadLeveler, you should have one global configuration file
and only one administration file, both centrally located on a machine in the TWS
LoadLeveler cluster. Every other machine in the cluster must be able to read the
configuration and administration file that are located on the central machine.

You may have multiple local configuration files that specify information specific to
individual machines.

TWS LoadLeveler does not prevent you from having multiple copies of
administration files, but you need to be sure to update all the copies whenever you
make a change to one. Having only one administration file prevents any confusion.

39

Chapter 4. Configuring the LoadLeveler environment
One of your main tasks as system administrator is to configure LoadLeveler.

To configure LoadLeveler, you need to know what the configuration information is
and where it is located. Configuration information includes the following:
v The LoadLeveler user ID and group ID
v The configuration directory
v The global configuration file

Configuring LoadLeveler involves modifying the configuration files that specify
the terms under which LoadLeveler can use machines. There are two types of
configuration files:
v Global Configuration File: This file by default is called the LoadL_config file and it
contains configuration information common to all nodes in the LoadLeveler
cluster.
v Local Configuration File: This file is generally called LoadL_config.local (although
it is possible for you to rename it). This file contains specific configuration
information for an individual node. The LoadL_config.local file is in the same
format as LoadL_config and the information in this file overrides any
information specified in LoadL_config. It is an optional file that you use to
modify information on a local machine. Its full path name is specified in the
LoadL_config file by using the LOCAL_CONFIG keyword. See “Specifying file
and directory locations” on page 47 for more information.
Table 9 identifies where you can find more information about using configuration
and administration files to modify the TWS LoadLeveler environment.
Table 9. Roadmap of tasks for TWS LoadLeveler administrators
Controlling how TWS LoadLeveler Chapter 4, “Configuring the LoadLeveler
operates by customizing the global or environment”
local configuration file
Controlling TWS LoadLeveler resources Chapter 5, “Defining LoadLeveler resources to
by customizing an administration file administer,” on page 83
Additional ways to modify TWS Chapter 6, “Performing additional administrator
LoadLeveler that require customization tasks,” on page 103
of both the configuration and
administration files
Ways to control or monitor TWS v Chapter 16, “Commands,” on page 411
LoadLeveler operations by using the
v Chapter 7, “Using LoadLeveler’s GUI to
TWS LoadLeveler commands, GUI, and
perform administrator tasks,” on page 169
APIs
v Chapter 17, “Application programming
interfaces (APIs),” on page 541

You can run your installation with default values set by LoadLeveler, or you can
change any or all of them. Table 10 on page 42 lists topics that discuss how you
may configure the LoadLeveler environment by modifying the configuration file.

41

Table 10. Roadmap of administrator tasks related to using or modifying the LoadLeveler
configuration file
Using the default Chapter 2, “Getting a quick start using the default
configuration files shipped configuration,” on page 29
with LoadLeveler
Modifying the global and “Modifying a configuration file”
local configuration files
Defining major elements of v “Defining LoadLeveler administrators” on page 43
the LoadLeveler configuration
v “Defining a LoadLeveler cluster” on page 44
v “Defining LoadLeveler machine characteristics” on page
54
v “Defining security mechanisms” on page 56
v “Defining usage policies for consumable resources” on
page 60
v “Steps for configuring a LoadLeveler multicluster” on
page 151
Enabling optional v “Enabling support for bulk data transfer and rCxt blocks”
LoadLeveler functions on page 61
v “Gathering job accounting data” on page 61
v “Managing job status through control expressions” on
page 68
v “Tracking job processes” on page 70
v “Querying multiple LoadLeveler clusters” on page 71
Modifying LoadLeveler “Providing additional job-processing controls through
operations through installation exits” on page 72
installation exits

Modifying a configuration file
By taking a look at the configuration files that come with LoadLeveler, you will
find that there are many parameters that you can set. In most cases, you will only
have to modify a few of these parameters.

In some cases, though, depending upon the LoadLeveler nodes, network
connection, and hardware availability, you may need to modify additional
parameters.

All LoadLeveler commands, daemons, and processes read the administration and
configuration files at start up time. If you change the administration or
configuration files after LoadLeveler has already started, any LoadLeveler
command or process, such as the LoadL_starter process, will read the newer
version of the files while the running daemons will continue to use the data from
the older version. To ensure that all LoadLeveler commands, daemons, and
processes use the same configuration data, run the reconfiguration command on all
machines in the cluster each time the administration or configuration files are
changed.

To override the defaults, you must update the following keywords in the
/etc/LoadL.cfg file:
LoadLUserid
Specifies the LoadLeveler user ID.


LoadLGroupid
Specifies the LoadLeveler group ID.
LoadLConfig
Specifies the full path name of the configuration file.

Note that if you change the LoadLeveler user ID to something other than loadl,
you will have to make sure your configuration files are owned by this ID.

If Cluster Security (CtSec) services is enabled, make sure you update the unix.map
file if the LoadLUserid is specified as something other than loadl. Refer to “Steps
for enabling CtSec services” on page 58 for more details.

You can also override the /etc/LoadL.cfg file. For an example of when you might
want to do this, see “Querying multiple LoadLeveler clusters” on page 71.

Before you modify a configuration file, you need to:
v Ensure that the installation procedure has completed successfully and that the
configuration file, LoadL_config, exists in LoadLeveler’s home directory or in
the directory specified in /etc/LoadL.cfg. For additional details about installation,
see TWS LoadLeveler: Installation Guide.
v Know how to correctly specify keywords in the configuration file. For
information about configuration file keyword syntax and other details, see
v Identify yourself as the LoadLeveler administrator using the LOADL_ADMIN
keyword.

After you finish modifying the configuration file, notify LoadLeveler daemons by
issuing the llctl command with either the reconfig or recycle keyword. Otherwise,
LoadLeveler will not process the modifications you made to the configuration file.

Defining LoadLeveler administrators
Specify the LOADL_ADMIN keyword with a list of user names of those
individuals who will have administrative authority.

These users are able to invoke the administrator-only commands such as llctl,
llfavorjob, and llfavoruser. These administrators can also invoke the
administrator-only GUI functions. For more information, see Chapter 7, “Using
LoadLeveler’s GUI to perform administrator tasks,” on page 169.

LoadLeveler administrators on this list also receive mail describing problems that
are encountered by the master daemon. When CtSec is enabled, the
LOADL_ADMIN list is used only as a mailing list. For more information, see
“Defining security mechanisms” on page 56.

An administrator on a machine is granted administrative privileges on that
machine. It does not grant him administrative privileges on other machines. To be
an administrator on all machines in the LoadLeveler cluster, either specify your
user ID in the global configuration file with no entries in the local configuration
file, or specify your user ID in every local configuration file that exists in the
LoadLeveler cluster.


Chapter 4. Configuring the LoadLeveler environment 43

Defining a LoadLeveler cluster
It will be necessary to define the characteristics of the LoadLeveler cluster.

Table 11 lists the topics that discuss how you can define the characteristics of the
Table 11. Roadmap for defining LoadLeveler cluster characteristics
Defining characteristics of v “Choosing a scheduler”
specific LoadLeveler daemons
v “Setting negotiator characteristics and policies” on page
45
v “Specifying alternate central managers” on page 46
Defining other cluster v “Defining network characteristics” on page 47
characteristics
v “Specifying file and directory locations” on page 47
v “Configuring recording activity and log files” on page
48
v “Setting up file system monitoring” on page 54
Correctly specifying Chapter 12, “Configuration file reference,” on page 263
configuration file keywords
Working with daemons and v “llctl - Control LoadLeveler daemons” on page 439
machines in a LoadLeveler
v “llinit - Initialize machines in the LoadLeveler cluster”
cluster
on page 457

Choosing a scheduler
This topic discusses the types of schedulers available, which you may specify using
the configuration file keyword SCHEDULER_TYPE.

For information about the configuration file keyword syntax and other details, see
| LL_DEFAULT
| This scheduler runs serial jobs. It efficiently uses CPU time by scheduling
| jobs on what otherwise would be idle nodes (and workstations). It does
| not require that users set a wall clock limit. Also, this scheduler starts,
| suspends, and resumes jobs based on workload.
| BACKFILL
| This scheduler runs both serial and parallel jobs. The objective of
| BACKFILL scheduling is to maximize the use of resources to achieve the
| highest system efficiency, while preventing potentially excessive delays in
| starting jobs with large resource requirements. These large jobs can run
| because the BACKFILL scheduler does not allow jobs with smaller resource
| requirements to continuously use up resource before the larger jobs can
| accumulate enough resource to run.
| The BACKFILL scheduler supports:
| v The scheduling of multiple tasks per node
| v The scheduling of multiple user space tasks per adapter
| v The preemption of jobs
| v The use of reservations
| v The scheduling of inbound and outbound data staging tasks


| v Scale-across scheduling that allows you to take advantage of
| underutilized resources in a multicluster installation
| These functions are not supported by the default LoadLeveler scheduler.
| For more information about the BACKFILL scheduler, see “Using the
| BACKFILL scheduler” on page 110.
API This keyword option allows you to enable an external scheduler, such as
the Extensible Argonne Scheduling sYstem (EASY). The API option is
intended for installations that want to create a scheduling algorithm for
parallel jobs based on site-specific requirements.
For more information about external schedulers, see “Using an external
scheduler” on page 115.

Setting negotiator characteristics and policies
You may set the following negotiator characteristics and policies.

v Prioritize the queue maintained by the negotiator
Each job step submitted to LoadLeveler is assigned a system priority number,
based on the evaluation of the SYSPRIO keyword expression in the
configuration file of the central manager. The LoadLeveler system priority
number is assigned when the central manager adds the new job step to the
queue of job steps eligible for dispatch. Once assigned, the system priority
number for a job step is not changed, except under the following circumstances:
– An administrator or user issues the llprio command to change the system
priority of the job step.
– The value set for the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL
keyword is not zero.
– An administrator uses the llmodify command with the -s option to alter the
system priority of a job step.
– A program with administrator credentials uses the ll_modify subroutine to
alter the system priority of a job step.
Job steps assigned higher SYSPRIO numbers are considered for dispatch before
job steps with lower numbers.
For related information, see the following topics:
– “Controlling the central manager scheduling cycle” on page 73.
– “Setting and changing the priority of a job” on page 230.
– “llmodify - Change attributes of a submitted job step” on page 464.
– “ll_modify subroutine” on page 677.
v Prioritize the order of executing machines maintained by the negotiator
Each executing machine is assigned a machine priority number, based on the
evaluation of the MACHPRIO keyword expression in the configuration file of
the central manager. The LoadLeveler machine priority number is updated every
time the central manager updates its machine data. Machines assigned higher
MACHPRIO numbers are considered to run jobs before machines with lower
numbers. For example, a machine with a MACHPRIO of 10 is considered to run
a job before a machine with a MACHPRIO of 5. Similarly, a machine with a
MACHPRIO of -2 would be considered to run a job before a machine with a
MACHPRIO of -3.
Note that the MACHPRIO keyword is valid only on the machine where the
central manager is running. Using this keyword in a local configuration file has
no effect.


When you use a MACHPRIO expression that is based on load average, the
machine may be temporarily ordered later in the list immediately after a job is
scheduled to that machine. This temporary drop in priority happens because the
negotiator adds a compensating factor to the startd machine’s load average
every time the negotiator assigns a job. For more information, see the
NEGOTIATOR_LOADAVG_INCREMENT keyword.
v Specify additional negotiator policies
This topic lists keywords that were not mentioned in the previous configuration
steps. Unless your installation has special requirements for any of these
keywords, you can use them with their default settings.
– NEGOTIATOR_INTERVAL
– NEGOTIATOR_CYCLE_DELAY
– NEGOTIATOR_CYCLE_TIME_LIMIT
– NEGOTIATOR_LOADAVG_INCREMENT
– NEGOTIATOR_PARALLEL_DEFER
– NEGOTIATOR_PARALLEL_HOLD
– NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL
– NEGOTIATOR_REJECT_DEFER
– NEGOTIATOR_REMOVE_COMPLETED
– NEGOTIATOR_RESCAN_QUEUE
| – SCALE_ACROSS_SCHEDULING_TIMEOUT

Specifying alternate central managers
In one of your machine stanzas specified in the administration file, you specified
that the machine would serve as the central manager.

It is possible for some problem to cause this central manager to become unusable
such as network communication or software or hardware failures. In such cases,
the other machines in the LoadLeveler cluster believe that the central manager
machine is no longer operating. To remedy this situation, you can assign one or
more alternate central managers in the machine stanza to take control.

The following machine stanza example defines the machine deep_blue as an
alternate central manager:
#
deep_blue: type=machine
central_manager = alt

If the primary central manager fails, the alternate central manager then becomes
the central manager. The alternate central manager is chosen based upon the order
in which its respective machine stanza appears in the administration file.

When an alternate becomes the central manager, jobs will not be lost, but it may
take a few minutes for all of the machines in the cluster to check in with the new
central manager. As a result, job status queries may be incorrect for a short time.

When you define alternate central managers, you should set the following
keywords in the configuration file:
v CENTRAL_MANAGER_HEARTBEAT_INTERVAL
v CENTRAL_MANAGER_TIMEOUT

In the following example, the alternate central manager will wait for 30 intervals,
where each interval is 45 seconds:


# Set a 45 second interval
CENTRAL_MANAGER_HEARTBEAT_INTERVAL = 45
# Set the number of intervals to wait
CENTRAL_MANAGER_TIMEOUT = 30

For more information on central manager backup, refer to “What happens if the
central manager isn’t operating?” on page 708. For information about configuration
file keyword syntax and other details, see Chapter 12, “Configuration file
reference,” on page 263.

Defining network characteristics
A port number is an integer that specifies the port to use to connect to the
specified daemon.

You can define these port numbers in the configuration file or the /etc/services file
or you can accept the defaults. LoadLeveler first looks in the configuration file for
these port numbers. If LoadLeveler does not find the value in the configuration
file, it looks in the /etc/services file. If the value is not found in this file, the default
is used.

See Appendix C, “LoadLeveler port usage,” on page 741 for more information.

Specifying file and directory locations
The configuration file provided with LoadLeveler specifies default locations for all
of the files and directories.

You can modify their locations using the keywords shown in Table 12. Keep in
mind that the LoadLeveler installation process installs files in these directories and
these files may be periodically cleaned up. Therefore, you should not keep any
files that do not belong to LoadLeveler in these directories.

Managing distributed software systems is a primary concern for all system
administrators. Allowing users to share file systems to obtain a single,
network-wide image, is one way to make managing LoadLeveler easier.
Table 12. Default locations for all of the files and directories
To specify the
location of the: Specify this keyword:
Administration ADMIN_FILE
file
Local LOCAL_CONFIG
configuration
file
Local directory The following subdirectories reside in the local directory. It is possible that
the local directory and LoadLeveler’s home directory are the same.
v COMM
v EXECUTE
v LOG
v SPOOL and HISTORY

Tip: To maximize performance, you should keep the log, spool, and
execute directories in a local file system. Also, to measure the performance
of your network, consider using one of the available products, such as
Toolbox/6000.


Table 12. Default locations for all of the files and directories (continued)
To specify the
location of the: Specify this keyword:
Release RELEASEDIR
directory
The following subdirectories are created during installation and they
reside in the release directory. You can change their locations.
v BIN
v LIB
Core dump You may specify alternate directories to hold core dumps for the daemons
directory and starter process:
v MASTER_COREDUMP_DIR
v NEGOTIATOR_COREDUMP_DIR
v SCHEDD_COREDUMP_DIR
v STARTD_COREDUMP_DIR
v GSMONITOR_COREDUMP_DIR
v KBDD_COREDUMP_DIR
v STARTER_COREDUMP_DIR

When specifying core dump directories, be sure that the access
permissions are set so the LoadLeveler daemon or process can write to
the core dump directory. The permissions set for path names specified in
the keywords just mentioned must allow writing by both root and the
LoadLeveler ID. The permissions set for the path name specified for the
STARTER_COREDUMP_DIR keyword must allow writing by root, the
LoadLeveler ID, and any user who can submit LoadLeveler jobs.

The simplest way to be sure the access permissions are set correctly is to
set them the same as are set for the /tmp directory.

If a problem with access permissions prevents a LoadLeveler daemon or
process from writing to a core dump directory, then a message will be
written to the log, and the daemon or process will continue using the
default /tmp directory for core files.


Configuring recording activity and log files
The LoadLeveler daemons and processes keep log files according to the
specifications in the configuration file.

Administrators can also configure the LoadLeveler daemons to store additional
debugging messages in a circular buffer in memory. A number of keywords are
used to describe where LoadLeveler maintains the logs and how much information
is recorded in each log and buffer. These keywords, shown in Table 13 on page 49,
are repeated in similar form to specify the path name of the log file, its maximum
length, the size of the circular buffer, and the debug flags to be used for the log
and the buffer.

“Controlling the logging buffer” on page 50 describes how administrators can
configure LoadLeveler to buffer debugging messages.

“Controlling debugging output” on page 51 describes the events that can be
reported through logging controls.


“Saving log files” on page 53 describes the configuration keyword to use to save
logs for problem diagnosis.

Table 13. Log control statements
Daemon/ Log File (required) Max Length (required) Debug Control (required)
Process
(See note 1) (See note 2) (See note 4 on page 50)
Master MASTER_LOG = MAX_MASTER_LOG = bytes [buffer MASTER_DEBUG = flags [buffer
path bytes] flags]
Schedd SCHEDD_LOG = MAX_SCHEDD_LOG = bytes [buffer SCHEDD_DEBUG = flags [buffer
path bytes] flags]
Startd STARTD_LOG = path MAX_STARTD_LOG = bytes [buffer STARTD_DEBUG = flags [buffer
bytes] flags]
Starter STARTER_LOG = MAX_STARTER_LOG = bytes [buffer STARTER_DEBUG = flags [buffer
path bytes] flags]
Negotiator NEGOTIATOR_LOG MAX_NEGOTIATOR_LOG = bytes NEGOTIATOR_DEBUG = flags
= path [buffer bytes] [buffer flags]
Kbdd KBDD_LOG = path MAX_KBDD_LOG = bytes [buffer KBDD_DEBUG = flags [buffer
bytes] flags]
GSmonitor GSMONITOR_LOG MAX_GSMONITOR_LOG = bytes GSMONITOR_DEBUG = flags
= path [buffer bytes] [buffer flags]

where:
buffer bytes
Is the size of the circular buffer. The default value is 0, which indicates that
the buffer is disabled. To prevent the daemon from running out of
memory, this value should not be too large. Brackets must be used to
specify buffer bytes.
buffer flags
Indicates that messages with buffer flags in addition to messages with flags
will be stored in the circular buffer in memory. The default value is blank,
which indicates that the logging buffer is disabled because no additional
debug flags were specified for buffering. Brackets must be used to specify
buffer flags.

Note:
1. When coding the path for the log files, it is not necessary that all
LoadLeveler daemons keep their log files in the same directory, however,
you will probably find it a convenient arrangement.
2. There is a maximum length, in bytes, beyond which the various log files
cannot grow. Each file is allowed to grow to the specified length and is
then saved to an .old file. The .old files are overwritten each time the log
is saved, thus the maximum space devoted to logging for any one
program will be twice the maximum length of its log file. The default
length is 64 KB. To obtain records over a longer period of time, that do
not get overwritten, you can use the SAVELOGS keyword in the local or
global configuration files. See “Saving log files” on page 53 for more
information on extended capturing of LoadLeveler logs.


You can also specify that the log file be started anew with every
invocation of the daemon by setting the TRUNC statement to true as
follows:
v TRUNC_MASTER_LOG_ON_OPEN = true|false
v TRUNC_STARTD_LOG_ON_OPEN = true|false
v TRUNC_SCHEDD_LOG_ON_OPEN = true|false
v TRUNC_KBDD_LOG_ON_OPEN = true|false
v TRUNC_STARTER_LOG_ON_OPEN = true|false
v TRUNC_NEGOTIATOR_LOG_ON_OPEN = true|false
v TRUNC_GSMONITOR_LOG_ON_OPEN = true|false
3. LoadLeveler creates temporary log files used by the starter daemon.
These files are used for synchronization purposes. When a job starts, a
StarterLog.pid file is created. When the job ends, this file is appended to
the StarterLog file.
4. Normally, only those who are installing or debugging LoadLeveler will
need to use the debug flags, described in “Controlling debugging
output” on page 51 The default error logging, obtained by leaving the
right side of the debug control statement null, will be sufficient for most
installations.

Controlling the logging buffer
LoadLeveler allows a LoadLeveler daemon to store log messages in a buffer in
memory instead of writing the messages to a log file.

The administrator can force the messages in this buffer to be written to the log file,
when necessary, to diagnose a problem. The buffer is circular and once it is full,
older messages are discarded as new messages are logged. The llctl dumplogs
command is used to write the contents of the logging buffer to the appropriate log
file for the Master, Negotiator, Schedd, and Startd daemons.

Buffering will be disabled if either the buffer length is 0 or no additional debug
flags are specified for buffering.

See “Configuring recording activity and log files” on page 48 for log control
statement specifications. See TWS LoadLeveler: Diagnosis and Messages Guide for
additional information on TWS LoadLeveler log files.

Logging buffer example:

With the following configuration, the Schedd daemon will write only D_ALWAYS
and D_SCHEDD messages to the ${LOG}/SchedLog log file. The following
messages will be stored in the buffer:
v D_ALWAYS
v D_SCHEDD
v D_LOCKING
The maximum size of the Schedd log is 64 MB and the size of the logging buffer is
32 MB.
SCHEDD_LOG = ${LOG}/SchedLog
MAX_SCHEDD_LOG = 64000000 [32000000]
SCHEDD_DEBUG = D_SCHEDD [D_LOCKING]

To write the contents of the logging buffer to SchedLog file on the machine, issue
llctl dumplogs


To write the contents of the logging buffer to the SchedLog file on node1 in the
LoadLeveler cluster, issue:
llctl -h node1 dumplogs

To write the contents of the logging buffers to the SchedLog files on all machines,
issue:
llctl -g dumplogs

Note that the messages written from the logging buffer include a bracket message
and a prefix to identify them easily.
=======================BUFFER BEGIN========================

BUFFER: message .....
BUFFER: message .....

=======================BUFFER END==========================

Controlling debugging output
You can control the level of debugging output logged by LoadLeveler programs.

The following flags are presented here for your information, though they are used
primarily by IBM personnel for debugging purposes:
D_ACCOUNT
Logs accounting information about processes. If used, it may slow down
the network.
D_ACCOUNT_DETAIL
Logs detailed accounting information about processes. If used, it may slow
down the network and increase the size of log files.
D_ADAPTER
Logs messages related to adapters.
D_AFS
Logs information related to AFS credentials.
D_CKPT
Logs information related to checkpoint and restart
D_DAEMON
Logs information regarding basic daemon set up and operation, including
information on the communication between daemons.
D_DBX
Bypasses certain signal settings to permit debugging of the processes as
they execute in certain critical regions.
D_EXPR
Logs steps in parsing and evaluating control expressions.
D_FAIRSHARE
Displays messages related to fair share scheduling in the daemon logs. In
the global configuration file, D_FAIRSHARE can be added to
SCHEDD_DEBUG and NEGOTIATOR_DEBUG.
D_FULLDEBUG
Logs details about most actions performed by each daemon but doesn’t log
as much activity as setting all the flags.
D_HIERARCHICAL
Used to enable messages relating to problems related to the transmission of
hierarchical messages. A hierarchical message is sent from an originating
node to lower ranked receiving nodes.
D_JOB
Logs job requirements and preferences when making decisions regarding
whether a particular job should run on a particular machine.


D_KERNEL
Activates diagnostics for errors involving the process tracking kernel
extension.
D_LOAD
Displays the load average on the startd machine.
D_LOCKING
Logs requests to acquire and release locks.
D_LXCPUAFNT
Logs messages related to Linux CPU affinity. This flag is only valid for the
startd daemon.
D_MACHINE
Logs machine control functions and variables when making decisions
regarding starting, suspending, resuming, and aborting remote jobs.
D_MUSTER
Logs information related to multicluster processing.
D_NEGOTIATE
Displays the process of looking for a job to run in the negotiator. It only
pertains to this daemon.
D_PCRED
Directs that extra debug should be written to a file if the setpcred()
function call fails.
D_PROC
Logs information about jobs being started remotely such as the number of
bytes fetched and stored for each job.
D_QUEUE
Logs changes to the job queue.
D_REFCOUNT
Logs activity associated with reference counting of internal LoadLeveler
objects.
D_RESERVATION
Logs reservation information in the negotiator and Schedd daemon logs.
D_RESERVATION can be added to SCHEDD_DEBUG and
NEGOTIATOR_DEBUG.
D_RESOURCE
Logs messages about the management and consumption of resources.
These messages are recorded in the negotiator log.
D_SCHEDD
Displays how the Schedd works internally.
D_SDO
Displays messages detailing LoadLeveler objects being transmitted between
daemons and commands.
D_SECURITY
Logs information related to Cluster Security (CtSec) services identities.
D_SPOOL
Logs information related to the usage of databases in the LoadLeveler
spool directory.
D_STANZAS
Displays internal information about the parsing of the administration file.
D_STARTD
Displays how the startd works internally.
D_STARTER
Displays how the starter works internally.
D_STREAM
Displays messages detailing socket I/O.


D_SWITCH
Logs entries related to switch activity and LoadLeveler Switch Table Object
data.
D_THREAD
Displays the ID of the thread producing the log message. The thread ID is
displayed immediately following the date and time. This flag is useful for
debugging threaded daemons.
D_XDR
Logs information regarding External Data Representation (XDR)
communication protocols.
For example:
SCHEDD_DEBUG = D_CKPT D_XDR

Causes the scheduler to log information about checkpointing user jobs and
exchange xdr messages with other LoadLeveler daemons. These flags will
primarily be of interest to LoadLeveler implementers and debuggers.

The LL_COMMAND_DEBUG environment variable can be set to a string of
debug flags the same way as the *_DEBUG configuration keywords are set.
Normally, LoadLeveler commands and APIs do not print debug messages, but
with this environment variable set, the requested classes of debugging messages
will be logged to stderr. For example:
LL_COMMAND_DEBUG="D_ALWAYS D_STREAM" llstatus

will cause the llstatus command to print out debug messages related to I/O to
stderr.

Saving log files
By default, LoadLeveler stores only the two most recent iterations of a daemon’s
log file (<daemon name>Log, and <daemon name>Log.old).

Occasionally, for problem diagnosing, users will need to capture LoadLeveler logs
over an extended period. Users can specify that all log files be saved to a
particular directory by using the SAVELOGS keyword in a local or global
configuration file. Be aware that LoadLeveler does not provide any way to manage
and clean out all of those log files, so users must be sure to specify a directory in a
file system with enough space to accommodate them. This file system should be
separate from the one used for the LoadLeveler log, spool, and execute directories.

Each log file is represented by the name of the daemon that generated it, the exact
time the file was generated, and the name of the machine on which the daemon is
running. When you list the contents of the SAVELOGS directory, the list of log file
names looks like this:
NegotiatorLogNov02.16:10:39.123456.c163n10.ppd.pok.ibm.com
StarterLogNov02.16:09:19.622387.c163n10.ppd.pok.ibm.com
SchedLogNov02.16:09:05.543677.c163n10.ppd.pok.ibm.com


StartLogNov02.16:09:05.697753.c163n10.ppd.pok.ibm.com


Setting up file system monitoring
You can use the file system keywords to monitor the file system space or inodes
used by LoadLeveler.

You can use the file system keywords to monitor the file system space or inodes
used by LoadLeveler for:
v Logs
v Saving executables
v Spool information
v History files

You can also use the file system keywords to take preventive action and avoid
problems caused by running out of file system space or inodes. This is done by
setting the frequency that LoadLeveler checks the file system free space or inodes
and by setting the upper and lower thresholds that initialize system responses to
the free space or inodes available. By setting a realistic span between the lower and
upper thresholds, you will avoid excessive system actions.

The file system monitoring keywords are:
v FS_INTERVAL
v FS_NOTIFY
v FS_SUSPEND
v FS_TERMINATE
v INODE_NOTIFY
v INODE_SUSPEND
v INODE_TERMINATE


Defining LoadLeveler machine characteristics
You can use the following keywords to define the characteristics of machines in the

v ARCH
v CLASS
v CUSTOM_METRIC
v CUSTOM_METRIC_COMMAND
v FEATURE
v GSMONITOR_RUNS_HERE
v MAX_STARTERS
v SCHEDD_RUNS_HERE
v SCHEDD_SUBMIT_AFFINITY
v STARTD_RUNS_HERE


v START_DAEMONS
v VM_IMAGE_ALGORITHM
v X_RUNS_HERE

Defining job classes that a LoadLeveler machine will accept
There are a number of possible ways of defining job classes.

The following examples illustrate possible ways of defining job classes.
v Example 1
This example specifies multiple classes:
Class = No_Class(2)

or
Class = { "No_Class" "No_Class" }

The machine will only run jobs that have either defaulted to or explicitly
requested class No_Class. A maximum of two LoadLeveler jobs are permitted to
run simultaneously on the machine if the MAX_STARTERS keyword is not
specified. See “Specifying how many jobs a machine can run” for more
information on MAX_STARTERS.
v Example 2
Class = No_Class(1) Small(1) Medium(1) Large(1)

or
Class = { "No_Class" "Small" "Medium" "Large" }

The machine will only run a maximum of four LoadLeveler jobs that have either
defaulted to, or explicitly requested No_Class, Small, Medium, or Large class. A
LoadLeveler job with class IO_bound, for example, would not be eligible to run
here.
v Example 3
Class = B(2) D(1)

or
Class = { "B" "B" "D" }

The machine will run only LoadLeveler jobs that have explicitly requested class
B or D. Up to three LoadLeveler jobs may run simultaneously: two of class B
and one of class D. A LoadLeveler job with class No_Class, for example, would
not be eligible to run here.

Specifying how many jobs a machine can run
To specify how many jobs a machine can run, you need to take into consideration
both the MAX_STARTERS keyword and the Class statement.

This is described in more detail in “Defining LoadLeveler machine characteristics”
on page 54.

For example, if the configuration file contains these statements:


Class = A(1) B(2) C(1)
MAX_STARTERS = 2

then the machine can run a maximum of two LoadLeveler jobs simultaneously. The
possible combinations of LoadLeveler jobs are:
v A and B
v A and C
v B and B
v B and C
v Only A, or only B, or only C

If this keyword is specified together with a Class statement, the maximum number
of jobs that can be run is equal to the lower of the two numbers. For example, if:
MAX_STARTERS = 2
Class = class_a(1)

then the maximum number of job steps that can be run is one (the Class statement
defines one class).

If you specify MAX_STARTERS keyword without specifying a Class statement, by
default one class still exists (called No_Class). Therefore, the maximum number of
jobs that can be run when you do not specify a Class statement is one.

Note: If the MAX_STARTERS keyword is not defined in either the global
configuration file or the local configuration file, the maximum number of
jobs that the machine can run is equal to the number of classes in the Class
statement.

Defining security mechanisms
LoadLeveler can be configured to control authentication and authorization of
LoadLeveler functions by using Cluster Security (CtSec) services, a subcomponent
of Reliable Scalable Cluster Technology (RSCT), which uses the host-based
authentication (HBA) as an underlying security mechanism.

LoadLeveler permits only one security service to be configured at a time. You can
skip this topic if you do not plan to use this security feature or if you plan to use
the credential forwarding provided by the llgetdce and llsetdce program pair.
Refer to “Using the alternative program pair: llgetdce and llsetdce” on page 75 for
more information.

LoadLeveler for Linux does not support CtSec security.

LoadLeveler can be enabled to interact with OpenSSL for secure multicluster
communications

Table 14 on page 57 lists the topics that explain LoadLeveler daemons and how
you may define their characteristics and modify their behavior.


Table 14. Roadmap of configuration tasks for securing LoadLeveler operations
Securing LoadLeveler v “Configuring LoadLeveler to use cluster security
operations using cluster services”
security services
v “Steps for enabling CtSec services” on page 58
v “Limiting which security mechanisms LoadLeveler can
use” on page 60
Enabling LoadLeveler to secure “Steps for securing communications within a LoadLeveler
multicluster communication multicluster” on page 153
with OpenSSL

Configuring LoadLeveler to use cluster security services
Cluster security (CtSec) services allows a software component to authenticate and
authorize the identity of one of its peers or clients.

When configured to use CtSec services, LoadLeveler will:
v Authenticate the identity of users and programs interacting with LoadLeveler.
v Authorize users and programs to use LoadLeveler services. It prevents
unauthorized users and programs from misusing resources or disrupting
services.

To use CtSec services, all nodes running LoadLeveler must first be configured as
part of a cluster running Reliable Scalable Cluster Technology (RSCT). For details
on CtSec services administration, see IBM Reliable Scalable Cluster Technology:
Administration Guide, SA22-7889.

CtSec services are designed to use multiple security mechanisms and each security
mechanism must be configured for LoadLeveler. At the present time, directions are
provided only for configuring the host-based authentication (HBA) security
mechanism for LoadLeveler’s use. If CtSec is configured to use additional security
mechanisms that are not configured for LoadLeveler’s use, then the LoadLeveler
configuration file keyword SEC_IMPOSED_MECHS must be specified. This
keyword is used to limit the security mechanisms that will be used by CtSec
services to only those that are configured for use by LoadLeveler.

Authorization is based on user identity. When CtSec services are enabled for
LoadLeveler, user identity will differ depending on the authentication mechanism
in use. A user’s identity in UNIX host-based authentication is the user’s network
identity which is comprised of the user name and host name, such as
user_name@host.

LoadLeveler uses CtSec services to authorize owners of jobs, administrators and
LoadLeveler daemons to perform certain actions. CtSec services uses its own
identity mapping file to map the clients’ network identity to a local identity when
performing authorizations. A typical local identity is the user name without a
hostname. The local identities of the LoadLeveler administrators must be added as
members of the group specified by the keyword SEC_ADMIN_GROUP. The local
identity of the LoadLeveler user name must be added as the sole member of the
group specified by the keyword SEC_SERVICES_GROUP. The LoadLeveler
Services and Administrative groups, those identified by the keywords


SEC_SERVICES_GROUP and SEC_ADMIN_GROUP, must be the same across all
nodes in the LoadLeveler cluster. To ensure consistency in performing tasks which
require owner, administrative or daemon privileges across all nodes in the
LoadLeveler cluster, user network identities must be mapped identically across all
nodes in the LoadLeveler cluster. If this is not the case, LoadLeveler authorizations
may fail.

Steps for enabling CtSec services
It is necessary to enable LoadLeveler to use CtSec services.

To enable LoadLeveler to use CtSec services, perform the following steps:
1. Include, in the Trusted Host List, the host names of all hosts with which
communications may take place. If LoadLeveler tries to communicate with a
host not on the Trusted Host List the message: The host identified in the
credentials is not a trusted host on this system will occur. Additionally, the
system administrator should ensure that public keys are manually exchanged
between all hosts in the LoadLeveler cluster. Refer to IBM Reliable Scalable
Cluster Technology: Administration Guide, SA22-7889 for information on setting
up Trusted Host Lists and manually transferring public keys.
2. Create user IDs. Each LoadLeveler administrator and the LoadLeveler user ID
need to be created, if they don’t already exist. You can do this through SMIT or
the mkuser command.
3. Ensure that the unix.map file contains the correct value for the service name
ctloadl which specifies the LoadLeveler user name. If you have configured
LoadLeveler to use loadl as the LoadLeveler user name, either by default or by
specifying loadl in the LoadLUserid keyword of the /etc/LoadL.cfg file, nothing
needs to be done. The default map file will contain the ctloadl service name
already assigned to loadl. If you have configured a different user name in the
LoadLUserid keyword of the /etc/LoadL.cfg file, you will need to make sure
that the /var/ct/cfg/unix.map file exists and that it assigns the same user name
to the ctloadl service name. If the /var/ct/cfg/unix.map file does not exist, create
one by copying the default map file /usr/sbin/rsct/cfg/unix.map. Do not modify
the default map file.
If the value of the LoadLUserid and the value associated with ctloadl are not
the same a security services error which indicates a UNIX identity mismatch
will occur.
4. Add entries to the global mapping file of each machine in the LoadLeveler
cluster to map network identities to local identities. This file is located at:
/var/ct/cfg/ctsec_map.global. If this file doesn’t yet exist, you should copy the
default global mapping file to this location—don’t modify the default mapping
file. The default global mapping file, which is shared among all CtSec services
exploiters, is located at /usr/sbin/rsct/cfg/ctsec_map.global. See IBM Reliable
Scalable Cluster Technology for AIX: Technical Reference, SA22-78900 for more
information on the mapping file.
When adding names to the global mapping file, enter more specific entries
ahead of the other, less specific entries. Remember that you must update the
global mapping file on each machine in the LoadLeveler cluster, and each
mapping file has to be updated with the security services identity of each
member of the administrator group, the services group, and the users.
Therefore, you would have entries like this:
unix:brad@mach1.pok.ibm.com=bradleyf
unix:marsha@mach2.pok.ibm.com=marshab


unix:marsha@mach3.pok.ibm.com=marshab
unix:loadl@mach1.pok.ibm.com=loadl

However, if you’re sure your LoadLeveler cluster is secure, you could specify
mapping for all machines this way:
unix:brad@*=bradleyf
unix:marsha@*=marshab
unix:loadl@*=loadl

This indicates that the UNIX network identity of the users brad, marsha and
loadl will map to their respective security services identities on every machine
in the cluster. Refer to IBM Reliable Scalable Cluster Technology for AIX: Technical
Reference, SA22-7800 for a description of the syntax used in the
ctsec_map.global file.
5. Create UNIX groups. The LoadLeveler administrator group and services group
need to be created for every machine in the cluster and should contain the local
identities of members. This can be done either by using SMIT or the mkgroup
command.
For example, to create the group lladmin which lists the LoadLeveler
administrators:
mkgroup "users=sam,betty,loadl" lladmin

These groups must be created on each machine in the LoadLeveler cluster and
must contain the same entries.
To create the group llsvcs which lists the identity under which LoadLeveler
daemons run using the default id of loadl:
mkgroup users=loadl llsvcs

These groups must be created on each machine in the LoadLeveler cluster and
must contain the same entries.
6. Add or update these keywords in the LoadLeveler configuration file:
SEC_ENABLEMENT=CTSEC
SEC_ADMIN_GROUP=name of lladmin group
SEC_SERVICES_GROUP=group name that contains identities of LoadLeveler daemons

The SEC_ENABLEMENT=CTSEC keyword indicates that CtSec services
mechanism should be used. SEC_ADMIN_GROUP points to the name of the
UNIX group which contains the local identities of the LoadLeveler
administrators. The SEC_SERVICES_GROUP keyword points to the name of
the UNIX group which contains the local identity of the LoadLeveler daemons.
All LoadLeveler daemons run as the LoadLeveler user ID. Refer to step 5 for
discussion of the administrators and services groups.
7. Update the .rhosts file in each user’s home directory. This file is used to
identify which UNIX identities can run LoadLeveler jobs on the local host
machine. If the file does not exist in a user’s home directory, you must create it.
The .rhosts file must contain entries which specify all host and user
combinations allowed to submit jobs which will run as the local user, either
explicitly or through the use of wildcards.
Entries in the .rhosts file are specified this way:
HostNameField [UserNameField]

Refer to IBM AIX Files Reference, SC23-4168 for further details about the .rhosts
file format.


Tips for configuring LoadLeveler to use CtSec services:

When using CtSec services for LoadLeveler, each machine in the LoadLeveler
cluster must be set up properly.

CtSec authenticates network identities based on trust established between
individual machines in a cluster, based on local host configurations. Because of this
it is possible for most of the cluster to run correctly but to have transactions from
certain machines experience authentication or authorization problems.

If unexpected authentication or authorization problems occur in a LoadLeveler
cluster with CtSec enabled, check that the steps in “Steps for enabling CtSec
services” on page 58 were correctly followed for each machine in the LoadLeveler
cluster.

If any machine in a LoadLeveler cluster is improperly configured to run CtSec you
may see that:
v Users cannot perform user tasks (such as cancel) for jobs they submitted.
Either the machine the job was submitted from or the machine the user
operation was submitted from (or both) do not contain mapping files for the
user that specify the same security services identity. The user should attempt the
operation from the same machine the job was submitted from and record the
results. If the user still cannot perform a user task on a job they submitted, then
they should contact the LoadLeveler administrator who should review the steps
in “Steps for enabling CtSec services” on page 58.
v LoadLeveler daemons fail to communicate.
When LoadLeveler daemons communicate they must first authenticate each
other. If the daemons cannot authenticate a message will be put in the daemon
log indicating an authentication failure. Ensure the Trusted Hosts List on all
LoadLeveler nodes contains the correct entries for all of the nodes in the
LoadLeveler cluster. Also, make sure that the LoadLeveler Services group on all
nodes of the LoadLeveler cluster contains the local identity for the LoadLeveler
user name. The ctsec_map.global must contain mapping rules to map the
LoadLeveler user name from every machine in the LoadLeveler cluster to the
local identity for the LoadLeveler user name. An example of what may happen
when daemons fail to communicate is that an alternate central manager may
take over while the primary central manager is still active. This can occur when
the alternate central manager does not trust the primary central manager.

Limiting which security mechanisms LoadLeveler can use
As more security mechanisms become available, they must be configured for
LoadLeveler’s use.

If there are security mechanisms configured for CtSec that are not configured for
LoadLeveler’s use, then the LoadLeveler configuration file keyword
SEC_IMPOSED_MECHS must specify the mechanisms configured for
LoadLeveler.

Defining usage policies for consumable resources
The LoadLeveler scheduler can schedule jobs based on the availability of
consumable resources.

You can use the following keywords to configure consumable resources:
v ENFORCE_RESOURCE_MEMORY


v ENFORCE_RESOURCE_POLICY
v ENFORCE_RESOURCE_SUBMISSION
v ENFORCE_RESOURCE_USAGE
v FLOATING_RESOURCES
v RESOURCES
v SCHEDULE_BY_RESOURCES


Enabling support for bulk data transfer and rCxt blocks
On systems with device drivers and network adapters that support remote
direct-memory access (RDMA), LoadLeveler allows bulk data transfer for jobs that
use either the Internet or user space communication protocol mode.

For jobs using the Internet protocol (IP jobs), LoadLeveler does not monitor or
control the use of bulk transfer. For user space jobs that request bulk transfer,
however, LoadLeveler creates a consumable RDMA resource, and limits RDMA
resources to only four for a single machine with Switch Network Interface for HPS
network adapters. There is no limit on RDMA resources for machines with
InfiniBand network adapters.

You do not need to perform specific configuration or job-definition tasks to enable
bulk transfer for LoadLeveler jobs that use the IP network protocol. LoadLeveler
cannot affect whether IP communication uses bulk transfer; the implementation of
IP where the job runs determines whether bulk transfer is supported.

To enable user space jobs to use bulk data transfer, you must update the
LoadLeveler configuration file to include the value RDMA in the
SCHEDULE_BY_RESOURCES list for machines with Switch Network Interface for
HPS network adapters.

Example:
SCHEDULE_BY_RESOURCES = RDMA others

For additional information about using bulk data transfer and job-definition
requirements, see “Using bulk data transfer” on page 188.

Gathering job accounting data
Your organization may have a policy of charging users or groups of users for the
amount of resources that their jobs consume.

You can do this using LoadLeveler’s accounting feature. Using this feature, you can
produce accounting reports that contain job resource information for completed
| serial and parallel job steps. You can also view job resource information on jobs
that are continuing to run.

The accounting record for a job step will contain separate sets of resource usage
data for each time a job step is dispatched to run. For example, the accounting
record for a job step that is vacated and then started again will contain two sets of
resource usage data. The first set of resource usage data is for the time period
when the job step was initially dispatched until the job step was vacated. The
second set of resource usage data is for the time period for when the job step is
dispatched after the vacate until the job step completes.


The job step’s accounting data that is provided in the llsummary short listing and
in the user mail will contain only one set of resource usage data. That data will be
from the last time the job step was dispatched to run. For example, the mail
message for job step completion for a job step that is checkpointed with the hold
(-h) option and then restarted, will contain the set of resource usage data only for
the dispatch that restarted the job from the checkpoint. To obtain the resource
usage data for the entire job step, use the detailed llsummary command or
accounting API.

The following keywords allow you to control accounting functions:
v ACCT
v ACCT_VALIDATION
v GLOBAL_HISTORY
v HISTORY_PERMISSION
v JOB_ACCT_Q_POLICY
v JOB_LIMIT_POLICY
For example, the following section of the configuration file specifies that the
accounting function is turned on. It also identifies the default module used to
perform account validation and the directory containing the global history files:
ACCT = A_ON A_VALIDATE
ACCT_VALIDATION = $(BIN)/llacctval
GLOBAL_HISTORY = $(SPOOL)

Table 15 lists the topics related to configuring, gathering and using job accounting
data.
Table 15. Roadmap of tasks for gathering job accounting data
Configuring LoadLeveler to v “Collecting job resource data on serial and parallel jobs”
gather job accounting data
v “Collecting job resource data based on machines” on page
64
v “Collecting job resource data based on events” on page 64
v “Collecting job resource information based on user
accounts” on page 65
v “Collecting accounting data for reservations” on page 63
v “Collecting the accounting information and storing it into
files” on page 66
v “64-bit support for accounting functions” on page 67
v “Example: Setting up job accounting files” on page 67
Managing accounting data v “Producing accounting reports” on page 66
v “Correlating AIX and LoadLeveler accounting records” on
page 66
v “llacctmrg - Collect machine history files” on page 413
v “llsummary - Return job resource information for
accounting” on page 535

Collecting job resource data on serial and parallel jobs
| Information on completed serial and parallel job steps is gathered using the UNIX
| wait3 system call.


Information on non-completed serial and parallel jobs is gathered in a
platform-dependent manner by examining data from the UNIX process.

| Accounting information on a completed serial job step is determined by
| accumulating resources consumed by that job on the machines that ran the job.
| Similarly, accounting information on completed parallel job steps is gathered by
| accumulating resources used on all of the nodes that ran the job step.

You can also view resource consumption information on serial and parallel jobs
that are still running by specifying the -x option of the llq command. To enable llq
-x, specify the following keywords in the configuration file:
v ACCT = A_ON A_DETAIL
v JOB_ACCT_Q_POLICY = number

| Collecting accounting information for recurring jobs
| For recurring jobs, accounting records are written as each occurrence of each step
| of the job completes. The reservation ID field in the accounting record can be used
| to distinguish one occurrence from another.

Collecting accounting data for reservations
LoadLeveler can collect accounting data for reservations, which are set periods of
time during which node resources are reserved for the use of particular users or
groups.

To enable recording of reservation information, specify the following keywords in
the configuration file:
v To turn on accounting for reservations, add the A_RES flag to the ACCT
keyword.
v To specify a file other than the default history file to contain the data, use the
RESERVATION_HISTORY keyword.
See Chapter 12, “Configuration file reference,” on page 263 for details about the
ACCT and RESERVATION_HISTORY keywords.

When these keyword values are set and a reservation ends or is canceled,
LoadLeveler records the following information:
v The reservation ID
v The time at which the reservation was created
v The user ID of the reservation owner
v The name of the owning group
v Requested and actual start times
v Requested and actual duration
v Actual time at which the reservation ended or was canceled
v Whether the reservation was created with the SHARED or REMOVE_ON_IDLE options
v A list of users and a list of groups that were authorized to use the reservation
v The number of reserved nodes
v The names of reserved nodes

This reservation information is appended in a single line to the reservation history
file for the reservation. The format of reservation history data is:
Reservation ID!Reservation Creation Time!Owner!Owning Group!Start Time!
Actual Start Time!Duration!Actual Duration!Actual End Time!SHARED(yes|no)!
REMOVE_ON_IDLE(yes|no)!Users!Groups!Number of Nodes!Nodes!BG C-nodes!
BG Connection!BG Shape!Number of BG BPs!BG BPs

In reservation history data:


v The unit of measure for start times and end times is the number of seconds since
January 1, 1970.
v The unit of time for durations is seconds.

| Note: As each occurrence of a recurring reservation completes, an accounting
| record is appended to the reservation history file. The format of the record is
| identical to that of a one time reservation. In the record, the Reservation ID
| includes the occurrence ID of the completed reservation.

| When you cancel the entire recurring reservation (as opposed to only one
| occurrence being canceled), one additional accounting record is written. This
| record is based on the state of the reservation:
| v If an occurrence is ACTIVE, then the end time and duration of that
| occurrence is set and an accounting record written.
| v If there are not any ACTIVE occurrences, then an accounting record will
| be written for the next scheduled occurrence. This is similar to the
| accounting record that is written when you cancel a one time reservation
| in the WAITING state.

The following is an example of a reservation history file entry:
bgldd1.rchland.ibm.com.68.r!1150242970!ezhong!group1!1150243200!1150243200!
300!300!1150243500!no!no!yang!fvt,dev!1!bgldd1!0!!!0!
bgldd1.rchland.ibm.com.54.r!1150143472!ezhong!No_Group!1153612800!0!60!0!
1150243839!no!no!!!0!32!MESH!0x0x0!1!R010(J115)
bgldd1.rchland.ibm.com.70.r!1150244654!ezhong!No_Group!1150244760!1150244760!
60!60!1150244820!yes!yes!user1,user2!group1,group2!0!512!MESH!1x1x1!1!R010

| To collect the reservation information stored in the history file, use the llacctmrg
| command with the -R option. For llacctmrg command syntax, see “llacctmrg -
| Collect machine history files” on page 413.

To format reservation history data contained in a file, use the sample script
llreshist.pl in directory ${RELEASEDIR}/samples/llres/.

Collecting job resource data based on machines
LoadLeveler can collect job resource usage information for every machine on
which a job may run.

A job may run on more than one machine because it is a parallel job or because the
job is vacated from one machine and rescheduled to another machine.

To enable recording of resources by machine, you need to specify ACCT = A_ON
A_DETAIL in the configuration file.

The machine’s speed is part of the data collected. With this information, an
installation can develop a charge back program which can charge more or less for
resources consumed by a job on different machines. For more information on a
machine’s speed, refer to the machine stanza information. See “Defining machines”
on page 84.

Collecting job resource data based on events
In addition to collecting job resource information based upon machines used, you
can gather this information based upon an event or time that you specify.


For example, you may want to collect accounting information at the end of every
work shift or at the end of every week or month. To collect accounting information
on all machines in this manner, use the llctl command with the capture parameter:
llctl -g capture eventname

eventname is any string of continuous characters (no white space is allowed) that
defines the event about which you are collecting accounting data. For example, if
you were collecting accounting data on the graveyard work shift, your command
could be:
llctl -g capture graveyard

This command allows you to obtain a snapshot of the resources consumed by
active jobs up to and including the moment when you issued the command. If you
want to capture this type of information on a regular basis, you can set up a
crontab entry to invoke this command regularly. For example:
# sample crontab for accounting
# shift crontab 94/8/5
#
# Set up three shifts, first, second, and graveyard shift.
# Crontab entries indicate the end of shift.
#
#M H d m day command
#
00 08 * * * /u/loadl/bin/llctl -g capture graveyard
00 16 * * * /u/loadl/bin/llctl -g capture first
00 00 * * * /u/loadl/bin/llctl -g capture second

For more information on the llctl command, refer to “llctl - Control LoadLeveler
daemons” on page 439. For more information on the collection of accounting
records, see “llq - Query job status” on page 479.

Collecting job resource information based on user accounts
If your installation is interested in keeping track of resources used on an account
basis, you can require all users to specify an account number in their job command
files.

They can specify this account number with the account_no keyword which is
explained in detail in “Job command file keyword descriptions” on page 359.
Interactive POE jobs can specify an account number using the
LOADL_ACCOUNT_NO environment variable.

LoadLeveler validates this account number by comparing it against a list of
account numbers specified for the user in the user stanza in the administration file.

Account validation is under the control of the ACCT keyword in the configuration
file. The routine that performs the validation is called llacctval. You can supply
your own validation routine by specifying the ACCT_VALIDATION keyword in
the configuration file. The following are passed as character string arguments to
the validation routine:
v User name
v User’s login group name
v Account number specified on the Job
v Blank-separated list of account numbers obtained from the user’s stanza in the
administration file.
The account validation routine must exit with a return code of zero if the
validation succeeds. If it fails, the return code is a nonzero number.


Collecting the accounting information and storing it into files
LoadLeveler stores the accounting information that it collects in a file called history
in the spool directory of the machine that initially scheduled this job, the Schedd
machine. Data on parallel jobs are also stored in the history files.

Resource information collected on the LoadLeveler job is constrained by the
capabilities of the wait3 system call. Information for processes which fork child
processes will include data for those child processes as long as the parent process
waits for the child process to terminate. Complete data may not be collected for
jobs which are not composed of simple parent/child processes. For example, if you
have a LoadLeveler job which invokes an rsh command to execute a function on
another machine, the resources consumed on the other machine will not be
collected as part of the LoadLeveler accounting data.

LoadLeveler accounting uses the following types of files:
v The local history file which is local to each Schedd machine is where job
resource information is first recorded. These files are usually named history and
are located in the spool directory of each Schedd machine, but you may specify
an alternate name with the HISTORY keyword in either the global or local
configuration file.
v The global history file is a combination of the history files from some or all of
the machines in the LoadLeveler cluster merged together. The command
llacctmrg is used to collect files together into a global file. As the files are
collected from each machine, the local history file for that machine is reset to
contain no data. The file is named globalhist.YYYYMMDDHHmm. You may
specify the directory in which to place the file when you invoke the llacctmrg
command or you can specify the directory with the GLOBAL_HISTORY
keyword in the configuration file. The default value set up in the sample
configuration file is the local spool directory.

Producing accounting reports
You can produce three types of reports using either the local or global history file.

These reports are called the short, long, and extended versions. As their names
imply, the short version of the report is a brief listing of the resources used by
LoadLeveler jobs. The long version provides more comprehensive detail with
summarized resource usage, and the extended version of the report provides the
comprehensive detail with detailed resource usage.

If you do not specify a report type, you will receive the default short version. The
short report displays the number of jobs along with the total CPU usage according
to user, class, group, and account number. The extended version of the report
displays all of the data collected for every job.
v For examples of the short and extended versions of the report, see “llsummary -
Return job resource information for accounting” on page 535.
v For information on the accounting APIs, refer to Chapter 17, “Application
programming interfaces (APIs),” on page 541.

Correlating AIX and LoadLeveler accounting records
For jobs running on AIX systems, you can use a job accounting key to correlate
AIX accounting records with LoadLeveler accounting records.

The job accounting key uniquely identifies each job step. LoadLeveler derives this
key from the job key and the date and time at which the job entered the queue


(see the QDate variable description). The key is associated with the starter process
for the job step and any of its child processes.

For checkpointed jobs, LoadLeveler does not change the job accounting key,
regardless of how it restarts the job step. Jobs restarted from a checkpoint file or
through a new job step retain the job accounting key for the original job step.

To access the job accounting key for a job step, you can use the following
interfaces:
v The llsummary command, requesting the long version of the report. For details
about using this command, see “llsummary - Return job resource information for
accounting” on page 535.
v The GetHistory subroutine. For details about using this subroutine, see
“GetHistory subroutine” on page 545.
v The ll_get_data subroutine, through the LL_StepAcctKey specification. For
details about using this subroutine, see “ll_get_data subroutine” on page 570.

For information about AIX accounting records, see the system accounting topic in
AIX System Management Guide: Operating System and Devices.

64-bit support for accounting functions
LoadLeveler 64-bit support for accounting functions includes several features.

LoadLeveler 64-bit support for accounting functions includes:
v Statistics of jobs such as usage, limits, consumable resources, and other 64-bit
integer data are preserved in the history file as rusage64, rlimit64 structures and
as data items of type int64_t.
v The LL_job_step structure defined in llapi.h allows access to the 64-bit data
items either as data of type int64_t or as data of type int32_t. In the latter case,
the returned values may be truncated.
v The llsummary command displays 64-bit information where appropriate.
v The data access API supports both 64-bit and 32-bit access to accounting and
usage information in a history file. See “Examples of using the data access API”
on page 633 for an example of how to use the ll_get_data subroutine to access
information stored in a LoadLeveler history file.

Example: Setting up job accounting files
You can perform all of the steps included in this sample procedure or just the ones
that apply to your situation.

The sample procedure shown in Table 16 walks you through the process of
collecting account data.
1. Edit the configuration file according to the following table:
Table 16. Collecting account data - modifying the configuration file
Edit this keyword: To:
ACCT Turn accounting and account validation on and off and specify
detailed accounting.
ACCT_VALIDATION Specify the account validation routine.
GLOBAL_HISTORY Specify a directory in which to place the global history files.


2. Specify account numbers and set up account validation by performing the
following steps:
a. Specify a list of account numbers a user may use when submitting jobs, by
using the account keyword in the user stanza in the administration file.
b. Instruct users to associate an account number with their job, by using the
account_no keyword in the job command file.
c. Specify the ACCT_VALIDATION keyword in the configuration file that
identifies the module that will be called to perform account validation. The
default module is called llacctval. You can replace this module with your
installation’s own accounting routine by specifying a new module with this
keyword.
3. Specify machines and their weights by using the speed keyword in a machine’s
machine stanza in the administration file.
Also, if you have in your cluster machines of differing speeds and you want
LoadLeveler accounting information to be normalized for these differences,
specify cpu_speed_scale=true in each machine’s respective machine stanza.
For example, suppose you have a cluster of two machines, called A and B,
where Machine B is three times as fast as Machine A. Machine A has
speed=1.0, and Machine B has speed=3.0. Suppose a job runs for 12 CPU
seconds on Machine A. The same job runs for 4 CPU seconds on Machine B.
When you specify cpu_speed_scale=true, the accounting information collected
on Machine B for that job shows the normalized value of 12 CPU seconds
rather than the actual 4 CPU seconds.
4. Merge multiple files collected from each machine into one file, using the
llacctmrg command.
5. Report job information on all the jobs in the history file, using the llsummary
command.

Managing job status through control expressions
You can control running jobs by using five control functions as Boolean expressions

These functions are useful primarily for serial jobs. You define the expressions,
using normal C conventions, with the following functions:
v START
v SUSPEND
v CONTINUE
v VACATE
v KILL

The expressions are evaluated for each job running on a machine using both the
job and machine attributes. Some jobs running on a machine may be suspended
while others are allowed to continue.

The START expression is evaluated twice; once to see if the machine can accept
jobs to run and second to see if the specific job can be run on the machine. The
other expressions are evaluated after the jobs have been dispatched and in some
cases, already running.

When evaluating the START expression to determine if the machine can accept
jobs, Class != ″Z″ evaluates to true only if Z is not in the class definition. This
means that if two different classes are defined on a machine, Class != ″Z″ (where Z


is one of the defined classes) always evaluates to false when specified in the
START expression and, therefore, the machine will not be considered to start jobs.

Typically, machine load average, keyboard activity, time intervals, and job class are
used within these various expressions to dynamically control job execution.

For additional information about:
v Time-related variables that you may use for this keyword, see “Variables to use
for setting times” on page 320.
v Coding these control expressions in the configuration file, see Chapter 12,
“Configuration file reference,” on page 263.

How control expressions affect jobs
After LoadLeveler selects a job for execution, the job can be in any of several
states.

Figure 10 on page 70 shows how the control expressions can affect the state a job is
in. The rectangles represent job or daemon states (Idle, Completed, Running,
Suspended, and Vacating) and the diamonds represent the control expressions
(Start, Suspend, Continue, Vacate, and Kill).


Idle

False
Completed Start

True

Running

False
Suspend

True

Suspended

True
Continue

False

False
Vacate

True

Vacating

False
Kill

True

Figure 10. How control expressions affect jobs

Criteria used to determine when a LoadLeveler job will enter Start, Suspend,
Continue, Vacate, and Kill states are defined in the LoadLeveler configuration files
and they can be different for each machine in the cluster. They can be modified to
meet local requirements.

Tracking job processes
When a job terminates, its orphaned processes may continue to consume or hold
resources, thereby degrading system performance, or causing jobs to hang or fail.

Process tracking allows LoadLeveler to cancel any processes (throughout the entire
cluster), left behind when a job terminates. Process tracking is required to do
preemption by the suspend method when running either the BACKFILL or API
schedulers. Process tracking is optional in all other cases.


When process tracking is enabled, all child processes are terminated when the
main process terminates. This will include any background or orphaned processes
started in the prolog, epilog, user prolog, and user epilog.

Process tracking on LoadLeveler for Linux is supported only on RHEL 5 and SLES
10 systems.

There are two keywords used in specifying process tracking:
PROCESS_TRACKING
To activate process tracking, set PROCESS_TRACKING=TRUE in the
LoadLeveler global configuration file. By default, PROCESS_TRACKING is
set to FALSE.
PROCESS_TRACKING_EXTENSION
On AIX, this keyword specifies the path to the loadable kernel module
LoadL_pt_ke in the local or global configuration file. If the
PROCESS_TRACKING_EXTENSION keyword is not supplied, then
LoadLeveler will search the $HOME/bin default directory.
On Linux, this keyword specifies the path to the loadable kernel module
proctrk.ko in the local or global configuration file. The proctrk.ko kernel
module is shipped as source code and must be built and installed on all
machines where process tracking is required. See the TWS LoadLeveler:
Installation Guide for additional information about which directory to specify
when using the PROCESS_TRACKING_EXTENSION configuration keyword.

The process tracking kernel extension is not unloaded when the startd daemon
terminates. Therefore if a mismatch in the version of the loaded kernel extension
and the installed kernel extension is found when the startd starts up the daemon
will exit. In this case a reboot of the node is needed to unload the currently loaded
kernel extension. If you install a new version of LoadLeveler which contains a new
version of the kernel extension you may need to reboot the node.


Querying multiple LoadLeveler clusters
This topic applies only to those installations having more than one LoadLeveler
cluster, where the separate clusters have not been organized into a multicluster
environment.

To organize separate LoadLeveler clusters into a multicluster environment, see
“LoadLeveler multicluster support” on page 148.

You can query, submit, or cancel jobs in multiple LoadLeveler clusters by setting
up a master configuration file for each cluster and using the LOADL_CONFIG
environment variable to specify the name of the master configuration file that the
LoadLeveler commands must use. The master configuration file must be located in
the /etc directory and the file name must have a format of base_name.cfg where
base_name is a user defined identifier for the cluster.

The default name for the master configuration file is /etc/LoadL.cfg. The format for
the LOADL_CONFIG environment variable is LOADL_CONFIG=/etc/


base_name.cfg or LOADL_CONFIG=base_name. When you use the form
LOADL_CONFIG=base_name, the prefix /etc and suffix .cfg are appended to the
base_name.

The following example explains how you can set up a machine to query multiple
clusters:

You can configure /etc/LoadL.cfg to point to the configuration files for the ″default″
cluster, and you can configure /etc/othercluster.cfg to point to the configuration
files of another cluster which the user can select.

For example, you can enter the following query command:
$ llq

The llq command uses the configuration from /etc/LoadL.cfg and queries job
information from the ″default″ cluster.

To send a query to the cluster defined in the configuration file of
/etc/othercluster.cfg, enter:
$ env LOADL_CONFIG=othercluster llq

Note that the machine from which you issue the llq command is considered as a
submit-only machine by the other cluster.

Handling switch-table errors
Configuration file keywords can be used to control how LoadLeveler responds to
switch-table errors.

You may use the following configuration file keywords to control how LoadLeveler
responds to switch-table errors:
v ACTION_ON_SWITCH_TABLE_ERROR
v DRAIN_ON_SWITCH_TABLE_ERROR
v RESUME_ON_SWITCH_TABLE_ERROR_CLEAR


Providing additional job-processing controls through installation exits
LoadLeveler allows administrators to further configure the environment through
installation exits.

Table 17 lists these additional job-processing controls.
Table 17. Roadmap of administrator tasks accomplished through installation exits
Writing a program to control when jobs “Controlling the central manager scheduling
are scheduled to run cycle” on page 73
Writing a pair of programs to override “Handling DCE security credentials” on page 74
the default LoadLeveler DCE
authentication method
Writing a program to refresh an AFS “Handling an AFS token” on page 75
token when a job starts


Table 17. Roadmap of administrator tasks accomplished through installation exits (continued)
Writing a program to check or modify “Filtering a job script” on page 76
job requests when they are submitted
Writing programs to run before and “Writing prolog and epilog programs” on page 77
after job requests
Overriding the LoadLeveler default “Using your own mail program” on page 81
mail notification method
Defining a cluster metric to determine See the CLUSTER_METRIC configuration
where a remote job is distributed keyword description in Chapter 12, “Configuration
file reference,” on page 263.
Defining cluster user mapper for See the CLUSTER_USER_MAPPER configuration
multicluster environment keyword description in Chapter 12, “Configuration
file reference,” on page 263.
Correctly specifying configuration file Chapter 12, “Configuration file reference,” on page
keywords 263

Controlling the central manager scheduling cycle
To determine when to run the LoadLeveler scheduling algorithm, the central
manager uses the values set in the configuration file for the
NEGOTIATOR_INTERVAL and the NEGOTIATOR_CYCLE_DELAY keywords.

The central manager will run the scheduling algorithm every
NEGOTIATOR_INTERVAL seconds, unless some event takes place such as the
completion of a job or the addition of a machine to the cluster. In such cases, the
scheduling algorithm is run immediately. When NEGOTIATOR_CYCLE_DELAY is
set, a minimum of NEGOTIATOR_CYCLE_DELAY seconds will pass between the
central manager’s scheduling attempts, regardless of what other events might take
place. When the NEGOTIATOR_INTERVAL is set to zero, the central manager
will not run the scheduling algorithm until instructed to do so by an authorized
process. This setting enables your program to control the central manager’s
scheduling activity through one of the following:
v The llrunscheduler command.
v The ll_run_scheduler subroutine.
Both the command and the subroutine instruct the central manager to run the
scheduling algorithm.

You might choose to use this setting if, for example, you want to write a program
that directly controls the assignment of the system priority for all LoadLeveler jobs.
In this particular case, you would complete the following steps to control system
priority assignment and the scheduling cycle:
1. Decide the following:
v Which system priority value to assign to jobs from specific sources or with
specific resource requirements.
v How often the central manager should run the scheduling algorithm. Your
program has to be designed to issue the ll_run_scheduler subroutine at
regular intervals; otherwise, LoadLeveler will not attempt to schedule any
job steps.
You also need to understand how changing the system priority affects the job
queue. After you successfully use the ll_modify subroutine or the llmodify
command to change system priority values, LoadLeveler will not readjust the
values for those job steps when the negotiator recalculates priorities at regular


intervals set through the
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword. Also, you
can change the system priority for jobs only when those jobs are in the Idle
state or a state similar to it. To determine which job states are similar to the
Idle state or to the Running state, see the table in “LoadLeveler job states” on
page 19.
2. Code a program to use LoadLeveler APIs to perform the following functions:
a. Use the Data Access APIs to obtain data about all jobs.
b. Determine whether jobs have been added or removed.
c. Use the ll_modify subroutine to set the system priority for the LoadLeveler
jobs. The values you set through this subroutine will not be readjusted
when the negotiator recalculates job step priorities.
d. Use the ll_run_scheduler subroutine to instruct the central manager to run
the scheduling algorithm.
e. Set a timer for the scheduling interval, to repeat the scheduling instruction
at regular intervals. This step is required to replace the effect of setting the
configuration keyword NEGOTIATOR_CYCLE_DELAY, which LoadLeveler
ignores when NEGOTIATOR_INTERVAL is set to zero.
3. In the configuration file, set values for the following keywords:
v Set the NEGOTIATOR_INTERVAL keyword to zero to stop the central
manager from automatically recalculating system priorities for jobs.
v (Optional) Set the SYSPRIO_THRESHOLD_TO_IGNORE_STEP keyword to
specify a threshold value. If the system priority assigned to a job step is less
than this threshold value, the job will remain idle.
4. Issue the llctl command with either the reconfig or recycle keyword.
Otherwise, LoadLeveler will not process the modifications you made to the
configuration file.
5. (Optional) To make sure that the central manager’s automatic scheduling
activity has been disabled (by setting the NEGOTIATOR_INTERVAL keyword
to zero), use the llstatus command.
6. Run your program under a user ID with administrator authority.

Once this procedure is complete, you might want to use one or more of the
following commands to make sure that jobs are scheduled according to the correct
system priority. The value of q_sysprio in command output indicates the system
priority for the job step.
v Use the command llq -s to detect whether a job step is idle because its system
priority is below the value set for the
SYSPRIO_THRESHOLD_TO_IGNORE_STEP keyword.
v Use the command llq -l to display the previous system priority for a job step.
v When unusual circumstances require you to change system priorities manually:
1. Use the command llmodify -s to set the system priority for LoadLeveler jobs.
The values you set through this command will not be readjusted when the
negotiator recalculates job step priorities.
2. Use the llrunscheduler command to instruct the central manager to run the
scheduling algorithm.

Handling DCE security credentials
You can write a pair of programs to override the default LoadLeveler DCE
authentication method.

To enable the programs, use the DCE_AUTHENTICATION_PAIR keyword in
your configuration file:


v As an alternative, you can also specify the program pair:
DCE_AUTHENTICATION_PAIR = $(BIN)/llgetdce, $(BIN)/llsetdce

Specifying the DCE_AUTHENTICATION_PAIR keyword enables LoadLeveler
support for forwarding DCE credentials to LoadLeveler jobs. You may override the
default function provided by LoadLeveler to establish DCE credentials by
substituting your own programs.

Using the alternative program pair: llgetdce and llsetdce
The program pair, llgetdce and llsetdce, forwards DCE credentials by copying
credential cache files from the submitting machine to the executing machines.

While this technique may require less overhead, it has been known to produce
credentials on the executing machines which are not fully capable of being
forwarded by rsh commands. This is the only pair of programs offered in earlier
releases of LoadLeveler.

Forwarding DCE credentials
An example of a credentials object is a character string containing the DCE
principle name and a password.

program1 writes the following to standard output:
v The length of the handle to follow
v The handle

If program1 encounters errors, it writes error messages to standard error.

program2 receives the following as standard input:
v The length of the handle to follow
v The same handle written by program1

program2 writes the following to standard output:
v The length of the login context to follow
v An exportable DCE login context, which is the idl_byte array produced from the
sec_login_export_context DCE API call. For more information, see the DCE
Security Services API chapter in the Distributed Computing Environment for AIX:
Application Development Reference.
v A character string suitable for assigning to the KRB5CCNAME environment
variable This string represents the location of the credentials cache established in
order for program2 to export the DCE login context.

If program2 encounters errors, it writes error messages to standard error. The parent
process, the LoadLeveler starter process, writes those messages to the starter log.

For examples of programs that enable DCE security credentials, see the
samples/lldce subdirectory in the release directory.

Handling an AFS token
You can write a program, run by the scheduler, to refresh an AFS token when a job
is started.

To invoke the program, use the AFS_GETNEWTOKEN keyword in your
configuration file.


Before running the program, LoadLeveler sets up standard input and standard
output as pipes between the program and LoadLeveler. LoadLeveler also sets up
the following environment variables:
LOADL_STEP_OWNER
The owner (UNIX user name) of the job
LOADL_STEP_COMMAND
The name of the command the user’s job step invokes.
LOADL_STEP_CLASS
The class this job step will run.
LOADL_STEP_ID
The step identifier, generated by LoadLeveler.
LOADL_JOB_CPU_LIMIT
The number of CPU seconds the job is limited to.
LOADL_WALL_LIMIT
The number of wall clock seconds the job is limited to.

LoadLeveler writes the following current AFS credentials, in order, over the
standard input pipe:
v The ktc_principal structure indicating the service.
v The ktc_principal structure indicating the client.
v The ktc_token structure containing the credentials.

The ktc_principal structure is defined in the AFS header file afs_rxkad.h. The
ktc_token structure is defined in the AFS header file afs_auth.h.

LoadLeveler expects to read these same structures in the same order from the
standard output pipe, except these should be refreshed credentials produced by the
installation exit.

The installation exit can modify the passed credentials (to extend their lifetime)
and pass them back, or it can obtain new credentials. LoadLeveler takes whatever
is returned and uses it to authenticate the user prior to starting the user’s job.

Filtering a job script
You can write a program to filter a job script when the job is submitted to the local
cluster and when the job is submitted from a remote cluster.

This program can, for example, modify defaults or perform site specific verification
of parameters. To invoke the local job filter, specify the SUBMIT_FILTER keyword
in your configuration file. To invoke the remote job filter, specify the
CLUSTER_REMOTE_JOB_FILTER keyword in your configuration file. For more
information on these keywords, see the SUBMIT_FILTER or
CLUSTER_REMOTE_JOB_FILTER keyword in Chapter 12, “Configuration file

LoadLeveler sets the following environment variables when the program is
invoked:
LOADL_ACTIVE
LoadLeveler version
LOADL_STEP_COMMAND
Job command file name
LOADL_STEP_ID
The job identifier, generated by LoadLeveler
LOADL_STEP_OWNER
The owner (UNIX user name) of the job


For details about specific keyword syntax and use in the configuration file, see

Writing prolog and epilog programs
An administrator can write prolog and epilog installation exits that can run before
and after a LoadLeveler job runs, respectively.

Prolog and epilog programs fall into two types:
v Those that run as the LoadLeveler user ID.
v Those that run in a user’s environment.

Depending on the type of processing you want to perform before or after a job
runs, specify one or more of the following configuration file keywords, in any
combination:
v To run a prolog or epilog program under the LoadLeveler user ID, specify
JOB_PROLOG or JOB_EPILOG, respectively.
v To run a prolog or epilog program under the user’s environment, specify
JOB_USER_PROLOG or JOB_USER_EPILOG, respectively.
You do not have to provide a prolog/epilog pair of programs. You may, for
example, use only a prolog program that runs under the LoadLeveler user ID.

For details about specific keyword syntax and use in the configuration file, see

Note: If process tracking is enabled and your prolog or epilog program invokes
the mailx command, set the sendwait variable to prevent the background
mail process from being killed by process tracking.

A user environment prolog or epilog runs with AFS authentication if installed and
enabled. For security reasons, you must code these programs on the machines
where the job runs and on the machine that schedules the job. If you do not define
a value for these keywords, the user environment prolog and epilog settings on the
executing machine are ignored.

The user environment prolog and epilog can set environment variables for the job
by sending information to standard output in the following format:
env id = value

Where:
id Is the name of the environment variable
value Is the value (setting) of the environment variable

For example, the user environment prolog sets the environment variable
STAGE_HOST for the job:
#!/bin/sh
echo env STAGE_HOST=shd22

Coding conventions for prolog programs
The prolog program is invoked by the starter process.

Once the starter process invokes the prolog program, the program obtains
information about the job from environment variables.

Syntax:


prolog_program

Where prolog_program is the name of the prolog program as defined in the
JOB_PROLOG keyword.

No arguments are passed to the program, but several environment variables are
set. For more information on these environment variables, see “Run-time
environment variables” on page 400.

The real and effective user ID of the prolog process is the LoadLeveler user ID. If
the prolog program requires root authority, the administrator must write a secure
C or Perl program to perform the desired actions. You should not use shell scripts
with set uid permissions, since these scripts may make your system susceptible to
security problems.

Return code values:
0 The job will begin.

If the prolog program is ended with a signal, the job does not begin and a message
is written to the starter log.

Sample prolog programs:
v Sample of a prolog program for korn shell:
#!/bin/ksh
#
# Set up environment
set -a
. /etc/environment
. /.profile
export PATH="$PATH:/loctools/lladmin/bin"
export LOG="/tmp/$LOADL_STEP_OWNER.$LOADL_STEP_ID.prolog"
#
# Do set up based upon job step class
#
case $LOADL_STEP_CLASS in
# A OSL job is about to run, make sure the osl filesystem is
# mounted. If status is negative then filesystem cannot be
# mounted and the job step should not run.
"OSL")
mount_osl_files >> $LOG
if [ status = 0 ]
then EXIT_CODE=1
else
EXIT_CODE=0
fi
;;
# A simulation job is about to run, simulation data has to
# be made available to the job. The status from copy script must
# be zero or job step cannot run.
"sim")

copy_sim_data >> $LOG
if [ status = 0 ]
then EXIT_CODE=0
else
EXIT_CODE=1
fi
;;
# All other job will require free space in /tmp, make sure
# enough space is available.
*)
check_tmp >> $LOG


EXIT_CODE=$?
;;
esac
# The job step will run only if EXIT_CODE == 0
exit $EXIT_CODE
v Sample of a prolog program for C shell:
#!/bin/csh
#
source /u/loadl/.login
#
setenv PATH "${PATH}:/loctools/lladmin/bin"
setenv LOG "/tmp/${LOADL_STEP_OWNER}.${LOADL_STEP_ID}.prolog"
#
# Do set up based upon job step class
#
switch ($LOADL_STEP_CLASS)
# A OSL job is about to run, make sure the osl filesystem is
# mounted. If status is negative then filesystem cannot be
# mounted and the job step should not run.
case "OSL":
mount_osl_files >> $LOG
if ($status < 0 ) then
set EXIT_CODE = 1
else
set EXIT_CODE = 0
endif
breaksw
# A simulation job is about to run, simulation data has to
# be made available to the job. The status from copy script must
# be zero or job step cannot run.
case "sim":
copy_sim_data >> $LOG
if ($status == 0 ) then
set EXIT_CODE = 0
else
set EXIT_CODE = 1
endif
breaksw
# All other job will require free space in /tmp, make sure
# enough space is available.
default:
check_tmp >> $LOG
set EXIT_CODE = $status
breaksw
endsw

# The job step will run only if EXIT_CODE == 0
exit $EXIT_CODE

Coding conventions for epilog programs
The installation defined epilog program is invoked after a job step has completed.

The purpose of the epilog program is to perform any required clean up such as
unmounting file systems, removing files, and copying results. The exit status of
both the prolog program and the job step is set in environment variables.

Syntax:
epilog_program

Where epilog_program is the name of the epilog program as defined in the
JOB_EPILOG keyword.


No arguments are passed to the program but several environment variables are set.
These environment variables are described in “Run-time environment variables” on
page 400. In addition, the following environment variables are set for the epilog
programs:
LOADL_PROLOG_EXIT_CODE
The exit code from the prolog program. This environment variable is set
only if a prolog program is configured to run.
LOADL_USER_PROLOG_EXIT_CODE
The exit code from the user prolog program. This environment variable is
set only if a user prolog program is configured to run.
LOADL_JOB_STEP_EXIT_CODE
The exit code from the job step.

Note: To interpret the exit status of the prolog program and the job step, convert
the string to an integer and use the macros found in the sys/wait.h file.
These macros include:
v WEXITSTATUS: gives you the exit code
v WTERMSIG: gives you the signal that terminated the program
v WIFEXITED: tells you if the program exited
v WIFSIGNALED: tells you if the program was terminated by a signal

The exit codes returned by the WEXITSTATUS macro are the valid codes.
However, if you look at the raw numbers in sys/wait.h, the exit code may
appear to be 256 times the expected return code. The numbers in sys/wait.h
are the wait3 system calls.

Sample epilog programs:
v Sample of an epilog program for korn shell:
#!/bin/ksh
#
set -a
. /etc/environment
. /.profile
export PATH="$PATH:/loctools/lladmin/bin"
export LOG="/tmp/$LOADL_STEP_OWNER.$LOADL_STEP_ID.epilog"
#
if [ [ -z $LOADL_PROLOG_EXIT_CODE ] ]
then
echo "Prolog did not run" >> $LOG
else
echo "Prolog exit code = $LOADL_PROLOG_EXIT_CODE" >> $LOG
fi
#
if [ [ -z $LOADL_USER_PROLOG_EXIT_CODE ] ]
then
echo "User environment prolog did not run" >> $LOG
else
echo "User environment exit code = $LOADL_USER_PROLOG_EXIT_CODE" >> $LOG
fi
#
if [ [ -z $LOADL_JOB_STEP_EXIT_CODE ] ]
then
echo "Job step did not run" >> $LOG
else
echo "Job step exit code = $LOADL_JOB_STEP_EXIT_CODE" >> $LOG
fi
#
#
# Do clean up based upon job step class
#
case $LOADL_STEP_CLASS in
# A OSL job just ran, unmount the filesystem.
"OSL")
umount_osl_files >> $LOG


;;
# A simulation job just ran, remove input files.
# Copy results if simulation was successful (second argument
# contains exit status from job step).
"sim")
rm_sim_data >> $LOG
if [ $2 = 0 ]
then copy_sim_results >> $LOG
fi
;;
# Clean up /tmp
*)
clean_tmp >> $LOG
;;
esac
v Sample of an epilog program for C shell:
#!/bin/csh
#
source /u/loadl/.login
#
setenv PATH "${PATH}:/loctools/lladmin/bin"
setenv LOG "/tmp/${LOADL_STEP_OWNER}.${LOADL_STEP_ID}.prolog"
#
if ( ${?LOADL_PROLOG_EXIT_CODE} ) then
echo "Prolog exit code = $LOADL_PROLOG_EXIT_CODE" >> $LOG
else
echo "Prolog did not run" >> $LOG
endif
#
if ( ${?LOADL_USER_PROLOG_EXIT_CODE} ) then
echo "User environment exit code = $LOADL_USER_PROLOG_EXIT_CODE" >> $LOG
else
echo "User environment prolog did not run" >> $LOG
endif
#
if ( ${?LOADL_JOB_STEP_EXIT_CODE} ) then
echo "Job step exit code = $LOADL_JOB_STEP_EXIT_CODE" >> $LOG
else
echo "Job step did not run" >> $LOG
endif
#
# Do clean up based upon job step class
#
switch ($LOADL_STEP_CLASS)
# A OSL job just ran, unmount the filesystem.
case "OSL":
umount_osl_files >> $LOG
breaksw
# A simulation job just ran, remove input files.
# Copy results if simulation was successful (second argument
# contains exit status from job step).
case "sim":
rm_sim_data >> $LOG
if ($argv{2} == 0 ) then
copy_sim_results >> $LOG
endif
breaksw
# Clean up /tmp
default:
clean_tmp >> $LOG
breaksw
endsw

Using your own mail program
You can write a program to override the LoadLeveler default mail notification
method.

You can use this program, for example, to display your own messages to users
when a job completes, or to automate tasks such as sending error messages to a
network manager.


The syntax for the program is the same as it is for standard UNIX mail programs;
the command is called with the following arguments:
v -s to indicate a subject.
v A pointer to a string containing the subject.
v A pointer to a string containing a list of mail recipients.
The mail message is taken from standard input.

To enable this program to replace the default mail notification method, use the
MAIL keyword in the configuration file. For details about specific keyword syntax
and use in the configuration file, see Chapter 12, “Configuration file reference,” on
page 263.


Chapter 5. Defining LoadLeveler resources to administer
After installing LoadLeveler, you may customize it by modifying the

The administration file optionally lists and defines the machines in the
LoadLeveler cluster and the characteristics of classes, users, and groups.

LoadLeveler does not prevent you from having multiple copies of administration
files, but you need to be sure to update all the copies whenever you make a
change to one. Having only one administration file prevents any confusion.

Table 18 lists the LoadLeveler resources you may define by modifying the
Table 18. Roadmap of tasks for modifying the LoadLeveler administration file
Modifying the administration “Steps for modifying an administration file”
file
Defining LoadLeveler v “Defining machines” on page 84
resources to administer
v “Defining adapters” on page 86
v “Defining classes” on page 89
v “Defining users” on page 97
v “Defining groups” on page 99
v “Defining clusters” on page 100
Correctly specifying Chapter 13, “Administration file reference,” on page 321
administration file keywords

Steps for modifying an administration file
All LoadLeveler commands, daemons, and processes read the administration and
configuration files at start up time.

If you change the administration or configuration files after LoadLeveler has
already started, any LoadLeveler command or process, such as the LoadL_starter
process, will read the newer version of the files while the running daemons will
continue to use the data from the older version. To ensure that all LoadLeveler
commands, daemons, and processes use the same configuration data, run the
reconfiguration command on all machines in the cluster each time the
administration or configuration files are changed.

Before you begin: You need to:
v Ensure that the installation procedure has completed successfully and that the
administration file, LoadL_admin, exists in LoadLeveler’s home directory. For
additional details about installation, see TWS LoadLeveler: Installation Guide.
v Know how to correctly specify keywords in the administration file. For
information about administration file keyword syntax and other details, see
Chapter 13, “Administration file reference,” on page 321.

83

v (Optional) Know how to correctly issue the llextRPD command, if you choose to
use it (see “llextRPD - Extract data from an RSCT peer domain” on page 443).

Perform the following steps to modify the administration file, LoadL_admin:
1. Identify yourself as a LoadLeveler administrator using the LOADL_ADMIN
keyword.
2. Provide the following stanza types in the administration file:
v One machine stanza to define the central manager for the LoadLeveler
cluster. You also may create machine stanzas for other machines in the
You can use the llextRPD command to automatically create machine stanzas.
v (Optional) An adapter stanza for each type of network adapter that you want
LoadLeveler jobs to be able to request.
You can use the llextRPD command to automatically create adapter stanzas.
3. (Optional) Specify one or more of the following stanza types:
v A class stanza for each set of LoadLeveler jobs that have similar
characteristics or resource requirements.
v A user stanza for specific users, if their requirements do not match those
characteristics defined in the default user stanza.
v A group stanza for each set of LoadLeveler users that have similar
characteristics or resource requirements.
4. (Optional) You may specify values for additional administration file keywords,
which are listed and described in “Administration file keyword descriptions”
on page 327.
5. Notify LoadLeveler daemons by issuing the llctl command with either the
reconfig or recycle keyword. Otherwise, LoadLeveler will not process the
modifications you made to the administration file.

Defining machines
The information in a machine stanza defines the characteristics of that machine.

You do not have to specify a machine stanza for every machine in the LoadLeveler
cluster, but you must have one machine stanza for the machine that will serve as
the central manager.

If you do not specify a machine stanza for a machine in the cluster, the machine
and the central manager still communicate and jobs are scheduled on the machine
but the machine is assigned the default values specified in the default machine
stanza. If there is no default stanza, the machine is assigned default values set by
LoadLeveler.

Any machine name used in the stanza must be a name which can be resolved to
an IP address. This name is referred to as an interface name because the name can
be used for a program to interface with the machine. Generally, interface names
match the machine name, but they do not have to.

By default, LoadLeveler will append the DNS domain name to the end of any
machine name without a domain name appended before resolving its address. If
you specify a machine name without a domain name appended to it and you do
not want LoadLeveler to append the DNS domain name to it, specify the name
using a trailing period. You may have a need to specify machine names in this way
if you are running a cluster with more than one nameserving technique. For


example, if you are using a DNS nameserver and running NIS, you may have
some machine names which are resolved by NIS which you do not want
LoadLeveler to append DNS names to. In situations such as this, you also want to
specify name_server keyword in your machine stanzas.

Under the following conditions, you must have a machine stanza for the machine
in question:
v If you set the MACHINE_AUTHENTICATE keyword to true in the
configuration file, then you must create a machine stanza for each node that
LoadLeveler includes in the cluster.
v If the machine’s hostname (the name of the machine returned by the UNIX
hostname command) does not match an interface name. In this case, you must
specify the interface name as the machine stanza name and specify the
machine’s hostname using the alias keyword.
v If the machine’s hostname does match an interface name but not the correct
interface name.

For information about automatically creating machine stanzas, see “llextRPD -
Extract data from an RSCT peer domain” on page 443.

Planning considerations for defining machines
There are several matters to consider before customizing the administration file.

Before customizing the administration file, consider the following:
v Node availability
Some workstation owners might agree to accept LoadLeveler jobs only when
they are not using the workstation themselves. Using LoadLeveler keywords,
these workstations can be configured to be available at designated times only.
v Common name space
To run jobs on any machine in the LoadLeveler cluster, a user needs the same
uid (the user ID number for a user) and gid (the group ID number for a group)
on every machine in the cluster.
For example, if there are two machines in your LoadLeveler cluster, machine_1
and machine_2, user john must have the same user ID and login group ID in the
/etc/passwd file on both machines. If user john has user ID 1234 and login group
ID 100 on machine_1, then user john must have the same user ID and login
group ID in /etc/passwd on machine_2. (LoadLeveler requires a job to run with
the same group ID and user ID of the person who submitted the job.)
If you do not have a user ID on one machine, your jobs will not run on that
machine. Also, many commands, such as llq, will not work correctly if a user
does not have a user ID on the central manager machine.
However, there are cases where you may choose to not give a user a login ID on
a particular machine. For example, a user does not need an ID on every
submit-only machine; the user only needs to be able to submit jobs from at least
one such machine. Also, you may choose to restrict a user’s access to a Schedd
machine that is not a public scheduler; again, the user only needs access to at
least one Schedd machine.
v Resource handling
Some nodes in the LoadLeveler cluster might have special software installed that
users might need to run their jobs successfully. You should configure
LoadLeveler to distinguish those nodes from other nodes using, for example,
machine features.

Chapter 5. Defining LoadLeveler resources to administer 85

Machine stanza format and keyword summary
Machine stanzas take the following format.

Default values for keywords appear in bold:

label: type = machine
adapter_stanzas = stanza_list
alias = machine_name
central_manager = true | false | alt
cpu_speed_scale = true | false
machine_mode = batch | interactive | general
master_node_exclusive = true | false
max_jobs_scheduled = number
name_server = list
pool_list = pool_numbers
reservation_permitted = true | false
resources = name(count) name(count) ... name(count)
schedd_fenced = true | false
schedd_host = true | false
speed = number
submit_only = true | false

Figure 11. Format of a machine stanza

Examples: Machine stanzas
These machine stanza examples may apply to your situation.
v Example 1
In this example, the machine is being defined as the central manager.
#
machine_a: type = machine
central_manager = true # central manager runs here
v Example 2
This example sets up a submit-only node. Note that the submit-only keyword in
the example is set to true, while the schedd_host keyword is set to false. You
must also ensure that you set the schedd_host to true on at least one other node
in the cluster.
#
machine_b: type = machine
central_manager = false # not the central manager
schedd_host = false # not a scheduling machine
submit_only = true # submit only machine
alias = machineb # interface name
v Example 3
In the following example, machine_c is the central manager and has an alias
associated with it:
#
machine_c: type = machine
central_manager = true # central manager runs here
schedd_host = true # defines a public scheduler
alias = brianne

Defining adapters
An adapter stanza identifies network adapters that are available on the machines
in the LoadLeveler cluster.


If you want LoadLeveler jobs to be able to request specific adapters, you must
either specify adapter stanzas or configure dynamic adapters in the administration
file.

Note the following when using an adapter stanza:
v An adapter stanza is required for each adapter stanza name you specify on the
adapter_stanzas keyword of the machine stanza.
v The adapter_name, interface_address and interface_name keywords are
required.

For information about creating adapter stanzas, see “llextRPD - Extract data from
an RSCT peer domain” on page 443 for peer domains.

Configuring dynamic adapters
LoadLeveler can dynamically determine the adapters in any operating system
instance (OSI) that has RSCT installed.

LoadLeveler must be told on an OSI basis if it is to handle dynamic adapter
configuration changes for that OSI. The specification of whether to use dynamic or
static adapter configuration for an OSI is done through the presence or absence of
the machine stanza’s adapter_stanzas keyword.

If a machine stanza in the administration file contains an adapter_stanzas
statement then this is taken as a directive by the LoadLeveler administrator to use
only those specified adapters. For this OSI, LoadLeveler will not perform any
dynamic adapter configuration or processing. If an adapter change occurs in this
OSI then the administrator will have to make the corresponding change in the
administration file and then stop and restart or reconfigure the LoadLeveler startd
daemon to pick up the adapter changes. If an OSI (machine stanza) in the
administration file does not contain the adapter_stanzas keyword then this is taken
as a directive by the LoadLeveler administrator for LoadLeveler to dynamically
configure the adapters for that OSI. For that OSI, LoadLeveler will determine what
adapters are present at startup via calls to the RMCAPI. If an adapter change
occurs during execution in the OSI then LoadLeveler will automatically detect and
handle the change without requiring a restart or reconfiguration.

Configuring InfiniBand adapters
InfiniBand adapters, known as host channel adapters (HCAs) can be multiported.

Tasks can use ports of an HCA independently, which allows them to be allocated
by the scheduling algorithm independently.

| Note: InfiniBand adapters are supported on the AIX operating system and in SUSE
| Linux Enterprise Server (SLES) 9 and SLES 10 on TWS LoadLeveler for
| POWER clusters.

An InfiniBand adapter can have multiple adapter ports. Each port on the
InfiniBand adapter can be connected to one network and will be managed by TWS
LoadLeveler as a switch adapter. InfiniBand adapter ports derive their resources
and usage state from the InfiniBand adapter with which they are associated, but
are allocated to jobs separately.

If you want LoadLeveler jobs to be able to request InfiniBand adapters, you must
either specify adapter stanzas or configure dynamic adapters in the administration


file. The InfiniBand ports are identified to TWS LoadLeveler in the same way other
adapters are. Stanzas are specified in the administration file if static adapters are
used and the ports are discovered by RSCT if dynamic adapters are used.

The port_number administration keyword has been added to support an
InfiniBand port. The port_number keyword specifies the port number of the
InfiniBand adapter port. Only InfiniBand ports are managed and displayed by
TWS LoadLeveler; the InfiniBand adapter itself is not. The adapter stanza for
InfiniBand support only contains the adapter port information. There is no
InfiniBand adapter information in the adapter stanza (see example 2 in “Examples:
Adapter stanzas” on page 89).

Note:
1. TWS LoadLeveler distributes the switch adapter windows of the
InfiniBand adapter equally among its ports and the allocation is not
adjusted should all of the resources on one port be consumed.
2. The InfiniBand ports determine their usage state and availability from
their InfiniBand adapter. If one port is in use exclusively, no other ports
on the adapter can be used for any other job.
3. If you have a mixed cluster where some nodes use the InfiniBand
adapter and some nodes use the HPS adapter, you have to organize the
nodes into pools so that the job is only dispatched to nodes with the
same kind of switch adapter.
4. There is no change to the way the InfiniBand adapters are requested on
the job command file network statement; that is, InfiniBand adapters are
requested the same way as any other adapter would be.
5. Because InfiniBand adapters do not support rCxt blocks, jobs that would
otherwise use InfiniBand adapters, but which also request rCxt blocks
with the rcxtblks keyword on the network statement will remain in the
idle state. This behavior is consistent with how other adapters (for
example, the HPS) behave in the same situation. You can use the llstatus
-a command to see rCxt blocks on adapters (see “llstatus - Query
machine status” on page 512 for more information).

Adapter stanza format and keyword summary
Consider this format of an adapter stanza.

An adapter stanza has the following format:

label: type = adapter
adapter_name = name
adapter_type = type
device_driver_name = name
interface_address = IP_address
interface_name = name
logical_id = id
multilink_address = ip_address
multilink_list = adapter_name <, adapter_name>*
network_id = id
network_type = type
port_number = number
switch_node_number = integer

Figure 12. Format of an adapter stanza


Examples: Adapter stanzas
These adapter stanza examples may apply to your situation.
v Example 1: Specifying an HPS adapter
In the following example, the adapter stanza called
“c121s0n10.ppd.pok.ibm.com” specifies an HPS adapter. Note that
c121s0n10.ppd.pok.ibm.com is also specified on the adapter_stanzas keyword of
the machine stanza for the “yugo” machine.
yugo: type=machine
adapter_stanzas = c121s0n10.ppd.pok.ibm.com
...

c121s0n10.ppd.pok.ibm.com: type = adapter
adapter_name = sn0
network_type = switch
interface_address = 192.168.0.10
interface_name = c121s0n10.ppd.pok.ibm.com
multilink_address = 10.10.10.10
logical_id = 2
adapter_type = Switch_Network_Interface_For_HPS
device_driver_name = sni0
network_id = 1

c121f2rp02.ppd.pok.ibm.com: type = adapter
adapter_name = en0
network_type = ethernet
interface_name = c121f2rp02.ppd.pok.ibm.com
device_driver_name = ent0
v Example 2: Specifying an InfiniBand adapter
In the following example, the port_number specifies the port number of the
InfiniBand adapter port:
192.168.9.58: type = adapter
adapter_name = ib1
network_type = InfiniBand
interface_name = 192.168.9.58
logical_id = 23
adapter_type = InfiniBand
device_driver_name = ehca0
network_id = 18338657682652659714
port_number = 2

Defining classes
The information in a class stanza defines characteristics for that class.

These characteristics can include the quantities of consumable resources that may
be used by a class per machine or cluster.

Within a class stanza, you can have optional user substanzas that define policies
that apply to a user’s job steps that need to use this class. For more information
about user substanzas, see “Defining user substanzas in class stanzas” on page 94.
For information about user stanzas, see “Defining users” on page 97.

Using limit keywords
A limit is the amount of a resource that a job step or a process is allowed to use.
(A process is a dispatchable unit of work.) A job step may be made up of several
processes.


Limits include both a hard limit and a soft limit. When a hard limit is exceeded,
the job is usually terminated. When a soft limit is exceeded, the job is usually
given a chance to perform some recovery actions. Limits are enforced either per
process or per job step, depending on the type of limit. For parallel jobs steps,
which consist of multiple tasks running on multiple machines, limits are enforced
on a per task basis.

The class stanza includes the limit keywords shown in Table 19, which allow you
to control the amount of resources used by a job step or a job process.
Table 19. Types of limit keywords
Limit How the limit is enforced
as_limit Per process
ckpt_time_limit Per job step
core_limit Per process
cpu_limit Per process
data_limit Per process
default_wall_clock_limit Per job step
file_limit Per process
job_cpu_limit Per job step
locks_limit Per process
memlock_limit Per process
nofile_limit Per process
nproc_limit Per user
rss_limit Per process
stack_limit Per process
wall_clock_limit Per job step

For example, a common limit is the cpu_limit, which limits the amount of CPU
time a single process can use. If you set cpu_limit to five hours and you have a job
step that forks five processes, each process can use up to five hours of CPU time,
for a total of 25 CPU hours. Another limit that controls the amount of CPU used is
job_cpu_limit. For a serial job step, if you impose a job_cpu_limit of five hours,
the entire job step (made up of all five processes) cannot consume more than five
CPU hours. For information on using this keyword with parallel jobs, see
job_cpu_limit keyword.

You can specify limits in either the class stanza of the administration file or in the
job command file. The lower of these two limits will be used to run the job even if
the system limit for the user is lower. For more information, see:
v “Enforcing limits”
v “Administration file keyword descriptions” on page 327 or “Job command file
keyword descriptions” on page 359

Enforcing limits
LoadLeveler depends on the underlying operating system to enforce process limits.

Users should verify that a process limit such as rss_limit is enforced by the
operating system, otherwise setting it in LoadLeveler will have no effect.


Exceeding job step limits:

When a hard limit is exceeded LoadLeveler sends a non-trappable signal (except in
the case of a parallel job) to the process group that LoadLeveler created for the job
step.

When a soft limit is exceeded, LoadLeveler sends a trappable signal to the process
group. Any job application that intends to trap a signal sent by LoadLeveler must
ensure that all processes in the process group set up the appropriate signal
handler.

All processes in the job step normally receive the signal. The exception to this rule
is when a child process creates its own process group. That action isolates the
child’s process, and its children, from any signals that LoadLeveler sends. Any
child process creating its own process group is still known to process tracking. So,
if process tracking is enabled, all the child processes are terminated when the main
process terminates.

Table 20 summarizes the actions that the LoadL_starter daemon takes when a job
step limit is exceeded.
Table 20. Enforcing job step limits
Type of Job When a Soft Limit is Exceeded When a Hard Limit is Exceeded
Serial SIGXCPU or SIGKILL issued SIGKILL issued
Parallel SIGXCPU issued to both the user SIGTERM issued
program and to the parallel
daemon

On systems that do not support SIGXCPU, LoadLeveler does not distinguish
between hard and soft limits. When a soft limit is reached on these platforms,
LoadLeveler issues a SIGKILL.

Enforcing per process limits:

For per process limits, what happens when your job reaches and exceeds either the
soft limit or the hard limit depends on the operating system you are using.

When a job forks a process that exceeds a per process limit, such as the CPU limit,
the operating system (not LoadLeveler) terminates the process by issuing a
SIGXCPU. As a result, you will not see an entry in the LoadLeveler logs indicating
that the process exceeded the limit. The job will complete with a 0 return code.
LoadLeveler can only report the status of any processes it has started.

If you need more specific information, refer to your operating system
documentation.

How LoadLeveler uses hard limits:

Consider these details on how LoadLeveler uses hard limits.


See Table 21 for more information on specifying limits.
Table 21. Setting limits
If the hard limit is: Then LoadLeveler does the following:
Set in both the class stanza and the Smaller of the two limits is taken into consideration. If
job command file the smaller limit is the job limit, the job limit is then
compared with the user limit set on the machine that
runs the job. The smaller of these two values is used.
If the limit used is the class limit, the class limit is
used without being compared to the machine limit.
Not set in either the class stanza or User per process limit set on the machine that runs
the job command file the job is used.
Set in the job command file and is The job is not submitted.
less than its respective job soft limit
Set in the class stanza and is less Soft limit is adjusted downward to equal the hard
than its respective class stanza soft limit.
limit
Specified in the job command file Hard limit must be greater than or equal to the
specified soft limit and less than or equal to the limit
set by the administrator in the class stanza of the

Note: If the per process limit is not defined in the
administration file and the hard limit defined by the
user in the job command file is greater than the limit
on the executing machine, then the hard limit is set to
the machine limit.

Allowing users to use a class
In a class stanza, you may define a list of users or a list of groups to identify those
who may use the class.

To do so, use the include_users or include_groups keyword, respectively, or you
may use both keywords. If you specify both keywords, a particular user must
satisfy both the include_users and the include_groups restrictions for the class.
This requirement means that a particular user must be defined not only in a User
stanza in the administration file, but also in one of the following ways:
v The user’s name must appear in the include_users keyword in a Group stanza
whose name corresponds to a name in the include_groups keyword of the Class
stanza.
v The user’s name must appear in the include_groups keyword of the Class
stanza. For information about specifying a user name in a group list, see the
include_groups keyword description in “Administration file keyword
descriptions” on page 327.

Class stanza format and keyword summary
Class stanzas are optional.

Class stanzas take the following format. Default values for keywords appear in
bold.


label: type = class
admin= list
allow_scale_across_jobs = true | false
as_limit= hardlimit,softlimit
ckpt_dir = directory
ckpt_time_limit = hardlimit,softlimit
class_comment = "string"
core_limit = hardlimit,softlimit
cpu_limit = hardlimit,softlimit
data_limit = hardlimit,softlimit
default_resources = name(count) name(count)...name(count)
default_node_resources = name(count) name(count)...name(count)
env_copy = all | master
exclude_bg = list
exclude_groups = list
exclude_users = list
file_limit = hardlimit,softlimit
include_bg = list
include_groups = list
include_users = list
job_cpu_limit = hardlimit,softlimit
locks_limit = hardlimit,softlimit
master_node_requirement = true | false
max_node = number
max_protocol_instances = number
max_top_dogs = number
max_total_tasks = number
maxjobs = number
memlock_limit = hardlimit,softlimit
nice = value
nofile_limit = hardlimit,softlimit
nproc_limit = hardlimit,softlimit
priority = number
rss_limit = hardlimit,softlimit
smt = yes | no | as_is
stack_limit = hardlimit,softlimit
| striping_with_minimum_networks = true | false
total_tasks = number
wall_clock_limit = hardlimit,softlimit
default_wall_clock_limit = hardlimit,softlimit

Figure 13. Format of a class stanza

Examples: Class stanzas
Any of the following class stanza examples may apply to your situation.
v Example 1: Creating a class that excludes certain users
class_a: type=class # class that excludes users
priority=10 # ClassSysprio
exclude_users=green judy # Excluded users
v Example 2: Creating a class for small-size jobs
small: type=class # class for small jobs
priority=80 # ClassSysprio (max=100)
cpu_limit=00:02:00 # 2 minute limit
data_limit=30mb # max 30 MB data segment
default_resources=ConsumbableVirtualMemory(10mb) # resources consumed by each
ConsumableCpus(1) resA(3) floatinglicenseX(1) # task of a small job step if
# resources are not explicitly
# specified in the job command file
ckpt_time_limit=3:00,2:00 # 3 minute hardlimit,
# 2 minute softlimit
core_limit=10mb # max 10 MB core file
file_limit=50mb # max file size 50 MB


stack_limit=10mb # max stack size 10 MB
rss_limit=35mb # max resident set size 35 MB
include_users = bob sally # authorized users
v Example 3: Creating a class for medium-size jobs
medium: type=class # class for medium jobs
cpu_limit=00:10:00 # 10 minute run time limit
data_limit=80mb,60mb # max 80 MB data segment
# soft limit 60 MB data segment
# 4 minute 30 second softlimit to checkpoint
stack_limit=30mb # max stack size 30 MB
job_cpu_limit=1800,1200 # hard limit is 30 minutes,
# soft limit is 20 minutes
v Example 4: Creating a class for large-size jobs
large: type=class # class for large jobs
cpu_limit=00:10:00 # 10 minute run time limit
data_limit=120mb # max 120 MB data segment
default_resources=ConsumableVirtualMemory(40mb) # resources consumed
ConsumableCpus(2) resA(8) floatinglicenseX(1) resB(1) # by each task of
# a large job step if resources are not
# explicitly specified in the job command file
# 5 minute softlimit to checkpoint
stack_limit=unlimited # unlimited stack size
job_cpu_limit = 3600,2700 # hard limit 60 minutes
# soft limit 45 minutes
wall_clock_limit=12:00:00,11:59:55 # hard limit is 12 hours
v Example 5: Creating a class for master node machines
sp-6hr-sp: type=class # class for master node machines
priority=50 # ClassSysprio (max=100)
# 20 minute softlimit to checkpoint
cpu_limit = 06:00:00 # 6 hour limit
job_cpu_limit = 06:00:00 # hard limit is 6 hours
core_limit = lmb # max 1MB core file
master_node_requirement = true # master node definition
v Example 6: Creating a class for MPICH-GM jobs
MPICHGM: type=class # class for MPICH-GM jobs
default_resources = gmports(1) # one gmports resource is consumed by each
# task, if resources are not explicitly
# specified in the job command file

Defining user substanzas in class stanzas
In a class stanza, you might define user substanzas using the same syntax as you
would for any stanza in the LoadLeveler administration file.

A user substanza within a class stanza defines policies that apply to job steps
submitted by that user and belonging to that class. User substanzas are optional
and are independent of user stanzas (for information about user stanzas, see
“Defining users” on page 97).


Class stanzas that contain user substanzas have the following format:

label: {
type = class
label: {
type = user
maxidle = number
maxjobs = number
maxqueued = number
}
}

Figure 14. Format of a user substanza

When defining substanzas within other stanzas, you must use opening and closing
braces ({ and }) to mark the beginning and the end of the stanza and substanza.
The only keywords that are supported in a user substanza are type (required),
maxidle, maxjobs, maxqueued, and max_total_tasks. For detailed descriptions of
these keywords, see “Administration file keyword descriptions” on page 327.

Examples: Substanzas
Any of these substanza examples may apply to your situation.

In the following example, the default machine and class stanzas do not require
braces, but the parallel class stanza does require them. Without braces to open and
close the parallel stanza, it would not be clear that the default user and dept_head
user stanza belong to the parallel class:
default:
type = machine
central_manager = false
schedd_host = true

default:
type = class
wall_clock_limit = 60:00,30:00

parallel: {
type = class

# Allow at most 50 running jobs for class parallel
maxjobs = 50

# Allow at most 10 running jobs for any single
# user of class parallel
default: {
type = user
maxjobs = 10

}

# Allow user dept_head to run as many as 20 jobs
# of class parallel
dept_head: {type = user
maxjobs = 20

}
}

dept_head: type = user
maxjobs = 30


When user substanzas are used in class stanzas, a default user substanza can be
defined. Each class stanza can have its own default user substanza, and even the
default class stanza can have a default user substanza. In this example, the default
user substanza in the default class indicates that for any combination of class and
user, the limits maxidle=20 and maxqueued=30 apply, and that maxjobs and
max_total_tasks are unlimited. Some of these values are overridden in the physics
class stanza. Here is an example of how class stanzas can be configured:
default: {
type = class
default: {
type = user
maxidle = 20
maxqueued = 30
maxjobs = -1
max_total_tasks = -1
}
}
physics: {
type = class
default: {
type = user
maxjobs = 10
max_total_tasks = 128
}
john: {
type = user
maxidle = 10
maxjobs = 14
}
jane: {
type = user
}
}

In the following example, the physics stanza shows which values are inherited
from which stanzas:
physics: {
type = class
default: {
type = user
# inherited from default class, default user
# maxidle = 20

# inherited from default class, default user
# maxqueued = 30

# overrides value of -1 in default class, default user
maxjobs = 10

# overrides value of -1 in default class, default user
}
john: {
type = user
# overrides value of 10 in default user
maxidle = 10

# inherited from default user, which was inherited
# from default class, default user
# maxqueued = 30

maxjobs = 14


# inherited from default user
# max_total_tasks = 128
}

jane: {
type = user
# maxidle = 20

# maxqueued = 30

# inherited from default user
# maxjobs = 10

}
}

Any user other than john and jane who submits jobs of class physics is subject to
the constraints in the default user substanza in the physics class stanza. Should
john or jane submit jobs of any class other than physics, they are subject to the
constraints in the default user substanza in the default class stanza.

In addition to specifying a default user substanza within the default class stanza,
an administrator can specify other user substanzas in the default class stanza. It is
important to note that all class stanzas will inherit all user substanzas from the
default class stanza.

Note: An important rule to understand is that a user substanza within a class
stanza will inherit its values from the user substanza in the default class
stanza first, if a substanza for that user is present. The next location a user
substanza inherits values from is the default user substanza within the same
class stanza.

When no default stanzas or substanzas are provided, the LoadLeveler default for
all four keywords is -1 or unlimited.

If a user substanza is provided for a user on the class exclude_users list,
exclude_users takes precedence and the user substanza will be effectively ignored
because that user cannot use the class at all. On the other hand, when
include_users is used in a class, the presence of a user substanza implies that the
user is permitted to use the class (it is as if the user were present on the
include_users list).

Defining users
The information specified in a user stanza defines the characteristics of that user.
You can have one user stanza for each user but this is not necessary. If an
individual user does not have their own user stanza, that user uses the defaults
defined in the default user stanza.

User stanza format and keyword summary
User stanzas take a particular format.


User stanzas take the following format:

label: type = user
account = list
default_class = list
default_group = group name
default_interactive_class = class name
fair_shares = number
max_node = number
max_reservation_duration = number
| max_reservation_expiration = number
max_reservations = number
maxidle = number
maxjobs = number
maxqueued = number
priority = number

Figure 15. Format of a user stanza

For more information about the keywords listed in the user stanza format, see

Examples: User stanzas
Any of the following user stanzas may apply to your situation.
v Example 1
In this example, user fred is being provided with a user stanza. User fred’s jobs
will have a user priority of 100. If user fred does not specify a job class in the
job command file, the default job class class_a will be used. In addition, he can
have a maximum of 15 jobs running at the same time.
# Define user stanzas
fred: type = user
priority = 100
default_class = class_a
maxjobs = 15
v Example 2
This example explains how a default interactive class for a parallel job is set by
presenting a series of user stanzas and class stanzas. This example assumes that
users do not specify the LOADL_INTERACTIVE_CLASS environment variable.
default: type =user
default_interactive_class = red
default_class = blue

carol: type = user
default_class = single double
default_interactive_class = ijobs

steve: type = user
default_class = single double

ijobs: type = class
wall_clock_limit = 08:00:00

red: type = class
wall_clock_limit = 30:00
If the user Carol submits an interactive job, the job is assigned to the default
interactive class called ijobs. The job is assigned a wall clock limit of 8 hours. If


the user Steve submits an interactive job, the job is assigned to the red class
from the default user stanza. The job is assigned a wall clock limit of 30
minutes.
v Example 3
In this example, Jane’s jobs have a user priority of 50, and if she does not specify
a job class in her job command file the default job class small_jobs is used. This
user stanza does not specify the maximum number of jobs that Jane can run at
the same time so this value defaults to the value defined in the default stanza.
Also, suppose Jane is a member of the primary UNIX group “staff.” Jobs
submitted by Jane will use the default LoadLeveler group “staff.” Lastly, Jane
can use three different account numbers.
# Define user stanzas
jane: type = user
priority = 50
default_class = small_jobs
default_group = Unix_Group
account = dept10 user3 user4

Defining groups
LoadLeveler groups are another way of granting control to the system
administrator.

Although a LoadLeveler group is independent from a UNIX group, you can
configure a LoadLeveler group to have the same users as a UNIX group by using
the include_users keyword.

Group stanza format and keyword summary
The information specified in a group stanza defines the characteristics of that
group.

Group stanzas are optional and take the following format:

label: type = group
admin = list
fair_shares = number
exclude_users = list
include_users = list
max_node = number
max_reservation_duration = number
| max_reservation_expiration = number
max_reservations = number
maxidle = number
maxjobs = number
maxqueued = number
priority = number

Figure 16. Format of a group stanza

For more information about the keywords listed in the group stanza format, see

Examples: Group stanzas
Any of the following group stanzas may apply to your situation.
v Example 1


In this example, the group name is department_a. The jobs issued by users
belonging to this group will have a priority of 80. There are three members in
this group.
# Define group stanzas
department_a: type = group
priority = 80
include_users = susann holly fran
v Example 2
In this example, the group called great_lakes has five members and these user’s
jobs have a priority of 100:
# Define group stanzas
great_lakes: type = group
priority = 100
include_users = huron ontario michigan erie superior

Defining clusters
The cluster stanza defines the LoadLeveler multicluster environment.

Any cluster that wants to participate in the multicluster must have cluster stanzas
defined for all clusters with which the local cluster interacts. If you have a cluster
stanza defined, LoadLeveler is configured to be in the multicluster environment.

Cluster stanza format and keyword summary
Cluster stanzas are optional.

Cluster stanzas take the following format. Default values for keywords appear in
bold.

The cluster stanza label must define a unique cluster name within the multicluster
environment.

label: type = cluster
| allow_scale_across_jobs = true | false
exclude_classes = class_name[(cluster_name)] ...
exclude_groups = group_name[(cluster_name)] ...
exclude_users = user_name[(cluster_name)] ...
inbound_hosts = hostname[(cluster_name)] ...
inbound_schedd_port = port_number
include_classes = class_name[(cluster_name)] ...
include_groups = group_name[(cluster_name)] ...
include_users = user_name[(clustername)] ...
local = true | false
| main_scale_across_cluster = true | false
multicluster_security = SSL
outbound_hosts = hostname[(cluster_name)] ...
secure_schedd_port = port_number
ssl_cipher_list = cipher_list

Figure 17. Format of a cluster stanza

Examples: Cluster stanzas
Any of the following cluster stanzas may apply to your situation.


SCHEDD_STREAM_PORT = 1966

M1 M6 M7

M2
cluster1 cluster3

M3 M4

M5
cluster2

Figure 18. Multicluster Example

Figure 18 shows a simple multicluster with three clusters defined as members.
Cluster1 has defined an alternate port number for the Schedds running in its
cluster by setting the SCHEDD_STREAM_PORT = 1966. All of the other clusters need to
define what port to use when connecting to the inbound Schedds of cluster1 by
specifying the inbound_schedd_port = 1966 keyword in the cluster1 stanza.
Cluster2 has a single machine connected to cluster1 and 2 machines connected to
cluster3. Cluster3 has a single machine connected to both cluster2 and cluster1.
Each cluster would set the local keyword to true for their cluster stanza in the
cluster’s administration file.

Multicluster with 3 clusters defined as members
cluster1: type=cluster
outbound_hosts = M2(cluster2) M1(cluster3)
inbound_hosts = M2(cluster2) M1(cluster3)
inbound_schedd_port = 1966

outbound_hosts = M3(cluster1) M4(cluster3)
inbound_hosts = M3(cluster1) M4(cluster3) M5(cluster3)

outbound_hosts = M6
inbound_hosts = M6


Chapter 6. Performing additional administrator tasks
There are additional ways to modify the LoadLeveler environment that either
require an administrator.

Table 22 lists additional ways to modify the LoadLeveler environment that either
require an administrator to customize both the configuration and administration
files, or require the use of the LoadLeveler commands or APIs.
Table 22. Roadmap of additional administrator tasks
Setting up the environment for “Setting up the environment for parallel jobs” on page
parallel jobs 104
Configuring and using an v “Using the BACKFILL scheduler” on page 110
alternative scheduler
v “Using an external scheduler” on page 115
v “Example: Changing scheduler types” on page 126
| Using additional features available v “Preempting and resuming jobs” on page 126
| with the BACKFILL scheduler
v “Configuring LoadLeveler to support reservations”
on page 131
| v “Working with reservations” on page 213
| v “Data staging” on page 113
Working with AIX’s workload “Steps for integrating LoadLeveler with the AIX
balancing component Workload Manager” on page 137
Enabling LoadLeveler’s “LoadLeveler support for checkpointing jobs” on page
checkpoint/restart function 139
Enabling LoadLeveler’s affinity v LoadLeveler scheduling affinity (see “LoadLeveler
support scheduling affinity support” on page 146)
Enabling LoadLeveler’s v “LoadLeveler multicluster support” on page 148
multicluster support
v “Configuring a LoadLeveler multicluster” on page
150
| v “Scale-across scheduling with multiclusters” on page
| 153
Enabling LoadLeveler’s Blue Gene v “LoadLeveler Blue Gene support” on page 155
support
v “Configuring LoadLeveler Blue Gene support” on
page 157
Enabling LoadLeveler’s fair share v “Fair share scheduling overview” on page 27
scheduling support
v “Using fair share scheduling” on page 160
Moving job records from a down v “Procedure for recovering a job spool” on page 167
Schedd to another Schedd within
v “llmovespool - Move job records” on page 472
the local cluster
Correctly specifying configuration v Chapter 12, “Configuration file reference,” on page
and administration file keywords 263
v Chapter 13, “Administration file reference,” on page
321
Managing LoadLeveler operations

103

Table 22. Roadmap of additional administrator tasks (continued)

v Querying status v “llclass - Query class information” on page 433
v “llqres - Query a reservation” on page 500
v “llstatus - Query machine status” on page 512

v Changing attributes of submitted v “llfavorjob - Reorder system queue by job” on page
jobs 447
v “llfavoruser - Reorder system queue by user” on
page 449
v “llmodify - Change attributes of a submitted job
step” on page 464
v “llprio - Change the user priority of submitted job
steps” on page 477

v Changing the state of submitted v “llcancel - Cancel a submitted job” on page 421
jobs v “llhold - Hold or release a submitted job” on page
454

Setting up the environment for parallel jobs
Additional administration tasks apply to parallel jobs.

This topic describes the following administration tasks that apply to parallel jobs:
v Scheduling support
v Reducing job launch overhead
v Submitting interactive POE jobs
v Setting up a class
v Setting up a parallel master node
v Configuring MPICH jobs
v Configuring MVAPICH jobs
v Configuring MPICH-GM jobs

For information on submitting parallel jobs, see “Working with parallel jobs” on
page 194.

Scheduling considerations for parallel jobs
| For parallel jobs, LoadLeveler supports BACKFILL scheduling for efficient use of
| system resources.

This scheduler runs both serial and parallel jobs.

BACKFILL scheduling also supports:
v Multiple tasks per node
v Multiple user space tasks per adapter
v Preemption

Specify the LoadLeveler scheduler using the SCHEDULER_TYPE keyword. For
more information on this keyword and supported scheduler types, see “Choosing a


Steps for reducing job launch overhead for parallel jobs
Administrators may define a number of LoadLeveler starter processes to be ready
and waiting to handle job requests.

Having this pool of ready processes reduces the amount of time LoadLeveler needs
to prepare jobs to run. You also may control how environment variables are copied
for a job. Reducing the number of environment variables that LoadLeveler has to
copy reduces the amount of time LoadLeveler needs to prepare jobs to run.

Before you begin: You need to know:
v How many jobs might be starting at the same time. This estimate determines
how many starter processes to have LoadLeveler start in advance, to be ready
and waiting for job requests.
v The type of parallel jobs that typically are used. If IBM Parallel Environment
(PE) is used for parallel jobs, PE copies the user’s environment to all executing
nodes. In this case, you may configure LoadLeveler to avoid redundantly
copying the same environment variables.
v How to correctly specify configuration keywords. For details about specific
keyword syntax and use:
– In the administration file, see Chapter 13, “Administration file reference,” on
page 321.
– In the configuration file, see Chapter 12, “Configuration file reference,” on
page 263.

Perform the following steps to configure LoadLeveler to reduce job launch
overhead for parallel jobs.
1. In the local or global configuration file, specify the number of starter processes
for LoadLeveler to automatically start before job requests are submitted. Use
the PRESTARTED_STARTERS keyword to set this value.
Tip: The default value of 1 should be sufficient for most installations.
2. If typical parallel jobs use a facility such as Parallel Environment, which copies
user environment variables to all executing nodes, set the env_copy keyword in
the class, user, or group stanzas to specify that LoadLeveler only copy user
environment variables to the master node by default.
Rules:
v Users also may set this keyword in the job command file. If the env_copy
keyword is set in the job command file, that setting overrides any setting in
the administration file. For more information, see “Step for controlling
whether LoadLeveler copies environment variables to all executing nodes”
on page 195.
v If the env_copy keyword is set in more than one stanza in the administration
file, LoadLeveler determines the setting to use by examining all values set in
the applicable stanzas. See the table in theenv_copy administration file
keyword to determine what value LoadLeveler will use.
modifications you made to the configuration and administration files.

When you are done with this procedure, you can use the POE stderr and
LoadLeveler logs to trace actions during job launch.

Chapter 6. Performing additional administrator tasks 105

Steps for allowing users to submit interactive POE jobs
You can set up your system so that users can submit interactive POE jobs to
LoadLeveler.

Perform the following steps to set up your system so that users can submit
interactive POE jobs to LoadLeveler.
1. Make sure that you have installed LoadLeveler and defined LoadLeveler
administrators. See “Defining LoadLeveler administrators” on page 43 for
information on defining LoadLeveler administrators.
2. If running user space jobs, LoadLeveler must be configured to use switch
adapters. A way to do this is to run the llextRPD command to extract node
and adapter information from the RSCT peer domain. See “llextRPD - Extract
data from an RSCT peer domain” on page 443 for additional information.
3. In the configuration file, define your scheduler to be the LoadLeveler
BACKFILL scheduler by specifying SCHEDULER_TYPE = BACKFILL. See
“Choosing a scheduler” on page 44 for more information.
4. In the administration file, specify batch, interactive, or general use for nodes.
You can use the machine_mode keyword in the machine stanza to specify the
type of jobs that can run on a node; you must specify either interactive or
general if you are going to run interactive jobs.
5. In the administration file, configure optional functions, including:
v Setting up pools: you can organize nodes into pools by using the pool_list
keyword in the machine stanza. See “Defining machines” on page 84 for
more information.
v Enabling SP™ exclusive use accounting: you can specify that the accounting
function on an SP system be informed that a job step has exclusive use of a
machine by specifying spacct_exclusive_enable = true in the machine stanza
(as shown in the previous example).
See “Defining machines” on page 84 for more information on these
keywords.
6. Consider setting up a class stanza for your interactive POE jobs. See “Setting
up a class for parallel jobs” for more information. Define this class to be your
default class for interactive jobs by specifying this class name on the
default_interactive_class keyword. See “Defining users” on page 97 for more
information.

Setting up a class for parallel jobs
To define the characteristics of parallel jobs run by your installation you should set
up a class stanza in the administration file and define a class (in the Class
statement in the configuration file) for each task you want to run on a node.

Suppose your installation plans to submit long-running parallel jobs, and you want
to define the following characteristics:
v Only certain users can submit these jobs
v Jobs have a 30 hour run time limit
v A job can request a maximum of 60 nodes and 120 total tasks
v Jobs will have a relatively low run priority

The following is a sample class stanza for long-running parallel jobs which takes
into account these characteristics:


long_parallel: type=class
wall_clock_limit = 1800
include_users = jack queen king ace
priority = 50
total_tasks = 120
max_node = 60
maxjobs = 2

Note the following about this class stanza:
v The wall_clock_limit keyword sets a wall clock limit of 1800 seconds (30 hours)
for jobs in this class
v The include_users keyword allows four users to submit jobs in this class
v The priority keyword sets a relative priority of 50 for jobs in this class
v The total_tasks keyword specifies that a user can request up to 120 total tasks
for a job in this class
v The max_node keyword specifies that a user can request up to 60 nodes for a
job in this class
v The maxjobs keyword specifies that a maximum of two jobs in this class can run
simultaneously

Suppose users need to submit job command files containing the following
statements:
node = 30
tasks_per_node = 4

In your LoadL_config file, you must code the Class statement such that at least 30
nodes have four or more long_parallel classes defined. That is, the configuration
file for each of these nodes must include the following statement:
Class = { "long_parallel" "long_parallel" "long_parallel" "long_parallel" }

or
Class = long_parallel(4)

For more information, see “Defining LoadLeveler machine characteristics” on page
54.

| Striping when some networks fail
| When multiple networks are configured in a cluster, a job can request striping over
| the networks by setting sn_all in the network statement in the job command file.
| The striping_with_minimum_networks administration file keyword in the class
| stanza is used to tell LoadLeveler how to select nodes for sn_all jobs of a specific
| class when one or more networks are unavailable. When
| striping_with_minimum_networks is set to false for a class, LoadLeveler will only
| select nodes for sn_all jobs of that class where all the networks are up and in the
| READY state. When striping_with_minimum_networks is set to true, LoadLeveler
| will select a set of nodes where at least more than half of the networks on the
| nodes are up and in the READY state.

| For example, if there are 8 networks connected to a node and
| striping_with_minimum_networks is set to false, all 8 networks would have to be
| up and in the READY state to consider that node for sn_all jobs. If
| striping_with_minimum_networks is set to true, nodes with at least 5 networks
| up and in the READY state would be considered for sn_all jobs


Setting up a parallel master node
LoadLeveler allows you to define a parallel master node that LoadLeveler will use
as the first node for a job submitted to a particular class.

To set up a parallel master node, code the following keywords in the node’s class
and machine stanzas in the administration file:
# MACHINE STANZA: (optional)
mach1: type = machine
master_node_exclusive = true

# CLASS STANZA: (optional)
pmv3: type = class
master_node_requirement = true

Specifying master_node_requirement = true forces all parallel jobs in this class to
use–as their first node–a machine with the master_node_exclusive = true setting.
For more information on these keywords, see “Defining machines” on page 84 and
“Defining classes” on page 89.

Configuring LoadLeveler to support MPICH jobs
The MPICH package can be configured so that LoadLeveler will be used to spawn
all tasks in a MPICH application.

Using LoadLeveler to spawn MPICH tasks allows LoadLeveler to accumulate
accounting data for the tasks and also allows LoadLeveler to ensure that all tasks
are terminated when the job completes.

For LoadLeveler to spawn the tasks of a MPICH job, the MPICH package must be
configured to use the LoadLeveler llspawn.stdio command when starting tasks. To
configure MPICH to use llspawn.stdio, set the environment variable
RSHCOMMAND to the location of the llspawn.stdio command and run the
configure command for the MPICH package.

On Linux systems, enter the following:
# export RSHCOMMAND=/opt/ibmll/LoadL/full/bin/llspawn.stdio
# ./configure

Note: This configuration works on MPICH-1.2.7. Additional documentation for
MPICH is available from the Argonne National Laboratory web site at
http://www-unix.mcs.anl.gov/mpi/mpich1/.

Configuring LoadLeveler to support MVAPICH jobs
To run MVAPICH jobs under LoadLeveler control, you must specify the llspawn
command to replace the default RSHCOMMAND value during software
configuration.

The compiled MVAPICH implementation code uses the llspawn command to start
tasks under LoadLeveler control. This allows LoadLeveler to have total control
over the remote tasks for accounting and cleanup.

To configure the MVAPICH code to use the llspawn command as
RSHCOMMAND, change the mpirun_rsh.c program source code by following
these steps before compiling MVAPICH:
1. Replace:


Void child_handler(int);
with:
Void child_handler(int);
Void term_handler(int);
2. For Linux, replace:
#define RSH_CMD “/usr/bin/rsh”
#define RSH_CMD “/usr/bin/ssh”
with:
#define RSH_CMD “/opt/ibmll/LoadL/full/bin/llspawn”
#define SSH_CMD “/opt/ibmll/LoadL/full/bin/llpsawn”
3. Replace:
signal(SIGCHLD, child_handler);
with:
signal(SIGCHLD, SIG_IGN);
signal(SIGTERM, term_handler);
4. Add the definition for term_handler function at the end:
Void term_handler(int signal)
{
exit(0);
}

Configuring LoadLeveler to support MPICH-GM jobs
To run MPICH-GM jobs under LoadLeveler control, you need to configure the
MPICH-GM implementation you are using by specifying the llspawn command as
RSHCOMMAND.

The compiled MPICH-GM implementation code uses the llspawn command to
start tasks under LoadLeveler control. This allows LoadLeveler to have total
control over the remote tasks for accounting and cleanup.

To configure the MPICH-GM code to use the llspawn command as
RSHCOMMAND, change the mpich.make.gcc script before compiling the
MPICH-GM:

Replace:
Setenv RSHCOMMAND /usr/bin/rsh

with:
Setenv RSHCOMMAND /opt/ibmll/LoadL/full/bin/llspawn

LoadLeveler does not manage the GM ports on the Myrinet switch. For
LoadLeveler to keep track of the GM ports they must be identified as LoadLeveler
consumable resources.

Perform the following steps to use consumable resources to manage GM ports:
1. Pick a name for the GM port resource.
Example: As an example, this procedure assumes the name is gmports, but you
may use another name.
Tip: Users who submit MPICH-GM jobs need to know the name that you
define for the GM port resource.
2. In the LoadLeveler configuration file, specify the GM port resource name on
the SCHEDULE_BY_RESOURCES keyword.
Example:


SCHEDULE_BY_RESOURCES = gmports
Tip: If the SCHEDULE_BY_RESOURCES keyword already is specified in the
configuration file, you can just add the GM port resource name to other values
already listed.
3. In the administration file, specify how many GM ports are available on each
machine. Use the resources keyword to specify the GM port resource name and
the number of GM ports.
Example:
resources=gmports(n)
Tips:
v The resources keyword also must appear in the job command file for an
MPICH-GM job.
Example:
resources=gmports(1)
v To determine the value of n use either the number specified in the GM
documentation or the number of GM ports you have successfully used.
Certain system configurations may not support all available GM ports, so
you might need to specify a lower value for the gmports resource than what
is actually available.
4. Issue the llctl command with either the reconfig or recycle keyword.
Otherwise, LoadLeveler will not process the modifications you made to the
configuration and administration files.

For information about submitting MPICH-GM jobs, see “Running MPICH,
MVAPICH, and MPICH-GM jobs” on page 204.

Using the BACKFILL scheduler
The BACKFILL scheduling algorithm in LoadLeveler is designed to maximize the
use of resources to achieve the highest system efficiency, while preventing
potentially excessive delays in starting jobs with large resource requirements.

These large jobs can run because the BACKFILL scheduler does not allow jobs
with smaller resource requirements to continuously use up resources before the
larger jobs can accumulate enough resources to run. While BACKFILL can be used
for both serial and parallel jobs, the potential advantage is greater with parallel
jobs.

Job steps are arranged in a queue based on their SYSPRIO order as they arrive
from the Schedd nodes in the cluster. The queue can be periodically reordered
depending on the value of the RECALCULATE_SYSPRIO_INTERVAL keyword.
In each dispatching cycle, as determined by the NEGOTIATOR_INTERVAL and
NEGOTIATOR_CYCLE_DELAY configuration keywords, the BACKFILL algorithm
examines these job steps sequentially in an attempt to find available resources to
run each job step, then dispatches those steps to run.

Once the BACKFILL algorithm encounters a job step for which it cannot
immediately find enough resources, that job step becomes known as a ″top dog″.
The BACKFILL algorithm can allocate multiple top dogs in the same dispatch
cycle. By using the MAX_TOP_DOGS configuration keyword (for more
information, see Chapter 12, “Configuration file reference,” on page 263), you can
define the maximum number of top dogs that the central manager will allocate.
For each top dog, the BACKFILL algorithm will attempt to calculate the earliest
time at which enough resources will become free to run the corresponding top


dog. This is based on the assumption that each currently running job step will run
until its hard wall clock limit is reached and that when a job step terminates, the
resources which that step has been using will become available.

The time at which enough currently running job steps will have terminated,
meaning enough resources have become available to run a top dog, is called top
dog’s future start time. The future start time of each top dog is effectively
guaranteed for the remainder of the execution of the BACKFILL algorithm. The
resources that each top dog will use at its corresponding start time and for its
duration, as specified by its hard wall clock limit, are reserved (not to be confused
with the reservation feature available in LoadLeveler).

Note: A job that is bound to a reservation is not considered for top-dog
scheduling, so there is no top-dog scheduling performed inside reservations.

In some cases, it may not be possible to calculate the future start time of a job step.
Consider, for example, a case where there are 20 nodes in the cluster and a job step
requires 24 nodes to run. Even when all nodes in the cluster are idle, it will not be
possible for this job step to run. Only the addition of nodes to the cluster would
allow the job step to run, and there is no way the BACKFILL algorithm can make
any assumptions about when that could take place. In situations like this, the job
step is not considered a ″top dog″, no resources are ″reserved″, and the BACKFILL
algorithm goes on to the next job step in the queue.

| The BACKFILL scheduling algorithm classifies job steps into distinct types:
| REGULAR, TOP DOG, and BACKFILL:
| v The REGULAR job step is a job step for which enough resources are currently
| available and no top dogs have yet been allocated.
| v The TOP DOG job step is a job step for which not enough resources are
| currently available, but enough resources are available at a future time and one
| of the following conditions is met:
| – The TOP DOG job step is not expected to run at a time when any other top
| dog is expected to run.
| – If the TOP DOG is expected to run at a time when some other top dogs are
| expected to run, then it cannot be using resources reserved by such top dogs.
| v The BACKFILL job step is a job step for which enough resources are currently
| available and one of the following conditions is met:
| – The BACKFILL job step is expected to complete before the future start times
| of all top dogs, based on the hard wall clock limit of the BACKFILL job step.
| – If the BACKFILL job step is not expected to complete before the future start
| time of at least one top dog, then it cannot be using resources reserved by the
| top dogs that are expected to start before BACKFILL job step is expected to
| complete.

Table 23 provides a roadmap of BACKFILL scheduler tasks.
Table 23. Roadmap of BACKFILL scheduler tasks
Subtask Associated instructions (see . . . )
Configuring the BACKFILL v “Choosing a scheduler” on page 44
scheduler
v “Tips for using the BACKFILL scheduler” on page 112
v “Example: BACKFILL scheduling” on page 113


Table 23. Roadmap of BACKFILL scheduler tasks (continued)
Using additional LoadLeveler v “Preempting and resuming jobs” on page 126
features available under the
v “Configuring LoadLeveler to support reservations” on
BACKFILL scheduler
page 131
| v “Working with reservations” on page 213
| v “Data staging” on page 113
| v “Scale-across scheduling with multiclusters” on page 153
Use the BACKFILL scheduler v “llclass - Query class information” on page 433
to dispatch and manage jobs
v “llmodify - Change attributes of a submitted job step” on
page 464
v “llpreempt - Preempt a submitted job step” on page 474
v “Data access API” on page 560
v “Error handling API” on page 639
v “ll_modify subroutine” on page 677
v “ll_preempt subroutine” on page 686

Tips for using the BACKFILL scheduler
There are a number of essential considerations to make when using the BACKFILL
scheduler.

Note the following when using the BACKFILL scheduler:
v To use this scheduler, either users must set a wall-clock limit in their job
command file or the administrator must define a wall-clock limit value for the
class to which a job is assigned. Jobs with the wall_clock_limit of unlimited
cannot be used to backfill because they may not finish in time.
v Using wall clock limits that accurately reflect the actual running time of the job
steps will result in a more efficient utilization of resources. When a job step’s
wall clock limit is substantially longer than the amount of time the job step
actually needs, it results in two inefficiencies in the BACKFILL algorithm:
– The future start time of a ″top dog″ will be calculated to be much later due to
the long wall clock limits of the running job steps, leaving a larger window
for BACKFILL job steps to run. This causes the ″top dog″ to start later than it
would have if more accurate wall clock limits had been given.
– A job step is less likely to be backfilled if its wall clock limit is longer because
it is more likely to run past the future start time of a ″top dog″.
v You should use only the default settings for the START expression and the other
job control functions described in “Managing job status through control
expressions” on page 68. If you do not use these default settings, jobs will still
run but the scheduler will not be as efficient. For example, the scheduler will not
be able to guarantee a time at which the highest priority job will run.
v You should configure any multiprocessor (SMP) nodes such that the number of
jobs that can run on a node (determined by the MAX_STARTERS keyword) is
always less than or equal to the number of processors on the node.
v Due to the characteristics of the BACKFILL algorithm, in some cases this
scheduler may not honor the MACHPRIO statement. For more information on
MACHPRIO, see “Setting negotiator characteristics and policies” on page 45.


v When using PREEMPT_CLASS rules it is helpful to create a SYSPRIO
expression which is consistent with the preemption rules. This can be done by
using the ClassSysprio built-in variable with a multiplier, such as SYSPRIO:
(ClassSysprio * 10000) - QDate. If classes which appear on the left-hand side
of PREEMPT_CLASS rules are given a higher priority than those which appear
on the right, preemption won’t be required as often because the job steps which
can preempt will be higher in the queue than the job steps which can be
preempted.
v Entering llq -s against a top-dog step will display that this step is a top-dog.

Example: BACKFILL scheduling
On a rack with 10 nodes, 8 of the nodes are being used by Job A.

Job B has the highest priority in the queue, and requires 10 nodes. Job C has the
next highest priority in the queue, and requires only two nodes. Job B has to wait
for Job A to finish so that it can use the freed nodes. Because Job A is only using 8
of the 10 nodes, the BACKFILL scheduler can schedule Job C (which only needs
the two available nodes) to run as long as it finishes before Job A finishes (and Job
B starts). To determine whether or not Job C has time to run, the BACKFILL
scheduler uses Job C’s wall_clock_limit value to determine whether or not it will
finish before Job A ends. If Job C has a wall_clock_limit of unlimited, it may not
finish before Job B’s start time, and it won’t be dispatched.

| Data staging
| Data staging allows you to stage data needed by a job before the job begins
| execution and to move data back to archives when a job has finished execution. A
| job can use one inbound data staging step and one outbound data staging step.
| The inbound step will be the first to be executed and the outbound step, the last.

| LoadLeveler provides data staging for two scenarios:
| 1. A single replica of the data files needed by a job have to be created on a
| common file system.
| 2. A replica of the data files has to be created on every machine on which the job
| will run.

| LoadLeveler allows you to request the time at which data staging operations
| should be scheduled.
| 1. A single replica must be created as soon as a job is submitted, regardless of
| when the job will be executed. This is the AT_SUBMIT configuration option.
| 2. A single replica of the data files must be created as close as possible to
| execution time of the job. This is the JUST_IN_TIME configuration option.
| 3. A replica must be created on each machine that the job runs on, as close as
| possible to execution time of the job. This is also the JUST_IN_TIME
| configuration option.

| The basic steps involved in data staging include:
| 1. A job is submitted that contains data staging keywords.
| 2. LoadLeveler generates inbound and outbound data staging steps in accordance
| with these keywords. All other steps of the job have an implicit dependency on
| the completion of the inbound data staging step.
| 3. Scheduling methods:


| a. With the AT_SUBMIT configuration option, the data staging step is started
| first and the application steps are scheduled when its data staging
| dependency is satisfied (that is, when the inbound data staging step is
| completed).
| b. With the JUST_IN_TIME configuration option, the first application step of
| the job is scheduled in the future based on the wall clock time specified for
| the inbound data staging step. The inbound data staging step is started on
| the machines that will be used by the first application step.
| 4. When the inbound data staging step completes, all of the application job steps
| become eligible for scheduling. The exit code from the inbound data staging
| program is made available to all application job steps in the
| LL_DSTG_IN_EXIT_CODE environment variable.
| 5. When all the application job steps are completed, the outbound data staging
| step is started by LoadLeveler. Typically, the outbound data staging step would
| be used to move data files back to their archives.

| Note: You cannot preempt data staging steps using the llpreempt command or by
| specifying the data_stage class in system preemption rules. Similarly, a step
| belonging to the data_stage class cannot preempt any other job step.

| Configuring LoadLeveler to support data staging
| LoadLeveler allows you to specify the execution time for data staging job steps
| using the DSTG_TIME keyword. It defaults to the AT_SUBMIT value. To
| schedule data staging operation as close to the application as possible, the
| JUST_IN_TIME value can be used. DSTG_MIN_SCHEDULING_INTERVAL is a
| keyword used to optimize scheduler performance by allowing data staging jobs to
| be scheduled only at specific intervals.

| A special set of data staging step initiators, called DSTG_MAX_STARTERS, can be
| set up for data staging job steps. These initiators will be a distinct set of resources
| on the compute node, not included in the MAX_STARTERS set up for compute
| jobs. You cannot specify the built-in data_stage class in:
| v The CLASS keyword of a job command file
| v The default_class keyword in the administration file

| For more information about the data staging keywords, see “Configuration file
| keyword descriptions” on page 265.

| The LoadLeveler administration class stanza keywords can be used to specify
| defaults, limits, and restrictions for the built-in data_stage class. The data_stage
| class cannot be specified as the default class for a user. You cannot specify the
| data_stage class in your job command file. Steps of this class will be automatically
| generated by LoadLeveler based on the data staging keywords used in job
| command files.

| LoadLeveler provides a built-in class called data_stage that can be configured in
| the administration file using a class stanza, just as you would do for any other
| class. Some examples of how you might use a stanza for the data_stage class are:
| v Include and exclude users and groups from this class to control which users are
| permitted to use data staging.
| v Specifying defaults for resource limits such as cpu_limit or nofile_limit for data
| staging steps.


| v Specifying defaults and maximum allowed values for the dstg_resources job
| command file keyword using default_resources and max_resources.
| v Limiting the total number of data staging jobs or tasks in the cluster at any one
| time using maxjobs or max_total_tasks.

| For more information about the data staging keywords, see “Administration file

| If an inbound data staging job step is soft-bound to a reservation and keyword
| dstg_node=any, it can be started ahead of the reservation start time, if data staging
| resources are available. In all other cases, data staging steps will run within the
| reservation itself.

Using an external scheduler
The LoadLeveler API provides interfaces that allow an external scheduler to
manage the assignment of resources to jobs and dispatching those jobs.

The primary interfaces for the tasks of an external scheduler are:
v ll_query to obtain information about the LoadLeveler cluster, the machines of
the cluster, jobs and AIX Workload Manager.
v ll_get_data to obtain information about specific objects such as jobs, machines
and adapters.
| v ll_start_job_ext to start a LoadLeveler job.
| – The ll_start_job_ext subroutine supports both serial and parallel jobs. For
| parallel jobs, ll_start_job_ext provides the ability to specify which adapters
| are used by the communication protocols of each job task. This assures that
| each task uses the same network for communication over a given protocol.

The steps for dispatching jobs with an external scheduler are:
1. Gather information about the LoadLeveler cluster ( ll_query(CLUSTER) ).
2. Gather information about the machines in the LoadLeveler cluster (
ll_query(MACHINES) ).
3. Gather information about the jobs in the cluster ( ll_query(JOBS) ).
4. Determine the resources that are currently free. (See the note that follows.)
5. Determine which jobs to start. Assign resources to jobs to be started and
dispatch ( ll_start_job_ext(LL_start_job_info_ext*) ).
6. Repeat steps 1 through 5.

When an external scheduler is used, the LoadLeveler Negotiator does not keep
track of the resources used by jobs started by the external scheduler. There are two
ways that an external scheduler can keep track of the free resources available for
starting new jobs. The method that should be used depends on whether the
external scheduler runs continuously while all scheduling is occurring or is
executed to start a finite number of jobs and then terminates:
v If the external scheduler runs continuously, it should query the total resources
available in the LoadLeveler system with ll_query and ll_get_data. Then it can
keep track of the resource assigned to jobs it starts while they are running and
return the resources to the available pool when the jobs complete.
v If the external scheduler is executed to start a finite number of jobs and then
terminates, it must determine the pool of available resources when it first starts.
It can do this by first querying the total resources in the LoadLeveler system
using ll_query and ll_get_data. Then it would query the jobs in the system


(again using ll_query), looking for jobs that are running. For each running job, it
would remove the resources used by the job from the available pool. After all
the running jobs are processed, the available pool would indicate the amount of
free resource for starting new jobs.

To find out more about dispatching jobs with an external scheduler, use the
information in Table 24.
Table 24. Roadmap of tasks for using an external scheduler
Learn about the LoadLeveler functions “Replacing the default LoadLeveler scheduling
that are limited or not available when algorithm with an external scheduler”
you use an external scheduler
Prepare the LoadLeveler environment “Customizing the configuration file to define an
for using an external scheduler external scheduler” on page 118
Use an external scheduler to dispatch v “Steps for getting information about the
jobs LoadLeveler cluster, its machines, and jobs” on
page 118
v “Assigning resources and dispatching jobs” on
page 122

Replacing the default LoadLeveler scheduling algorithm with
an external scheduler
It is important to know how LoadLeveler keywords and commands behave when
you replace the default LoadLeveler scheduling algorithm with an external
scheduler.

LoadLeveler scheduling keywords and commands fall into the following
categories:
v Keywords not involved in scheduling decisions are unchanged.
| v Keywords kept in the job object or in the machine which are used by the
| LoadLeveler default scheduler have their values maintained as before and
| passed to the data access API.
v Keywords used only by the LoadLeveler default scheduler have no effect.

Table 25 discusses specific keywords and commands and how they behave when
you disable the default LoadLeveler scheduling algorithm.
Table 25. Effect of LoadLeveler keywords under an external scheduler
Keyword type / name Notes
Job command file keywords
| class This value is provided by the data access API.
| Machines chosen by ll_start_job_ext must have the
| class of the job available or the request will be
| rejected.
| dependency Supported as before. Job objects for which
| dependency cannot be evaluated (because a previous
| step has not run) are maintained in the NotQueued
| state, and attempts to start them using
| ll_start_job_ext will result in an error. If the
| dependency is met, ll_start_job_ext can start the
| proc.


Table 25. Effect of LoadLeveler keywords under an external scheduler (continued)
Keyword type / name Notes
| hold ll_start_job_ext cannot start a job that is in Hold
| status.
| preferences Passed to the data access API.
| requirements ll_start_job_ext returns an error if the specified
| machines do not match the requirements of the job.
| This includes Disk and Virtual Memory
| requirements.
| startdate The job remains in the Deferred state until the
| startdate specified in the job is reached.
| ll_start_job_ext cannot start a job in the Deferred
| state.
| user_priority Used in calculating the system priority (as described
| in “Setting and changing the priority of a job” on
| page 230). The system priority assigned to the job is
| available through the data access API. No other
| control of the order in which jobs are run is
| enforced.
Administration file keywords
master_node_exclusive Ignored
master_node_requirement Ignored
max_jobs_scheduled Ignored
max_reservations Ignored
max_reservation_duration Ignored
max_total_tasks Ignored
maxidle Supported
maxjobs Ignored
maxqueued Supported
priority Used to calculate the system priority (where
appropriate).
| speed Available through the data access API.
Configuration file keywords
MACHPRIO Calculated but is not used.
| MAX_STARTERS Calculated, and if starting the job causes this value
| to be exceeded, ll_start_job_ext returns an error.
| SYSPRIO Calculated and available to the data access API.
NEGOTIATOR_PARALLEL_DEFER Ignored
NEGOTIATOR_PARALLEL_HOLD Ignored
NEGOTIATOR_RESCAN_QUEUE Ignored
NEGOTIATOR_RECALCULATE_ Works as before. Set this value to 0 if you do not
SYSPRIO_INTERVAL want the system priorities of job objects recalculated.


Customizing the configuration file to define an external
scheduler
| To use an external scheduler, one of the tasks you must perform is setting the
| configuration file keyword SCHEDULER_TYPE to the value API.

This keyword option provides a time-based (rather than an event-based) interface.
That is, your application must use the data access API to poll LoadLeveler at
specific times for machine and job information.

When you enable a scheduler type of API, you must specify
AGGREGATE_ADAPTERS=NO to make the individual switch adapters available
to the external scheduler. This means the external scheduler receives each
individual adapter connected to the network, instead of collectively grouping them
together. You’ll see each adapter listed individually in the llstatus -l command
output. When this keyword is set to YES, the llstatus -l command will show an
aggregate adapter which contains information on all switch adapters on the same
network. For detailed information about individual switch adapters, issue the
llstatus -a command.

You also may use the PREEMPTION_SUPPORT keyword, which specifies the
level of preemption support for a cluster. Preemption allows for a running job step
to be suspended so that another job step can run.

Steps for getting information about the LoadLeveler cluster,
its machines, and jobs
There are steps to retrieve and use information about the LoadLeveler cluster,
machines, jobs and AIX Workload Manager.

Perform the following steps to retrieve and use information about the LoadLeveler
cluster, machines, jobs and AIX Workload Manager:
1. Create a query object for the kind of information you want.
Example: To query machine information, code the following instruction:
LL_element * query_element = ll_query(MACHINES);
2. Customize the query to filter the specific information you want. You can filter
the list of objects for which you want information. For some queries, you can
also filter how much information you want.
Example: The following lines customize the query for just hosts
node01.ibm.com and node02.ibm.com and to return the information contained
in the llstatus -f command:
char * hostlist[] = { "node01.ibm.com","node02.ibm.com",NULL };
ll_set_request(query_element,QUERY_HOST,hostlist,STATUS_LINE);
3. Once the query has been customized:
a. Submit it using ll_get_objs, which returns the first object that matches the
query.
b. Interrogate the returned object using the ll_get_data command to retrieve
specific attributes. Depending on the information being queried for, the
query may be directed to a specific node and a specific daemon on that
node.
Example: A JOBS query for all data may be directed to the negotiator, Schedd
or the history file. If it is directed to the Schedd, you must specify the host of


the Schedd you are interested in. The following demonstrates retrieving the
name of the first machine returned by the query constructed previously:
int machine_count;
int rc;
LL_element * element =ll_get_objs(query_element,LL_CM,NULL,&machine_count,&rc)
char * mname;
ll_get_data(element,LL_MachineName,&mname);

Because there is only one negotiator in a LoadLeveler cluster, the host does not
have to be specified. The third parameter is the address of an integer that will
receive the count of objects returned and the fourth parameter is the address of
an integer that will receive the completion code of the call. If the call fails,
NULL is returned and the location pointed to by the fourth parameter is set to
a reason code. If the call succeeds, the value returned is used as the first
parameter to a call to ll_get_data. The second parameter to ll_get_data is a
specification that indicates what attribute of the object is being interrogated.
The third parameter to ll_get_data is the address of the location into which to
store the result. ll_get_data returns zero if it is successful and nonzero if an
error occurs. It is important that the specification (the second parameter to
ll_get_data) be valid for the object passed in (the first parameter) and that the
address passed in as the third parameter point to the correct type for the
specification. Undefined, potentially dangerous behavior will occur if either of
these conditions is not met.

Example: Retrieving specific information about machines
The following example demonstrates printing out the name and adapter list of all
machines in the LoadLeveler cluster.

The example could be extended to retrieve all of the information available about
the machines in the cluster such as memory, disk space, pool list, features,
supported classes, and architecture, among other things. A similar process would
be used to retrieve information about the cluster overall.
int i, w, rc;
int machine_count;
LL_element * query_elem;
LL_element * machine;
LL_element * adapter;
char * machine_name;
char * adapter_name;
int * window_list;
int window_count;

/* First we need to obtain a query element which is used to pass */
/* parameters in to the machine query */
if ((query_elem = ll_query(MACHINES)) == NULL)
{
fprintf(stderr,"Unable to obtain query elementn");
/* without the query object we will not be able to do anything */
exit(-1);
}

/* Get information relating to machines in the LoadLeveler cluster. */

/* QUERY_ALL: we are querying all machines */
/* NULL: since we are querying all machines we do not need to */
/* specify a filter to indicate which machines */
/* ALL_DATA: we want all the information available about the machine */
rc=ll_set_request(query_elem,QUERY_ALL,NULL,ALL_DATA);
if(rc<0)
{
/* A real application would map the return code to a message */


printf("
/* Without customizing the query we cannot proceed */
exit(rc);
}

/* If successful, ll_get_objs() returns the first object that */
/* satisfies the criteria that are set in the query element and */
/* the parameters. In this case those criteris are: */
/* A machine (from the type of query object) */
/* LL_CM: that the negotiator knows about */
/* NULL: since there is only one negotiator we don’t have to */
/* specify which host it is on */
/* The number of machines is returned in machine_count and the */
/* return code is returned in rc */
machine = ll_get_objs(query_elem,LL_CM,NULL,&machine_count,&rc);
if(rc<0)
{
printf("

/* query was not successful -- we cannot proceed but we need to */
/* release the query element */
if(ll_deallocate(query_elem) == -1)
{
fprintf(stderr,"Attempt to deallocate invalid query elementn");
}
exit(rc);
}

printf("Number of Machines =
i = 0;
while(machine!=NULL)
{
printf("------------------------------------------------------n");
printf("Machine

int rc = ll_get_data(machine,LL_MachineName,&machine_name);
if(0==rc)
{
printf("Machine name =
}
else
{
printf("Error
}

printf("Adaptersn");
ll_get_data(machine,LL_MachineGetFirstAdapter,&adapter);
while(adapter != NULL)
{
rc = ll_get_data(adapter,LL_AdapterName,&adapter_name);
if(0!=rc)
{
printf("Error
}
else
{
/* Because the list of windows on an adapter is returned */
/* as an array of integers, we also need to know how big */
/* the list is. First we query the window count, */
/* storing the result in an integer, then we query for */
/* the list itself, storing the result in a pointer to */
/* an integer. The window list is allocated for us so */
/* we need to free it when we are done */

printf("
ll_get_data(adapter,LL_AdapterTotalWindowCount,&window_count);


ll_get_data(adapter,LL_AdapterWindowList,&window_list);
for (w = 0;w<iBuffer;w++)
{
printf("
}
printf("n");
}
free(window_list);
/* After the first object has been gotten, GetNext returns */
/* the next until the list is exhausted */
ll_get_data(machine,LL_MachineGetNextAdapter,&adapter);
}

printf("n");
i++;
machine = ll_next_obj(query_elem);
}

/* First we need to release the individual objects that were */
/* obtained by the query */
if(ll_free_objs(query_elem) == -1)
{
fprintf(stderr,"Attempt to free invalid query elementn");
}

/* Then we need to release the query itself */
{
}

Example: Retrieving information about jobs
The following example may apply to your situation.

The following example demonstrates retrieving information about jobs up to the
point of starting a job:
int i, rc;
int job_count;
LL_element * query_elem;
LL_element * job;
LL_element * step;
int step_state;

/* First we need to obtain a query element which is used to pass */
/* parameters in to the jobs query */
if ((query_elem = ll_query(JOBS)) == NULL)
{
fprintf(stderr,"Unable to obtain query elementn");
/* without the query object we will not be able to do anything */
exit(-1);
}

/* Get information relating to Jobs in the LoadLeveler cluster. */
printf("Jobs Information ========================================nn");
/* QUERY_ALL: we are querying all jobs */
/* NULL: since we are querying all jobs we do not need to */
/* specify a filter to indicate which jobs */
/* ALL_DATA: we want all the information available about the job */
rc=ll_set_request(query_elem,QUERY_ALL,NULL,ALL_DATA);
if(rc<0)
{
printf("
/* Without customizing the query we cannot proceed */
exit(rc);
}


/* If successful, ll_get_objs() returns the first object that */
/* satisfies the criteria that are set in the query element and */
/* the parameters. In this case those criteris are: */
/* A job (from the type of query object) */
/* LL_CM: that the negotiator knows about */
/* NULL: since there is only one negotiator we don’t have to */
/* specify which host it is on */
/* The number of jobs is returned in job_count and the */
/* return code is returned in rc */
job = ll_get_objs(query_elem,LL_CM,NULL,&job_count,&rc);
if(rc<0)
{
printf("

/* query was not successful -- we cannot proceed but we need to */
/* release the query element */
{
}
exit(rc);
}

printf("Number of Jobs =
step = NULL;
while(job!=NULL)
{
/* Each job is composed of one or more steps which are started */
/* individually. We need to check the state of the job’s steps */
ll_get_data(job,LL_JobGetFirstStep,&step);
while(step!=NULL)
{
ll_get_data(step,LL_StepState,&step_state);
/* We are looking for steps that are in idle state. The */
/* state is returned as an int so we cast it to */
/* enum StepState as declared in llapi.h */
if((enum StepState)step_state == STATE_IDLE)
break;
}
/* If we exit the loop with a valid step, it is the one to start */
/* otherwise we need to keep looking */
if(step != NULL)
break;

ll_next_obj(query_elem);
}

if(step==NULL)
{
printf("No step to startn");
exit(0);
}

Assigning resources and dispatching jobs
| After an external scheduler selects a job step to start and identifies the machines
| that the job step will run on, the LoadLeveler job start API is used to tell
| LoadLeveler the job step to start and the resources that are to be assigned to the
| job step.

In “Example: Retrieving information about jobs” on page 121, we reached the point
where a step to start was identified. In a real external scheduler, the decision
would be reached after consideration of all the idle jobs and constructing a priority

value based on attributes such as class and submit time, all of which are accessible
through ll_get_data. Next, the list of available machines would be examined to
determine whether a set exists with sufficient resources to run the job. This process
also involves determining the size of that set of machines using attributes of the
step such as number of nodes, instances of each node and tasks per node. The
LoadLeveler data query API allows access to that information about each job but
the interface for starting the job does not require that the machine and adapter
resource match the specifications when the job was submitted. For example, a job
could be submitted specifying node=4 but could be started by an external
scheduler on a single node only. Similarly, the job could specify the LAPI protocol
with network.lapi=... but be started and told to use the MPI protocol. This is not
considered an error since it is up to the scheduler to interpret (and enforce, if
necessary), the specifications in the job command file.

In allocating adapter resources for a step, it is important that the order of the
adapter usages be consistent with the structure of the step. In some environments a
task can use multiple instances of adapter windows for a protocol. If the protocol
requests striping (sn_all), an adapter window (or set of windows if instances are
used) is allocated on each available network. If multiple protocols are used by the
task (for example, MPI and LAPI), each protocol defines its own set of windows.
The array of adapter usages passed in to ll_start_job_ext must group the windows
for all of the instances on one network for the same protocol together. If the
protocol requests striping, that grouping must be immediately followed by the
grouping for the next network. If the task uses multiple protocols, the set of
adapter usages for the first protocol must be immediately followed by the set for
the next protocol. Each task will have exactly the same pattern of adapter usage
entries. Corresponding entries across all the tasks represent a communication path
and must be able to communicate with each other. If the usages are for User Space
communication, a network table will be loaded for each set of corresponding
entries.

All of the job command file keywords for specifying job structure such as
total_tasks, tasks_per_node, node=min,max and blocking are supported by the
ll_start_job_ext interface but users should ensure that they understand the
LoadLeveler model that is created for each combination when constructing the
adapter usage list for ll_start_job_ext. Jobs that are submitted with node=number
and tasks_per_node result in more regular LoadLeveler models and are easier to
create adapter usage lists for.

In the following example, it is assumed that the step found to be dispatched will
run on one machine with two tasks, each task using one switch adapter window
for MPI communication. The name of the machine to run on is contained in the
variable use_machine (char*), the names of the switch adapters are contained in
use_adapter_1 (char *) and use_adpater_2 (char *) and the adapter windows on
those adapters in use_window_1 int) and use_window_2 (int), respectively.
Further more, each adapter will be allocated 1M of memory.

If the network adapters that the external scheduler assigns to the job allocate
communication buffers in rCxt blocks instead of bytes (the Switch Network
Interface for HPS is an example of such a network adapter), the api_rcxtblocks
field of adapterUsage should be used to specify the number of rCxt blocks to
assign instead of the mem field.
LL_start_job_info_ext *start_info;
char * pChar;
LL_element * step;
LL_element * job;


int rc;
char * submit_host;
char * step_id;

start_info = (LL_start_job_info_ext *)(malloc(sizeof(LL_start_job_info_ext)));
if(start_info == NULL)
{
fprintf(stderr, "Out of memory.n");
return;
}

/* Create a NULL terminated list of target machines. Each task */
/* must have an entry in this list and the entries for tasks on the */
/* same machine must be sequential. For example, if a job is to run */
/* on two machines, A and B, and three tasks are to run on each */
/* machine, the list would be: AAABBB */
/* Any specifications on the job when it was submitted such as */
/* nodes, total_tasks or tasks_per_node must be explicitly queried */
/* and honored by the external scheduler in order to take effect. */
/* They are not automatically enforced by LoadLeveler when an */
/* external scheduler is used. */
/* */
/* In this example, the job will only be run on one machine */
/* with only one task so the machine list consists of only 1 machine */
/* (plus the terminating NULL entry) */
start_info->nodeList = (char **)malloc(2*sizeof(char *));
if (!start_info->nodeList)
{
fprintf(stderr, "Out of memory.n");
return;
}

start_info->nodeList[0] = strdup(use_machine);
start_info->nodeList[1] = NULL;

/* Retrieve information from the job to populate the start_info */
/* structure */
/* In the interest of brevity, the success of the ll_get_data() */
/* is not tested. In a real application it shuld be */

/* The version number is set from the header that is included when */
/* the application using the API is compiled. This allows for */
/* checking that the application was compiled with a version of the */
/* API that is compatible with the version in the library when the */
/* application is run. */
start_info->version_num = LL_PROC_VERSION;

/* Get the first step of the job to start */
ll_get_data(job,LL_JobGetFirstStep,&step);
if(step==NULL)
{
printf("No step to startn");
return;
}

/* In order to set the submitting host, cluster number and proc */
/* number in the start_info structure, we need to parse it out of */
/* the step id */

/* First get the submitting host and save it */
ll_get_data(job,LL_JobSubmitHost,&submit_host);
start_info->StepId.from_host = strdup(submit_host);
free(submit_host);

rc = ll_get_data(step, LL_StepID, &step_id);

/* The step id format is submit_host.jobno.stepno . Because the */


/* submit host is a dotted string of indeterminant length, the */
/* simplest way to detect where the job number starts is to retrieve */
/* the submit host from the job and skip forward its length in the */
/* step id. */

pChar = step_id+strlen(start_info->StepId.from_host)+1;
/* The next segment is the cluster or job number */
pChar = strtok(pChar,".");
start_info->StepId.cluster=atoi(pChar);
/* The last token is the proc or step number */
pChar = strtok(NULL,".");
start_info->StepId.proc = atoi(pChar);
free(step_id);

/* For each protocol (eg. MPI or LAPI) on each task, we need to */
/* specify which adapter to use, whether a window is being used */
/* (subsystem = "US") or not (subsytem="IP"). If a window is used, */
/* the window ID and window buffer size must be specified. */
/* */
/* The adapter usage entries for the protocols of a task must be */
/* sequential and the set of entries for tasks on the same node must */
/* be sequential. For example the twelve entries for a job where */
/* each task uses one window for MPI and one for LAPI with three */
/* tasks per node and running on two nodes would be laid out as: */
/* 1: MPI window for 1st task running on 1st node */
/* 2: LAPI window for 1st task running on 1st node */
/* 3: MPI window for 2nd task running on 1st node */
/* 4: LAPI window for 2nd task running on 1st node */
/* 5: MPI window for 3rd task running on 1st node */
/* 6: LAPI window for 3rd task running on 1st node */
/* 7: MPI window for 1st task running on 2nd node */
/* 8: LAPI window for 1st task running on 2nd node */
/* 9: MPI window for 2nd task running on 2nd node */
/* 10: LAPI window for 2nd task running on 2nd node */
/* 11: MPI window for 3rd task running on 2nd node */
/* 12: LAPI window for 3rd task running on 2nd node */
/* An improperly ordered adapter usage list may cause the job not to */
/* be started or, if started, incorrect execution of the job */
/* */
/* This example starts the job with two tasks on one machine, using */
/* one switch adapter window on each task. The protocol is forced */
/* to MPI and a fixed window size of 1M is used. An actual external */
/* scheduler application would check the steps requirements and its */
/* adapter requirements of the step with ll_get_data */
/* */
start_info->adapterUsageCount = 2;
start_info->adapterUsage =
(LL_ADAPTER_USAGE *)malloc((start_info->adapterUsageCount)
* sizeof(LL_ADAPTER_USAGE));

start_info->adapterUsage[0].dev_name = use_adapter_1;
start_info->adapterUsage[0].protocol = "MPI";
start_info->adapterUsage[0].subsystem = "US";
start_info->adapterUsage[0].wid = use_window_1;
start_info->adapterUsage[0].mem = 1048577;

start_info->adapterUsage[1].dev_name = use_adapter_2;
start_info->adapterUsage[1].protocol = "MPI";
start_info->adapterUsage[1].subsystem = "US";
start_info->adapterUsage[1].wid = use_window_2;
start_info->adapterUsage[1].mem = 1048577;

if ((rc = ll_start_job_ext(start_info)) != API_OK)
{
printf("Error %d returned attempting to start Job Step %s.%d.%d on %sn",
rc,
start_info->StepId.from_host,


start_info->StepId.cluster,
start_info->StepId.proc,
start_info->nodeList[0]
);
}
else
{
printf("ll_start_job_ext() invoked to start job step: "
"%s.%d.%d on machine: %s.nn",
start_info->StepId.from_host, start_info->StepId.cluster,
start_info->StepId.proc, start_info->nodeList[0]);
}
free(start_info->nodeList[0]);
free(start_info);

Finally, when the step and job element are no longer in use, ll_free_objs() and
ll_deallocate() should be called on the query element.

Example: Changing scheduler types
You can toggle between the default LoadLeveler scheduler and other types of
schedulers by using the SCHEDULER_TYPE keyword.

Changes to SCHEDULER_TYPE will not take effect at reconfiguration. The
administrator must stop and restart or recycle LoadLeveler when changing
SCHEDULER_TYPE. A combination of changes to SCHEDULER_TYPE and some
other keywords may terminate LoadLeveler.

The following example illustrates how you can toggle between the default
LoadLeveler scheduler and an external scheduler, such as the Extensible Argonne
Scheduling sYstem (EASY), developed by Argonne National Laboratory and
available as public domain code.

If you are running the default LoadLeveler scheduler, perform the following steps
to switch to an external scheduler:
1. In the configuration file, set SCHEDULER_TYPE = API
2. On the central manager machine:
v Issue llctl -g stop and llctl -g start, or
v Issue llctl -g recycle
If you are running an external scheduler, this is how you can re-enable the
LoadLeveler scheduling algorithm:
1. In the configuration file, set SCHEDULER_TYPE = LL_DEFAULT

Preempting and resuming jobs
The BACKFILL scheduler allows LoadLeveler jobs to be preempted so that a
higher priority job step can run.

Administrators may specify not only preemption rules for job classes, but also the
method that LoadLeveler uses to preempt jobs. The BACKFILL scheduler supports
various methods of preemption.


Use Table 26 to find more information about preemption.
Table 26. Roadmap of tasks for using preemption
Learn about types of “Overview of preemption”
preemption and what it
means for preempted jobs
Prepare the LoadLeveler “Planning to preempt jobs” on page 128
environment and jobs for
preemption
Configure LoadLeveler to use “Steps for configuring a scheduler to preempt jobs” on page
preemption 130

Overview of preemption
LoadLeveler supports two types of preemption.

The types of preemption thatLoadLeveler supports are of the following two types:
v System-initiated preemption
– Automatically enforced by LoadLeveler, except for job steps running under a
reservation.
– Governed by the PREEMPT_CLASS rules defined in the global configuration
file.
– When resources required by an incoming job are in use by other job steps, all
or some of those job steps in certain classes may be preempted according to
the PREEMPT_CLASS rules.
– An automatically preempted job step will be resumed by LoadLeveler when
resources become available and conditions such as START_CLASS rules are
satisfied.
– An automatically preempted job step cannot be resumed using llpreempt
command or ll_preempt subroutine.
v User-initiated preemption
– Manually initiated by LoadLeveler administrators using llpreempt command
or ll_preempt subroutine.
– A manually preempted job step cannot be resumed automatically by
LoadLeveler.
– A manually preempted job step can be resumed using llpreempt command or
ll_preempt subroutine. Issuing this command or subroutine, however, does
not guarantee that the job step will successfully be resumed. A manually
preempted job step that was resumed through these interfaces competes for
resources with system-preempted job steps, and will be resumed only when
resources become available.
– All steps in a set of coscheduled job steps will be preempted if one or more
steps in the step is preempted.
– A coscheduled step will not be resumed until all steps in the set of
coscheduled job steps can be resumed.

For the BACKFILL scheduler only, administrators may select which method
LoadLeveler uses to preempt and resume jobs. The suspend method is the default
behavior, and is the preemption method LoadLeveler uses for any external
schedulers that support preemption. For more information about preemption
methods, see “Planning to preempt jobs” on page 128.


For a preempted job to be resumed after system- or user-initiated preemption
occurs through a method other than suspend, the restart keyword in the job
command file must be set to yes. Otherwise, LoadLeveler vacates the job step and
removes it from the cluster.

In order to determine the preempt type and preempt method to use when a
coscheduled step preempts another step, an order of precedence for preempt types
and preempt methods has been defined. All steps in the preempting coscheduled
step will be examined and the preempt type and preempt method having the
highest precedence will be used. The order of precedence for preempt type will be
ALL, ENOUGH. The precedence order for preempt method will be remove, vacate,
system hold, user hold, suspend.

When coscheduled steps are running, if one step is preempted as a result of a
system initiated preemption, then all coscheduled steps will be preempted. This
implies that more resource than necessary might be preempted when one of the
steps being preempted is a coscheduled step.

Planning to preempt jobs
There are points to consider when planning to use preemption.

Consider the following points when planning to use preemption:
v Avoiding circular preemption under the BACKFILL scheduler
BACKFILL scheduling enables job preemption using rules specified with the
PREEMPT_CLASS keyword. When you are setting up the preemption rules,
make sure that you do not create a circular preemption path. Circular
preemption causes a job class to preempt itself after applying the preemption
rules recursively. For example, the following keyword definitions set up circular
preemption rules on Class_A:
PREEMPT_CLASS[Class_A] = ALL { Class_B }
PREEMPT_CLASS[Class_B] = ALL { Class_C }
PREEMPT_CLASS[Class_C] = ENOUGH { Class_A }
Another example of circular preemption involves allclasses:
PREEMPT_CLASS[Class_A] = ENOUGH {allclasses}
PREEMPT_CLASS[Class_B] = ALL {Class_A}

In this instance, allclasses means all classes except Class_A, any additional
preemption rule preempting Class_A causes circular preemption.
v Understanding implied START_CLASS values
Using the ″ALL″ value in the PREEMPT_CLASS keyword places implied
restrictions on when a job can start. For example,
PREEMPT_CLASS[Class_A] = ALL {Class_B Class_C}

tells LoadLeveler two things:
1. If a new Class_A job is about to run on a node set, then preempt all Class_B
and Class_C jobs on those nodes
2. If a Class_A job is running on a node set, then do not start any Class_B or
Class_C jobs on those nodes
This PREEMPT_CLASS statement also implies the following START_CLASS
expressions:
1. START_CLASS[Class_B] = (Class_A < 1)
2. START_CLASS[Class_C] = (Class_A < 1)


LoadLeveler adds all implied START_CLASS expressions to the START_CLASS
expressions specified in the configuration file. This overrides any existing values
for START_CLASS.
For example, if the configuration file contains the following statements:
PREEMPT_CLASS[Class_A] = ALL {Class_B Class_C}
START_CLASS[Class_B] = (Class_A < 5)
START_CLASS[Class_C] = (Class_C < 3)

When LoadLeveler runs through the configuration process, the
PREEMPT_CLASS statement on the first line generates the two implied
START_CLASS statements. When the implied START_CLASS statements get
added in, the user specified START_CLASS statements are overridden and the
resulting START_CLASS statements are effectively equivalent to:
START_CLASS[Class_B] = (Class_A < 1)
START_CLASS[Class_C] = (Class_C < 3) && (Class_A < 1)

Note: LoadLeveler’s central manager (CM) uses these effective expressions
instead of the original statements specified in the configuration file. The
output from llclass -l displays the original customer specified
START_CLASS expressions.
v Selecting the preemption method under the BACKFILL scheduler
Use Table 27 and Table 28 on page 130 to determine which preemption you want
to use for jobs running under the BACKFILL scheduler. You may define one or
more of the following:
– A default preemption method to be used for all job classes, by setting the
DEFAULT_PREEMPT_METHOD keyword in the configuration file.
– A specific preemption method for one or more classes or job steps, by using
an option on:
- The PREEMPT_CLASS statement in the configuration file.
- The llpreempt command, ll_preempt subroutine or ll_preempt_jobs
subroutine.

Note:
1. Process tracking must be enabled in order to use the suspend method
to preempt a job. To configure LoadLeveler for process tracking, see
“Tracking job processes” on page 70.
2. For a preempted job to be resumed after system- or user-initiated
preemption occurs through a method other than suspend and remove,
the restart keyword in the job command file must be set to yes.
Otherwise, LoadLeveler vacates the job step and removes it from the
cluster.
Table 27. Preemption methods for which LoadLeveler automatically resumes preempted jobs
Preemption LoadLeveler resumes preempted job:
method
(abbreviation) At this time At this location At this processing point
Suspend (su) When preempting job On the same nodes At the point of suspension
completes
Vacate (vc) When nodes are Any nodes that meet At the beginning or at the
available job requirements last successful checkpoint


Table 28. Preemption methods for which administrator or user intervention is required
Preemption LoadLeveler resumes preempted job:
method
(abbreviation) Required intervention At this location At this processing point
Remove (rm) Administrator or user must Any nodes that At the beginning or at
resubmit the preempted job meet job the last successful
requirements, checkpoint
System Hold Administrator must release
when they are
(sh) the preempted job
available
User Hold (uh) User must release the
preempted job

v Understanding how LoadLeveler treats resources held by jobs to be
preempted
When a job step is running, it may be holding the following resources:
– Processors
– Scheduling slots
– Real memory
| – ConsumableCpus, ConsumableMemory, ConsumableVirtualMemory, and
| ConsumableLargePageMemory
– Communication switches, if the PREEMPTION_TYPE keyword is set to FULL
When LoadLeveler suspends preemptable jobs running under the BACKFILL
scheduler, certain resources held by those jobs do not become available for the
| preempting jobs. These resources include ConsumableVirtualMemory,
| ConsumableLargePageMemory, and floating resources. Under the BACKFILL
scheduler only, LoadLeveler releases these resources when you select a
preemption method other than suspend. For all preemption methods other than
suspend, LoadLeveler treats all job-step resources as available when it preempts
the job step.
v Understanding how LoadLeveler processes multiple entries for the same
keywords
If there are multiple entries for the same keyword in either a configuration file
or an administration file, the last entry wins. For example, the following
statements are all valid specifications for the same keyword START_CLASS:
START_CLASS [Class_B] = (Class_A < 1)
START_CLASS [Class_B] = (Class_B < 1)
START_CLASS [Class_B] = (Class_C < 1)

However, all three statements identify Class_B as the incoming class.
LoadLeveler resolves these statements according to the ″last one wins″ rule.
Because of that, the actual value used for the keyword is (Class_C < 1).

Steps for configuring a scheduler to preempt jobs
You need to know certain details about the job characteristics and workload at
your installation before you begin to define rules for starting and preempting jobs.

Before you begin:
v To define rules for starting and preempting jobs, you need to know certain
details about the job characteristics and workload at your installation, including:
– Which jobs require the same resources, or must be run on the same machines,
and so on. This knowledge allows you to group specific jobs into a class.
– Which jobs or classes have higher priority than others. This knowledge allows
you to define which job classes can preempt other classes.


v To correctly configure LoadLeveler to preempt jobs, you might need to refer to
the following information:
– “Choosing a scheduler” on page 44.
– “Planning to preempt jobs” on page 128.
– Chapter 12, “Configuration file reference,” on page 263.
– Chapter 13, “Administration file reference,” on page 321.
– “llctl - Control LoadLeveler daemons” on page 439.

Perform the following steps to configure a scheduler to preempt jobs:
1. In the configuration file, use the SCHEDULER_TYPE keyword to define the
type of LoadLeveler or external scheduler you want to use. Of the LoadLeveler
schedulers, only the BACKFILL scheduler supports preemption.
Rule: If you select the BACKFILL or API scheduler, you must set the
PREEMPTION_SUPPORT configuration keyword to either full or no_adapter.
2. (Optional) In the configuration file, use the DEFAULT_PREEMPT_METHOD
to define the default method that the BACKFILL scheduler should use for
preempting jobs.
| Alternative: You also may set the preemption method through the
| PREEMPT_CLASS keyword or on the LoadLeveler preemption command or
| APIs, which override the setting for the DEFAULT_PREEMPT_METHOD
| keyword.
3. For either the BACKFILL or API scheduler, to preempt by the suspend method
requires that you set the PROCESS_TRACKING configuration keyword to
true.
4. In the configuration file, use the PREEMPT_CLASS and START_CLASS to
define the preemption and start policies for job classes.
5. In the administration file, use the max_total_tasks keyword to define the
maximum number of tasks that may be run per user, group, or class.

When you are done with this procedure, you can use the llq command to
determine whether jobs are being preempted and resumed correctly. If not, use the
LoadLeveler logs to trace the actions of each daemon involved in preemption to
determine the problem.

Configuring LoadLeveler to support reservations
| Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make
| reservations or recurring reservations, which specify one or more time periods
| during which specific node resources are reserved for use by particular users or
| groups.

Normally, jobs wait to be dispatched until the resources they require become
available. Through the use of reservations, wait time can be reduced because only
| jobs that are bound to the reservation may use the node resources as soon as the
reservation period begins.


Reservation tasks for administrators

Use Table 29 to find additional information about reservations.
Table 29. Roadmap of reservation tasks for administrators
Learn how reservations work in the v “Overview of reservations” on page 25
LoadLeveler environment
v “Understanding the reservation life cycle”
on page 214
Configuring a LoadLeveler cluster to v “Steps for configuring reservations in a
support reservations LoadLeveler cluster”
v “Examples: Reservation keyword
combinations in the administration file” on
page 134
v “Collecting accounting data for reservations”
on page 63
Working with reservations: “Working with reservations” on page 213
v Creating reservations
v Submitting jobs under a reservation
v Managing reservations
Correctly coding and using administration v Chapter 13, “Administration file reference,”
and configuration keywords on page 321
v Chapter 12, “Configuration file reference,”
on page 263

Steps for configuring reservations in a LoadLeveler cluster
Only the BACKFILL scheduler supports the use of reservations.

Before you begin:
v For information about configuring the BACKFILL scheduler, see “Choosing a
v You need to decide:
– Which users will be allowed to create reservations.
– How many reservations users may own, and how long a duration for their
reservations will be allowed.
– Which nodes will be used for reservations.
– How much setup time is required before the reservation period starts.
– Whether accounting data for reservations is to be saved.
| – The maximum lifetime for a recurring reservation before you require the user
| to request a new reservation for that job.
| – Additional system-wide limitations that you may want to implement such as
| maintenance time blocks for specific node sets.
v For examples of possible reservation keyword combinations, see “Examples:
Reservation keyword combinations in the administration file” on page 134.
v For details about specific keyword syntax and use:
– In the administration file, see Chapter 13, “Administration file reference,” on
page 321.
– In the configuration file, see Chapter 12, “Configuration file reference,” on
page 263.

| Perform the following steps to configure reservations:


1. In the administration file, modify the user or group stanzas to authorize users
to create reservations. You may grant the ability to create reservations to an
individual user, a group of users, or a combination of users and groups. To do
so, define the following keywords in the appropriate user or group stanzas:
v max_reservations, to set the maximum number of reservations that a user or
group may have.
v (Optional) max_reservation_duration, to set the maximum amount of time
for the reservation period.
Tip: To quickly set up and use reservations, use one of the following examples:
v To allow every user to create a reservation, add max_reservations=1 to the
default user stanza. Then every administrator or user may create a
reservation, as long as the number of reservations has not reached the limit
for a LoadLeveler cluster.
v To allow a specific group of users to make 10 reservations, add
max_reservations=10 to the group stanza for that LoadLeveler group. Then
every user in that group may create a reservation, as long as the number of
reservations has not reached the limit for that group or for a LoadLeveler
cluster.
See the max_reservations description in Chapter 13, “Administration file
reference,” on page 321 for more information about setting this keyword in the
user or group stanza.
2. In the administration file, modify the machine stanza of each machine that may
be reserved. To do so, set the reservation_permitted keyword to true.
Tip: If you want to allow every machine to be reserved, you do not have to set
this keyword; by default, any LoadLeveler machine may be reserved. If you
want to prevent particular machines from being reserved, however, you must
define a machine stanza for that machine and set the reservation_permitted
keyword to false.
3. In the global configuration file, set reservation policy by specifying values for
the following keywords:
v MAX_RESERVATIONS to specify the maximum number of reservations per
cluster.

| Note: A recurring reservation only counts as one reservation towards the
| MAX_RESERVATIONS limit regardless of the number of times that
| the reservation recurs.
v RESERVATION_CAN_BE_EXCEEDED to specify whether LoadLeveler will
be permitted to schedule job steps bound to a reservation when their
expected end times exceed the reservation end time.
The default for this keyword is TRUE, which means that LoadLeveler will
schedule these bound job steps even when they are expected to continue
running beyond the time at which the reservation ends. Whether these job
steps run and successfully complete depends on resource availability, which
is not guaranteed after the reservation ends. In addition, these job steps
become subject to preemption rules after the reservation ends.
Tip: You might want to set this keyword value to FALSE to prevent users
from binding long-running jobs to run under reservations of short duration.
v RESERVATION_MIN_ADVANCE_TIME to define the minimum time
between the time at which a reservation is created and the time at which the
reservation is to start.
Tip: To reduce the impact to the currently running workload, consider
changing the default for this keyword, which allows reservations to begin as
soon as they are created. You may, for example, require reservations to be


made at least one day (1440 minutes) in advance, by specifying
RESERVATION_MIN_ADVANCE_TIME=1440 in the global configuration file.
v RESERVATION_PRIORITY to define whether LoadLeveler administrators
may reserve nodes on which running jobs are expected to end after the start
time for the reservation.
Tip: The default for this keyword is NONE, which means that LoadLeveler will
not reserve a node on which running jobs are expected to end after the start
time for the reservation. If you want to allow LoadLeveler administrators to
reserve specific nodes regardless of the expected end times of job steps
currently running on the node, set this keyword value to HIGH. Note,
however, that setting this keyword value to HIGH might increase the number
of job steps that must be preempted when LoadLeveler sets up the
reservation, and many jobs might remain in Preempted state. This also
applies to Blue Gene job steps.
This keyword value applies only for LoadLeveler administrators; other
reservation owners do not have this capability.
v RESERVATION_SETUP_TIME to define the amount of time LoadLeveler
uses to prepare for a reservation before it is to start.
4. (Optional) In the global configuration file, set controls for the collection of
accounting data for reservations:
v To turn on accounting for reservations, add the A_RES flag to the ACCT
keyword.
v To specify a file other than the default history file to contain the data, use the
RESERVATION_HISTORY keyword.
To learn how to collect accounting data for reservations, see “Collecting
accounting data for reservations” on page 63.
5. If LoadLeveler is already started, to process the changes you made in the
preceding steps, issue the command llctl -g reconfig.
Tip: If you have changed the value of only the RESERVATION_PRIORITY
keyword, issue the command llctl reconfig only on the central manager node.
Result: The new keyword values take effect immediately, but they do not
change the attributes of existing reservations.

When you are done with this procedure, you may perform additional tasks
described in “Working with reservations” on page 213.

Examples: Reservation keyword combinations in the
administration file
The following examples demonstrate LoadLeveler behavior when the
max_reservations and max_reservation_duration keywords are set.

The examples assume that only the user and group stanzas listed exist in the
LoadLeveler administration file.
v Example 1: Assume the administration file contains the following stanzas:
default: type = user
maxjobs = 10

group2: type = group
include_users = rich dave steve

rich: type = user
default_group = group2


This example shows that, by default, no one is allowed to make any
reservations. No one, including LoadLeveler administrators, is permitted to
make any reservations unless the max_reservations keyword is used.
maxjobs = 10


rich: type = user
max_reservations = 5
This example shows how permission to make reservations can be granted to a
specific user through the user stanza only. Because the max_reservations
keyword is not used in any group stanza, by default, the group stanzas neither
grant permissions nor put any restrictions on reservation permissions. User Rich
can make reservations in any group (group2, No_Group, Group_A, and so on),
whether or not the group stanzas exist in the LoadLeveler administration file.
The total number of reservations user Rich can own at any given time is limited
to five.
maxjobs = 10


rich: type = user
group of users through the group stanza only. Because the max_reservations
keyword is not used in any user stanza, by default, the user stanzas neither
grant nor deny permission to make reservations. All users in group2 (Rich, Dave
and Steve) can make reservations, but they must make reservations in group2
because other groups do not grant the permission to make reservations. The
total number of reservations the users in group2 can own at any given time is
limited to five.
maxjobs = 10


rich: type = user
group of users except one specific user. Because the max_reservations keyword
is set to zero in the user stanza for Rich, he does not have permission to make
any reservation, even though all other users in group2 (Dave and Steve) can
make reservations.


default: type = group



rich: type = user

dave: type = user
This example shows how permission to make reservations can be granted to
specific user and group pairs. Because the max_reservations keyword is set to
zero in both the default user and group stanza, no one has permission to make
any reservation unless they are specifically granted permission through both the
user and group stanza. In this example:
– User Rich can own at any time up to five reservations in group2 only.
– User Dave can own at any time up to two reservations in group2 only.
The total number of reservations they can own at any given time is limited to
five. No other combination of user or group pairs can make any reservations.
This example permits any user to make one reservation in any group, until the
number of reservations reaches the maximum number allowed in the
default: type = group


max_reservation_duration = 1440

carol: type = user

dave: type = user
In this example, two users, Carol and Dave, are members of group1. Neither
Carol nor Dave belong to any other group with a group stanza in the
LoadLeveler administration file, although they may use any string as the name
of a LoadLeveler group and belong to it by default.
Because the max_reservations keyword is set to zero in the default group stanza,
reservations can be made only in group1, which has an allotment of six
reservations. Each reservation can have a maximum duration of 1440 minutes
(24 hours).


Considering only the user-stanza attributes for reservations:
– User Carol can make up to four reservations with each having a maximum
duration of 720 minutes (12 hours).
– User Dave can make up to four reservations with each having a maximum
duration of 2880 minutes (48 hours).
If there are no reservations in the system and user Carol wants to make four
reservations, she may do so. Each reservation can have a maximum duration of
no more than 720 minutes. If Carol attempts to make a reservation with a
duration greater than 720 minutes, LoadLeveler will not make the reservation
because it exceeds the duration allowed for Carol.
Assume that Carol has created four reservations, and user Dave now wants to
create four reservations:
– The number of reservations Dave may make is limited by the state of Carol’s
reservations and the maximum limit on reservations for group1. If the four
reservations Carol made are still being set up, or are active, active shared or
waiting, LoadLeveler will restrict Dave to making only two reservations at
this time.
– Because the value of max_reservation_duration for the group is more
restrictive than max_reservation_duration for user Dave, LoadLeveler
enforces the group value, 1440 minutes.
If Dave belonged to another group that still had reservations available, then he
could make reservations under that group, assuming the maximum number of
reservations for the cluster had not been met. However, in this example, Dave
cannot make any further reservations because they are allowed in group1 only.

Steps for integrating LoadLeveler with the AIX Workload Manager
| Another administrative setup task you must consider is whether you want to
| enforce resource usage of ConsumableCpus, ConsumableMemory,
| ConsumableVirtualMemory, and ConsumableLargePageMemory.

| If you want to control these resources, AIX Workload Manager (WLM) can be
| integrated with LoadLeveler to balance workloads at the machine level. When you
| are using WLM, workload balancing is done by assigning relative priorities to job
| processes. These job priorities prevent one job from monopolizing system resources
| when that resource is under contention.

| Note: WLM is not supported in LoadLeveler for Linux.

| To integrate LoadLeveler and WLM, perform the following steps:
| 1. As required for your use, define the applicable options for ConsumableCpus,
| ConsumableMemory, ConsumableVirtualMemory, or
| ConsumableLargePageMemory as consumable resources in the
| SCHEDULE_BY_RESOURCES global configuration keyword. This enables the
| LoadLeveler scheduler to consider these consumable resources.
| 2. As required for your use, define the applicable options for ConsumableCpus,
| ConsumableMemory, ConsumableVirtualMemory, or
| ConsumableLargePageMemory in the ENFORCE_RESOURCE_USAGE global
| configuration keyword. This enables enforcement of these consumable resources
| by AIX WLM.
| 3. Define hard, soft or shares in the ENFORCE_RESOURCE_POLICY
| configuration keyword. This defines what policy is used by LoadLeveler for
| CPUs and real memory when setting WLM class resource entitlements.


4. (Optional) Set the ENFORCE_RESOURCE_MEMORY configuration keyword
to true. This setting allows AIX WLM to limit the real memory usage of a
WLM class as precisely as possible. When a class exceeds its limit, all processes
in the class are killed.
Rule: ConsumableMemory must be defined in the
ENFORCE_RESOURCE_USAGE keyword in the global configuration file, or
LoadLeveler does not consider the ENFORCE_RESOURCE_MEMORY
keyword to be valid.
Tips:
v When set to true, the ENFORCE_RESOURCE_MEMORY keyword overrides
the policy set through the ENFORCE_RESOURCE_POLICY keyword for
ConsumableMemory only. The ENFORCE_RESOURCE_POLICY keyword
value still applies for ConsumableCpus.
v ENFORCE_RESOURCE_MEMORY may be set in either the global or the
local configuration file. In the global configuration file, this keyword sets the
default value for all the machines in the LoadLeveler cluster. If the keyword
also is defined in a local file, the local setting overrides the global setting.
| 5. Using the resources keyword in a machine stanza in the administration file,
| define the CPU, real memory, virtual memory, and large page machine
| resources available for user jobs.
v The ConsumableCpus reserved word accepts a count value of ″all.″ This
indicates that the initial resource count will be obtained from the Startd
machine update value for CPUs.
v If no resources are defined for a machine, then no enforcement will be done
on that machine.
v If the count specified by the administrator is greater than what the Startd
update indicates, the initial count value will be reduced to match what the
Startd reports.
| v For CPUs and real memory, if the count specified by the administrator is less
| than what the Startd update indicates, the WLM resource shares assigned to
| a job will be adjusted to represent that difference. In addition, a WLM
| softlimit will be defined for each WLM class. For example, if the
| administrator defines 8 CPUs on a 16 CPU machine, then a job requesting 4
| CPUs will get a share of 4 and a softlimit of 50%.
v Use caution when determining the amount of real memory available for user
jobs. A certain percentage of a machine’s real memory will be dedicated to
the Default and System WLM classes and will not be included in the
calculation of real memory available for users jobs. Start LoadLeveler with
the ENFORCE_RESOURCE_USAGE keyword enabled and issue wlmstat -v
-m. Look at the npg column to determine how much memory is being used
by these classes.
| v ConsumableVirtualMemory and ConsumableLargePageMemory are hard
| max limit values.
| – AIX WLM considers the ConsumableVirtualMemory value to be real
| memory plus large page plus swap space.
| – The ConsumableLargePageMemory value should be a value equal to the
| multiple of the pagesize. For example, 16MB (page size) * 4 pages = 64MB.
| 6. Decide if all jobs should have their CPU, real memory, virtual memory, or large
| page resources enforced and then define the
ENFORCE_RESOURCE_SUBMISSION global configuration keyword.
v If the value specified is true, LoadLeveler will check all jobs at submission
time for the resources and node_resources keywords. To be submitted, either
the job’s resources or node_resources keyword must have the same
resources specified as the ENFORCE_RESOURCE_USAGE keyword.


v If the value specified is false, no checking is performed and jobs submitted
without the resources or node_resources keyword will not have resources
enforced and it might interfere with other jobs whose resources are enforced.
v To support existing job command files without the resources or
node_resources keyword, the default_resources and default_node_resources
keywords in the class stanza can be defined.

For more information on the ENFORCE_RESOURCE_USAGE and the
ENFORCE_RESOURCE_SUBMISSION keywords, see “Defining usage policies
for consumable resources” on page 60.

LoadLeveler support for checkpointing jobs
Checkpointing is a method of periodically saving the state of a job step so that if
the step does not complete it can be restarted from the saved state.

When checkpointing is enabled, checkpoints can be initiated from within the
application at major milestones, or by the user, administrator or LoadLeveler
external to the application. Both serial and parallel job steps can be checkpointed.

Once a job step has been successfully checkpointed, if that step terminates before
completion, the checkpoint file can be used to resume the job step from its saved
state rather than from the beginning. When a job step terminates and is removed
from the LoadLeveler job queue, it can be restarted from the checkpoint file by
submitting a new job and setting the restart_from_ckpt = yes job command file
keyword. When a job is terminated and remains on the LoadLeveler job queue,
such as when a job step is vacated, the job step will automatically be restarted
from the latest valid checkpoint file. A job can be vacated as a result of flushing a
node, issuing checkpoint and hold, stopping or recycling LoadLeveler or as the
result of a node crash.

To find out more about checkpointing jobs, use the information in Table 30.
Table 30. Roadmap of tasks for checkpointing jobs
Preparing the LoadLeveler v “Checkpoint keyword summary”
environment for
v “Planning considerations for checkpointing jobs” on page
checkpointing and restarting
140
jobs
v “AIX checkpoint and restart limitations” on page 141
v “Naming checkpoint files and directories” on page 145
Checkpointing and restarting v “Checkpointing a job” on page 232
jobs
v “Removing old checkpoint files” on page 146
Correctly specifying v Chapter 12, “Configuration file reference,” on page 263
configuration and
v Chapter 13, “Administration file reference,” on page 321
administration file keywords

Checkpoint keyword summary
There are keywords associated with the checkpoint and restart function.

The following is a summary of keywords associated with the checkpoint and
restart function.
v Configuration file keywords


– CKPT_CLEANUP_INTERVAL
– CKPT_CLEANUP_PROGRAM
– CKPT_EXECUTE_DIR
– MAX_CKPT_INTERVAL
– MIN_CKPT_INTERVAL
For more information about these keywords, see Chapter 12, “Configuration file
v Administration file keywords
– ckpt_dir
– ckpt_time_limit
For more information about these keywords, see Chapter 13, “Administration file
v Job command file keywords
– checkpoint
– ckpt_dir
– ckpt_execute_dir
– ckpt_file
– ckpt_time_limit
– restart_from_ckpt
For more information about these keywords, see “Job command file keyword
descriptions” on page 359.

Planning considerations for checkpointing jobs
There are guidelines to review before you submit a checkpointing job.

Review the following guidelines before you submit a checkpointing job:
v Plan for jobs that you will restart on different nodes
If you plan to migrate jobs (restart jobs on a different node or set of nodes), you
should understand the difference between writing checkpoint files to a local file
system versus a global file system (such as AFS or GPFS™). The ckpt_file and
ckpt_dir keywords in the job command and configuration files allow you to
write to either type of file system. If you are using a local file system, before
restarting the job from checkpoint, make certain that the checkpoint files are
accessible from the machine on which the job will be restarted.
v Reserve adequate disk space
A checkpoint file requires a significant amount of disk space. The checkpoint
will fail if the directory where the checkpoint file is written does not have
adequate space. For serial jobs, one checkpoint file will be created. For parallel
jobs, one checkpoint file will be created for each task. Since the old set of
checkpoint files are not deleted until the new set of files are successfully created,
the checkpoint directory should be large enough to contain two sets of
checkpoint files. You can make an accurate size estimate only after you have run
your job and noticed the size of the checkpoint file that is created.
v Plan for staging executables
If you want to stage the executable for a job step, use the ckpt_execute_dir
keyword to define the directory where LoadLeveler will save the executable.
This directory cannot be the same as the current location of the executable file,
or LoadLeveler will not stage the executable.
You may define the ckpt_execute_dir keyword in either the configuration file or
the job command file. To decide where to define the keyword, use the
information in Table 31 on page 141.


Table 31. Deciding where to define the directory for staging executables
If the ckpt_execute_dir
keyword is defined in: Then the following information applies:
The configuration file only v LoadLeveler stages the executable file in a new subdirectory
of the specified directory. The name of the subdirectory is the
job step ID.
v The user is the owner of the subdirectory and has permission
700.
v If the user issues the llckpt command with the -k option,
LoadLeveler deletes the staged executable.
v LoadLeveler will delete the subdirectory and the staged
executable when the job step ends.
The job command file only v LoadLeveler stages the executable file in the directory
specified in the job command file.
v The user is the owner of the file and has execute permission
Both the configuration and for it.
job command files v The user is responsible for deleting the staged file after the
job step ends.
Neither file (the keyword LoadLeveler does not stage the executable file for the job step.
is not defined)

v Set your checkpoint file size to the maximum
To make sure that your job can write a large checkpoint file, assign your job to a
job class that has its file size limit set to the maximum (unlimited). In the
administration file, set up a class stanza for checkpointing jobs with the
following entry:
file_limit = unlimited,unlimited

This statement specifies that there is no limit on the maximum size of a file that
your program can create.
v Choose a unique checkpoint file name
To prevent another job step from writing over your checkpoint file with another
checkpoint file, make certain that your checkpoint file name is unique. The
ckpt_dir and ckpt_file keywords give you control over the location and name of
these files.
For mode information, see “Naming checkpoint files and directories” on page
145.

AIX checkpoint and restart limitations
There are limitations associated with checkpoint and restart.
v The following items cannot be checkpointed:
– Programs that are being run under:
- The dynamic probe class library (DPCL).
- Any debugger.
– MPI programs that are not compiled with mpcc_r, mpCC_r, mpxlf_r,
mpxlf90_r, or mpxlf95_r.
– Processes that use:
- Extended shmat support
- Pinned shared memory segments
| - The debug malloc tool (MALLOCTYPE=debug)
– Sets of processes in which any process is running a setuid program when a
checkpoint occurs.
– Sets of processes if any process is running a setgid program when a
checkpoint occurs.


– Interactive parallel jobs for which POE input or output is a pipe.
– Interactive parallel jobs for which POE input or output is redirected, unless
the job is submitted from a shell that had the CHECKPOINT environment
variable set to yes before the shell was started. If POE is run from inside a
shell script and is run in the background, the script must be started from a
shell started in the same manner for the job to be checkpointable.
– Interactive POE jobs for which the su command was used prior to
checkpointing or restarting the job.
v The node on which a process is restarted must have:
– The same operating system level (including PTFs). In addition, a restarted
process may not load a module that requires a system call from a kernel
extension that was not present at checkpoint time.
– The same switch type as the node where the checkpoint occurred.
If any threads in a process were bound to a specific processor ID at checkpoint
time, that processor ID must exist on the node where that process is restarted.
v If the LoadLeveler cluster contains nodes running a mix of 32-bit and 64-bit
kernels then applications must be checkpointed and restarted on the same set of
nodes. For more information, see “llckpt - Checkpoint a running job step” on
page 430 and the restart_on_same_nodes keyword description.
v For a parallel job, the number of tasks and the task geometry (the tasks that are
common within a node) must be the same on a restart as it was when the job
was checkpointed.
v Any regular file open in a process when it is checkpointed must be present on
the node where that process is restarted, including the executable and any
dynamically loaded libraries or objects.
v If any process uses sockets or pipes, user callbacks should be registered to save
data that may be ″in flight″ when a checkpoint occurs, and to restore the data
when the process is resumed after a checkpoint or restart. Similarly, any user
shared memory in a parallel task should be saved and restored.
v A checkpoint operation will not begin on a process until each user thread in that
process has released all pthread locks, if held. This can potentially cause a
significant delay from the time a checkpoint is issued until the checkpoint
actually occurs. Also, any thread of a process that is being checkpointed that
does not hold any pthread locks and tries to acquire one will be stopped
immediately. There are no similar actions performed for atomic locks
(_check_lock and _clear_lock, for example).
v Atomic locks must be used in such a way that they do not prevent the releasing
of pthread locks during a checkpoint. For example, if a checkpoint occurs and
thread 1 holds a pthread lock and is waiting for an atomic lock, and thread 2
tries to acquire a different pthread lock (and does not hold any other pthread
locks) before releasing the atomic lock that is being waited for in thread 1, the
checkpoint will hang.
v A process must not hold a pthread lock when creating a new process (either
implicitly using popen, for example, or explicitly using fork) if releasing the lock
is contingent on some action of the new process. Otherwise, a checkpoint could
occur which would cause the child process to be stopped before the parent
could release the pthread lock causing the checkpoint operation to hang.
v The checkpoint operation will hang if any user pthread locks are held across:
– Any collective communication calls in MPI or LAPI
– Calls to mpc_init_ckpt or mp_init_ckpt
v Processes cannot be profiled at the time a checkpoint is taken.
v There can be no devices other than TTYs or /dev/null open at the time a
checkpoint is taken.


v Open files must either have an absolute path name that is less than or equal to
PATHMAX in length, or must have a relative path name that is less than or
equal to PATHMAX in length from the current directory at the time they were
opened. The current directory must have an absolute path name that is less than
or equal to PATHMAX in length.
v Semaphores or message queues that are used within the set of processes being
checkpointed must only be used by processes within the set of processes being
checkpointed. This condition is not verified when a set of processes is
checkpointed. The checkpoint and restart operations will succeed, but
inconsistent results can occur after the restart.
v The processes that create shared memory must be checkpointed with the
processes using the shared memory if the shared memory is ever detached from
all processes being checkpointed. Otherwise, the shared memory may not be
available after a restart operation.
v The ability to checkpoint and restart a process is not supported for B1 and C2
security configurations.
v A process can only checkpoint another process if it can send a signal to the
process. In other words, the privilege checking for checkpointing processes is
identical to the privilege checking for sending a signal to the process. A
privileged process (the effective user ID is 0) can checkpoint any process. A set
of processes can only be checkpointed if each process in the set can be
checkpointed.
v A process can only restart another process if it can change its entire privilege
state (real, saved, and effective versions of user ID, group ID, and group list) to
match that of the restarted process. A set of processes can only be restarted if
each process in the set can be restarted.
v The only DCE function supported is DCE credential forwarding by LoadLeveler
using the DCE_AUTHENTICATION_PAIR configuration keyword. DCE
credential forwarding is for the sole purpose of DFS™ access by the application.
v If a process invokes any Network Information Service (NIS) functions, from then
on, AIX will delay the start of a checkpoint of a process until the process returns
from any system calls.
v Jobs in which the message passing application is not a direct child of the
Partition Manager Daemon (pmd) cannot be checkpointed.
| v Scale-across jobs cannot be checkpointed.
v The following functions will return ENOTSUP if called in a job that has enabled
checkpointing:
– clock_getcpuclockid()
– clock_getres()
– clock_gettime()
– clock_nanosleep()
– clock_settime()
– mlock()
– mlockall()
– mq_close()
– mq_getattr()
– mq_notify()
– mq_open()
– mq_receive()
– mq_send()
– mq_setattr()
– mq_timedreceive()
– mq_timedsend()


– mq_unlink()
– munlock()
– munlockall()
– nanosleep()
– pthread_barrier_destroy()
– pthread_barrier_init()
– pthread_barrier_wait()
– pthread_barrierattr_destroy()
– pthread_barrierattr_getpshared()
– pthread_barrierattr_init()
– pthread_barrierattr_setpshared()
– pthread_condattr_getclock()
– pthread_condattr_setclock()
– pthread_getcpuclockid()
– pthread_mutex_getprioceiling()
– pthread_mutex_setprioceiling()
– pthread_mutex_timedlock()
– pthread_mutexattr_getprioceiling()
– pthread_mutexattr_getprotocol()
– pthread_mutexattr_setprioceiling()
– pthread_mutexattr_setprotocol()
– pthread_rwlock_timedrdlock()
– pthread_rwlock_timedwrlock()
– pthread_setschedprio()
– pthread_spin_destroy()
– pthread_spin_init()
– pthread_spin_lock()
– pthread_spin_trylock()
– pthread_spin_unlock()
– sched_get_priority_max()
– sched_get_priority_min()
– sched_getparam()
– sched_getscheduler()
– sched_rr_get_interval()
– sched_setparam()
– sched_setscheduler()
– sem_close()
– sem_destroy()
– sem_getvalue()
– sem_init()
– sem_open()
– sem_post()
– sem_timedwait()
– sem_trywait()
– sem_unlink()
– sem_wait()
– shm_open()
– shm_unlink()
– timer_create()
– timer_delete()
– timer_getoverrun()
– timer_gettime()
– timer_settime()


Naming checkpoint files and directories
At checkpoint time, a checkpoint file and potentially an error file will be created.

For jobs which are enabled for checkpoint, a control file may be generated at the
time of job submission. The directory which will contain these files must pre-exist
and have sufficient space and permissions for these files to be written. The name
and location of these files will be controlled through keywords in the job command
file or the LoadLeveler configuration. The file name specified is used as a base
name from which the actual checkpoint file name is constructed. To prevent
another job step from writing over your checkpoint file, make certain that your
checkpoint file name is unique. For serial jobs and the master task (POE) of
parallel jobs, the checkpoint file name will be <basename>.Tag. For a parallel job, a
checkpoint file is created for each task. The checkpoint file name will be
<basename>.Taskid.Tag.

The tag is used to differentiate between a current and previous checkpoint file. A
control file may be created in the checkpoint directory. This control file contains
information LoadLeveler uses for restarting certain jobs. An error file may also be
created in the checkpoint directory. The data in this file is in a machine readable
format. The information contained in the error file is available in mail, LoadLeveler
logs or is output of the checkpoint command. Both of these files are named with
the same base name as the checkpoint file with the extensions .cntl and .err,
respectively.

Naming checkpoint files for serial and batch parallel jobs
There is an order in which keywords are checked to construct the full path name
for a serial or batch checkpoint file.

The following describes the order in which keywords are checked to construct the
full path name for a serial or batch checkpoint file:
v Base name for the checkpoint file name
1. The ckpt_file keyword in the job command file
2. The default file name [< jobname.>]<job_step_id>.ckpt
Where:
jobname
The job_name specified in the Job Command File. If job_name is not
specified, it is omitted from the default file name
job_step_id
Identifies the job step that is being checkpointed
v Checkpoint Directory Name
1. The ckpt_file keyword in the job command file, if it contains a ″/″ as the first
character
2. The ckpt_dir keyword in the job command file
3. The ckpt_dir keyword specified in the class stanza of the LoadLeveler admin
file
4. The default directory is the initial working directory

Note that two or more job steps running at the same time cannot both write to the
same checkpoint file, since the file will be corrupted.

Naming checkpointing files for interactive parallel jobs
There is an order in which keywords and variables are checked to construct the
full path name for the checkpoint file for an interactive parallel job.


The following describes the order in which keywords and variables are checked to
construct the full path name for the checkpoint file for an interactive parallel job.
v Checkpoint File Name
1. The value of the MP_CKPTFILE environment variable within the POE
process
2. The default file name, poe.ckpt.<pid>
v Checkpoint Directory Name
1. The value of the MP_CKPTFILE environment variable within the POE
process, if it contains a full path name.
2. The value of the MP_CKPTDIR environment variable within the POE
process.
3. The initial working directory.

Note: The keywords ckpt_dir and ckpt_file are not allowed in the command file
for an interactive session. If they are present, they will be ignored and the
job will be submitted.

Removing old checkpoint files
LoadLeveler provides two keywords to help automate the process of removing
checkpoint files that are no longer necessary.

To keep your system free of checkpoint files that are no longer necessary,
LoadLeveler provides two keywords to help automate the process of removing
these files:
v CKPT_CLEANUP_PROGRAM
v CKPT_CLEANUP_INTERVAL
Both keywords must contain valid values to automate this process. For information
about configuration file keyword syntax and other details, see Chapter 12,
“Configuration file reference,” on page 263.

LoadLeveler scheduling affinity support
LoadLeveler offers a number of scheduling affinity options.

LoadLeveler offers the following scheduling affinity options:
v Memory and adapter affinity
v Processor affinity

Enabling scheduling affinity allows LoadLeveler jobs to utilize performance
improvement from multiple chip modules (MCMs) (memory and adapter) and
processor affinities. If enabled, LoadLeveler will schedule and attach the
appropriate CPUs in the cluster to the job tasks in order to maximize performance
improvement based on the type of affinity requested by the job.

Memory and adapter affinity

Memory affinity is a special purpose option for improving performance on IBM
POWER6™, POWER5™, and POWER4™ processor-based systems. These machines
contain MCMs, each containing multiple processors. System memory is attached to
these MCMs. While any processor can access all of the memory in the system, a
processor has faster access and higher bandwidth when addressing memory that is
attached to its own MCM rather than memory attached to the other MCMs in the
system. The concept of affinity also applies to the I/O subsystem. The processes
running on CPUs from an MCM have faster access to the adapters attached to the


I/O slots of that MCM. I/O affinity will be referred to as adapter affinity in this
topic. For more information about memory and adapter affinity, see AIX
Performance Management Guide.

| Processor affinity

| LoadLeveler provides processor affinity options to improve job performance on the
| following platforms:
| v IBM POWER6 and POWER5 processor-based systems running in simultaneous
| multithreading (SMT) mode with AIX or Linux
| v IBM POWER6 and POWER5 processor-based systems running in Single
| Threaded (ST) mode with AIX or Linux
| v IBM POWER4 processor-based systems with AIX or Linux
| v x86 and x86_64 processor-based systems with Linux

| On AIX, affinity support is implemented by using a Resource Set (RSet), which
| contains bit maps for CPU and memory pool resources. The RSet APIs available in
| AIX can be used to attach RSets to processes. Attaching an RSet to a process limits
| the process to only using the resources contained in the RSet. One of the main uses
| of RSets is to limit the application processes to run only on the processors
| contained in a single MCM and hence to benefit from memory affinity. For more
| details on RSets, refer to AIX System Management Guide: Operating System and
| Devices.

| On Linux on Power systems, affinity support is implemented by using ″cpusets,″
| which provide a mechanism for assigning a set of CPUs and memory nodes
| (MCMs) to a set of tasks. The cpusets constrain the CPU and memory placement of
| tasks to only the resources within a task’s current cpuset. The cpusets are managed
| by the virtual file system type cpuset. Before configuring LoadLeveler to support
| affinity, the cpuset virtual file system must be created on every machine in the
| cluster to enable affinity support.

| On Linux on x86 and x86_64 systems, affinity support is implemented by using the
| sched_setaffinity Linux-specific system call to assign a set of physical or logical
| CPUs to the job processes.

Configuring LoadLeveler to use scheduling affinity
On AIX and Linux on Power systems, scheduling affinity can be enabled by using
the RSET_SUPPORT configuration file keyword. Machines that are configured
with this keyword indicate the ability to service jobs requesting or requiring
scheduling affinity.

| Enable RSET_SUPPORT with one of these values:
| v Choose RSET_MCM_AFFINITY to allow jobs specifying rset =
| RSET_MCM_AFFINITY or the task_affinity keyword to run on a node. When
| rset = RSET_MCM_AFFINITY, LoadLeveler will select and attach sets of CPUs
| to task processes such that a set of CPUs will be from the same MCM. When the
| task_affinity keyword is used, LoadLeveler will select CPUs regardless of their
| location with respect to an MCM.
| v Choose RSET_USER_DEFINED to allow jobs specifying a user-defined RSet
| name for rset to run on a node. The RSET_USER_DEFINED option enables
| scheduling affinity, allowing users more control over scheduling affinity
| parameters by allowing the use of user-defined RSets. Through the use of
| user-defined RSets, users can utilize new RSet features before a LoadLeveler


| implementation is released. This option also allows users to specify a different
| number of CPUs in their RSets depending on the needs of each task. This value
| is supported only on AIX machines.

Note:
| 1. Because LoadLeveler creates a cpuset for each task requesting affinity
| under the /dev/cpuset directory on Linux on POWER machines, the
| cpuset virtual file system must be created and mounted on the
| /dev/cpuset directory by issuing the following commands on each node:
| # mkdir /dev/cpuset
| # mount -t cpuset none /dev/cpuset
| 2. A virtual file system of type cpuset mounted at /dev/cpuset will be
| deleted when the node is rebooted. To create the /dev/cpuset directory
| and have the virtual cpuset file system mounted on it automatically
| when the node is rebooted, add the following commands to your
| start-up script (for example, /etc/init.d/boot.local), which is run when the
| node is rebooted or started:
| if test -e /dev/cpuset || mkdir -p /dev/cpuset ; then
| mount -t cpuset none /dev/cpuset
| fi

| See “Configuration file keyword descriptions” on page 265 for more information
| on the RSET_SUPPORT keyword.

| On AIX and Linux on Power systems, jobs requesting processor affinity with the
| task_affinity keyword in the job command file will only run on machines where
| the resource statement in the machine stanza in the LoadLeveler administration file
| contains the ConsumableCpus keyword. For more information on specifying
| ConsumableCpus, see the resource keyword description in “Administration file

| On Linux on x86 and x86_64 systems, exclusive allocation of CPUs to job steps is
| enabled by using the ALLOC_EXCLUSIVE_CPU_PER_JOB configuration file
| keyword. Enable ALLOC_EXCLUSIVE_CPU_PER_JOB with one of these values:
| v Choose the PHYSICAL option to allow LoadLeveler to assign tasks to physical
| processor packages. The PHYSICAL option allows LoadLeveler to treat
| hyperthreaded processors and multicore processors as a single unit so that a job
| has dedicated computing resources. For example, a node with two Intel x86
| processors with hyperthreading turned ON, will be treated as a node with two
| physical processors. Similarly, a node with two dual-core AMD Opteron
| processors will be treated as a node with two physical processors.
| v Choose the LOGICAL option to allow LoadLeveler to assign tasks to processor
| units. For example, a node with two Intel x86 processors with hyperthreading
| turned ON will be treated as a node with four processors. A node with two
| dual-core AMD Opteron processors will be treated as a node with four
| processors.

| See “Configuration file keyword descriptions” on page 265 for more information
| on the ALLOC_EXCLUSIVE_CPU_PER_JOB keyword.

LoadLeveler multicluster support
To provide a more scalable runtime environment and more efficient workload
balancing, you may configure a LoadLeveler multicluster environment.


A LoadLeveler multicluster environment consists of two or more LoadLeveler
clusters, grouped together through network connections that allow the clusters to
share resources. These clusters may be AIX, Linux, or mixed clusters.

Within a LoadLeveler multicluster environment:
v The local cluster is the cluster from which the user submits jobs or issues
commands.
v A remote cluster is a cluster that accepts job submissions and commands from
the local cluster.
v A local gateway Schedd is a Schedd within the local cluster serving as an
inbound point from some remote cluster, an outbound point to some remote
cluster, or both.
v A remote gateway Schedd is a Schedd within a remote cluster serving as an
inbound point from the local cluster, an outbound point to the local cluster, or
both.
v A local central manager is the central manager in the same cluster as the local
gateway Schedd.
v A remote central manager is the central manager in the same cluster as a remote
gateway Schedd.

A LoadLeveler multicluster environment addresses scalability and workload
balancing issues by providing the ability to:
v Distribute workload among LoadLeveler clusters when jobs are submitted.
v Easily access multiple LoadLeveler cluster resources.
v Display information about the multicluster.
v Monitor and control operations in a multicluster.
v Transfer idle jobs from one cluster to another.
v Transfer user input and output files between clusters.
v Enable LoadLeveler to operate in a secure environment where clusters are
separated by a firewall.

Table 32 shows the multicluster support subtasks with a pointer to the associated
instructions:
Table 32. Multicluster support subtasks and associated instructions
Configure a LoadLeveler multicluster “Configuring a LoadLeveler multicluster” on
page 150
Submit and monitor jobs in a LoadLeveler “Submitting and monitoring jobs in a
multicluster LoadLeveler multicluster” on page 223
| Scale-across scheduling “Scale-across scheduling with multiclusters”
on page 153

Table 33. Multicluster support related topics
Related topics Additional information (see . . . )
Administration file: Cluster stanzas “Defining clusters” on page 100
Administration file: Cluster keywords “Administration file keyword descriptions”
on page 327
Configuration file: Cluster keywords “Configuration file keyword descriptions”
on page 265
Job command file: Cluster keywords “Job command file keyword descriptions” on
page 359


Table 33. Multicluster support related topics (continued)
Related topics Additional information (see . . . )
Commands and APIs Chapter 16, “Commands,” on page 411 or
Chapter 17, “Application programming
Diagnosis and messages TWS LoadLeveler: Diagnosis and Messages
Guide

Configuring a LoadLeveler multicluster
These are the subtasks for configuring a LoadLeveler multicluster.

Table 34 lists the subtasks for configuring a LoadLeveler multicluster.
Table 34. Subtasks for configuring a LoadLeveler multicluster
Configure the v “Steps for configuring a LoadLeveler multicluster” on page 151
LoadLeveler
v “Steps for securing communications within a LoadLeveler
multicluster
multicluster” on page 153
environment
Display information v Use the llstatus command:
about the LoadLeveler – With the -X option to display information about machines in
multicluster the multicluster.
environment – With the -C option to display information defined in cluster
stanzas in the administration file.
v Use the llclass command with the -X option to display
information about classes on any cluster (local or remote).
v Use the llq command with the -X option to display information
about jobs on any cluster (local or remote).


Table 34. Subtasks for configuring a LoadLeveler multicluster (continued)
Monitor and control Existing LoadLeveler user commands accept the -X option for a
operations in the multicluster environment.
LoadLeveler
multicluster Rules:
environment v Administrator only commands are not applicable in a multicluster
environment.
v The options -x, -W, -s, and -p cannot be specified together with
the -X option on the llmodify command.
v The options -x and -w cannot be specified together with the -X
option on the llq command.
v The -X option on the following commands is restricted to a single
cluster:
– llcancel
– llckpt
– llhold
– llmodify
– llprio
v The following commands are not applicable in a multicluster
environment:
– llacctmrg
– llchres
– llextRPD
– llinit
– llmkres
– llqres
– llrmres
– llrunscheduler
– llsummary

Steps for configuring a LoadLeveler multicluster
The primary task for configuring a LoadLeveler multicluster environment is to
enable communication between gateway Schedd daemons on all of the clusters in
the multicluster.

To do so requires defining each Schedd daemon as either local or remote, and
defining the inbound and outbound hosts with which the daemon will
communicate.

Before you begin: You need to know that:
v A single machine may be defined as an inbound or outbound host, or as both.
v A single cluster must belong to only one multicluster.
v A single multicluster must consist of 10 or fewer clusters.
v Clusters must have unique host names within the multicluster network domain
space.
v The inbound Schedd becomes the schedd_host of all remote jobs it receives.

Perform the following steps to configure a LoadLeveler multicluster:
1. In the administration file, define one cluster stanza for each cluster in the
LoadLeveler multicluster environment.
Rules:
v You must define one cluster as the local cluster.
v You must code the following required cluster-stanza keywords and variable
values:


cluster_name: type=cluster
outbound_hosts = hostname[(cluster_name)]
inbound_hosts = hostname[(cluster_name)]
v If you want to allow users to submit remote jobs to the local cluster, the list
of inbound hosts must include the name of the inbound Schedd and the
cluster you are defining as remote or you must specify the name of an
inbound Schedd without any cluster specification so that it defaults to being
an inbound Schedd for all clusters.
v If the configuration file keyword SCHEDD_STREAM_PORT for any cluster
is set to use a port other than the default value of 9605, you must set the
inbound_schedd_port keyword in the cluster stanza for that cluster.
2. (Optional) If the local cluster wants to provide job distribution where users
allow LoadLeveler to select the appropriate cluster for job submission based on
administration defined objectives, then define an installation exit to be executed
at submit time using the CLUSTER_METRIC configuration keyword. You can
use the LoadLeveler data access APIs in this exit to query other clusters for
information about possible metrics, such as the number of jobs in a specified
job class, the number of jobs in the idle queue, or the number of free nodes in
the cluster. For more detailed information, see CLUSTER_METRIC.
Tip: LoadLeveler provides a set of sample exits for you to use as models. These
samples are in the ${RELEASEDIR}/samples/llcluster directory.
3. (Optional) If the local cluster wants to perform user mapping on jobs arriving
from remote clusters, define the CLUSTER_USER_MAPPER configuration
keyword. For more information, see CLUSTER_USER_MAPPER.
4. (Optional) If the local cluster wants to perform job filtering on jobs received
from remote clusters, define the CLUSTER_REMOTE_JOB_FILTER
configuration keyword. For more information, see
CLUSTER_REMOTE_JOB_FILTER.
modifications you made to the administration file.

Additional considerations:
v Remote jobs are subjected to the same configuration checks as locally submitted
jobs. Examples include account validation, class limits, include lists, and exclude
lists.
v Remote jobs will be processed by the local submit_filter prior to submission to a
remote cluster.
v Any tracker program specified in the API parameters will be invoked upon the
scheduling cluster nodes.
v If a step is enabled for checkpoint and the ckpt_execute_dir is not specified,
LoadLeveler will not copy the executable to the remote cluster, the user must
ensure that executable exists on the remote cluster. If the executable is not in a
shared file system, the executable can be copied to the remote cluster using the
cluster_input_file job command file keyword.
v If the job command file is also the executable and the job is submitted or moved
to a remote cluster, the $(executable) variable will contain the full path name of
the executable on the local cluster from which it came. This differs from the
behavior on the local cluster, where the $(executable) variable will be the
command line argument passed to the llsubmit command. If you only want the
file name, use the $(base_executable) variable.


Steps for securing communications within a LoadLeveler
multicluster
Configuring LoadLeveler to use the OpenSSL library enables it to operate in a
secure environment where clusters are separated by a firewall.

Perform the following steps to configure LoadLeveler to use OpenSSL in a
multicluster environment:
1. Install SSL using the standard platform installation process.
2. Ensure a link exists from the installed SSL library to:
a. /usr/lib/libssl.so for 32-bit Linux platforms.
b. /usr/lib64/libssl.so for 64-bit Linux platforms.
c. /usr/lib/libssl.a for AIX platforms.
3. Create the SSL authorization keys by invoking the llclusterauth command with
the -k option on all local gateway schedds.
Result: LoadLeveler creates a public key, a private key, and a security certificate
for each gateway node.
4. Distribute the public keys to remote gateway schedds on other secure clusters.
This is done by exchanging the public keys with the other clusters you wish to
communicate with.
v for AIX, public keys can be found in the /var/LoadL/ssl/id_rsa.pub file.
v for Linux, public keys can be found in the /var/opt/LoadL/ssl/id_rsa.pub
file.
5. Copy the public keys of the clusters you wish to communicate with into the
authorized_keys directory on your inbound Schedd nodes.
v for AIX, /var/LoadL/ssl/authorized_keys
v for Linux, /var/opt/LoadL/ssl/authorized_keys
v The authorization key files can be named anything within the
authorized_keys directory.
6. Define the cluster stanzas within the LoadLeveler administration file, using the
multicluster_security = SSL keyword. Define the keyword ssl_cipher_list if a
specific OpenSSL cipher encryption method is desired. Use secure_schedd_port
to define the port number to be used for secure inbound transactions to the
cluster.
7. Notify LoadLeveler daemons by issuing the llctl -g command with the recycle
keyword. Otherwise, LoadLeveler will not process the modifications you made
to the administration file.
8. Configure firewalls to accept connections to the secure_schedd_port numbers
you defined in the administration file.

| Scale-across scheduling with multiclusters
| In the multicluster environment, scale-across scheduling allows you to schedule
| jobs across more than one cluster. This feature allows large jobs that request more
| resources than a single cluster can provide to combine resources from more than
| one cluster and run large jobs on the combined resources. effectively spanning
| resources across more than one cluster.

| By effectively spanning resources across more than one cluster, scale-across
| scheduling also allows utilization of fragmented resources from more than one
| cluster. Fragmented resources occur when the resources available on a single
| cluster cannot satisfy any single job on that cluster. This feature allows any size job
| to take advantage of these resources by combining them from multiple clusters.


| The following are not supported with scale-across scheduling:
| v Checkpointing jobs
| v Coscheduled jobs
| v Data staging jobs
| v Hostlist jobs
| v IBM Blue Gene Systems resources jobs
| v Interactive Parallel Operating Environment (POE)
| v Multistep jobs
| v Preemption of scale-across jobs
| v Reservations
| v Secure Sockets Layer (SSL)
| v Task-geometry jobs
| v User space jobs

| Requirements for scale-across scheduling
| Main Cluster
| In a multicluster environment that supports scale-across scheduling, one of
| the clusters in the multicluster environment must be designated as the
| ″main cluster.″ The main cluster will only schedule scale-across jobs; it will
| not run any jobs. Scale-across jobs will run on non-main clusters.
| Network Connectivity
| A requirement for any cluster that will participate in scale-across
| scheduling is that any node in one cluster must be able to communicate
| with any other node in any other cluster that is part of the scale-across
| configuration. There are two reasons for this requirement:
| v Since the main cluster initiates the scale-across job, one node in the main
| cluster must have connectivity to any node in any of the other clusters
| where the job will run.
| v Tasks of parallel applications must communicate with other tasks
| running on different nodes.

| Configuring LoadLeveler for scale-across scheduling
| After you choose a set of clusters to participate in scale-across scheduling, you
| must designate one cluster as the main cluster. Do so by specifying a value of true
| in the main_scale_across_cluster keyword for that cluster’s stanza in the
| administration files of all scale-across clusters. The cluster that specifies this
| keyword as true for its own cluster stanza becomes the main cluster. Any cluster
| that specifies this keyword as true for another cluster stanza becomes a non-main
| cluster.

| Table 35 lists scale-across scheduling keywords:
| Table 35. Keywords for configuring scale-across scheduling
| Keyword type Keyword reference
|
| Administration file keywords allow_scale_across_jobs cluster stanza keyword
| main_scale_across_cluster cluster stanza keyword
| allow_scale_across_jobs class stanza keyword
|
| Configuration file keyword SCALE_ACROSS_SCHEDULING_TIMEOUT keyword
|


| Tuning considerations for scale-across scheduling
| NEGOTIATOR_CYCLE_DELAY
| The value on both the main and the non-main clusters should be set to
| similar values to minimize the wait delays on both the main and the
| non-main clusters that occur when the main cluster is requesting a
| negotiator cycle on the non-main clusters. It is reasonable to set
| NEGOTIATOR_CYCLE_DELAY=1 on all clusters.
| MAX_TOP_DOGS
| The maximum number of top-dog scale-across jobs allowed on the main
| cluster should be smaller than the maximum number of top-dog jobs
| allowed on the non-main clusters to allow the non-main clusters to
| schedule both the scale-across and regular jobs as top dogs.
| SCALE_ACROSS_SCHEDULING_TIMEOUT
| The default value should be overridden only if there are non-main clusters
| that have extremely long dispatch cycles or that have very long
| NEGOTIATOR_CYCLE_DELAY values. In these cases, the
| SCALE_ACROSS_SCHEDULING_TIMEOUT needs to be set to a value
| greater than those intervals.
|
LoadLeveler Blue Gene support
Blue Gene is a massively parallel system based on a scalable cellular architecture
which exploits a very large number of tightly interconnected compute nodes
(C-nodes).

| To take advantage of Blue Gene support, you must be using the LoadLeveler
| BACKFILL scheduler. With the BACKFILL scheduler, LoadLeveler enables the Blue
| Gene system to take advantage of reservations that allow you to schedule when,
| and with which resources a job will run.

While LoadLeveler Blue Gene support is available on all platforms, Blue Gene®/L™
software is only supported on IBM POWER servers running SLES 9. This limitation
currently restricts LoadLeveler Blue Gene/L support to SLES 9 on IBM POWER
servers. LoadLeveler Blue Gene®/P™ software is only supported on IBM POWER
servers running SLES 10. Mixed clusters of Blue Gene/L and Blue Gene/P systems
are not supported.

Terms you should know:
v Compute nodes, also called C-nodes, are system-on-a-chip nodes that execute at
most a single job at a time. All the C-nodes are interconnected in a
three-dimensional toroidal pattern. Each C-node has a unique address and
location in the three-dimensional toroidal space. Compute nodes execute the
jobs’ tasks. Compute nodes run a minimal custom operating system called
BLRTS.
v Front End Nodes (FEN) are machines from which users and administrators
interact with Blue Gene. Applications are compiled on and submitted for
execution in the Blue Gene core from FENs. User interactions with applications,
including debugging, are also performed from the FENs.
v The Service Node is dedicated hardware that runs software to control and
manage the Blue Gene system.
v I/O nodes are special nodes that connect the compute nodes to the outside
world. I/O nodes allow processes that are executing in the compute nodes to
perform I/O operations, such as accessing files, and to communicate with the


job management system. Each I/O node serves anywhere from 8 to 64 C-nodes,
depending on the physical configuration.
v mpirun is a program that is executed partly on the Front End Node, and partly
on the Service Node. mpirun controls and monitors the parallel Blue Gene job.
The mpirun program is executed by the user program that is run on the FEN by
LoadLeveler.
v A base partition (BP) is a group of compute nodes connected in a 3D
rectangular pattern and their controlled I/O nodes. A base partition is one of the
basic allocation units for jobs. For example, an allocation for the job will require
at least one base partition, unless an allocation requests a small partition, in
which case sub base partition allocation is possible.
v A small partition is a group of C-nodes which are part of one base partition.
Valid small partitions have size of 32 or 128 C-nodes.
v A partition is a group of base partitions, switches, and switch states allocated to
a job. A partition is predefined or is created on demand to execute a job.
Partitions are physically (electronically) isolated from each other (for example,
messages cannot flow outside an allocated partition). A partition can have the
topology of a mesh or a torus.
v The Control System is a component that serves as the interface to the Blue Gene
system. It contains persistent storage with configuration and status information
on the entire system. It also provides various services to perform actions on the
Blue Gene system, such as launching a job.
v A node card is a group of 32 compute nodes within a base partition. This is the
minimal allocation size for a partition.
v A quarter is a group of 4 node cards. This is a logical grouping of node cards
within a base partition. A quarter, which is 128 compute nodes, is the next
smallest allowed allocation size for a partition after a node card.
v A switch state is a set of internal switch connections which physically ″wire″ the
partition. A switch has a number of incoming and outgoing wires. An internal
switch connection physically connects one incoming wire with one outgoing
wire, setting up a communication path between base partitions.

For more information about the Blue Gene system and Blue Gene terminology,
refer to IBM System Blue Gene Solution documentation. Table 36 lists the IBM
System Blue Gene Solution publications that are available from the IBM Redbooks®
Web site at the following URLs:
Table 36. IBM System Blue Gene Solution documentation
Blue Gene
System Publication Name URL
Blue Gene/P IBM System Blue Gene Solution: Blue Gene/P http://www.redbooks.ibm.com/abstracts/
System Administration sg247417.html
IBM System Blue Gene Solution: Blue Gene/P http://www.redbooks.ibm.com/abstracts/
Safety Considerations redp4257.html
IBM System Blue Gene Solution: Blue Gene/P http://www.redbooks.ibm.com/abstracts/
Application Development sg247287.html
Evolution of the IBM System Blue Gene Solution http://www.redbooks.ibm.com/abstracts/
redp4247.html


Table 36. IBM System Blue Gene Solution documentation (continued)
Blue Gene
System Publication Name URL
Blue Gene/L IBM System Blue Gene Solution: System http://www.redbooks.ibm.com/abstracts/
Administration sg247178.html
Blue Gene/L: Hardware Overview and Planning http://www.redbooks.ibm.com/abstracts/
sg246796.html
IBM System Blue Gene Solution: Application http://www.redbooks.ibm.com/abstracts/
Development sg247179.html
Unfolding the IBM eServer™ Blue Gene Solution http://www.redbooks.ibm.com/abstracts/
sg246686.html

Table 37 lists the Blue Gene subtasks with a pointer to the associated instructions:
Table 37. Blue Gene subtasks and associated instructions
Configure LoadLeveler Blue Gene support “Configuring LoadLeveler Blue Gene
support”
Submit and monitor Blue Gene jobs “Submitting and monitoring Blue Gene jobs”
on page 226

Table 38 lists the Blue Gene related topics and associated information:
Table 38. Blue Gene related topics and associated information
Related topic Associated information (see . . . )
Configuration file: Blue Gene keywords “Configuration file keyword descriptions”
on page 265
Job command file: Blue Gene keywords “Job command file keyword descriptions” on
page 359
Commands and APIs Chapter 16, “Commands,” on page 411 or
Chapter 17, “Application programming
Diagnosis and messages TWS LoadLeveler: Diagnosis and Messages
Guide

Configuring LoadLeveler Blue Gene support
This is a list of the subtasks for configuring LoadLeveler Blue Gene support along
with a pointer to the associated instructions.

Table 39 lists the subtasks for configuring LoadLeveler Blue Gene support along
with a pointer to the associated instructions:
Table 39. Blue Gene configuring subtasks and associated instructions
Configuring “Steps for configuring LoadLeveler Blue Gene support” on page 158
LoadLeveler Blue
Gene support


Table 39. Blue Gene configuring subtasks and associated instructions (continued)
Display information v Use the llstatus command with the -b option to display
about the Blue Gene information about the Blue Gene system. The llstatus command
system can also be used with the -B option to display information about
Blue Gene base partitions. Using llstatus with the -P option can
be used to display information about Blue Gene partitions.
Display information v Use the llsummary command with the -l option to display job
about Blue gene jobs resource information.
v Use the llq command with the -b option to display information
about all Blue Gene jobs.

Steps for configuring LoadLeveler Blue Gene support
The primary task for configuring LoadLeveler Blue Gene support consists of
setting up the environment of the LoadL_negotiator daemon, the environment of
any process that will run Blue Gene jobs, and the LoadLeveler configuration file.

Perform the following steps to configure LoadLeveler Blue Gene support:
1. Configure the LoadL_negotiator daemon to run on a node which has access to
the Blue Gene Control System.
2. Enable Blue Gene support by setting the BG_ENABLED configuration file
keyword to true.
3. (Optional) Set any of the following additional Blue Gene related configuration
file keywords which your setup requires:
v BG_ALLOW_LL_JOBS_ONLY
v BG_CACHE_PARTITIONS
v BG_MIN_PARTITION_SIZE
v CM_CHECK_USERID
See “Configuration file keyword descriptions” on page 265 for more
information on these keywords.
4. Set the required environment variables for the LoadL_negotiator daemon and
any process that will run Blue Gene jobs. You can use global profiles to set the
necessary environment variables for all users. Follow these steps to set
environment variables for a LoadLeveler daemon:
a. Add required environment variable settings to global profile.
b. Set the environment as the administrator before invoking llctl start on the
central manager node.
c. Build a shell script which sets the required environments and starts
LoadLeveler, which can be invoked using rsh remotely.

Note: Using the llctl -h or llctl -g command to start the central manager
remotely will not carry the environment variables from the login session
to the LoadLeveler daemons on the remote nodes.
v Specify the full path name of the bridge configuration file by setting the
BRIDGE_CONFIG_FILE environment variable. For details on the contents of
the bridge configuration file, see the Blue Gene/L: System Administration or
Blue Gene/P: System Administration book.
Example:
For ksh:
export BRIDGE_CONFIG_FILE=/var/bluegene/config/bridge.cfg


For csh:
setenv BRIDGE_CONFIG_FILE=/var/bluegene/config/bridge.cfg
v Specify the full path name of the file containing the data required to access
the Blue Gene Control System database by setting the DB_PROPERTY
environment variable. For details on the contents of the database property
file, see the Blue Gene/L: System Administration or Blue Gene/P: System
Administration book.
Example:
For ksh:
export DB_PROPERTY=/var/bluegene/config/db.cfg
For csh:
setenv DB_PROPERTY=/var/bluegene/config/db.cfg
v Specify the host name of the machine running the Blue Gene control system
by setting the MMCS_SERVER_IP environment variable. For details on the
use of this environment variable, see the Blue Gene/L: System Administration or
Blue Gene/P: System Administration book.
Example:
For ksh:
export MMCS_SERVER_IP=bluegene.ibm.com
For csh:
setenv MMCS_SERVER_IP=bluegene.ibm.com

Blue Gene reservation support
| Reservation supports Blue Gene resources including the Blue Gene compute nodes.
| It is important to note that when the reservation includes Blue Gene nodes, it
| cannot include conventional nodes. A front end node (FEN), which is used to start
| a Blue Gene job, is not part of the Blue Gene resources. A Blue Gene reservation
| only reserves Blue Gene resources and a Blue Gene job step bound to a reservation
| uses the reserved Blue Gene resources and shares a FEN outside the reservation.

Jobs using Blue Gene resources can be submitted to a Blue Gene reservation to run.
A Blue Gene job step can also be used to select what Blue Gene resources to
reserve to make sure the reservation will have enough Blue Gene resources to run
the Blue Gene job step.

| For more information about reservations, see “Overview of reservations” on page
| 25.

Blue Gene fair share scheduling support
Fair share scheduling has been extended to Blue Gene resources as well.

The FAIR_SHARE_TOTAL_SHARES keyword in LoadL_config and the
fair_shares keyword for the user and group stanza in LoadL_admin apply to both
the CPU resources and the Blue Gene resources. When a Blue Gene job step ends,
both the CPU utilization and the Blue Gene resource utilization data will be
collected. The elapsed job running time multiplied by the number of C-nodes
allocated to the job step (the Size Allocated field in the llq -l output) will be
counted as the amount of Blue Gene resource used. The used shares of the Blue
Gene resources are independent of the used shares of the CPU resources and are
made available through the LoadLeveler variables UserUsedBgShares and
GroupUsedBgShares. LoadLeveler variable JobIsBlueGene will indicate whether a
job step is a Blue Gene job step or not. LoadLeveler administrators have flexibility


in specifying the behavior of fair share scheduling by using these variables in the
SYSPRIO expression. The llfs command and the related APIs can also handle
requests related to the Blue Gene resources.

For more information about fair share scheduling, see “Using fair share
scheduling.”

Blue Gene heterogeneous memory support
The LoadLeveler job command file has a bg_requirements keyword that can be
used to specify the requirements that a Blue Gene base partition must meet to
execute the job step.

The Blue Gene compute nodes (C-nodes) in the same base partition have the same
amount of physical memory. The C-nodes in different base partitions might have
different amounts of physical memory. The bg_requirements job command file
keyword allows users to specify the memory requirement on the Blue Gene
C-nodes.

The bg_requirements keyword works like the requirements keyword, but it can
only support memory requirements and applies only to Blue Gene base partitions.
For a Blue Gene job step, the requirements keyword value applies to the front end
node needed by the job step and the bg_requirements keyword value applies to
the Blue Gene base partitions needed by the job step.

Blue Gene preemption support
Preemption support for Blue Gene jobs has been enabled.

Blue Gene jobs have the same preemption support as non-Blue Gene jobs. In a
typical Blue Gene system, many Blue Gene jobs share the same front end node
while dedicated Blue Gene resources are used for each job. To avoid preempting
Blue Gene jobs that use different Blue Gene resources as requested by a
preempting job, ENOUGH instead of ALL must be used in the PREEMPT_CLASS
rules for Blue Gene job preemption.

For more information about preemption, see “Preempting and resuming jobs” on
page 126

Blue Gene/L HTC partition support
The allocation of High Throughput Computing (HTC) partitions on Blue Gene/L is
supported when the LoadLeveler BG_CACHE_PARTITIONS configuration
keyword is set to false.

See the following IBM System Blue Gene Solution Redbooks (dated April 27, 2007)
for more information about Blue Gene/L HTC support:
v IBM Blue Gene/L: System Administration, SG24-7178
v IBM Blue Gene/L: Application Development, SG24-7179

Using fair share scheduling
Fair share scheduling in LoadLeveler provides a way to divide resources in a
LoadLeveler cluster among users or groups of users.

To fairly share cluster resources, LoadLeveler can be configured to allocate a
proportion of the resources to each user or group and to let job priorities be


adjusted based on how much of the resources have been used and when they were
used. Generally speaking, LoadLeveler should be configured so that job priorities
decrease for a user or group that has recently used more resources than the
allocated proportion and job priorities should increase for a user or group that has
not run any jobs recently.

Administrators can configure the behavior of fair share scheduling through a set of
configuration keywords. They can also query fair share information, save a
snapshot of historic data, reset and restore fair share scheduling, and perform other
functions by using the LoadLeveler llfs command, the GUI, and the corresponding
APIs.

Fair share scheduling also includes Blue Gene resources (see “Blue Gene fair share
scheduling support” on page 159 for more information).

Note: The time of day clocks on all of the nodes in the cluster must be
synchronized in order for fair share scheduling to work properly.

For more information, see the following:
v “llfs - Fair share scheduling queries and operations” on page 450
v Corresponding APIs:
– “ll_fair_share subroutine” on page 642
– “Data access API” on page 560
v Keywords:
– fair_shares
– FAIR_SHARE_INTERVAL
– FAIR_SHARE_INTERVAL
v SYSPRIO expression

Fair share scheduling keywords
The FAIR_SHARE_TOTAL_SHARES global configuration file keyword is used to
specify the total number of shares that each type of resource is divided into.

The fair_shares keyword in a user or group stanza in the administration file
specifies how many shares the user or group is allocated. The ratio of the
fair_shares keyword value in a user or group stanza over the
FAIR_SHARE_TOTAL_SHARES keyword value defines the resource usage
proportion for the user or group. For example, if a user is allocated one third of
the cluster resources, then the ratio of the user’s fair_share value over the
FAIR_SHARE_TOTAL_SHARES keyword value should be one third.

The LoadLeveler SYSPRIO expression can be configured to let job priorities change
to achieve the specified resource usage proportions. Besides changing job priorities,
fair share scheduling does not change in any way how LoadLeveler schedules jobs.
If a job can be scheduled to run, it will be run regardless of whether the owner
and the LoadLeveler group of the job has any shares allocated or not. No matter
how many shares are allocated to a user, if the user does not submit any jobs to
run, then the resource usage proportion for that user cannot be achieved and other
users might be able to use more than their allocated proportions.

Note: The sum of all allocated shares for users or groups does not have to equal
the value of the FAIR_SHARE_TOTAL_SHARES keyword. The share


allocation can be used as a way to prevent a single user from consuming too
much of the cluster resources and as a way to share the resources as fairly
as possible.

When the value of the FAIR_SHARE_TOTAL_SHARES keyword is greater than 0,
fair share scheduling is on, which means that resource usage data is collected
when every job ends, regardless of the fair_shares values for any user or group.
The collected usage data is converted to used shares for each user and group. The
llfs command can be used to display the allocated and used shares. Turning fair
share scheduling on does not mean that job priorities are affected by fair share
scheduling. You have to configure the SYSPRIO expression to let fair share
scheduling affect job priorities in a way that suits your needs. By default, the value
of the FAIR_SHARE_TOTAL_SHARES keyword is 0 and fair share scheduling is
disabled.

There is a built-in decay mechanism for the historic resource usage data that is
collected when jobs end, that is, the initial resource usage value becomes smaller
and smaller as times goes by. This decay mechanism allows the most recent
resource usage to have more impact on fair share scheduling. The
FAIR_SHARE_INTERVAL global configuration file keyword is used to specify
how fast the decay is. The shorter the interval, the faster the historic data decays.
A resource usage value decays to 5% of its initial value after an elapsed time
period of the same length as the FAIR_SHARE_INTERVAL value. Generally, the
interval should be at least several times larger than the typical job running time in
the cluster to get stable results. A value should be chosen corresponding to how
long the historic resource usage data should have an impact on the current job
priorities.

The LoadLeveler SYSPRIO expression is used to calculate job priorities. A set of
LoadLeveler variables including some related to fair share scheduling can be used
in the SYSPRIO expression in the global configuration file. You can define the
SYSPRIO expression to let fair share scheduling influence the job priorities in a
way that is suitable to your needs. For more information, see the SYSPRIO
expression in Chapter 12, “Configuration file reference,” on page 263.

When the GroupTotalShares, GroupUsedShares, UserTotalShares,
UserUsedShares, UserUsedBgShares, GroupUsedBgShares, and JobIsBlueGene
and their corresponding user-defined variables are used, you must use the
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL global configuration
keyword to specify a time interval at which the job priorities will be recalculated
using the most recent share usage information.

You can add the following user-defined variables to the LoadL_config global
configuration file to make it easier to specify fair share scheduling in the SYSPRIO
expressions:
v GroupRemainingShares = (GroupTotalShares - GroupUsedShares)
v GroupHasShares = ($(GroupRemainingShares) > 0)
v GroupSharesExceeded = ($(GroupRemainingShares) <= 0)
v UserRemainingShares = (UserTotalShares - UserUsedShares)
v UserHasShares = ($(UserRemainingShares) > 0)
v UserSharesExceeded = ($(UserRemainingShares) <= 0)
v UserRemainingBgShares = ( UserTotalShares - UserUsedBgShares)
v UserHasBgShares = ( $(UserRemainingBgShares) > 0)
v UserBgSharesExceeded = ( $(UserRemainingBgShares) <= 0)
v GroupRemainingBgShares = ( GroupTotalShares - GroupUsedBgShares)
v GroupHasBgShares = ( $(GroupRemainingBgShares) > 0)


v GroupBgSharesExceeded = ( $(GroupRemainingBgShares) <= 0)
v JobIsNotBlueGene = ! JobIsBlueGene

If fair share scheduling is not turned on, either because the
FAIR_SHARE_INTERVAL keyword value is not positive or because the scheduler
type is not BACKFILL, then the variables will have the following values:
GroupTotalShares: 0
GroupUsedShares: 0
$(GroupRemainingShares): 0
$(GroupHasShares): 0
$(GroupSharesExceeded): 1
UserUsedBgShares: 0
$(UserRemainingBgShares): 0
$(UserHasBgShares): 0
$(UserBgSharesExceeded): 1

If a user has the fair_shares keyword set to 10 in its user stanza and the user has
used up 8 CPU shares and 3 Blue Gene shares, then the variables will have the
following values:
UserTotalShares: 10
UserUsedShares: 8
$(UserRemainingShares): 2
$(UserHasShares): 1
$(UserSharesExceeded): 0
UserUsedBgShares: 3
$(UserRemainingBgShares): 7
$(UserHasBgShares): 1
$(UserBgSharesExceeded): 0

If a group has the fair_shares keyword set to 10 in its group stanza and the group
has used up 15 CPU shares and 0 Blue Gene shares, then the variables will have
the following values:
GroupTotalShares: 10
GroupUsedShares: 15
$(GroupRemainingShares): -5
$(GroupHasShares): 0
$(GroupSharesExceeded): 1
GroupUsedBgShares: 0
$(GroupRemainingBgShares): 10
$(GroupHasBgShares): 1
$(GroupBgSharesExceeded): 0

The values of the following variables for a Blue Gene job step:
JobIsBlueGene: 1
$(JobIsNotBlueGene): 0

The values of the following variables for a non-Blue Gene job step:
JobIsBlueGene: 0
$(JobIsNotBlueGene): 1

Reconfiguring fair share scheduling keywords
LoadLeveler configuration and administration files can be modified to assign new
values to various keywords.

After files have been modified, issue the llctl -g reconfig command to read in the
new keyword values. All new keywords introduced for fair share scheduling
become effective right after reconfiguration.


Reconfiguring when the Schedd daemons are up
To avoid any inconsistency, change the value of the FAIR_SHARE_INTERVAL
keyword while the central manager and all Schedd daemons are up, then do the
reconfiguration.

After the reconfiguration, the following will happen:
v All historic fair share scheduling data will be decayed to the current time using
the old value.
v The old value is replaced with the new value
v The new value will be used from here on

Note:
1. You must have the same value for the FAIR_SHARE_INTERVAL
keyword in the central manager and the Schedd daemons because the
FAIR_SHARE_INTERVAL keyword determines the rate of decay for the
historic fair share data and the same value on the daemons maintains the
data consistency.
2. There are some LoadLeveler configuration parameters that require
restarting LoadLeveler with llctl recycle for changes to take effect. You
can use llctl recycle when changing fair share parameters also. The effect
will be the same as using llctl reconfig because when the Schedd
machine shuts down normally, the fair share scheduling data will be
decayed to the time of the shutdown and it will be saved.

Reconfiguring when the Schedd daemons are down
The value for the FAIR_SHARE_INTERVAL keyword may need to be changed
while a Schedd daemon is down.

If the value for the FAIR_SHARE_INTERVAL keyword has to be changed while a
Schedd daemon is down, the following will happen when the Schedd daemon is
restarted:
v All historic fair share scheduling data will be read in from the disk files in the
$(SPOOL) directory with no change.
v When a new job ends, the historic fair share scheduling data for the owner and
the LoadLeveler group of the job will be updated using the new value and then
sent to the central manager. The new value is used effectively from the time the
data was last updated before the Schedd went down, not from the time of the
reconfiguration as it would normally be.

Example: three groups share a LoadLeveler cluster
This example in which three groups share a LoadLeveler cluster may apply to your
situation.

For purposes of this example, we will assume the following:
v Three groups of users share a LoadLeveler cluster and each group is to have one
third of the resources
v Historic data will have significant impact for about 10 days
v Groups with unused shares will have much higher job priorities than the groups
which have used up their shares
To setup for fair share scheduling with these assumptions, an administrator could
update the LoadL_config global configuration file as follows:


FAIR_SHARE_TOTAL_SHARES = 99

FAIR_SHARE_INTERVAL = 240

NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 300

GroupRemainingShares = ( GroupTotalShares - GroupUsedShares )

GroupHasShares = ( $(GroupRemainingShares) > 0 )

SYSPRIO : 10000000 * $(GroupHasShares) - QDate

In the admin file LoadL_admin, add:
chemistry: type = group

include_users = harold mark kim enci george charlie

fair_shares = 33

physics: type = group

include_users = cnyang gchen newton roy

fair_shares = 33

math: type = group

include_users = rich dave chris popco
fair_shares = 33

When user rich in the math group wants to submit a job, the following keyword
can be put into the job command file so that the job will have high priority
through the math group:
#@group=math

If user rich has a job that does not need to be run right away or as soon as
possible (can be run at any time), then he should run the job in a LoadLeveler
group with no shares allocated (for example, the No_Group group). Because the
group No_Group has no shares allocated to it in this example, $(GroupHasShares)
has a value of 0 and the job priority will be lower than those jobs whose group has
unused shares. The job will be run when all higher priority jobs are done or when
it is used to backfill a higher priority job (will be run whenever it can be
scheduled).

Example: two thousand students share a LoadLeveler cluster
This example in which two thousand students share a LoadLeveler cluster may
apply to your situation.

For purposes of this example, we will assume the following:
v A university has 2000 students who share a LoadLeveler cluster and every
student is to have the same number of shares of the resources.
v Historic data will have significant impact for about 7 days (because
FAIR_SHARE_INTERVAL is not specified and the default value is 7 days).
v A student with unused shares is to have somewhat higher job priorities and let
the priorities decrease as the number of used shares increase.
The LoadL_config global configuration file should contain the following:


FAIR_SHARE_TOTAL_SHARES = 10000

NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 600

UserRemainingShares = ( UserTotalShares - UserUsedShares )

SYSPRIO : 100000 * $(UserRemainingShares) - QDate

In the LoadL_admin admin file, add

fair_shares = 5

Note: The value fair_shares = 5 is the result of dividing the total shares into the
number of students (10000 ÷ 2000). The number of students can be more or
less than 2000, but the same configuration parameters still prevent a single
user from using too much cluster resources in a short time period.

We can see from the SYSPRIO expression that the larger the number of unused
shares for a student and the earlier the job is submitted, the higher the priority is
for the student’s job.

Querying information about fair share scheduling
The llfs command, the GUI, and the data access API can be used to query
information about fair share scheduling.

The llfs command without any options displays the allocated and used shares for
all users and LoadLeveler groups having run one or more jobs in the cluster to
completion. The -u and -g options can show the allocated and used shares for any
user or LoadLeveler group regardless of whether they have run any jobs in the
cluster. In either case, the user or group need not have any fair_shares allocated in
the LoadL_admin administration file for the usage to be reported by the llfs
command.

Resetting fair share scheduling
The llfs -r command option (or the GUI option Reset historic data), by default,
will start fair share scheduling from the beginning, which means that all the
previous historic data will be lost.

This command will not be run unless all Schedd daemons are up and running.

In case a Schedd daemon is down when this command option is being run, the
request will not be processed. To manually reset fair share scheduling, bring down
the LoadLeveler cluster, remove all fair share data files (fair_share_queue.dir and
fair_share_queue.pag) in the $(SPOOL) directory and then restart the LoadLeveler
cluster.

Saving historic data
The LoadLeveler central manager holds the complete historic fair share data when
it is up.

Every Schedd holds a portion of the historic fair share data and the data is stored
on disk in the $(SPOOL) directory. When the central manager is restarted, it
receives the historic fair share data from every Schedd. If a Schedd machine is
down temporarily and the central manager remains up, the data in the central
manager is not affected. In case a Schedd machine is permanently damaged and


the central manager restarts, the central manager will not be able to get all of the
historic fair share data because the data stored on the damaged Schedd is lost. If
the value of FAIR_SHARE_INTERVAL is very large, many days of data on the
damaged Schedd could be lost. To reduce the loss of data, the historic fair share
data in the central manager can be saved to disk periodically. Recovery can be
done using the latest saved data when a Schedd machine is permanently out of
service. The llfs -s command, the GUI, or the ll_fair_share API can be used to save
a snapshot of the historic data in the central manager to a file.

Restoring saved historic data
You can use the llfs -r command option, the GUI, or the ll_fair_share API to
restore fair share scheduling to a previously saved state.

For the file name, specify a file you saved previously using llfs -s.

If the central manager goes down and restarts again, the historic data stored in an
out of service Schedd machine is not reported to the central manager. If the Schedd
machine will not be brought back to service at all, then the administrator can
consider restoring fair share scheduling to a state corresponding to the latest saved
file.

Procedure for recovering a job spool
The llmovespool command is intended for recovery purposes only.

Jobs being managed by a down Schedd are unable to clean up resources or move
to completion. These jobs need their job records transferred to another Schedd. The
llmovespool command moves the job records from the spool of one managing
Schedd to another managing Schedd in the local cluster. All moved jobs retain
their original job identifiers.

It is very important that the Schedd that created the job records to be moved is not
running during the move operation. Jobs within the job queue database will be
unrecoverable if the job queue is updated during the move by any process other
than the llmovespool command.

The llmovespool command operates on a set of job records, these records are
updated as the command executes. When a job is successfully moved, the records
for that job are deleted. Job records that are not moved because of a recoverable
failure, like the original Schedd not being fenced, may have the llmovespool
command executed against them again. It is very important that a Schedd never
reads the job records from the spool being moved. Jobs will be unrecoverable if
more than one Schedd is considered to be the managing Schedd.

The procedure for recovering a job spool is:
1. Move the files located in the spool directory to be transferred to another
directory before entering the llmovespool command in order to guarantee that
no other Schedd process is updating the job records.
2. Add the statement schedd_fenced=true to the machine stanza of the original
Schedd node in order to guarantee that the central manager ignores
connections from the original managing Schedd, and to prevent conflicts from
arising if the original Schedd is restarted after the llmovespool command has
been run. See the schedd_fenced=true keyword in Chapter 13, “Administration
file reference,” on page 321 for more information.


3. Reconfigure the central manager node so that it recognizes that the original
Schedd is ″fenced″.
4. Issue the llmovespool command providing the spool directory where the job
records are stored. The command displays a message that the transfer has
started and reports status for each job as it is processed. For more information
about the llmovespool command, see “llmovespool - Move job records” on
page 472. For more information about the ll_move_spool API, see
“ll_move_spool subroutine” on page 683.


Chapter 7. Using LoadLeveler’s GUI to perform administrator
tasks
| Note: This is the last release that will provide the Motif-based graphical user
| interface xloadl. The function available in xloadl has been frozen since TWS
| LoadLeveler 3.3.2.

The end user can perform many tasks more efficiently and faster using the
graphical user interface (GUI), but there are certain tasks that end users cannot
perform unless they have the proper authority.

If you are defined as a LoadLeveler administrator in the LoadLeveler configuration
file then you are immediately granted administrative authority and can perform
the administrative tasks discussed in this topic. To find out how to grant someone
administrative authority, see “Defining LoadLeveler administrators” on page 43.

You can access LoadLeveler administrative commands using the Admin pull-down
menu on both the Jobs window and the Machines window of the GUI. The Admin
pull-down menu on the Jobs window corresponds to the command options
available in the llhold, llfavoruser, and llfavorjob commands. The Admin
pull-down menu on the Machines window corresponds to the command options
available in the llctl command.

The main window of the GUI has three sub-windows: one for job status with
pull-down menus for job-related commands, one for machine status with
pull-down menus for machine-related commands, and one for messages and logs
(see “The LoadLeveler main window” on page 404 in the Chapter 15, “Graphical
user interface (GUI) reference,” on page 403). There are a variety of facilities
available that allow you to sort and select the items displayed.

Job-related administrative actions
You access the administrative commands that act on jobs through the Admin
pull-down menu in the Jobs window of the GUI.

You can perform the following tasks with this menu:
Favor Users
Allows you to favor users. This means that you can select one or more users
whose jobs you want to move up in the job queue. This corresponds to the
llfavoruser command.
Select Admin from the Jobs window
Select Favor User
The Order by User window appears.
Type in
The name of the user whose jobs you want to favor.
Press OK

169

Unfavor Users
Allows you to unfavor users. This means that you want to unfavor the user’s
jobs which you previously favored. This corresponds to the llfavoruser
command.
Select Unfavor User
The Order by User window appears.
Type in
The name of the user for whom you want to unfavor their jobs.
Press OK
Favor Jobs
Allows you to select a job that you want to favor. This corresponds to the
llfavorjob command.
Select One or more jobs from the Jobs window
Select Favor Job
The selected jobs are favored.
Press OK
Unfavor Jobs
Allows you select a job that you want to unfavor. This corresponds to the
llfavorjob command.
Select Unfavor Job
Unfavors the jobs that you previously selected.
Syshold
Allows you to place a system hold on a job. This corresponds to the llhold
command.
Select A job from the Jobs window
Select Admin pull-down menu from the Jobs window
Select Syshold to place a system hold on the job.
Release From Hold
Allows you to release the system hold on a job. This corresponds to the llhold
command.
Select A job from the Jobs window
Select Release From Hold to release the system hold on the job.
Preempt
Available when using the BACKFILL or external schedulers. Preempt allows
you to place the selected jobs in preempted state. This action corresponds to
the llpreempt command.


Select Preempt
Resume Preempted Job
Available only when using the BACKFILL or external schedulers. Resume
Preempted Job allows you to remove user-initiated preemption (initiated using
the Preempt menu option or the llpreempt command) from the selected jobs.
This action corresponds to the llpreempt -r command.
Select Resume Preempted Job
Prevent Preempt
Available only when using the BACKFILL or API scheduler. Prevent Preempt
allows you to place the selected running job into a non-preemptable state.
When the BACKFILL or API scheduler is in use, this is equivalent to the
llmodify -p nopreempt command.
Select One job from the Jobs window
Select Prevent Preempt
Allow Preempt
Available only when using the BACKFILL or API scheduler, Allow Preempt
makes the unpreemptable job preemptable again. When the BACKFILL or API
scheduler is in use, this is equivalent to the llmodify -p preempt command.
Select Allow Preempt
Extend Wallclock Limits
Allows you to extend the wallclock limits by the number of minutes specified.
This corresponds to the llmodify -W command.
Select Admin pull-down window from the Jobs window
Select Extend Wallclock Limit
The Extend Wallclock Limits window appears.
Type in
The number of minutes to extend the wallclock limit.
Press OK
Modify Job Priority
Allows you to modify the system priority of a job step. This corresponds to the
llmodify -s command.
Select Modify Job Priority
The Modify Job Priority window appears.
Type in
An integer value for system priority.
Press OK

Chapter 7. Using LoadLeveler’s GUI to perform administrator tasks 171

Move to another cluster
Allows you to move an idle job from the local cluster to another. This menu
items appears only when a mulitcluster environment is configured. It
corresponds to the llmovejob command.
Select Modify Job Priority
The Move Job to Another Cluster window appears.
Select The name of the target cluster.
Press OK

Machine-related administrative actions
You access the administrative commands that act on machines using the Admin
pull-down menu in the Machines window of the GUI.

Using the GUI pull-down menu, you can perform the tasks described in this topic.
Start All
Starts LoadLeveler on all machines listed in machine stanzas beginning with
the central manager. Submit-only machines are skipped. Use this option when
specifying alternate central managers in order to ensure the primary central
manager starts before any alternate central manager attempts to serve as
central manager.
Select Admin from the Machines window.
Select Start All
Start LoadLeveler
Allows you to start LoadLeveler on selected machines.
Select One or more machines on which you want to start LoadLeveler.
Select Start LoadLeveler
Start Drained
Allows you to start LoadLeveler with startd drained on selected machines.
Select One or more machines on which you want startd drained.
Select Start Drained
Stop LoadLeveler
Allows you to stop LoadLeveler on selected machines.
Select One or more machines on which you want to stop LoadLeveler.
Select Stop LoadLeveler.
Stop All
Stops LoadLeveler on all machines listed in machine stanzas. Submit-only
machines are skipped.
Select Stop All


Reconfig
Forces all daemons to reread the configuration files
Select The machine on which you want to operate. To reconfigure this xloadl
session, choose reconfig but do not select a machine.
Select reconfig
Recycle
Stops all LoadLeveler daemons and restarts them.
Select The machine on which you want to operate.
Select recycle
Configuration Tasks
Starts Configuration Tasks wizard
Select Config Tasks

Note: Use the invoking script lltg to start the wizard outside of xloadl. This
option will appear on the pull-down only if the LoadL.tguides fileset is
installed.
Drain
Allows no more LoadLeveler jobs to begin running on this machine but it does
allow running jobs to complete.
Select drain.
A cascading menu allows you to select either daemons, Schedd, startd,
or startd by class. If you select daemons, both the startd and the
Schedd on the selected machine will be drained. If you select Schedd,
only the Schedd on the selected machine will be drained. If you select
startd, only the startd on the selected machine will be drained. If you
select startd by class, a window appears which allows you to select
classes to be drained.
Flush
Terminates running jobs on this host and sends them back to the system queue
to await redispatch. No new jobs are redispatched to this machine until resume
is issued. Forces a checkpoint if jobs are enabled for checkpointing.
Select flush
Suspend
Suspends all jobs on this host.
Select suspend


Resume
Resumes all jobs on this machine.
Select Admin from the Machines window
Select resume
A cascading menu allows you to select either daemons, Schedd, startd,
or startd by class. If you select daemons, both machines will be
resumed. If you select Schedd, only the Schedd on the selected
machine will be resumed. If you select startd, only the startd on the
selected machine will be resumed. If you select startd by class, a
window appears which allows you to select classes to be resumed.
Capture Data
Collects information on the machines selected.
Select Capture Data.
Collect Account Data
Collects accounting data on the machines selected.
Select Collect Account Data.
A window appears prompting you to enter the name of the directory
in which you want the collected data stored.
Collect Reservation Data
Collects reservation data on the machines selected.
Select Collect Reservation Data.
A window appears prompting you to enter the name of the directory
in which you want the collected data stored.
Create Account Report
Creates an accounting report for you.
Select Admin → Create Account Report...
Note: If you want to receive an extended accounting report, select the
extended cascading button.
A window appears prompting you to enter the following information:
v A short, long, or extended version of the output. The short version is
the default.
v The user ID
v The class name
v The LoadL (LoadLeveler) group name
v The UNIX group name
v The Allocated host
v The job ID
v The report Type


v The section
v A start and end date for the report. If no date is specified, the
default is to report all of the data in the report.
v The name of the input data file.
v The name of the output data file. This is the same as stdout.
Press OK
The window closes and you return to the main window. The report
appears in the Messages window if no output data file was specified.
Move Spool
Moves the job records from the spool of one managing Schedd to another
managing Schedd in the local cluster. This is intended for recovery purposes
only.
Select One Schedd machine from the Machines window.
Select Move Spool
A window is displayed prompting you to enter the directory
containing the job records to be moved.
Press OK
Version
Displays version and release data for LoadLeveler on the machines selected in
an information window.
Select version
Fair Share Scheduling
Provides fair share scheduling functions (see “llfs - Fair share scheduling
queries and operations” on page 450).
Select Fair Share Scheduling
A cascading menu allows you to select one of the following:
v Show
Displays fair share scheduling information for all users or for specified users
and groups.
v Save historic data
Saves fair share scheduling information into the directory specified.
v Restore historic data
Restores fair share scheduling data to a state corresponding to a file
previously saved by Save historic data or the llfs -s command.
v Reset historic data
Erases all historic CPU data to reset fair share scheduling.


Part 3. Submitting and managing TWS LoadLeveler jobs
After an administrator installs IBM Tivoli Workload Scheduler (TWS) LoadLeveler
and customizes the environment, general users can build and submit jobs to
exploit the many features of the TWS LoadLeveler runtime environment.

177

Chapter 8. Building and submitting jobs
Learn more about building and submitting jobs.

The topics listed Table 40 will help you learn about building and submitting jobs:
Table 40. Learning about building and submitting jobs
Creating and submitting serial and Chapter 8, “Building and submitting jobs”
parallel jobs
Controlling and monitoring TWS Chapter 9, “Managing submitted jobs,” on page
LoadLeveler jobs 229
Ways to control or monitor TWS v Chapter 16, “Commands,” on page 411
LoadLeveler operations by using the
v Chapter 10, “Example: Using commands to
TWS LoadLeveler commands, GUI, and
build, submit, and manage jobs,” on page 235
APIs
v Chapter 11, “Using LoadLeveler’s GUI to build,
submit, and manage jobs,” on page 237
v Chapter 17, “Application programming

Table 41 lists the tasks that general users perform to run LoadLeveler jobs.
Table 41. Roadmap of user tasks for building and submitting jobs
Building jobs v “Building a job command file”
v “Editing job command files” on page 185
v “Defining resources for a job step” on page 185
v “Working with coscheduled job steps” on page 187
v “Using bulk data transfer” on page 188
v “Preparing a job for checkpoint/restart” on page 190
v “Preparing a job for preemption” on page 193
Submitting jobs v “Submitting a job command file” on page 193
Working with parallel jobs “Working with parallel jobs” on page 194
Working with reserved node “Working with reservations” on page 213
resources and the jobs that use
them
Correctly specifying job Chapter 14, “Job command file reference,” on page 357
command file keywords

Building a job command file
Before you can submit a job or perform any other job related tasks, you need to
build a job command file.

A job command file describes the job you want to submit, and can include
LoadLeveler keyword statements. For example, to specify a binary to be executed,

179

you can use the executable keyword, which is described later in this topic. To
specify a shell script to be executed, the executable keyword can be used; if it is
not used, LoadLeveler assumes that the job command file itself is the executable.

The job command file can include the following:
v LoadLeveler keyword statements: A keyword is a word that can appear in job
command files. A keyword statement is a statement that begins with a
LoadLeveler keyword. These keywords are described in “Job command file
keyword descriptions” on page 359.
v Comment statements: You can use comments to document your job command
files. You can add comment lines to the file as you would in a shell script.
v Shell command statements: If you use a shell script as the executable, the job
command file can include shell commands.
v LoadLeveler variables: See “Job command file variables” on page 399 for more
information.

You can build a job command file either by using the Build a Job window on the
GUI or by using a text editor.

Using multiple steps in a job command file
To specify a stream of job steps, you need to list each job step in the job command
file.

You must specify one queue statement for each job step. Also, the executables for
all job steps in the job command file must exist when you submit the job. For most
keywords, if you specify the keyword in a job step of a multi-step job, its value is
inherited by all proceeding job steps. Exceptions to this are noted in the keyword
description.

LoadLeveler treats all job steps as independent job steps unless you use the
dependency keyword. If you use the dependency keyword, LoadLeveler
determines whether a job step should run based upon the exit status of the
previously run job step.

For example, Figure 19 on page 181 contains two separate job steps. Notice that
step1 is the first job step to run and that step2 is a job step that runs only if step1
exits with the correct exit status.


# This job command file lists two job steps called "step1"
# and "step2". "step2" only runs if "step1" completes
# with exit status = 0. Each job step requires a new
# queue statement.
#
# @ step_name = step1
# @ executable = executable1
# @ input = step1.in1
# @ output = step1.out1
# @ error = step2.err1
# @ queue
# @ dependency = (step1 == 0)
# @ queue

Figure 19. Job command file with multiple steps

In Figure 19, step1 is called the sustaining job step. step2 is called the dependent job
step because whether or not it begins to run is dependent upon the exit status of
step1. A single sustaining job step can have more than one dependent job steps
and a dependent job step can also have job steps dependent upon it.

In Figure 19, each job step has its own executable, input, output, and error
statements. Your job steps can have their own separate statements, or they can use
those statements defined in a previous job step. For example, in Figure 20, step2
uses the executable statement defined in step1:

# This job command file uses only one executable for
# both job steps.
#
# @ queue
# @ dependency = (step1 == 0)
# @ queue

Figure 20. Job command file with multiple steps and one executable

Examples: Job command files
These examples of job command files may apply to your situation.
v Example 1: Generating multiple jobs with varying outputs
To run a program several times, varying the initial conditions each time, you
could can multiple LoadLeveler scripts, each specifying a different input and
output file as described in Figure 22 on page 183. It would probably be more
convenient to prepare different input files and submit the job only once, letting
LoadLeveler generate the output files and do the multiple submissions for you.
Figure 21 on page 182 illustrates the following:
– You can refer to the LoadLeveler name of your job symbolically, using
$(jobid) and $(stepid) in the LoadLeveler script file.
– $(jobid) refers to the job identifier.

Chapter 8. Building and submitting jobs 181

– $(stepid) refers to the job step identifier and increases after each queue
command. Therefore, you only need to specify input, output, and error
statements once to have LoadLeveler name these files correctly.
Assume that you created five input files and each input file has different initial
conditions for the program. The names of the input files are in the form
longjob.in.x, where x is 0–4.
Submitting the LoadLeveler script shown in Figure 21 results in your program
running five times, each time with a different input file. LoadLeveler generates
the output file from the LoadLeveler job step IDs. This ensures that the results
from the different submissions are not merged.

# @ executable = longjob
# @ input = longjob.in.$(stepid)
# @ output = longjob.out.$(jobid).$(stepid)
# @ error = longjob.err.$(jobid).$(stepid)
# @ queue
# @ queue
# @ queue
# @ queue
# @ queue

Figure 21. Job command file with varying input statements

To submit the job, type the command:
llsubmit longjob.cmd

LoadLeveler responds by issuing the following:
submit: The job "ll6.23" with 5 job steps has been submitted.

Table 42 lists the standard input files, standard output files, and standard error
files for the five job steps:
Table 42. Standard files for the five job steps
Job Step Standard Input Standard Output Standard Error
ll6.23.0 longjob.in.0 longjob.out.23.0 longjob.err.23.0

v Example 2: Using LoadLeveler variables in a job command file
Figure 22 on page 183 shows how you can use LoadLeveler variables in a job
command file to assign different names to input and output files. This example
assumes the following:
– The name of the machine from which the job is submitted is lltest1
– The user’s home directory is /u/rhclark and the current working directory is
/u/rhclark/OSL
– LoadLeveler assigns a value of 122 to $(jobid).
In Job Step 0:
– LoadLeveler creates the subdirectories oslsslv_out and oslsslv_err if they do
not exist at the time the job step is started.
In Job Step 1:


– The character string rhclark denotes the home directory of user rhclark in
input, output, error, and executable statements.
– The $(base_executable) variable is set to be the “base” portion of the
executable, which is oslsslv.
– The $(host) variable is equivalent to $(hostname). Similarly, $(jobid) and
$(stepid) are equivalent to $(cluster) and $(process), respectively.
In Job Step 2:
– This job step is executed only if the return codes from Step 0 and Step 1 are
both equal to zero.
– The initial working directory for Step 2 is explicitly specified.

# Job step 0 ============================================================
# The names of the output and error files created by this job step are:
#
# output: /u/rhclark/OSL/oslsslv_out/lltest1.122.0.out
# error : /u/rhclark/OSL/oslsslv_err/lltest1_122_0_err
#
# @ job_name = OSL
# @ step_name = step_0
# @ executable = oslsslv
# @ arguments = -maxmin=min -scale=yes -alg=dual
# @ environment = OSL_ENV1=20000; OSL_ENV2=500000
# @ requirements = (Arch == "R6000") && (OpSys == "AIX53")
# @ input = test01.mps.$(stepid)
# @ output = $(executable)_out/$(host).$(jobid).$(stepid).out
# @ error = $(executable)_err/$(host)_$(jobid)_$(stepid)_err
# @ queue
#
# Job step 1 ============================================================
#
#
# @ step_name = step_1
# @ executable = rhclark/$(job_name)/oslsslv
# @ arguments = -maxmin=max -scale=no -alg=primal
# @ environment = OSL_ENV1=60000; OSL_ENV2=500000;
OSL_ENV3=70000; OSL_ENV4=800000;
# @ input = rhclark/$(job_name)/test01.mps.$(stepid)
# @ output = rhclark/$(job_name)/$(base_executable)_out/$(hostname).$(cluster).$(process).out
# @ error = rhclark/$(job_name)/$(base_executable)_err/$(hostname)_$(cluster)_$(process)_err
# @ queue
#
# Job step 2 ============================================================
#
#
# @ step_name = OSL
# @ dependency = (step_0 == 0) && (step_1 == 0)
# @ comment = oslsslv
# @ initialdir = /u/rhclark/$(step_name)
# @ arguments = -maxmin=min -scale=yes -alg=dual
# @ environment = OSL_ENV1=300000; OSL_ENV2=500000
# @ input = test01.mps.$(stepid)
# @ output = $(comment)_out/$(host).$(jobid).$(stepid).out
# @ error = $(comment)_err/$(host)_$(jobid)_$(stepid)_err
# @ queue

Figure 22. Using LoadLeveler variables in a job command file

v Example 3: Using the job command file as the executable
The name of the sample script shown in Figure 23 on page 185 is run_spice_job.
This script illustrates the following:
– The script does not contain the executable keyword. When you do not use
this keyword, LoadLeveler assumes that the script is the executable. (Since the


name of the script is run_spice_job, you can add the executable =
run_spice_job statement to the script, but it is not necessary.)
– The job consists of four job steps (there are 4 queue statements). The spice3f5
and spice2g6 programs are invoked at each job step using different input data
files:
- spice3f5: Input for this program is from the file spice3f5_input_x where x
has a value of 0, 1, and 2 for job steps 0, 1, and 2, respectively. The name of
this file is passed as the first argument to the script. Standard output and
standard error data generated by spice3f5 are directed to the file
spice3f5_output_x. The name of this file is passed as second argument to
the script. In job step 3, the names of the input and output files are
spice3f5_input_benchmark1 and spice3f5_output_benchmark1,
respectively.
- spice2g6: Input for this program is from the file spice2g6_input_x.
Standard output and standard error data generated by spice2g6 together
with all other standard output and standard error data generated by this
script are directed to the files spice_test_output_x and spice_test_error_x,
respectively. In job step 3, the name of the input file is
spice2g6_input_benchmark1. The standard output and standard error files
are spice_test_output_benchmark1 and spice_test_error_benchmark1.
All file names that are not fully qualified are relative to the initial working
directory /home/loadl/spice. LoadLeveler will send the job steps 0 and 1 of
this job to a machine for that has a real memory of 64 MB or more for
execution. Job step 2 most likely will be sent to a machine that has more that
128 MB of real memory and has the ESSL library installed since these
preferences have been stated using the LoadLeveler preferences keyword.
LoadLeveler will send job step 3 to the machine ll5.pok.ibm.com for
execution because of the explicit requirement for this machine in the
requirements statement.


#!/bin/ksh
# @ job_name = spice_test
# @ account_no = 99999
# @ class = small
# @ arguments = spice3f5_input_$(stepid) spice3f5_output_$(stepid)
# @ input = spice2g6_input_$(stepid)
# @ output = $(job_name)_output_$(stepid)
# @ error = $(job_name)_error_$(stepid)
# @ initialdir = /home/loadl/spice
# @ requirements = ((Arch == "R6000") &&
# (OpSys == "AIX53") && (Memory > 64))
# @ queue
# @ queue
# @ preferences = ((Memory > 128) && (Feature == "ESSL"))
# @ queue
# @ class = large
# @ arguments = spice3f5_input_benchmark1 spice3f5_output_benchmark1
# @ requirements = (Machine == "ll5.pok.ibm.com")
# @ input = spice2g6_input_benchmark1
# @ output = $(job_name)_output_benchmark1
# @ error = $(job_name)_error_benchmark1
# @ queue
OS_NAME=`unamè

case $OS_NAME in
AIX)
echo "Running $OS_NAME version of spice3f5" > $2
AIX_bin/spice3f5 < $1 >> $2 2>&1
echo "Running $OS_NAME version of spice2g6"
AIX_bin/spice2g6
;;
*)
echo "spice3f5 for $OS_NAME is not available" > $2
echo "spice2g6 for $OS_NAME is not available"
;;
esac

Figure 23. Job command file used as the executable

Editing job command files
After you build a job command file, you can edit it using the editor of your choice.

You may want to change the name of the executable or add or delete some
statements.

When you create a job command file, it is considered the job executable unless you
specify otherwise by using the executable keyword in the job command file.
LoadLeveler copies the executable to the spool directory unless the checkpoint
keyword was set to yes or interval. Jobs that are to be checkpointed cannot be
moved to the spool directory. Do not make any changes to the executable while the
job is still in the queue–it could affect the way that job runs.

Defining resources for a job step
The LoadLeveler user may use the resources keyword in the job command file to
specify the resources to be consumed by each task of a job step.

If the resources keyword is specified in the job command file, it overrides any
default_resources specified by the administrator for the job step’s class.


For example, the following job requests one CPU and one FRM license for each of
its tasks:
resources = ConsumableCpus(1) FRMlicense(1)

If this were specified in a serial job step, one CPU and one FRM license would be
consumed while the job step runs. If this were a parallel job step, then the number
of CPUs and FRM licenses consumed while the job step runs would depend upon
how many tasks were running on each machine. For more information on
assigning tasks to nodes, see “Task-assignment considerations” on page 196.

Alternatively, you can use the node_resources keyword in the job command file to
specify the resources to be consumed by the job step on each machine it runs on,
regardless of the number of tasks assigned to each machine. If the node_resources
keyword is specified in the job command file, it overrides the
default_node_resources specified by the administrator for the job step’s class.

For example, the following job requests 240 MB of ConsumableMemory on each
machine:
node_resources = ConsumableMemory(240 mb)

Even if one machine only runs one task of the job step, while other machines run
multiple tasks, 240 MB will be consumed on every machine.

| Submitting jobs requesting data staging
| The dstg_in_script keyword causes LoadLeveler to generate an inbound data
| staging step, without requiring the #@queue specification. The value assigned to
| this keyword is the executable that will be started for data staging and any
| arguments needed by this script or executable as well.

| The dstg_in_wall_clock_limit keyword specifies a wall clock time for the inbound
| data staging step. Specifying the estimated wall clock limit is mandatory when a
| data staging script is specified. Similarly, dstg_out_script and
| dstg_out_wall_clock_limit will be used for generation and execution of the
| outbound data staging step for the job. All data staging job steps are assigned to
| the predefined class called data_stage.

| Resources required for data staging can be specified using the dstg_resources
| keyword.

| The dstg_node keyword allows you to specify how data replicas must be created:
| v If the value specified is any, one data staging task is executed on any available
| node in the cluster with data staging resources. This value can be used with
| either the at_submit or the just_in_time configuration options.
| v If the value specified is master, one data staging task is executed on the master
| node. The master node is the machine that will be used to run the inbound and
| outbound data staging steps as well as the first application step of the job.
| v If the value is all, a data staging task is executed on each of the nodes that will
| be or were used by the first application step.

| Any environment variables needed by the data staging scripts can be specified
| using the dstg_environment keyword. The copy_all value can be assigned to this
| keyword to get all of the user’s environment variables.


| For detailed information about the data staging job command file keywords, see
| “Job command file keyword descriptions” on page 359.

Working with coscheduled job steps
LoadLeveler allows you to specify that a group of two or more steps within a job
are to be coscheduled. Coscheduled steps are dispatched at the same time.

Submitting coscheduled job steps
| The coschedule = yes keyword in the job command file is used to specify which
| steps within a job are to be coscheduled.

| All steps within a job with the coschedule keyword set to yes will be coscheduled.
| The coscheduled steps will continue to be stored as individual steps in both
| memory and in the job queue, but when performing certain operations, such as
| scheduling, the steps will be managed as a single entity. An operation initiated on
| one of the coscheduled steps will cause the operation to be performed on all other
| steps (unless the coscheduling dependency between steps is broken).

Determining priority for coscheduled job steps
Coscheduled steps are supported only with the BACKFILL scheduler. The
LoadLeveler BACKFILL scheduler will only dispatch the set of coscheduled steps
when enough resource is available for all steps in the set to start.

If the set of coscheduled steps cannot be started immediately, but enough resource
will be available in the future, then the resource for all the steps will be reserved.
In this case, only one of the coscheduled steps will be designated as a top dog, but
enough resources will be reserved for all coscheduled steps and all the steps will
be dispatched when the top dog step is started. The coscheduled step with the
highest priority in the current job queue will be designated as the primary
coscheduled step and all other steps will be secondary coscheduled steps. The
primary coscheduled step will determine when the set of coscheduled steps will be
scheduled. The priority for all other coscheduled steps is ignored.

Supporting preemption of coscheduled job steps
Preemption of coscheduled steps is supported.

Preemption of coscheduled steps is supported with the following restrictions:
v In order for a step S to be preemptable by a coscheduled step, all steps in the set
of coscheduled steps must be able to preempt step S.
v In order for a step S to preempt a coscheduled step, all steps in the set of
coscheduled steps must be preemptable by step S.
v The set of job steps available for preemption will be the same for all coscheduled
steps. Any resource made available by preemption for one coscheduled step will
be available to all other coscheduled steps.

To determine the preempt type and preempt method to use when a coscheduled
step preempts another step, an order of precedence for preempt types and preempt
methods has been defined. All steps in the preempting coscheduled step are
examined and the preempt type and preempt method having the highest
precedence are used. The order of precedence for preempt type will be ALL and
ENOUGH. The precedence order for preempt method is:
v Remove

v Vacate
v System Hold
v User hold
v Suspend

For more information about preempt types and methods, see “Planning to preempt
jobs” on page 128.

When coscheduled steps are running, if one step is preempted as a result of a
system-initiated preemption, then all coscheduled steps are preempted. When
determining an optimal preempt set, the BACKFILL scheduler does not consider
coscheduled steps as a single entity. All coscheduled steps are in the initial
preempt set, but the final preempt set might not include all coscheduled steps, if
the scheduler determines the resources of some coscheduled steps are not
necessary to start the preempting job step. This implies that more resource than
necessary might be preempted when a coscheduled step is in the set of steps to be
preempted because regardless of whether or not all coscheduled steps are in the
preempt set, if one coscheduled step is preempted, then all coscheduled steps will
be preempted.

Coscheduled job steps and commands and APIs
Commands and APIs that operate on job steps are impacted by coscheduled steps.

For the llbind, llcancel, llhold, and llpreempt commands, even if all coscheduled
steps are not in the list of targeted steps, the requested operation is performed on
all coscheduled steps.

For the llmkres and llchres commands, a coscheduled job step cannot be specified
when using the -j or -f flags. For the llckpt command, you cannot specify a
coscheduled job step using the -u flag.

Termination of coscheduled steps
If a coscheduled step is dispatched but cannot be started and is rejected by the
startd daemon or the starter process, then all coscheduled steps are rejected.

If a running step is removed or vacated by LoadLeveler as a result of a system
related failure, then all coscheduled steps are removed or vacated. If a running
step is vacated as a result of the VACATE expression evaluating to true for the
step, then all coscheduled steps are vacated.

Using bulk data transfer
On systems with device drivers and network adapters that support remote
direct-memory access (RDMA), LoadLeveler supports bulk data transfer for jobs
that use either the Internet or user space communication protocol mode.

For jobs using the Internet protocol (IP jobs), LoadLeveler does not monitor or
control the use of bulk transfer. For user space jobs that request bulk transfer,
however, LoadLeveler creates a consumable RDMA resource requirement.
Machines with Switch Network Interface for HPS network adapters are
automatically given an RDMA consumable resource with an available amount of
four. Machines with InfiniBand switch adapters are given unlimited RDMA
consumable resources. Each step that requests bulk transfer consumes one RDMA
resource on each machine on which that step runs.


The RDMA resource is similar to user-defined consumable resources except in one
important way: A user-specified resource requirement is consumed by every task
of the job assigned to a machine, whereas the RDMA resource is consumed once
on a machine no matter how many tasks of the job are running on the machine.
Other than that exception, LoadLeveler handles the RDMA resource as it does all
other consumable resources. LoadLeveler displays RDMA resources in the output
of the following commands:
v llq -l
v llsummary -l

LoadLeveler also displays RDMA resources in the output of the following
commands for machines with Switch Network Interface for HPS network adapters:
v llstatus -l
v llstatus -R

Bulk transfer is supported only on systems where the device driver of the network
adapters supports RDMA. To determine which systems will support bulk transfer,
use the llstatus command with the -l, -R, or -a flag to display machines with
adapters that support RDMA. Machines with Switch Network Interface for HPS
network adapters will have an RDMA resource listed in the command output of
| llstatus -l and llstatus -R. The llstatus -a command displays the adapters list,
which can be used to verify whether InfiniBand adapters are connected to the
machines.

Under certain conditions, LoadLeveler displays a total count of RDMA resources as
less than four for machines with Switch Network Interface for HPS network
adapters:
v If jobs that LoadLeveler does not manage use RDMA, the amount of available
RDMA resource reported to the Negotiator is reduced by the amount consumed
by the unmanaged jobs.
v In rare situations, LoadLeveler jobs can fail to release their adapter resources
before reporting to the Negotiator that they have completed. When this occurs,
the amount of available RDMA reported to the Negotiator is reduced by the
amount consumed by the unreleased adapter resources. When the adapter
resources are eventually released, the RDMA resource they consumed becomes
available again.
These conditions do not require corrective action.

You do not need to perform specific job-definition tasks to enable bulk transfer for
LoadLeveler jobs that use the IP network protocol. LoadLeveler cannot affect
whether IP communication uses bulk transfer; the implementation of IP where the
job runs determines whether bulk transfer is supported.

To enable user space jobs to use bulk data transfer, however, all of the following
tasks must be completed. If you omit one or more of these steps, the job will run
but will not be able to use bulk transfer.
v A LoadLeveler administrator must update the LoadLeveler configuration file to
include the value RDMA in the SCHEDULE_BY_RESOURCES list for machines
with Switch Network Interfaces for HPS network adapters. It is not required to
include RDMA in the SCHEDULE_BY_RESOURCES list for machines with
InfiniBand network adapters.
Example:

SCHEDULE_BY_RESOURCES = RDMA others


v Users must request bulk transfer for their LoadLeveler jobs, using one of the
following methods:
– Specifying the bulkxfer keyword in the LoadLeveler job command file.
Example:

#@ bulkxfer=yes
If users specify this keyword for jobs that use the IP communication protocol,
LoadLeveler ignores the bulkxfer keyword.
– Specifying a POE line command parameter on interactive jobs.
Example:
poe_job -use_bulk_xfer=yes
– Specifying an environment variable on interactive jobs.
Example:
export MP_USE_BULK_XFER=yes
poe_job
v Because LoadLeveler honors the bulk transfer request only for LAPI or MPI jobs,
users must ensure that the network keyword in the job command file specifies
the MPI, LAPI, or MPI_LAPI protocol for user space communication.
Examples:
network.MPI =sn_single,not_shared,US,HIGH
network.MPI_LAPI =sn_single,not_shared,US,HIGH

Preparing a job for checkpoint/restart
You can checkpoint your entire job step, and allow a job step to restart from the
last checkpoint.

LoadLeveler has the ability to checkpoint your entire job step, and to allow a job
step to restart from the last checkpoint. When a job step is checkpointed, the entire
state of each process of that job step is saved by the operating system. On AIX, this
checkpoint capability is built in to the base operating system.

Use the information in Table 43 on page 191 to correctly configure your job for
checkpointing.


Table 43. Checkpoint configurations
To specify that: Do this:
Your job is v Add either one of the following two options to your job
checkpointable command file:
1. checkpoint = yes
This enables your job to checkpoint in any of the following
ways:
– The application can initiate the checkpoint. This is only
available on AIX.
– Checkpoint from a program which invokes the ll_ckpt API.
– Checkpoint using the llckpt command.
– As the result of a flush command.
OR
2. checkpoint = interval
This enables your job to checkpoint in any of the following
ways:
– The application can initiate the checkpoint. This is only
available on AIX.
– Checkpoint from a program which invokes the ll_ckpt API.
– Checkpoint using the llckpt command.
– Checkpoint automatically taken by LoadLeveler.
– As the result of a flush command.
v If you would like your job to checkpoint itself, use the API
ll_init_ckpt in your serial application, or mpc_init_ckpt for
parallel jobs to cause the checkpoint to occur. This is only
available on AIX.
Your job step’s Add the ckpt_execute_dir keyword to the job command file.
executable is to be
copied to the execute
node


Table 43. Checkpoint configurations (continued)
LoadLeveler 1. Add the following option to your job command file:
automatically
checkpoint = interval
checkpoints your job
at preset intervals This enables your job to checkpoint in any of the following
ways:
v Checkpoint automatically at preset intervals
v Checkpoint initiated from user application. This is only
available on AIX.
v Checkpoint from a program which invokes the ll_ckpt API
v Checkpoint using the llckpt command
v As the result of a flush command
2. The system administrators must set the following two keywords
in the configuration file to specify how often LoadLeveler
should take a checkpoint of the job. These two keywords are:
MIN_CKPT_INTERVAL = number
Where number specifies the initial period, in seconds,
between checkpoints taken for running jobs.
MAX_CKPT_INTERVAL = number
Where number specifies the maximum period, in seconds,
between checkpoints taken for running jobs.

The time between checkpoints will be increased after each
checkpoint within these limits as follows:
v The first checkpoint is taken after a period of time equal to the
MIN_CKPT_INTERVAL has passed.
v The second checkpoint is taken after LoadLeveler waits twice as
long (MIN_CKPT_INTERVAL X 2)
v The third checkpoint is taken after LoadLeveler waits twice as
long again (MIN_CKPT_INTERVAL X 4) before taking the third
checkpoint.
LoadLeveler continues to double this period until the value of
MAX_CKPT_INTERVAL has been reached, where it stays for the
remainder of the job.

A minimum value of 900 (15 minutes) and a maximum value of
7200 (2 hours) are the defaults.

You can set these keyword values globally in the global
configuration file so that all machines in the cluster have the same
value, or you can specify a different value for each machine by
modifying the local configuration files.
Your job will not be Add the following option to your job command file:
checkpointed v checkpoint = no

This will disable checkpoint.


Table 43. Checkpoint configurations (continued)
Your job has 1. Add the following option to your job command file:
successfully v restart_from_ckpt = yes
checkpointed and
2. On AIX, specify the name of the checkpoint file by setting the
terminated. The job
following job command file keywords to specify the directory
has left the
and file name of the checkpoint file to be used:
LoadLeveler job queue
v ckpt_dir
and you want
v ckpt_file
LoadLeveler to restart
your executable from When the job command file is submitted, a new job will be started
an existing checkpoint that uses the specified checkpoint file to restart the previously
file. checkpointed job.

The job command file which was used to submit the original job
should be used to restart from checkpoint. The only modifications
to this file should be the addition of restart_from_ckpt = yes and
ensuring ckpt_dir and ckpt_file point to the appropriate checkpoint
file.
Your job has
successfully When the job restarts, if a checkpoint file is available, the job will
checkpointed. The job be restarted from that file.
has been vacated and
remains on the If a checkpoint file is not available upon restart, the job will be
LoadLeveler job started from the beginning.
queue.

Preparing a job for preemption
Depending on various configuration options, LoadLeveler may preempt your job
so that a higher priority job step can run.

Administrators may:
v Configure LoadLeveler or external schedulers to preempt jobs through various
methods.
v Specify preemption rules for job classes.
v Manually preempt your job using LoadLeveler interfaces.

To ensure that your job can be resumed after preemption, set the restart keyword
in the job command file to yes.

Submitting a job command file
After building a job command file, you can submit it for processing either to a
machine in the LoadLeveler cluster or one outside of the cluster.

See “Querying multiple LoadLeveler clusters” on page 71 for information on
submitting a job to a machine outside the cluster. You can submit a job command
file either by using the GUI or the llsubmit command.

When you submit a job, LoadLeveler assigns a job identifier and one or more step
identifiers.

The LoadLeveler job identifier consists of the following:


machine name
The name of the machine which assigned the job identifier.
jobid A number given to a group of job steps that were initiated from the same
job command file.

The LoadLeveler step identifier consists of the following:
job identifier
The job identifier.
stepid A number that is unique for every job step in the job you submit.

If a job command file contains multiple job steps, every job step will have the same
jobid and a unique stepid.

For an example of submitting a job, see Chapter 10, “Example: Using commands to
build, submit, and manage jobs,” on page 235.

In a multicluster environment, job and step identifiers are assigned by the local
cluster and are retained by the job regardless of what cluster the job runs in.

Submitting a job using a submit-only machine
You can submit jobs from submit-only machines.

Submit-only machines allow machines that do not run LoadLeveler daemons to
submit jobs to the cluster. You can submit a job using either the submit-only
version of the GUI or the llsubmit command.

To install submit-only LoadLeveler, follow the procedure in the TWS LoadLeveler:
Installation Guide.

In addition to allowing you to submit jobs, the submit-only feature allows you to
cancel and query jobs from a submit-only machine.

Working with parallel jobs
LoadLeveler allows you to schedule parallel batch jobs.

LoadLeveler allows you to schedule parallel batch jobs that have been written
using the following:
v On AIX and Linux:
– IBM Parallel Environment (PE)
– MPICH, which is an open-source, portable implementation of the
Message-Passing Interface Standard developed by Argonne National
Laboratory
– MPICH-GM, which is a port of MPICH on top of Myrinet GM code
v On Linux:
– MVAPICH, which is a high performance implementation of MPI-1 over
InfiniBand based on MPICH support for PE is available in this release of
LoadLeveler for Linux


Step for controlling whether LoadLeveler copies environment
variables to all executing nodes
You may specify that LoadLeveler is to copy, either to all executing nodes or to
only the master executing node, the environment variables that are specified in the
environment job command file statement for a parallel job.

Before you begin: You need to know:
v Whether Parallel Environment (PE) will be used to run the parallel job; if so,
then LoadLeveler does not have to copy the application environment to the
executing nodes.
v How to correctly specify the env_copy keyword. For information about keyword
syntax and other details, see the env_copy keyword description.
v To specify whether LoadLeveler is to copy environment variables to only the
master node, or to all executing nodes, use the #@ env_copy keyword in the job
command file.

Ensuring that parallel jobs in a cluster run on the correct
levels of PE and LoadLeveler software
If support for parallel POE jobs is required, users must be aware that when
LoadLeveler uses Parallel Environment for parallel job submission, that the PE
software requires the same level of PE to be used throughout the parallel job.

| Different levels of PE cannot be mixed. For example, PE 5.1 supports only
| LoadLeveler 3.5, and PE 4.3 only supports LoadLeveler 3.4.3. Therefore, a POE
| parallel job cannot run some of its tasks on LoadLeveler 3.4.3 machines and the
| remaining tasks on LoadLeveler 3.5 machines.

The requirements keyword of the job command file can be used to ensure that all
the tasks of a POE job run on compatible levels of PE and LoadLeveler software in
a cluster. Here are three examples showing different ways this can be done:
| 1. If the following requirements statement is included in the job command file,
| LoadLeveler’s central manager will select only 3.5 or higher machines with the
| appropriate OpSys level for this job step.
| # @ requirements = (LL_Version >= "3.5") && (OpSys == "AIX53")
2. If a requirements statement such as the following is specified, the tasks of a
POE job will see a consistent environment when ″hostname1″ and ″hostname2″
run the same levels of PE and LoadLeveler software.
# @ requirements = (Machine == { "hostname1" "hostname2" }) && (OpSys == "AIX53")
| 3. If the mixed cluster has been partitioned into 3.4.3 and 3.5 LoadLeveler pools,
| then you may use a requirements statement similar to one of the two following
| statements to select machines running the same levels of software.
| v # @ requirements = (Pool == 35) && (OpSys == "AIX53")
| v # @ requirements = (Pool == 343) && (OpSys == "AIX53")
| Here, it is assumed that all the 3.4.3 machines in this mixed cluster are assigned
| to pool 343 and all 3.5 machines are assigned to pool 35. A LoadLeveler
| administrator can use the pool_list keyword of the machine stanza of the
| LoadLeveler administration file to assign machines to pools.

If a statement such as # @ executable = /bin/poe is specified in a job command
file, and if the job is intended to be run on 3.5 machines, then it is important that
the job be submitted from a 3.5 machine. When the ″executable″ keyword is used,
LoadLeveler will copy the associated binary on the submitting machine and send it

to a running machine for execution. In this example, the POE program will fail if
the submitting and the running machines are at different software levels. In a
mixed cluster, this problem can be circumvented by not using the executable
keyword in the job command file. By omitting this keyword, the job command file
itself is the shell script that will be executed. If this script invokes a local version of
the POE binary then there is no compatibility problem at run time.

Task-assignment considerations
You can use the keywords to specify how LoadLeveler assigns tasks to nodes.

You can use the keywords listed in Table 44 to specify how LoadLeveler assigns
tasks to nodes. With the exception of unlimited blocking, each of these methods
prioritizes machines in an order based on their MACHPRIO expressions. Various
task assignment keywords can be used in combination, and others are mutually
exclusive.
| Table 44. Valid combinations of task assignment keywords are listed in each column
| Keyword Valid Combinations
| total_tasks X X
| tasks_per_node X X
| node = <min, max> X
| node = <number> X X
| task_geometry X
| blocking X
|

The following examples show how each allocation method works. For each
example, consider a 3-node SP with machines named ″N1,″ ″N2,″ and ″N3″. The
machines’ order of priority, according to the values of their MACHPRIO
expressions, is: N1, N2, N3. N1 has 4 initiators available, N2 has 6, and N3 has 8.

node and total_tasks
When you specify the node keyword with the total_tasks keyword, the assignment
function will allocate all of the tasks in the job step evenly among however many
nodes you have specified.

If the number of total_tasks is not evenly divisible by the number of nodes, then
the assignment function will assign any larger groups to the first nodes on the list
that can accept them. In this example, 14 tasks must be allocated among 3 nodes:
# @ node=3
# @ total_tasks=14

Table 45 shows the machine, available initiators, and assigned tasks:
Table 45. node and total_tasks
Machine Available Initiators Assigned Tasks
N1 4 4
N2 6 5
N3 8 5

The assignment function divides the 14 tasks into groups of 5, 5, and 4, and begins
at the top of the list, to assign the first group of 5. The assignment function starts


at N1, but because there are only 4 available initiators, cannot assign a block of 5
tasks. Instead, the function moves down the list and assigns the two groups of 5 to
N2 and N3, the assignment function then goes back and assigns the group of 4
tasks to N1.

node and tasks_per_node
When you specify the node keyword with the tasks_per_node keyword, the
assignment function will assign tasks in groups of the specified value among the
specified number of nodes.
# @ node = 3
# @ tasks_per_node = 4

blocking
When you specify blocking, tasks are allocated to machines in groups (blocks) of
the specified number (blocking factor).

The assignment function will assign one block at a time to the machine which is
next in the order of priority until all of the tasks have been assigned. If the total
number of tasks are not evenly divisible by the blocking factor, the remainder of
tasks are allocated to a single node. The blocking keyword must be specified with
the total_tasks keyword. For example:
# @ blocking = 4
# @ total_tasks = 17

Where blocking specifies that a job’s tasks will be assigned in blocks, and 4
designates the size of the blocks. Table 46 shows how a blocking factor of 4 would
work with 17 tasks:
Table 46. Blocking
N1 4 4
N2 6 5
N3 8 8

The assignment function first determines that there will be 4 blocks of 4 tasks, with
a remainder of one task. Therefore, the function will allocate the remainder with
the first block that it can. N1 gets a block of four tasks, N2 gets a block, plus the
remainder, then N3 gets a block. The assignment function begins again at the top
of the priority list, and N3 is the only node with enough initiators available, so N3
ends up with the last block.

unlimited blocking
When you specify unlimited blocking, the assignment function will allocate as
many jobs as possible to each node; the function prioritizes nodes primarily by
how many initiators each node has available, and secondarily on their MACHPRIO
expressions.

This method allows you to allocate tasks among as few nodes as possible. To
specify unlimited blocking, specify ″unlimited″ as the value for the blocking
keyword. The total_tasks keyword must also be specified with unlimited blocking.
For example:
# @ blocking = unlimited
# @ total_tasks = 17

Table 47 on page 198 lists the machine, available initiators, and assigned tasks for
unlimited blocking:


Table 47. Unlimited blocking
N3 8 8
N2 6 6
N1 4 3

The assignment function begins with N3 (because N3 has the most initiators
available), and assigns 8 tasks, N2 takes six, and N1 takes the remaining 3.

task_geometry
The task_geometry keyword allows you to specify which tasks run together on the
same machines, although you cannot specify which machines.

In this example, the task_geometry keyword groups 7 tasks to run on 3 nodes:
# @ task_geometry = {(5,2)(1,3)(4,6,0)}

The entire task_geometry expression must be enclosed within braces. The task IDs
for each node must be enclosed within parenthesis, and must be separated by
commas. The entire range of task IDs that you specify must begin with zero, and
must end with the task ID which is one less than the total number of tasks. You
can specify the task IDs in any order, but you cannot skip numbers (the range of
task IDs must be complete). Commas may only appear between task IDs, and
spaces may only appear between nodes and task IDs.

Submitting jobs that use striping
When communication between parallel tasks occurs only over a single device such
as en0, the application and the device are gated by each other.

The device must wait for the application to fill a communication buffer before it
transmits the buffer and the application must wait for the device to transmit and
empty the buffer before it can refill the buffer. Thus the application and the device
must wait for each other and this wastes time.

The technique of striping refers to using two or more communication paths to
implement a single communication path as perceived by the application. As the
application sends data, it fills up a buffer on one device. As that buffer is
transmitted over the first device, the application’s data begins filling up a second
buffer and the application perceives no delay in being able to write. When the
second buffer is full, it begins transmission over the second device and the
application moves on to the next device. When all devices have been used, the
application returns to the first device. Much, if not all of the buffer on the first
device has been transmitted while the application wrote to the buffers on the other
devices so the application waits for a minimal amount of time or possibly does not
wait at all.

LoadLeveler supports striping in two ways. When multiple switch planes or
networks are present, striping over them is indicated by requesting sn_all
(multiple networks).

If multiple adapters are present on the same network and the communication
subsystem, such as LAPI, supports striping over multiple adapters on the same
network, specifying the instances keyword on the network statement requests
striping over adapters on the same network. The instances keyword specifies the
number of adapters on a single network to stripe on. It is possible to stripe over


multiple networks and over multiple adapters on each network by specifying both
sn_all and a value for instances greater than one. For HPS adapters, only
machines that are connected to both networks are considered for sn_all jobs.
v User space striping: When sn_all is specified on a network statement with US
mode, LoadLeveler commits an equivalent set of adapter resources (adapter
windows and memory) on each of the networks present in the system to the job
on each node where the job runs. The communication subsystem is initialized to
indicate that it should use the user space communication protocol on all the
available switch adapters to service communication requests on behalf of the
application.
v IP striping: When the sn_all device is specified on a network statement with the
IP mode, LoadLeveler attempts to locate the striped IP address associated with
the switch adapters, known as the multi-link address. If it is successful, it passes
the multi-link address to POE for use. If multi-link addresses are not available,
LoadLeveler instructs POE to use the IP address of one of the switch adapters.
The IP address that is used is different each time a choice has to be made in an
attempt to balance the adapter use. Multi-link addresses must be configured on
the system prior to running LoadLeveler and they are specified with the
multilink_address keyword on the switch adapter stanza in the administration
file. If a multi-link address is specified for a node, LoadLeveler assigns the
multi-link address and multi-link IP name to the striping adapter on that node.
If a multi-link address is not present on a node, the sn_all adapter associated
with the node will not have an IP address or IP name. If not all of the nodes of
a system have multi-link addresses but some do, LoadLeveler will only dispatch
jobs that request IP striping to nodes that have multi-link addresses.
Jobs that request striping (both user space and IP) can be submitted to nodes
with only one switch adapter. In that situation, the result is the same as if the
job requested no striping.

Note: When configured, a multi-link address is associated with the virtual ml0
device. The IP address of this device is the multi-link address. The
llextRPD program will create a stanza for the ml0 device that will appear
similar to Ethernet or token ring adapter stanzas except that it will
include the multilink_list keyword that lists the adapters it performs
striping over. As with any other device with an IP address, the ml0 device
can be requested in IP mode on the network statement. Doing so would
yield a comparable effect to requesting sn_all IP except that no checking
would be performed by LoadLeveler to ensure the associated adapters are
actually working. Thus it would be possible to dispatch a job that
requested communication over ml0 only to have the job fail because the
switch adapters that ml0 stripes over were down.
v Striping over one network: If the instances keyword is specified on a network
statement with a value greater than one, LoadLeveler allocates multiple sets of
resources for the protocol using as many sets as the instances keyword
specified. For User Space jobs, these sets are adapter windows and memory. For
IP jobs, these sets are IP addresses. If multiple adapters exist on each node on
the same network, then these sets of adapter resources will be distributed among
all the available adapters on the same network. Even though LoadLeveler will
allocate resources to support striping over a single network, the communication
subsystem must be capable of exploiting these resources in order for them to be
used.

Understanding striping over multiple networks
Striping over multiple networks involves establishing a communication path using
one or more of the available communication networks or switch fabrics.


How those paths are established depends on the network adapter that is present.
For the SP Switch2 family of adapters, it is not necessary to acquire communication
paths among all tasks on all fabrics as long as there is at least one fabric over
which all tasks can communicate. However, each adapter on a machine, if it is
available, must use exactly the same adapter resources (window and memory
amount) as the other adapters on that machine. Switch Network Interface for HPS
adapters are not required to use exactly the same resources on each network, but
in order for a machine to be selected, there must be an available communication
path on all networks.

Node 1

Adapter A
fault
Adapter B

Node 2

Adapter A

Adapter B
Switch Switch
Network A Node 3 Network B

Adapter A fault
Adapter B

Node 4

Adapter A
fault Adapter B

Figure 24. Striping over multiple networks

Consider these sample scenarios using the network configuration as shown in
Figure 24 where the adapters are from the SP Switch2 family:
v If a three node job requests striping over networks, it will be dispatched to Node
1, Node 2 and Node 4 where it can communicate on Network B as long as the
adapters on each machine have a common window free and sufficient memory
available. It cannot run on Node 3 because that node only has a common
communication path with Node 2, namely Network A.
v If a three node job does not request striping, it will not be run because there are
not enough adapters connected to Network A to run the job. Notice both the
adapter connected to Network A on Node 1 and the adapter connected to
Network A on Node 4 are both at fault. SP Switch2 family adapters can only use
the adapter connected to Network A for non-striped communication.


v If a three node job requests striped IP and some but not all of the nodes have
multi-linked addresses, the job will only be dispatched to the nodes that have
the multi-link addresses.

Consider these sample scenarios using the network configuration as shown in
Figure 24 on page 200 where the adapters are Switch Network Interface for HPS
adapters:
v If a three node job requests striping over networks, it will not be dispatched
because there are not three nodes that have active connections to both networks.
v If a three node job does not request striping, it can be run on Node 1, Node 2,
and Node 4 because they have an active connection to network B.
v If a three node job requests striped IP and some but not all of the nodes have
multi-linked addresses, the job will only be dispatched to the nodes that have
the multi-link addresses.

Note that for all adapter types, adapters are allocated to a step that requests
striping based on what the node knows is the available set of networks or fabrics.
LoadLeveler expects each node to have the same knowledge about available
networks. If this is not true, it is possible for tasks of a step to be assigned
adapters which cannot communicate with tasks on other nodes.

Similarly, LoadLeveler expects all adapters that are identified as being on the same
Network ID or fabric ID to be able to communicate with each other. If this is not
true, such as when LoadLeveler operates with multiple, independent sets of
networks, other attributes of the Step, such as the requirements expression, must
be used to ensure that only nodes from a single network set are considered for the
step.

As you can see from these scenarios, LoadLeveler will find enough nodes on the
same communication path to run the job. If enough nodes connected to a common
communication path cannot be found, no communication can take place and the
job will not run.

Understanding striping over a single network
Striping over a single network is only supported by Switch Network Interface for
HPS adapters.

Figure 25 on page 202 shows a network configuration where the adapters support
striping over a single network.


Instance 0

Node 1 Instance 1

A Instance 2
Adapter A
B
Adapter B

Node 2
Switch
A Network 0
Adapter A
Adapter B
B

Node 3
A
Adapter A A
Adapter B
fault

Figure 25. Striping over a single network

Both Adapter A and Adapter B on a node are connected to Network 0. The entire
oval represents the physical network and the concentric ovals (shaded differently)
represent the separate communication paths created for a job by the instances
keyword on the network statement. In this case a three node job requests two
instances for communication. On Node 1, adapter A is used for instance 0 and
adapter B is used for instance 1. There is no requirement to use the same adapter
for the same instance so on Node 2, adapter B was used for instance 0 and adapter
A for instance 1.

On Node 3, where a fault is keeping adapter B from connecting to the network,
adapter A is used for both instance 0 and instance 1 and Node 3 is available for
the job to use.

The network itself does not impose any limitation on the total number of
communication paths that can be active at a given time for either a single job or all
the jobs using the network. As long as nodes with adapter resources are available,
additional communication paths can be created.

Examples: Requesting striping in network statements
You request that a job be run using striping with the network statement in your
job command file.

The default when instances is not specified for a job in the network statement is
controlled by the class stanza keyword for sn_all. For more information on the
network and max_protocol_instances statements, see the keyword descriptions in
“Job command file keyword descriptions” on page 359.

Shown here are examples of IP and user space network modes:
v Example 1: Requesting striping using IP mode
To submit a job using IP striping, your network statement would look like this:


network.MPI = sn_all,,IP
v Example 2: Requesting striping using user space mode
To submit a job using user space striping, your network statement would look
like this:
network.MPI = sn_all,,US
v Example 3: Requesting striping over a single network
To request IP striping over multiple adapter on a single network, the network
statement would look like this:
network.MPI = sn_single,,IP,,instances=2

If the nodes on which the job runs have two or more adapters on the same
network, two different IP addresses will be allocated to each task for MPI
communication. If only one adapter exists per network, the same IP address will
be used twice for each task for MPI communication.
v Example 4: Requesting striping over multiple networks and multiple adapters
on the same network
To submit a user space job that will stripe MPI communication over multiple
adapters on all networks present in the system the network statement would
look like this:
network.MPI = sn_all,,US,,instances=2

If, on a node where the job runs, there are two adapters on each of the two
networks, one adapter window would be allocated from each adapter for MPI
communication by the job. If only one network were present with two adapters,
one adapter window from each of the two adapters would be used. If two
networks were present but each only had one adapter on it, two adapter
windows from each adapter would be used to satisfy the request for two
instances.

Running interactive POE jobs
POE will accept LoadLeveler job command files

However, you can still set the following environment variables to define specific
LoadLeveler job attributes before running an interactive POE job:
LOADL_ACCOUNT_NO
The account number associated with the job.
LOADL_INTERACTIVE_CLASS
The class to which the job is assigned.
MP_TASK_AFFINITY
The affinity preferences requested for the job.

For information on other POE environment variables, see IBM Parallel Environment
for AIX and Linux: Operation and Use, Volume 1.

For an interactive POE job, LoadLeveler does not start the POE process therefore
LoadLeveler has no control over the process environment or resource limits.

You also may run interactive POE jobs under a reservation. For additional details
about reservations and submitting jobs to run under them, see “Working with
reservations” on page 213.

Interactive POE jobs cannot be submitted to a remote cluster.


Running MPICH, MVAPICH, and MPICH-GM jobs
| LoadLeveler for AIX andLoadLeveler for Linux support three open-source
| implementations of the Message-Passing Interface (MPI).

MPICH is an open-source, portable implementation of the MPI Standard
developed by Argonne National Laboratory. It contains a complete implementation
of version 1.2 of the MPI Standard and also significant parts of MPI-2, particularly
in the area of parallel I/O. MPICH, MVAPICH, and MPICH-GM are the three MPI
| implementations supported by LoadLeveler for AIX and LoadLeveler for Linux:
v Additional documentation for MPICH is available from the Argonne National
Laboratory web site at:
http://www-unix.mcs.anl.gov/mpi/mpich1/
v MVAPICH is a high performance implementation of MPI-1 over InfiniBand
based on MPICH. Additional documentation for MVAPICH is available at the
Ohio State University Web site at:
http://mvapich.cse.ohio-state.edu/
v MPICH-GM is a port of MPICH on top of GM (ch_gm). GM is a low-level
message-passing system for Myrinet Networks. Additional documentation for
MPICH-GM is available from the Myrinet web site at:
http://www.myri.com/scs/

For either MPICH, MVAPICH, or MPICH-GM, LoadLeveler allocates the machines
to run the parallel job and starts the implementation specific script as master task.
Some of the options of implementation specific scripts might not be required or are
not supported when used with LoadLeveler.

The following standard mpirun script options are not supported:
-map <list>
The mpirun script can either take a machinefile or a mapping of the machines
in which to run the mpirun job. If both the machinefile and map are specified,
then the map list overrides the machinefile. Because we want LoadLeveler to
decide which nodes to run on, use the machinefile specified by the
environment variable LOADL_HOSTFILE. Specifying a mapping of the host
name is not supported.
-allcpus
This option is only supported when the -machinefile option is used. The
mpirun script will run the job using all machines specified in the machine file,
without the need to specify the -np option. Without specifying machinefile,
the mpirun script will look in the default machines <arch> file to find the
machines on which to run the job. The machines defined in the default file
might not match what LoadLeveler has selected, which will cause the job to be
removed.
-exclude <list>
This option is not supported because if you specified a machine in the exclude
list that has already been scheduled by LoadLeveler to run the job, the job will
be removed.
-dbg
This option might be used to select a debugger. This option is used to select a
debugger to be used with the mpirun script. LoadLeveler currently does not
support running interactive MPICH jobs, so starting mpirun jobs under a
debugger is not supported.


-ksq
This option keeps the send queue. This is useful if you expect later to attach
totalview to the running (or deadlocked) job, and want to see the send queues.
This option is used for debugging purposes when attaching the mpirun job to
totalview. Since we do not support running debuggers under LoadLeveler
MPICH job management, this option is not supported.
-machinedir <directory>
This option looks for the machine files in the indicated directory. LoadLeveler
will create a machinefile that contains the host name for each task in the
mpirun job. The environment variable LOADL_HOSTFILE contains the full
path to the machinefile. A different machinefile is created per job and stored
in the LoadLeveler execute directory. Because there might be multiple jobs
running at one time, we do not want the mpirun script to choose any file in
the execute directory because it might not be the correct file that the central
manager has assigned to the job step. This option is therefore not supported,
use the -machinefile option instead.
v When using MPICH, the mpirun script is run on the first machine allocated to
the job. The mpirun script starts the actual execution of the parallel tasks on the
other nodes included in the LoadLeveler cluster using llspawn.stdio as
RSHCOMMAND.
The following option of MPICHs mpirun script is not supported.
-nolocal
This option specifies not to run on the local machine. The default behavior
of MPICH (p4) is that the first MPI process is always spawned on the
machine which mpirun has invoked. The -nolocal option disables the
default behavior and does not run the MPI process on the local node. Under
LoadLeveler’s MPICH Job management, it is required that at least one task
is run on the local node, so the -nolocal option should not be used.
v When using MVAPICH, the mpirun_rsh command is run on the first machine
allocated to the job as master task. The mpirun_rsh command starts the actual
execution of parallel tasks on the other nodes included in the LoadLeveler
cluster using llspawn as RSHCOMMAND.
The following options of MVAPICHs mpirun_rsh command are not supported
when used with LoadLeveler.
-rsh
Specifies to use rsh for connecting.
-ssh
Specifies to use ssh for connecting. The -rsh and -ssh options are supported,
but the behavior has been changed to run mpirun_rsh jobs under
LoadLeveler MPICH job manager. Replace the -rsh and -ssh commands with
llspawn before compiling mpirun_rsh. Even if you select -rsh and -ssh, the
llspawn command is actually used in place of -rsh and -ssh at runtime.
-xterm
Runs remote processes under xterm. This option starts an xterm window for
each task in the mpirun job and runs the remote shell with the application
inside the xterm window. This will not work under LoadLeveler because the
llspawn command replaces the remote shell (rsh or ssh) and llspawn is not
kept alive to the end of the application process.
-debug
Runs each process under the control of gdb. This option is used to select a
debugger to be used with mpirun jobs. LoadLeveler currently does not
support running interactive MPICH jobs so starting mpirun jobs under a


debugger is not supported. This option also requires xterm to be working
properly as it opens gdb under an xterm window. Since we do not support
the -xterm option, the -debug option is also not supported.
h1 h2....
Specifies the names of hosts where processes should run. The mpirun_rsh
script can either take a host file or read in the names of the hosts, h1 h2 and
so on, in which to run the mpirun job. If both host file and list of machines
are specified in the mpirun_rsh arguments, mpirun_rsh will have an error
parsing the arguments. Because we want LoadLeveler to decide which nodes
to run on, you should use the host list specified by the environment variable
LOADL_HOSTFILE. Specifying the names of the hosts is not supported.
v When using MPICH-GM, the mpirun.ch_gm script is run on the first machine
allocated to the job as master task. The mpirun.ch_gm script starts the actual
execution of the parallel tasks on the other nodes included in the LoadLeveler
cluster using the llspawn command as RSHCOMMAND.
The following options of MPICH-GMs mpirun script are not supported when
used with LoadLeveler.
--gm-kill <n>
This is an option that allows you to kill all remaining processes <n> seconds
after the first one dies or exits. Do not specify this option when running the
application under LoadLeveler, because LoadLeveler will handle the cleanup
of the tasks.
--gm-tree-spawn
This is an option that uses a two-level spawn tree to launch the processes in
an effort to reduce the load on any particular host. Because LoadLeveler is
providing its own scalable method for spawning the application tasks from
the master host, using the llspawn command, spawning processes in a
tree-like fashion is not supported.
-totalview
This option is used to select a totalview debugging session to be used with
the mpirun script. LoadLeveler currently does not support running
interactive MPICH jobs, so starting mpirun jobs under a debugger is not
supported.
-r This is an optional option for MPICH-GM, which forces the removal of the
shared memory files. Because this option is not required, it is not supported.
If you specify this option, it will be ignored.
-ddt
This option is used to select a DDT debugging session to be used with the
mpirun script. LoadLeveler currently does not support running interactive
MPICH jobs, so starting mpirun jobs under a debugger is not supported.

Sample programs are available:
v See “MPICH sample job command file” on page 208 for a sample MPICH job
command file.
v See “MPICH-GM sample job command file” on page 209 for a sample
MPICH-GM job command file.
v See “MVAPICH sample job command file” on page 211 for a sample MVAPICH
job command file.
v The LoadLeveler samples directory also contains sample files:
– On AIX, use directory /usr/lpp/LoadL/full/samples/llmpich
– On Linux, use directory /opt/ibmll/LoadL/full/samples/llmpich


These sample files include:
– ivp.c: A simple MPI application that you may run as an MPICH, MVAPICH,
or MPICH-GM job.
– Job command files to run the ivp.c program as a batch job:
- For MPICH: mpich_ivp.cmd
- For MPICH-GM: mpich_gm_ivp.cmd

Examples: Building parallel job command files
This topic contains sample job command files for several parallel environments.

This topic contains sample job command files for the following parallel
environments:
v IBM AIX Parallel Operating Environment (POE)
v MPICH
v MPICH-GM
v MVAPICH

POE sample job command file
This is a sample job command file for POE.

Figure 26 is a sample job command file for POE.

#
# @ job_type = parallel
# @ environment = COPY_ALL
# @ output = poe.out
# @ error = poe.error
# @ node = 8,10
# @ network.LAPI = sn_all,US,,instances=1
# @ network.MPI = sn_all,US,,instances=1
# @ wall_clock_limit = 60
# @ executable = /usr/bin/poe
# @ arguments = /u/richc/My_POE_program -euilib "us"
# @ class = POE
# @ queue

Figure 26. POE job command file – multiple tasks per node

Figure 26 shows the following:
v The total number of nodes requested is a minimum of eight and a maximum of
10 (node=8,10). Two tasks run on each node (tasks_per_node=2). Thus the total
number of tasks can range from 16 to 20.
v Each task of the job will run using the LAPI protocol in US mode with a switch
adapter (network.LAPI=sn_all,US,,instances=1), and using the MPI protocol in
US mode with a switch adapter (network.MPI=sn_all,US,,instances=1).
v The maximum run time allowed for the job is 60 seconds (wall_clock_limit=60).

Figure 27 on page 208 is a second sample job command file for POE


#
# @ input = poe.in.1
# @ output = poe.out.1
# @ error = poe.err
# @ node = 2,8
# @ network.MPI = sn_single,shared,IP
# @ wall_clock_limit = 60
# @ class = POE
# @ queue
/usr/bin/poe /u/richc/my_POE_setup_program -infolevel 2
/usr/bin/poe /u/richc/my_POE_main_program -infolevel 2

Figure 27. POE sample job command file – invoking POE twice

v POE is invoked twice, through my_POE_setup_program and
my_POE_main_program.
v The job requests a minimum of two nodes and a maximum of eight nodes
(node=2,8).
v The job by default runs one task per node.
v The job uses the MPI protocol with a switch adapter in IP mode
(network.MPI=sn_single,shared,IP).
v The maximum run time allowed for the job is 60 seconds (wall_clock_limit=60).

MPICH sample job command file
This is a sample job command file for MPICH.

Figure 28 is a sample job command file for MPICH.

# ! /bin/ksh
# LoadLeveler JCF file for running an MPICH job
# @ job_type = MPICH
# @ node = 4
# @ output = mpich_test.$(cluster).$(process).out
# @ error = mpich_test.$(cluster).$(process).err
# @ queue
echo "------------------------------------------------------------"
echo LOADL_STEP_ID=$LOADL_STEP_ID
echo "------------------------------------------------------------"

/opt/mpich/bin/mpirun -np $LOADL_TOTAL_TASKS -machinefile
$LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test

Figure 28. MPICH job command file - sample 1

Note: You can also specify the job_type=parallel keyword and invoke the mpirun
script to run an MPICH job. In that case, the mpirun script would use rsh
or ssh and not the llspawn command.

Figure 28 shows that in the following job command file statement:
/opt/mpich/bin/mpirun -np $LOADL_TOTAL_TASKS -machinefile
-np
Specifies the number of parallel processes.
LOADL_TOTAL_TASKS
Is the environment variable set by LoadLeveler with the number of parallel
processes of the job step.


-machinefile
Specifies the machine list file.
LOADL_HOSTFILE
Is the environment variable set by LoadLeveler with the file name that contains
host names assigned to the parallel job step.

The following is another example of a MPICH job command file:

# ! /bin/ksh
# LoadLeveler JCF file for running an MPICH job
# @ node = 4
# @ output = mpich_test.$(cluster).$(process).out
# @ error = mpich_test.$(cluster).$(process).err
# @ executable = /opt/mpich/bin/mpirun
# @ arguments = -np $LOADL_TOTAL_TASKS -machinefile
# @ queue

Figure 29. MPICH job command file - sample 2

v The mpirun script is specified as a value of the executable job command file
keyword.
v The following mpirun script arguments are specified with the arguments job
command file keyword:
-np $LOADL_TOTAL_TASKS -machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_test

-np
LOADL_TOTAL_TASKS
-machinefile
LOADL_HOSTFILE
Is the environment variable set by LoadLeveler with file name, which
contains host names assigned to the parallel job step.

MPICH-GM sample job command file
This is a sample job command file for MPICH-GM.

Figure 30 on page 210 is a sample job command file for MPICH-GM.


#! /bin/ksh
# LoadLeveler JCF file for running an MPICH-GM job
# @ resources = gmports(1)
# @ node = 4
# @ output = mpich_gm_test.$(cluster).$(process).out
# @ error = mpich_gm_test.$(cluster).$(process).err
# @ queue
echo "------------------------------------------------------------"
echo "------------------------------------------------------------"
/opt/mpich/bin/mpirun.ch_gm -np $LOADL_TOTAL_TASKS -machinefile
$LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test

Figure 30. MPICH-GM job command file - sample 1

v The statement # @ resources = gmports(1) specifies that each task consumes one
GM port. This is how LoadLeveler limits the number of GM ports
simultaneously in use on any machine. This resource name is the name you
specified in schedule_by_resources in the configuration file and each machine
stanza in the administration file must define GM ports and specify the quantity
of GM ports available on each machine. Use the llstatus -R command to confirm
the names and values of the configured and available consumable resources.
v In the following job command file statement:
/opt/mpich/bin/mpirun.ch_gm -np $LOADL_TOTAL_TASKS
-machinefile $LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test
/opt/mpich/bin/mpirun.ch_gm
Specifies the location of the mpirun.ch_gm script shipped with the
MPICH-GM implementation that runs the MPICH-GM application.
-np
-machinefile
LOADL_HOSTFILE

Figure 31 is another sample job command file for MPICH-GM.

#! /bin/ksh
# LoadLeveler JCF file for running an MPICH-GM job
# @ resources = gmports(1)
# @ node = 4
# @ output = mpich_gm_test.$(cluster).$(process).out
# @ error = mpich_gm_test.$(cluster).$(process).err
# @ executable = /opt/mpich/bin/mpirun.ch_gm
$LOADL_HOSTFILE /common/NFS/ll_bin/mpich_gm_test
# @ queue

Figure 31. MPICH-GM job command file - sample 2

v The mpirun_gm script is specified as value of the executable job command file
keyword.

v The following mpirun_gm script arguments are specified with the arguments job
command file keyword:

-np
LOADL_TOTAL_TASKS
-machinefile
LOADL_HOSTFILE

MVAPICH sample job command file
This is a sample job command file for MVAPICH.

Figure 32 is a sample job command file for MVAPICH:

# ! /bin/ksh
# LoadLeveler JCF file for running an MVAPICH job
# @ node = 4
# @ output = mvapich_test.$(cluster).$(process).out
# @ error = mvapich_test.$(cluster).$(process).err
# @ queue
echo "------------------------------------------------------------"
echo "------------------------------------------------------------"

/opt/mpich/bin/mpirun_rsh -np $LOADL_TOTAL_TASKS -machinefile

Figure 32. MVAPICH job command file - sample 1

Figure 32 shows that in the following job command file statement:
/opt/mpich/bin/mpirun_rsh -np $LOADL_TOTAL_TASKS -machinefile
-np
LOADL_TOTAL_TASKS
-machinefile
LOADL_HOSTFILE
Is the environment variable set by LoadLeveler with file name, which contains
host names assigned to the parallel job step.

Figure 32 is another sample job command file for MVAPICH:


# ! /bin/ksh
# LoadLeveler JCF file for running an MVAPICH job
# @ node = 4
# @ output = mvapich_test.$(cluster).$(process).out
# @ error = mvapich_test.$(cluster).$(process).err
# @ executable = /opt/mpich/bin/mpirun_rsh
# @ queue

Figure 33. MVAPICH job command file - sample 2

v The mpirun_rsh command is specified as value for the executable job command
file keyword.
v The following mpirun_rsh command arguments are specified with the
arguments job command file keyword:

-np
LOADL_TOTAL_TASKS
-machinefile
LOADL_HOSTFILE

Obtaining status of parallel jobs
Both end users and LoadLeveler administrators can obtain status of parallel jobs in
the same way as they obtain status of serial jobs – either by using the llq
command or by viewing the Jobs window on the graphical user interface (GUI).

By issuing llq -l, or by using the Job Actions → Details selection in xloadl, users get
a list of machines allocated to the parallel job. If you also need to see task instance
information use the -x option in addition to the -l option (llq -l -x). See “llq -
Query job status” on page 479 for samples of output using the -x and -l options
with the llq command.

Obtaining allocated host names
llq -l output includes information on allocated host names.

Another way to obtain the allocated host names is with the
LOADL_PROCESSOR_LIST environment variable, which you can use from a shell
script in your job command file as shown in Figure 34 on page 213.

This example uses LOADL_PROCESSOR_LIST to perform a remote copy of a local
file to all of the nodes, and then invokes POE. Note that the processor list contains
an entry for each task running on a node. If two tasks are running on a node,
LOADL_PROCESSOR_LIST will contain two instances of the host name where the
tasks are running. The example in Figure 34 on page 213 removes any duplicate
entries.


Note that LOADL_PROCESSOR_LIST is set by LoadLeveler, not by the user. This
environment variable is limited to 128 hostnames. If the value is greater than the
128 limit, the environment variable is not set.

#!/bin/ksh
# @ output = my_POE_program.$(cluster).$(process).out
# @ error = my_POE_program.$(cluster).$(process).err
# @ class = POE
# @ node = 8,12
# @ network.MPI = sn_single,shared,US
# @ queue

tmp_file="/tmp/node_list"
rm -f $tmp_file

# Copy each entry in the list to a new line in a file so
# that duplicate entries can be removed.
for node in $LOADL_PROCESSOR_LIST
do
echo $node >> $tmp_file
done

# Sort the file removing duplicate entries and save list in variable
nodelist= sort -u /tmp/node_list

for node in $nodelist
do
rcp localfile $node:/home/userid
done

rm -f $tmp_file

/usr/bin/poe /home/userid/my_POE_program

Figure 34. Using LOADL_PROCESSOR_LIST in a shell script

Working with reservations
Under the BACKFILL scheduler only, LoadLeveler allows authorized users to make
reservations, which specify a time period during which specific node resources are
reserved for use by particular users or groups.

Use Table 48 to find information about working with reservations.
Table 48. Roadmap of tasks for reservation owners and users
Learn how reservations work in the v “Overview of reservations” on page 25
LoadLeveler environment
v “Understanding the reservation life cycle” on page
214
Creating new reservations “Creating new reservations” on page 216
Managing jobs that run under a v “Submitting jobs to run under a reservation” on
reservation page 218
v “Removing bound jobs from the reservation” on
page 220
Managing existing reservations v “Querying existing reservations” on page 221
v “Modifying existing reservations” on page 221
v “Canceling existing reservations” on page 222


Table 48. Roadmap of tasks for reservation owners and users (continued)
Using the LoadLeveler interfaces for v Chapter 16, “Commands,” on page 411
reservations v “Reservation API” on page 643

Understanding the reservation life cycle
From the time at which LoadLeveler creates a reservation through the time the
reservation ends or is canceled, a reservation goes through various states, which
are indicated in command listings and other displays or output.

Understanding these states is important because the current state of a reservation
dictates what actions you can take; for example, if you want to modify the start
time for a reservation, you may do so only while the reservation is in Waiting
state. Table 49 lists the possible reservation states, their abbreviations, and usage
notes.
Table 49. Reservation states, abbreviations, and usage notes
Reservation Abbreviation Usage notes
state in displays /
output
| Waiting W Reservations are in the Waiting state:
| 1. When LoadLeveler first creates a reservation.
| 2. After one occurrence of a recurring reservation ends
| and before the next occurrence starts.

| While the reservation is in the Waiting state:
v Only administrators and reservation owners may
modify, cancel, and add users or groups to the
reservation.
v Administrators, reservation owners, and users or groups
that are allowed to use the reservation may query it, and
submit jobs to run during the reservation period.


Table 49. Reservation states, abbreviations, and usage notes (continued)
state in displays /
output
Setup S LoadLeveler changes the state of a reservation from
Waiting to Setup just before the start time of the
reservation. The actual time at which LoadLeveler places
the reservation in Setup state depends on the value set for
the RESERVATION_SETUP_TIME keyword in the
configuration file.

While the reservation is in Setup state:
reservation.

During this setup period, LoadLeveler:
v Stops scheduling unbound job steps to reserved nodes.
v Preempts any jobs that are still running on the nodes
that are reserved through this reservation. To preempt
the running jobs, LoadLeveler uses the preemption
method specified through the
DEFAULT_PREEMPT_METHOD keyword in the
configuration file.
Note: The default value for
DEFAULT_PREEMPT_METHOD is SU (suspend),
which is not supported in all environments, and the
default value for PREEMPTION_SUPPORT is NONE. If
you want preemption to take place at the start of the
reservation, make sure the cluster is configured for
preemption (see “Steps for configuring a scheduler to
preempt jobs” on page 130 for more information).
Active A At the reservation start time, LoadLeveler changes the
reservation state from Setup to Active. It also dispatches
only job steps that are bound to the reservation, until the
reservation completes or is canceled.

LoadLeveler does not dispatch bound job steps that:
v Require certain resources, such as floating consumable
resources, that are not available during the reservation
period.
v Have expected end times that exceed the end time of the
reservation. By default, LoadLeveler allows such jobs to
run, but their completion is subject to resource
availability. (An administrator may configure
LoadLeveler to prevent such jobs from running.)
These bound job steps remain idle unless the required
resources become available.

While the reservation is in Active state:
reservation.


Table 49. Reservation states, abbreviations, and usage notes (continued)
state in displays /
output
Active_Shared AS At the reservation start time, LoadLeveler changes the
reservation state from Setup to Active. It also dispatches
only job steps that are bound to the reservation, unless the
reservation was created with the SHARED mode. In this case,
if reserved resources are still available after LoadLeveler
dispatches any bound job steps that are eligible to run,
LoadLeveler changes the reservation state to
Active_Shared, and begins dispatching job steps that are
not bound to the reservation. Once the reservation state
changes to Active_Shared, it remains in that state until the
reservation completes or is canceled. During this time,
LoadLeveler dispatches both bound and unbound job
steps, pending resource availability; bound job steps are
considered before unbound job steps.

The conditions under which LoadLeveler will not dispatch
bound job steps are the same as those listed in the notes
for the Active state.

The actions that administrators, reservation owners, and
users may perform are the same as those listed in the
notes for the Active state.
Canceled CA When a reservation owner, administrator, or LoadLeveler
issues a request to cancel the reservation, LoadLeveler
changes the state of a reservation to Canceled and unbinds
any job steps bound to this reservation. When the
reservation is in this state, no one can modify or submit
jobs to this reservation.
Complete C When a reservation end time is reached, LoadLeveler
changes the state of a reservation to Complete. When the
reservation is in this state, no one can modify or submit
jobs to this reservation.

Creating new reservations
You must be an authorized user or member of an authorized group to successfully
create a reservation. LoadLeveler administrators define authorized users by adding
the max_reservations keyword to the user or group stanza in the administration
file.

The max_reservations keyword setting also defines how many reservations you are
allowed to own. Ask your administrator whether you are authorized to create
reservations.

To be authorized to create reservations, LoadLeveler administrators also must have
the max_reservations keyword set in their user or group stanza.

| To create a reservation, use the llmkres command. Specify the start time of the
| reservation using the -t command option and the duration of the reservation using
| the -d command option. If you are creating a recurring reservation, you must use
| the -t option to specify the schedule for that reservation.


| In addition to the start time and duration (or reservation schedule), you must also
| use one of the following methods to specify how you want to select nodes for the
| reservation.

| Note: These methods are mutually exclusive.
v The -n option on the llmkres command instructs LoadLeveler to reserve a
number of nodes. LoadLeveler may select any unreserved node to satisfy a
reservation. This command option is perhaps the easiest to use, because you
need to know only how many nodes you want, not specific node characteristics.
The minimum number of nodes a reservation must have is 1.
v The -h option on the llmkres command instructs LoadLeveler to reserve specific
nodes.
v The -f option on the llmkres command instructs LoadLeveler to submit the
specified job command file, and reserve appropriate nodes for the first job step
in the job command file. Through this action, all job steps for the job are bound
to the reservation. If the reservation request fails, LoadLeveler changes the state
for all job steps for this job to NotQueued, and will not schedule any of those
job steps to run.
v The -j option on the llmkres command instructs LoadLeveler to reserve
appropriate nodes for that job step. Through this action, the job step is bound to
the reservation. If the reservation request fails, the job step remains in the same
state as it was before.
v The -c option on the llmkres command instructs LoadLeveler to reserve a
number of Blue Gene compute nodes (C-nodes). The -j and -f option also reserve
Blue Gene resources if the job type is bluegene.

You also may define other reservation attributes, including:
v Whether additional users or groups are allowed to use the reservation. Use the
-U or -G command options, respectively.
v Whether the reservation will be in one or both of these optional modes:
– SHARED mode: When you use the -s command option, LoadLeveler allows
reserved resources to be shared by job steps that are not associated with a
reservation. This mode enables the efficient use of reserved resources; if the
bound job steps do not use all of the reserved resources, LoadLeveler can
schedule unbound job steps as well so the resources do not remain idle.
Unless you specify this mode, however, only job steps bound to the
reservation may use the reserved resources.
– REMOVE_ON_IDLE mode: When you use the -i command option, LoadLeveler
automatically cancels the reservation when all bound job steps that can run
finish running. Using this mode is efficient because it prevents LoadLeveler
from wasting reserved resources when no jobs are available to use them.
Selecting this mode is especially useful for workloads that will run
unattended.
| v The default binding method to use when jobs are bound to the reservation. Use
| the -m option to specify whether the soft or firm binding method should be
| used when the binding method is not specified by the llbind command.
| – Soft binding allows the bound job to use resources outside of the reservation.
| – Firm binding restricts the job to the reserved resources.
| v For a recurring reservation, when the reservation will expire. Use the -e option
| to specify the expiration date of the recurring reservation.

Additional rules apply to the use of these options; see “llmkres - Make a
reservation” on page 459 for details.


| Alternative: Use the ll_make_reservation and the ll_init_reservation_param
| subroutines in a program.

Tips:
| v If your user ID is not authorized to create any type of reservation but you are
member of a group with authority to create reservations, you must use the -g
option to specify the name of the authorized group on the llmkres command.
v Only reservations in waiting and in use are counted toward the limit of allowed
reservations set through the max_reservations keyword. LoadLeveler does not
| count reservations or recurring reservations that have already ended or are in
the process of being canceled.
| v For accounting purposes, although recurring reservations have multiple
| instances, a recurring reservation counts as one reservation no matter how many
| times it may recur during its reservation period.
| v Although you may create more than one reservation or recurring reservation for
a particular node or set of nodes, only one of those reservations may be active at
a time. If LoadLeveler determines that the reservation you are requesting will
overlap with another reservation, LoadLeveler fails the create request. No
reservation periods for the same set of machines can overlap.

If the create request is successful, LoadLeveler assigns and returns to the owner a
unique reservation identifier, in the form host.rid.r, where:
host The name of the machine which assigned the reservation identifier.
rid A number assigned to the reservation by LoadLeveler.
r The letter r is used to distinguish a reservation identifier from a job step
identifier.

The following are examples of reservation identifiers:
c94n16.80.r
c94n06.1.r

For details about the LoadLeveler interfaces for creating reservations, see:
v “llmkres - Make a reservation” on page 459.
v “ll_make_reservation subroutine” on page 653 and “ll_init_reservation_param
subroutine” on page 652.

Submitting jobs to run under a reservation
LoadLeveler administrators, reservation owners, and authorized users may submit
jobs to run under a reservation.

You may bind both batch and interactive POE job steps to a reservation, both
before a reservation starts or while it is active.

Before you begin:
v If you are a reservation owner and used the -f or -j options on the llmkres
command when you created the reservation, you do not have to perform the
steps listed in Table 50 on page 219. Those command options automatically bind
the job steps to the reservation. To find out whether a particular job step is
bound to a reservation, use the command llq -l and check the listing for a
reservation ID.
v To find out which reservation IDs you may use, check with your LoadLeveler
administrator, or enter the command llqres -l and check the names in the Users
or Groups fields (under the Modification time field) in the output listing. If your


user name or a group name to which you belong appears in these output fields,
you are authorized to use the reservation.
v LoadLeveler cannot guarantee that certain resources will be available during a
reservation period. If you submit job steps that require these resources,
LoadLeveler will bind the job steps to the reservation, but will not dispatch
them unless the resources become available during the reservation. These
resources include:
– Specific nodes that were not reserved under this reservation.
– Floating consumable resources for a cluster.
– Resources that are not released through preemption, such as virtual memory
and adapters.
v Whether bound job steps are successfully dispatched depends not only on
resource availability, but also on administration file keywords that set maximum
numbers, including:
– max_jobs_scheduled
– maxidle
– maxjobs
– maxqueued
If LoadLeveler determines that scheduling a bound job will exceed one or more
of these configured limits, your job will remain idle unless conditions permit
scheduling at a later time during the reservation period.
Table 50. Instructions for submitting a job to run under a reservation
To bind this
type of job: Use these instructions:
Already Use the llbind command
submitted
| jobs Alternative: Use the ll_bind_reservation subroutine in a program.

Result: LoadLeveler either sets the reservation ID for each job step that can
be bound to the reservation, or sends a failure notification for the bind
request.
A new job 1. Specify the reservation ID through the LL_RES_ID environment variable
that has not or the ll_res_id command file keyword. The ll_res_id keyword takes
been precedence over the LL_RES_ID environment variable.
submitted
Tip: You can use the ll_res_id keyword to modify the reservation to
submit to in a job command file filter.
2. Use the llsubmit command to submit the job.
Result: If the job can be bound to the requested reservation, LoadLeveler
sets the reservation ID for each job step that can be bound to the
reservation. Otherwise, if the job step cannot be bound to the reservation,
LoadLeveler changes the job state to NotQueued. To change the job step’s
state to Idle, issue the llbind -r command.

Use the llqres command or llq command with the -l option to check the success or
failure of the binding request for each job step.

| Selecting firm or soft binding: There are two methods by which a job step can be
| bound to a reservation: firm and soft. When a job step is firm bound to a
| reservation, the job step can only use the reserved resources. A job step that is soft
| bound to a reservation can be started before the reservation becomes active and
| can use nodes that are not part of the reservation. Using soft binding is a way of
| guaranteeing that resources will be available for the job step at a given time, but
| allowing the job step to start earlier if there are available resources.


Which method to use is specified by the -m option of the llbind command. If
neither is specified by llbind, the default method specified for the reservation is
used. Use llqres -l and review the Binding Method field to determine which
method is the default for a reservation.

| Binding a job step to a recurring reservation: When a job step is bound to a
| reservation, the job step can be considered for scheduling as soon as any
| occurrence of the reservation is active. If you do not want the job step to run right
| away, but instead you want it to run in a later occurrence of the reservation, you
| can specify which occurrence the job step will be bound to by adding the
| occurrence ID to the end of the reservation ID.

| The format of the reservation identifier is [host.]rid[.r[.oid]].

| where:
| v host is the name of the machine that assigned the reservation identifier.
| v rid is the number assigned to the reservation when it was created. An rid is
| required.
| v r indicates that this is a reservation ID (r is optional if oid is not specified).
| v oid is the occurrence ID of a recurring reservation (oid is optional).

| When oid is specified, the job step will not be considered for scheduling until that
| occurrence of the reservation becomes active. The step will remain in Idle state
| during all earlier occurrences.

| If a job step is bound to a recurring reservation, and the reservation occurrence’s
| end time is reached before the job step can be scheduled to run, the job step will
| be automatically bound to the next occurrence of the reservation by LoadLeveler.
| When the next occurrence becomes active, the job step will again be considered for
| scheduling.

| A job can be submitted with the recurring keyword set to yes in the job command
| file to specify that all steps of the job will be run in every occurrence of the
| reservation to which it is bound. When all steps of the job have completed, the
| entire job is requeued and all steps are bound to the next occurrence of the
| reservation.

For details about the LoadLeveler interfaces for submitting jobs under reservations,
see:
v “llbind - Bind job steps to a reservation” on page 415.
v “ll_bind subroutine” on page 645.
v “llsubmit - Submit a job” on page 531.

Removing bound jobs from the reservation
LoadLeveler administrators, reservation owners, and authorized users may use the
llbind command to unbind one or more existing jobs from a reservation.

| Alternative: Use the ll_bind_reservation subroutine in a program.

Result: LoadLeveler either unbinds the jobs from the reservation, or sends a failure
notification for the unbind request. Use the llqres or llq command to check the
success or failure of the remove request.


For details about the LoadLeveler interfaces for removing bound jobs from the
reservation, see:
v “llbind - Bind job steps to a reservation” on page 415.
v “ll_bind subroutine” on page 645.

Querying existing reservations
| Any LoadLeveler administrator or user can issue the llqres and llq commands to
| query the status of an existing reservation or recurring reservation.

Use these commands to request specific information about reservations:
v Various options are available to filter reservations to be displayed.
v To show details of specific reservations, use the llqres command with the -l
option.
v To show job steps that are bound to specific reservations, use the llq command
with the -R option.

For details about:
v Reservation attributes and llqres command syntax, see “llqres - Query a
reservation” on page 500.
v llq command syntax, see “llq - Query job status” on page 479.

Modifying existing reservations
Only administrators and reservation owners can use the llchres command to
| modify one or more attributes of a reservation or a recurring reservation.

Certain attributes cannot be changed after a reservation has become active. Typical
uses for the llchres command include the following:
v Using the command llchres -U +newuser1 newuser2 to allow additional users to
submit jobs to the reservation.
v If a reservation was made through the command llmkres -h free but
LoadLeveler cannot include a particular node because it is down, you can use
the command llchres -h +node to add the node to the reserved node list when
that node becomes available again.
v If a reserved node is down after the reservation becomes active, a LoadLeveler
administrator can use:
– The command llchres -h -node to remove that node from the reservation.
– The command llchres -h +1 to add another node to the reservation.
| v Extending the expiration of a recurring reservation which may be about to
| expire. You can use llchres -e to specify a new expiration date for the
| reservation without having to create a new reservation.
| v Making a temporary change to the next occurrence of a recurring reservation
| without affecting any future occurrences of that reservation. For example, you
| can use the -o option of the llchres command to temporarily add a user (-U) or
| additional nodes (-n). Once that occurrence ends, the next occurrence will not
| retain the change.

| Alternative: Use the ll_change_reservation subroutine in a program.

For details about the LoadLeveler interfaces for modifying reservations, see:
v “llchres - Change attributes of a reservation” on page 424.
v “ll_change_reservation subroutine” on page 648.


Canceling existing reservations
| Administrators and reservation owners may use the llrmres command to cancel
| one or more reservations or to cancel some occurrences of a recurring reservation
| while leaving the remaining occurrences of that reservation unchanged in the
| system.

| The options available when canceling a reservation are:
| v Remove the entire reservation. All occurrences are removed and any bound job
| steps are automatically unbound from the reservation.
| v Remove a specific occurrence of the reservation. All other occurrences remain in
| the system and all bound job steps remain bound to the reservation.
| v Remove all occurrences during a specified interval. For example, a reservation
| may recur every day for one year, but during a one-week holiday period, the
| reservation is not needed. The reservation owner could cancel all of the
| occurrences during that one week period and all other occurrences would
| remain in the system and all bound job steps would remain bound to the
| reservation.
| If some occurrences are canceled and the result is that no occurrences remain, then
| the entire reservation is removed and all jobs are unbound from the reservation.

| Alternative: Use the ll_remove_reservation subroutine in a program.

Use the llqres command to check the success or failure of the remove request.

| Use the llqres -l command to see a list of canceled occurrence IDs or to note
| individual occurrence start times which have been omitted due to cancellation.

For details about the LoadLeveler interfaces for canceling reservations, see:
v “llrmres - Cancel a reservation” on page 508.
v “ll_remove_reservation subroutine” on page 658.

Submitting jobs requesting scheduling affinity
You can request that a job use scheduling affinity by setting the RSET and
TASK_AFFINITY job command file keywords.

Specify RSET with a value of:
v RSET_MCM_AFFINITY to have LoadLeveler schedule the job to machines
where RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY.
v user_defined_rset to have LoadLeveler schedule the job to machines where
RSET_SUPPORT is enabled with a value of RSET_USER_DEFINED;
user_defined_rset is the name of a valid user-defined RSet.
Specifying the RSET job command file keyword defaults to requesting memory
affinity as a requirement and adapter affinity as a preference. Scheduling affinity
options can be customized by using the job command file keyword
MCM_AFFINITY_OPTIONS. For more information on these keywords, see “Job
command file keyword descriptions” on page 359.

Note: If a job specifies memory or adapter affinity scheduling as a requirement,
LoadLeveler will only consider machines where RSET_SUPPORT is set to
RSET_MCM_AFFINITY. If there are not enough machines satisfying the
memory affinity requirements, the job will stay in the idle state.


Specify TASK_AFFINITY with a value of:
| v CORE(n) to have LoadLeveler schedule the job to machines where
| RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY. On SMT
| and ST nodes, LoadLeveler will assign n physical CPUs to each job task.
| v CPU(n) to have LoadLeveler schedule the job to machines where
| RSET_SUPPORT is enabled with a value of RSET_MCM_AFFINITY. On SMT
| nodes, LoadLeveler will assign n logical CPUs to each per job task. On ST
| nodes, LoadLeveler will assign n physical CPUs to each job task.

| Specify a requirement of SMT with a value of:
| v Enabled to have LoadLeveler schedule the job to machines where SMT is
| currently enabled.
| Example: #@ requirements = (SMT == "Enabled")
| v Disabled to have LoadLeveler schedule the job to machines where SMT is
| currently disabled or is not supported.
| Example: #@ requirements = (SMT == "Disabled")

OpenMP multithreaded jobs can be submitted requesting thread-level binding,
where each individual thread of an OpenMP application is bound to a separate
physical core processor or logical CPU. Use the parallel_threads job command file
keyword to request OpenMP thread-level binding, optionally, along with the
task_affinity job command file keyword.

The CPUs to individual OpenMP threads of the tasks are selected based on the
number of parallel threads (the parallel_threads job command file keyword) in
each task and set of CPUs or cores assigned (the task_affinity job command file
keyword) to the tasks. The CPUs are assigned to the threads only if at least one
CPU is available for each thread from the set of CPUs or cores assigned to the task.
If the number of CPUs in the set of CPUs or cores assigned to the tasks are not
sufficient to bind all of the threads, the job will not run.

This example binds 4 OpenMP parallel threads to 4 separate cores:
#@ task_affinity = Core(4)
#@ parallel_threads = 4

| Note: If you specify cpus_per_core along with your affinity request as:
| #@ task_affinity = core(n)
| #@ cpus_per_core = 1

| Then LoadLeveler allocates the requested number of CPUs to each task on
| SMT nodes only. The nodes running in ST mode are not assigned for the
| jobs requesting cpus_per_core.

Submitting and monitoring jobs in a LoadLeveler multicluster
There are subtasks and associated instructions for submitting and monitoring jobs
in a LoadLeveler multicluster.

Table 51 on page 224 shows the subtasks and associated instructions for submitting
and monitoring jobs in a LoadLeveler multicluster:


Table 51. Submitting and monitoring jobs in a LoadLeveler multicluster
Prepare and submit a job “Steps for submitting jobs in a LoadLeveler multicluster
in the LoadLeveler environment”
multicluster
Display information about v Use the llq -X cluster_name command to display information
a job in the LoadLeveler about jobs on remote clusters.
multicluster environment
v Use llq -x -d to display the user’s job command file keyword
statements.
v Use llq -X cluster_name -l to obtain multicluster specific
information.
Transfer an idle job from Use the llmovejob command, which is described in “llmovejob
one cluster to another - Move a single idle job from the local cluster to another
cluster cluster” on page 470.

Steps for submitting jobs in a LoadLeveler multicluster
environment
There are steps for submitting jobs in a LoadLeveler multicluster environment.

| In a multicluster environment, you can specify one of the following:
v That a job is to run on a particular cluster.
v That LoadLeveler is to decide which cluster is best from the list of clusters,
based on an administrator-defined metric. If any is specified, the job is
submitted to the best cluster, based on an administrator-defined metric.
| v That a job is a scale-across job which will run across multiple clusters

The following procedure explains how to prepare your job to be submitted in the
multicluster environment.

Before you begin: You need to know that:
v Only batch jobs are supported in the LoadLeveler multicluster environment.
LoadLeveler will fail any interactive jobs that you attempt to submit in a
v LoadLeveler assigns all steps of a multistep job to the same cluster.
v Job identifiers are assigned by the local cluster and are retained by the job
regardless of what cluster the job executes in.
v Remote jobs are subjected to the same configuration checks as locally submitted
jobs. Examples include account validation, class limits, include lists, and exclude
lists.

Perform the following steps to submit jobs to run in one cluster in a LoadLeveler
1. If files used by your job need to be copied between clusters, you must specify
the job files to be copied from the local to the remote cluster in the job
command file. Use the cluster_input_file and cluster_output_file keywords to
specify these files.
Rules:
v Any local file specified for copy must be accessible from the local gateway
Schedd machines. Input files must be readable. Directories and permissions
must be in place to write output files.


v Any remote file specified for copy must be accessible from the remote
gateway Schedd machines. Directories and permissions must be in place to
write input files. Output files must be readable when the job terminates.
v To copy more than one file, these keywords can be specified multiple times.
Tip: Each instance of these keywords allows you to specify a single local file
and a single remote file. If your job requires copying multiple files (for
example, all files in a directory), you may want to use a procedure to
consolidate the multiple files into a single file rather than specify multiple
cluster_file statements in the job command file. The following is an example of
how you could consolidate input files:
a. Use the tar command to produce a single tar file from multiple files.
b. On the cluster_input_file keyword, specify the file that resulted from the
tar command processing.
c. Modify your job command file such that it uses the tar command to restore
the multiple files from the tar file prior to invoking your application.
2. In the job command file, specify the clusters to which LoadLeveler may submit
the job. The cluster_list keyword is a blank-delimited list of cluster names or
the reserved word any where:
v A single cluster name indicates that the job is to be submitted to that cluster.
v A list of multiple cluster names indicates that the job is to be submitted to
one of the clusters as determined by the installation exit
CLUSTER_METRIC.
v The reserved word any indicates that the job is to be submitted to any
cluster defined by the installation exit CLUSTER_METRIC.
Alternative: You can specify the clusters to which LoadLeveler can submit your
job on the llsubmit command using the -X option.
| 3. Use the llsubmit command to submit the job.
| Tip: You may use the -X option on the llsubmit command to specify:
| -X {cluster_list | any}
| Is a blank-delimited list of cluster names or the reserved word any
| where:
| v A single cluster name indicates that the job is to be submitted to that
| cluster.
| v A list of multiple cluster names indicates that the job is to be
| submitted to one of the clusters as determined by the installation exit
| CLUSTER_METRIC.
| v The reserved word any indicates that the job is to be submitted to
| any cluster defined by the installation exit CLUSTER_METRIC.

| Note: If a remote job is submitted with a list of clusters or the reserved word
| any and the installation exit CLUSTER_METRIC is not specified, the
| remote job is not submitted.

Perform the following steps to submit scale-across jobs to run across multiple
clusters in a multicluster environment:
1. In the job command file, specify the cluster_option keyword as scale_across.
Alternative: You can submit a scale-across job using the -S option of the
llsubmit command.
2. You can limit which clusters can be used to run the job by using the
cluster_list keyword to specify the limited set of cluster. For a scale-across job,
if the cluster_list keyword is not specified or the reserved word any is
specified in the cluster_list, all clusters may be used to run the job.
Alternative: You can limit which clusters can be used to run the scale-across job
using the -X option of the llsubmit command.


3. Use the llsubmit command to submit the job from any cluster in the
scale-across multicluster environment.

The llsubmit command displays the assigned local outbound Schedd, the assigned
remote inbound Schedd, the scheduling cluster and the job identifier when the
remote job has been successfully submitted. Use the -q flag to stop these additional
messages from being displayed.

When you are done, you can use commands to display information about the
submitted job; for example:
v Use llq -l -X cluster_name -j job_id where cluster_name and job_id were displayed
by the llsubmit command to display information about the remote job.
v Use llq -l -X cluster_list to display the long listing about jobs, including
scheduling cluster, submitting cluster, user-requested cluster, cluster input and
output files.
v Use llq -X all to display information about all jobs in all configured clusters.
| v Use llq twice to display the job status for a scale-across job on all clusters where
| the job has been distributed. In the first command, specify the -l option to
| display the set of clusters where the job has been distributed (the value from the
| Cluster List output line). The second time you run the command, specify the -X
| option with the list of clusters reported from the first command. The result from
| that command shows the job status on the other clusters.

Submitting and monitoring Blue Gene jobs
The following procedure explains how to prepare your job to be submitted to the
Blue Gene system.

The submission of Blue Gene jobs is similar to the submission of other job types.

Before you begin: You need to know that checkpointing Blue Gene jobs is not
currently supported.

Tip: Use the llstatus command to check if Blue Gene support is enabled and
whether Blue Gene is currently present. The llstatus command will display:
The BACKFILL scheduler with Blue Gene support is in use

Blue Gene is present

when Blue Gene is support is enabled and Blue Gene is currently present

Perform the following steps to submit Blue Gene jobs:
1. In the job command file, set the job type to Blue Gene by specifying:
#@job_type = bluegene
2. Specify the size or shape of the Blue Gene job or the Blue Gene partition in
which the job will run.
v The size of the Blue Gene job can be specified by using the job command file
keyword bg_size to specify the size of the job. For more information, see the
detailed description of the bg_size keyword.
v The shape of the Blue Gene job can be specified by using the job command
file keyword bg_shape to specify the shape of the job. If you require the
specific shape you specified, you may wish to specify the bg_rotate keyword
to false. For more information on these keywords, see the detailed
descriptions of the bg_shape keyword and bg_rotate keyword.


v The partition in which the Blue Gene job is run can be specified using the
bg_partition job command file keyword. For more information, see the
detailed description of the bg_partition keyword.
| v The size of a Blue Gene job refers to the number of Blue Gene compute
| nodes instead of the number of tasks running on Startd machines. The
| following keywords cannot be used to control the size of a Blue Gene job:
| – node
| – tasks_per_node
| – total_tasks
3. Specify any other job command file keywords you require, including the
bg_connection and bg_requirements Blue Gene job command file keywords.
See “Job command file keyword descriptions” on page 359 for more
information on job command file keywords.
4. Upon completing your job command file, submit the job using the llsubmit
command.

If you experience a problem submitting a Blue Gene job, see “Troubleshooting in a
Blue Gene environment” on page 717 for common questions and answers
pertaining to operations within a Blue Gene environment.

When you are done, you can use the llq -b command to display information about
Blue Gene jobs in short form. For more information see “llq - Query job status” on
page 479.

Example:

The following is a sample job command file for a Blue Gene job:
# @ job_name = bgsample
# @ job_type = bluegene
# @ comment = "BGL Job by Size"
# @ error = $(job_name).err
# @ output = $(job_name).out
# @ environment = COPY_ALL;
# @ wall_clock_limit = 200:00,200:00
# @ notification = always
# @ notify_user = sam
# @ bg_size = 1024
# @ bg_connection = torus
# @ class = 2bp
# @ queue
/usr/bin/mpirun -exe /bgscratch/sam/com -verbose 2 -args "-o 100 -b 64 -r"


Chapter 9. Managing submitted jobs
This is a list of the tasks and sources of additional information for managing
LoadLeveler jobs.

Table 52 lists the tasks and sources of additional information for managing
LoadLeveler jobs.
Table 52. Roadmap of user tasks for managing submitted jobs
Displaying information about v “Querying the status of a job”
a submitted job or its
v “Working with machines” on page 230
environment
v “Displaying currently available resources” on page 230
v “llclass - Query class information” on page 433
v “llstatus - Query machine status” on page 512
v “llsummary - Return job resource information for
accounting” on page 535
Changing the priority of a v “Setting and changing the priority of a job” on page 230
submitted job
v “llmodify - Change attributes of a submitted job step” on
page 464
Changing the state of a v “Placing and releasing a hold on a job” on page 232
submitted job
v “Canceling a job” on page 232
v “llhold - Hold or release a submitted job” on page 454
v “llcancel - Cancel a submitted job” on page 421
Checkpointing a submitted v “Checkpointing a job” on page 232
job
v “llckpt - Checkpoint a running job step” on page 430

Querying the status of a job
Once you submit a job, you can query the status of the job to determine, for
example, if it is still in the queue or if it is running.

You also receive other job status related information such as the job ID and the
submitting user ID. You can query the status of a LoadLeveler job either by using
the GUI or the llq command. For an example of querying the status of a job, see
Chapter 10, “Example: Using commands to build, submit, and manage jobs,” on
page 235.

Querying the status of a job using a submit-only machine: In addition to
allowing you to submit and cancel jobs, a submit-only machine allows you to
query the status of jobs. You can query a job using either the submit-only version
of the GUI or by using the llq command. For information on llq, see “llq - Query
job status” on page 479.

229

Working with machines
There are types tasks related to machines.

You can perform the following types of tasks related to machines:
v Display machine status
When you submit a job to a machine, the status of the machine automatically
appears in the Machines window on the GUI. This window displays machine
related information such as the names of the machines running jobs, as well as
the machine’s architecture and operating system. For detailed information on
one or more machines in the cluster, you can use the Details option on the
Actions pull-down menu. This will provide you with a detailed report that
includes information such as the machine’s state and amount of installed
memory.
For an example of displaying machine status, see Chapter 10, “Example: Using
commands to build, submit, and manage jobs,” on page 235.
v Display central manager
The LoadLeveler administrator designates one of the machines in the
LoadLeveler cluster as the central manager. When jobs are submitted to any
machine, the central manager is notified and decides where to schedule the jobs.
In addition, it keeps track of the status of machines in the cluster and jobs in the
system by communicating with each machine. LoadLeveler uses this information
to make the scheduling decisions and to respond to queries.
Usually, the system administrator is more concerned about the location of the
central manager than the typical end user but you may also want to determine
its location. One reason why you might want to locate the central manager is if
you want to browse some configuration files that are stored on the same
machine as the central manager.
v Display public scheduling machines
Public scheduling machines are machines that participate in the scheduling of
LoadLeveler jobs on behalf of users at submit-only machines and users at other
workstations that are not running the Schedd daemon. You can find out the
names of all these machines in the cluster.
Submit-only machines allow machines that are not part of the LoadLeveler
cluster to submit jobs to the cluster for processing.

Displaying currently available resources
The LoadLeveler user can get information about currently available resources by
using the llstatus command with either the -F, or -R options.

The -F option displays a list of all of the floating resources associated with the
LoadLeveler cluster. The -R option lists all of the consumable resources associated
with all of the machines in the LoadLeveler cluster. The user can specify a hostlist
with the llstatus command to display only the consumable resources associated
with specific hosts.

Setting and changing the priority of a job
LoadLeveler uses the priority of a job to determine its position among a list of all
jobs waiting to be dispatched.


LoadLeveler schedules jobs based on the adjusted system priority, which takes in
account both system priority and user priority:
User priority
Every job has a user priority associated with it. A job with a higher priority
runs before a job with a lower priority (when both jobs are owned by the
same user). You can set this priority through the user_priority keyword in
the job command file, and modify it through the llprio command. See
“llprio - Change the user priority of submitted job steps” on page 477 for
more information.
System priority
Every job has a system priority associated with it. Administrators can set
this priority in the configuration file using the SYSPRIO keyword
expression. The SYSPRIO expression can contain class, group, and user
priorities, as shown in the following example:
SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate)

The SYSPRIO expression is evaluated by LoadLeveler to determine the
overall system priority of a job. To determine which jobs to run first,
LoadLeveler does the following:
1. Assigns a system priority value when the negotiator adds the new job
to the queue of jobs eligible for dispatch.
2. Orders jobs first by system priority.
3. Assigns jobs belonging to the same user and the same class an adjusted
system priority, which takes all the system priorities and orders them
by user priority. Jobs with a higher adjusted system priority are
scheduled ahead of jobs with a lower adjusted system priority.
Only administrators may modify the system priority through the llmodify
command with the -s option. See “llmodify - Change attributes of a
submitted job step” on page 464 for more information.

Example: How does a job’s priority affect dispatching order?
To understand how a job’s priority affects dispatching order, consider the sample
jobs which lists the priorities assigned to jobs submitted by two users, Rich and
Joe.

To understand how a job’s priority affects dispatching order, consider the sample
jobs in Table 53, which lists the priorities assigned to jobs submitted by two users,
Rich and Joe.

Two of the jobs belong to Joe, and three belong to Rich. User Joe has two jobs (Joe1
and Joe2) in Class A with SYSPRIOs of 9 and 8 respectively. Since Joe2 has the
higher user priority (20), and because both of Joe’s jobs are in the same class, Joe2’s
priority is swapped with that of Joe1 when the adjusted system priority is
calculated. This results in Joe2 getting an adjusted system priority of 9, and Joe1
getting an adjusted system priority of 8. Similarly, the Class A jobs belonging to
Rich (Rich1 and Rich3) also have their priorities swapped. The priority of the job
Rich2 does not change, since this job is in a different class (Class B).
Table 53. How LoadLeveler handles job priorities
System Priority Adjusted
Job User Priority (SYSPRIO) Class System Priority
Rich1 50 10 A 6

Chapter 9. Managing submitted jobs 231

Table 53. How LoadLeveler handles job priorities (continued)
System Priority Adjusted
Job User Priority (SYSPRIO) Class System Priority
Joe1 10 9 A 8
Joe2 20 8 A 9
Rich2 100 7 B 7
Rich3 90 6 A 10

Placing and releasing a hold on a job
You may place a hold on a job and thereby cause the job to remain in the queue
until you release it.

There are two types of holds: a user hold and a system hold. Both you and your
LoadLeveler administrator can place and release a user hold on a job. Only a
LoadLeveler administrator, however, can place and release a system hold on a job.

You can place a hold on a job or release the hold either by using the GUI or the
llhold command. For examples of holding and releasing jobs, see Chapter 10,
“Example: Using commands to build, submit, and manage jobs,” on page 235.

As a user or an administrator, you can also use the startdate keyword to place a
hold on a job. This keyword allows you to specify when you want to run a job.

Canceling a job
You can cancel one of your jobs that is either running or waiting to run by using
either the GUI or the llcancel command. You can use llcancel to cancel
LoadLeveler jobs, including jobs from a submit-only machine.

For more information about the llcancel command, see “llcancel - Cancel a
submitted job” on page 421.

Checkpointing a job
Checkpointing is a method of periodically saving the state of a job so that, if for
some reason, the job does not complete, it can be restarted from the saved state.
Checkpoints can be taken either under the control of the user application or
external to the application.

On AIX only, the LoadLeveler API ll_init_ckpt is used to initiate a serial
checkpoint from the user application. For initiating checkpoints from within a
parallel application, the API mpc_init_ckpt should be used. These APIs allow the
writer of the application to determine at what points in the application it would be
appropriate save the state of the job. To enable parallel applications to initiate
checkpointing, you must use the APIs provided with the Parallel Environment (PE)
program. For information on parallel checkpointing, see IBM Parallel Environment
for AIX and Linux: Operation and Use, Volume 1.

It is also possible to checkpoint a program running under LoadLeveler outside the
control of the application. There are several ways to do this:
v Use the llckpt command to initiate checkpoint for a specific job step. See “llckpt
- Checkpoint a running job step” on page 430 for more information.


v Checkpoint from a program which invokes the ll_ckpt API to initiate checkpoint
of a specific job step. See “ll_ckpt subroutine” on page 550 for more information.
v Have LoadLeveler automatically checkpoint all running jobs that have been
enabled for checkpoint.To enable this automatic checkpoint, specify checkpoint
= interval in the job command file.
v As the result of an llctl flush command.

Note: For interactive parallel jobs, the environment variable CHECKPOINT must
be set to yes in the environment prior to starting the parallel application or
the job will not be enabled for checkpoint. For more information see, IBM
Parallel Environment for AIX and Linux: MPI Programming Guide.

Chapter 9. Managing submitted jobs 233

Chapter 10. Example: Using commands to build, submit, and
manage jobs
The following procedure presents a series of simple tasks that a user might
perform using commands.

For additional information about individual commands noted in the procedure, see
Chapter 16, “Commands,” on page 411.
1. Build your job command file by using a text editor to create a script file. Into
the file enter the name of the executable, other keywords designating such
things as output locations for messages, and the necessary LoadLeveler
statements, as shown in Figure 35:

# This job command file is called longjob.cmd. The
# executable is called longjob, the input file is longjob.in,
# the output file is longjob.out, and the error file is
# longjob.err.
#
# @ executable = longjob
# @ input = longjob.in
# @ output = longjob.out
# @ error = longjob.err

# @ queue

Figure 35. Building a job command file

2. You can optionally edit the job command file you created in step 1.
3. To submit the job command file that you created in step 1, use the llsubmit
command:
llsubmit longjob.cmd
LoadLeveler responds by issuing a message similar to:
submit: The job "wizard.22" has been submitted.

Where wizard is the name of the machine to which the job was submitted and
22 is the job identifier (ID). You may want to record the identifier for future use
(although you can obtain this information later if necessary).
4. To display the status of the job you just submitted, use the llq command. This
command returns information about all jobs in the LoadLeveler queue:
llq wizard.22
Where wizard is the machine name to which you submitted the job, and 22 is
the job ID. You can also query this job using the command llq wizard.22.0,
where 0 is the step ID.
5. To change the priority of a job, use the llprio command. To increase the priority
of the job you submitted by a value of 10, enter:
llprio +10 wizard.22.0
You can change the user priority of a job that is in the queue or one that is
running. This only affects jobs belonging to the same user and the same class. If
you change the priority of a job in the queue, the job’s priority increases or
decreases in relation to your other jobs in the queue. If you change the priority
of a job that is running, it does not affect the job while it is running. It only

235

affects the job if the job re-enters the queue to be dispatched again. For more
information, see “Setting and changing the priority of a job” on page 230.
6. To place a temporary hold on a job in a queue, use the llhold command. This
command only takes effect if jobs are in the Idle or NotQueued state. To place a
hold on wizard.22.0, enter:
llhold wizard.22.0
7. To release the hold you placed in step 6, use the llhold command:
llhold -r wizard.22.0
8. To display the status of the machine to which you submitted a job, use the
llstatus command:
llstatus -l wizard
9. To cancel wizard.22.0, use the llcancel command:
llcancel wizard.22.0


Chapter 11. Using LoadLeveler’s GUI to build, submit, and
manage jobs
| Note: This is the last release that will provide the Motif-based graphical user
| interface xloadl. The function available in xloadl has been frozen since TWS
| LoadLeveler 3.3.2.

You do not have to perform the tasks in the order listed. You may perform certain
tasks before others without any difficulty; however, some tasks must be performed
prior to others for succeeding tasks to work. For example, you cannot submit a job
if you do not have a job command file that you built using either the GUI or an
editor.

The tasks included in this topic are listed in Table 54.
Table 54. User tasks available through the GUI
Subtask Associated information (see...)
Building and submitting v “Building jobs”
jobs
v “Editing the job command file” on page 249
v “Submitting a job command file” on page 250
Obtaining job status v “Displaying and refreshing job status” on page 251
v “Specifying which jobs appear in the Jobs window” on page
258
v “Sorting the Jobs window” on page 252
Managing a submitted job v “Changing the priority of your jobs” on page 253
v “Placing a job on hold” on page 253
v “Releasing the hold on a job” on page 253
v “Canceling a job” on page 254
Working with machines v “Displaying and refreshing machine status” on page 255
v “Specifying which machines appear in Machines window” on
page 259
v “Sorting the Machines window” on page 257
v “Finding the location of the central manager” on page 257
v “Finding the location of the public scheduling machines” on
page 258
Saving LoadLeveler “Saving LoadLeveler messages in a file” on page 259
messages in a file

Building jobs
Use these instructions when building jobs.

From the Jobs window:
SELECT
File → Build a Job
The dialog box shown in Figure 36 on page 238 appears:

237

Figure 36. LoadLeveler build a job window

Complete those fields for which you want to override what is currently
specified in your skel.cmd defaults file. Sample skel.cmd and
mcluster_skel.cmd files are found in the samples subdirectory of the


release directory. You can update this file to define defaults for your site,
and then update the *skelfile resource in Xloadl to point to your new
skel.cmd file. If you want a personal defaults file, copy skel.cmd to one of
your directories, edit the file, and update the *skelfile resource in
.Xdefaults. Table 55 shows the fields displayed in the Build a Job window:
Table 55. GUI fields and input
Field Input
Executable Name of the program to run. It must be an executable file.

Optional. If omitted, the command file is executed as if it were a shell
script.
Arguments Parameters to pass to the program.

Required only if the executable requires them.
Stdin Filename to use as standard input (stdin) by the program.

Optional. The default is /dev/null.
Stdout Filename to use as standard output (stdout) by the program.

Stderr Filename to use as standard error (stderr) by the program.

Cluster Input File A comma delimited local and remote path name pair, representing the
local file to copy to the remote location. If you have more than one pair
to enter, the More button will display a Cluster Input Files input
window.

Optional. The default is no files are copied.
Cluster Output A comma delimited local and remote path name pair, representing the
File local file destination to copy to the remote file into. If you have more
than one pair to enter, the More button will display a Cluster Output
Files input window.

Optional. The default is no files are copied.
Initialdir Initial directory. LoadLeveler changes to this directory before running
the job.

Optional. The default is your current working directory.
Notify User User id of person to notify regarding status of submitted job.

Optional. The default is your userid.
StartDate Month, day, and year in the format mm/dd/yyyy. The job will not start
before this date.

Optional. The default is to run the job as soon as possible.
StartTime Hour, minute, second in the format hh:mm:ss. The job will not start
before this time.

Optional. The default is to run the job as soon as possible.

If you specify StartTime but not StartDate, the default StartDate is the
current day. If you specify StartDate but not StartTime, the default
StartTime is 00:00:00. This means that the job will start as soon as
possible on the specified date.

Chapter 11. Using LoadLeveler’s GUI to build, submit, and manage jobs 239

Table 55. GUI fields and input (continued)
Field Input
Priority Number between 0 and 100, inclusive.

Optional. The default is 50.

This is the user priority. For more information on this priority, refer to
“Setting and changing the priority of a job” on page 230.
Image size Number in kilobytes that reflects the maximum size you expect your
program to grow to as it runs.

Optional.
Class Class name. The job will only run on machines that support the
specified class name. Your system administrator defines the class names.

Optional:
v Press the Choices button to get a list of available classes.
v Press the Details button under the class list to obtain long listing
information about classes.
Hold Hold status of the submitted job. Permitted values are:
user User hold
system System hold (only valid for LoadLeveler administrators)
usersys User and system hold (only valid for LoadLeveler
administrators)

Note: The default is a no-hold state.
Account Number Number associated with the job. For use with the llacctmrg and
llsummary commands for acquiring job accounting data.

Optional. Required only if the ACCT keyword is set to A_VALIDATE in
the configuration file.
Environment Your initial environment variables when your job starts. Separate
environment specifications with semicolons.

Optional.
Copy All or Master, to indicate whether the environment variables specified in
Environment the keyword Environment are copied to all nodes or just to the master
node of a parallel job.

Optional.
Shell The name of the shell to use for the job.

Optional. If not specified, the shell used in the owner’s password file
entry is used. If none is specified, /bin/sh is used.
Group The LoadLeveler group name to which the job belongs.

Optional.
Step Name The name of this job step.

Optional.


Table 55. GUI fields and input (continued)
Field Input
Node Usage How the node is used. Permitted values are:
shared
The node can be shared with other tasks of other job steps. This is
the default.
not shared
The node cannot be shared.
slice not shared
Has the same meaning as not shared. It is provided for
compatibility.
Dependency A Boolean expression defining the relationship between the job steps.

Optional.
Large Page Whether or not the job step requires Large Page memory.
yes
Use Large Page memory if available, otherwise use regular memory.
mandatory
Use of Large Page memory is mandatory.
no Do not use Large Page memory.
Bulk Transfer Indicates to the communication subsystem whether it should use the
bulk transfer mechanism to communicate between tasks.
yes
Use bulk transfer.
no Do not use bulk transfer.

Optional.
Rset What type of RSet support is requested. Permitted values are:
rset_mcm_affinity
Requests scheduling affinity.
Use the MCM options button to specify task allocation method,
memory affinity preference or requirement, and adapter affinity
preference or requirement.
rset_name
Requests a user defined RSet and nodes with rset_support set to
rset_user_defined.

Optional.
Comments Comments associated with the job. These comments help to distinguish
one job from another job.

Optional.
SMT Indicates whether a job requires dynamic simultaneous multithreading
(SMT) function.
yes
The job requires SMT function.
no The job does not require SMT function.
as_is
The SMT state will not be changed.
Note: The fields that appear in this table are what you see when viewing the Build a Job
window. The text in these fields does not necessarily correspond with the keywords listed in
“Job command file keyword descriptions” on page 359.

See “Job command file keyword descriptions” on page 359 for information
on the defaults associated with these keywords.


SELECT
A Job Type if you want to change the job type.
Your choices are:
Serial Specifies a serial job. This is the default.
Parallel
Specifies a parallel job.
Blue Gene
Specifies a bluegene job.
MPICH
Specifies a MPICH job.
Note that the job type you select affects the choices that are active on the
Build A Job window.
SELECT
a Notification option.
Your choices are:
Always
Notify you when the job starts, completes, and if it incurs errors.
Complete
Notify you when the job completes. This is the default option as
initially defined in the skel.cmd file.
Error Notify you if the job cannot run because of an error.
Never Do not notify you.
Start Notify you when the job starts.
SELECT
a Restart option.
Your choices are:
No This job is not restartable. This is the default.
Yes Restart the job.
SELECT
To restart the job on the same nodes from which it was vacated.
Your choices are:
No Restart the job on any available nodes.
Yes Restart the job on the same nodes it ran on previously. This option
is valid after a job has been vacated.

Note that there is no default for the selection.
SELECT
a Checkpoint option.
Your choices are:
No Do not checkpoint the job. This is the default.
Yes Yes, checkpoint the job at intervals you determine. See the
checkpoint keyword for more information.
Interval
Yes, checkpoint the job at intervals determined by LoadLeveler. See
the checkpoint keyword for more information.
SELECT
To start from a checkpoint file
Your choices are:


No Do not start the job from a checkpoint file (start job from
beginning).
Yes Yes, restart the job from an existing checkpoint file when you
submit the job. The file name must be specified by the job
command file. The directory name may be specified by the job
command file, configuration file, or default location.
SELECT
Coschedule if you want steps within a job to be scheduled and dispatched
at the same time.
Your choices are:
No Disables coscheduling for your job step.
Yes Allows coscheduling to occur for your job step.

Note:
1. This keyword is not inherited by other job steps.
2. The default is No.
3. The coscheduling function is only available with the
BACKFILL scheduler.
SELECT
Nodes (available when the job type is parallel)
The Nodes dialog box appears.
Complete the necessary fields to specify node information for a parallel job
(see Table 56). Depending upon which model you choose, different fields
will be available; any unavailable fields will be desensitized. LoadLeveler
will assign defaults for any fields that you leave blank. For more
information, see the appropriate job command file keyword (listed in
parentheses) in “Job command file keyword descriptions” on page 359.
Table 56. Nodes dialog box
Field Available in: Input
Min # of Nodes Tasks Per Node Minimum number of nodes required for running the
Model and Tasks parallel job (node keyword).
with Uniform
Blocking Model Optional. The default is one.
Max # of Nodes Tasks Per Node Maximum number of nodes required for running the
Model parallel job (node keyword).

Optional. The default is the minimum number of
nodes.
Tasks per Node Tasks Per Node The number of tasks of the parallel job you want to
Model run per node (tasks_per_node keyword).

Optional.
Total Tasks Tasks with The total number of tasks of the parallel job you
Uniform Blocking want to run on all available nodes (total_tasks
Model, and keyword).
Custom Blocking
Model Optional for Uniform, required for Custom Blocking.
The default is one.
Blocking Custom Blocking The number of tasks assigned (as a block) to each
Model consecutive node until all of a job’s tasks have been
assigned (blocking keyword)


Table 56. Nodes dialog box (continued)
Field Available in: Input
Task Geometry Custom The task ids of each task that you want to run on
Geometry Model each node. You can use the ″Set Geometry″ button for
step-by-step directions (task_geometry keyword).

SELECT
Close to return to the Build a Job dialog box.
SELECT
Network (available when the job type is parallel)
The Network dialog box appears.
The Network dialog box consists of two parts: The top half of the panel is
for MPI, and the bottom half is for LAPI. Click on the check box to the left
of MPI or LAPI to activate the part of the panel for which you want to
specify network information. If you want to use MPI with LAPI, click on
both:
v The MPI check box.
v The check box for Share windows between MPI and LAPI.
Complete those fields for which you want to specify network information
(see Table 57). For more information, see the network keyword description
in “Job command file keyword descriptions” on page 359.
Table 57. Network dialog box fields
Field Input
MPI (MPI/LAPI) Select:
v Only the MPI check box to use the Message Passing Interface
(MPI) protocol only.
v Both the MPI check box and the Share windows between MPI
and LAPI check box to use both MPI and the Low-level
Application Programming Interface (LAPI) protocols. This
selection corresponds to setting the network keyword in the job
command file to MPI_LAPI.

Optional.
LAPI Select the LAPI check box to use Low-level Application
Programming Interface (LAPI) protocol only.

Optional.
Adapter/Network Select an adapter name or a network type from the list.

Required for each protocol you select.
Adapter Usage Specifies that the adapter is either shared or not shared.

Optional. The default is shared.
Communication Mode Specifies the communication subsystem mode used by the
communication protocol that you specify and can be either IP
(Internet Protocol) or US (User Space).

Optional. The default is IP.
Communication Level Implies the amount of memory to be allocated to each window for
User Space mode. Allocation can be Low, Average, or High. It is
ignored by Switch_Network_Interface_For_HPS adapters.


Table 57. Network dialog box fields (continued)
Field Input
Instances Specifies the number of windows or IP addresses the
communication subsystem should allocate to this protocol.

Optional. The default is 1 unless sn_all is specified for network and
then the default is max.
rCxt Blocks The number of user rCxt blocks requested for each window used by
the associated protocol. It is recognized only by
Switch_Network_Interface_For_HPS adapters.

Optional.

SELECT
SELECT
Requirements
The Requirements dialog box appears.
Complete those fields for which you want to specify requirements (see
Table 58). Defaults are used for those fields that you leave blank.
LoadLeveler dispatches your job only to one of those machines with
resources that matches the requirements you specify.
Table 58. Build a job dialog box fields
Field Input
Architecture Machine type. The job will not run on any other machine type.

(see note 2) Optional. The default is the architecture of your current machine.
Operating System Operating system. The job will not run on any other operating system.

(see note 2) Optional. The default is the operating system of your current machine.
Disk Amount of disk space in the execute directory. The job will only run on
a machine with at least this much disk space.

Optional. The default is defined in your local configuration file.
Memory Amount of memory. The job will only run on a machine with at least
this much memory.

Optional. The default is defined in your local configuration file.
Large Page Amount of Large Page memory, in megabytes. The job step requires at
Memory least this much Large Page memory to run.

Optional.
Total Memory Amount of total (regular and Large Page memory) in megabytes needed
to run the job step.

Optional.
Machines Machine names. The job will only run on the specified machines.

Optional.
Features Features. The job will only run on machines with specified features.

Optional.


Table 58. Build a job dialog box fields (continued)
Field Input
Pool Specifies the number associated with the pool you want to use. All
available pools listed in the administration file appear as choices. The
default is to select nodes from any pool.
LoadLeveler Specifies the version of LoadLeveler, in dotted decimal format, on the
Version machine where you want the job to run. For example: 3.3.0.0 specifies
that your job will run on a machine running LoadLeveler Version 3.3.0.0
or higher.

Optional.
Connectivity A number from 0.0 through 1.0, representing the average connectedness
of the node’s managed adapters.
Requirement Requirements. The job will only run if these requirements are met.
Note:
1. If you enter a resource that is not available, you will NOT receive a message.
LoadLeveler holds your job in the Idle state until the resource becomes available.
Therefore, make certain that the spelling of your entry is correct. You can issue llq -s
jobID to find out if you have a job for which requirements were not met.
2. If you do not specify an architecture or operating system, LoadLeveler assumes that
your job can run only on your machine’s architecture and operating system. If your job
is not a shell script that can be run successfully on any platform, you should specify a
required architecture and operating system.

SELECT
SELECT
Resources
The Resources dialog box appears.
This dialog box allows you to set the amount of defined consumable
resources required for a job step. Resources with an ″*″ appended to their
names are not in the SCHEDULE_BY_RESOURCES list. For more
information, see the resources keyword.
SELECT
SELECT
Preferences
The Preferences dialog box appears.
This dialog box is similar to the Requirements dialog box, with the
exception of the Adapter choice, which is not supported as a Preference.
Complete the fields for those parameters that you want to specify. These
parameters are not binding. For any preferences that you specify,
LoadLeveler attempts to find a machine that matches these preferences
along with your requirements. If it cannot find the machine, LoadLeveler
chooses the first machine that matches the requirements.
SELECT
SELECT
Limits


The Limits dialog box appears.
Complete the fields for those limits that you want to impose upon your job
(see Table 59). If you type copy in any field except wall_clock_limit or
job_cpu_limit, the limits in effect on the submit machine are used. If you
leave any field blank, the default limits in effect for your userid on the
machine that runs the job are used. For more information, see “Using limit
keywords” on page 89.
Table 59. Limits dialog box fields
Field Input
CPU Limit Maximum amount of CPU time that the submitted job can use. Express
the amount as:
[[hours:]minutes:]seconds[ .fraction]

For example, 12:56:21 is 12 hours, 56 minutes, and 21 seconds.

Optional
Data Limit Maximum amount of the data segment that the submitted job can use.
Express the amount as:
integer[.fraction][units]

Optional
Core Limit Maximum size of a core file.

Optional
RSS Limit Maximum size of the resident set size. It is the largest amount of
physical memory a user’s process can allocate.

Optional
File Limit Maximum size of a file that is created.

Optional
Stack Limit Maximum size of the stack.

Optional
Job CPU Limit
Maximum total CPU time to be used by all processes of a serial job step
or if a parallel job, then this is the total CPU time for each LoadL_starter
process and its descendants for each job step of a parallel job.

Optional
Wall Clock Limit Maximum amount of elapsed time for which a job can run.

Optional

SELECT
SELECT
Checkpointing to specify checkpoint options (available when the
checkpoint option is set to Yes or Interval)
The checkpointing dialog box appears.
Complete those fields for which you want to specify checkpoint
information (see Table 60 on page 248). For detailed information on specific
keywords, see “Job command file keyword descriptions” on page 359.


Table 60. Checkpointing dialog box fieldsF
Field Input
Ckpt File Specifies a checkpoint file. The serial default is :
$(job_name).$(host).$(domain).$(jobid).$(stepid).ckpt
Ckpt Directory Specifies a checkpoint directory name.
Ckpt Execute Specifies a directory to use for staging the checkpoint executable file.
Directory
Ckpt Time Limits Sets the limits for the elapsed time a job can take checkpointing.

SELECT
SELECT
Blue Gene (available when the job type is bluegene)
The Blue Gene window appears.
Complete the necessary fields to specify information for a Blue Gene job
(see Table 61). Depending upon which request type you choose, different
fields will be available; any unavailable fields will be desensitized. For
more information, see the appropriate job command file keyword (listed in
parentheses) in “Job command file keyword descriptions” on page 359.
Table 61. Blue Gene job fields
Field Available when Input
requesting by:
# of Compute Size The requested size in number of compute nodes that
Nodes describes the size of the partition for this Blue Gene
job. (bg_size)
Shape Shape The requested shape of the requested Blue Gene job.
The units of each dimension of the shape are in
number of base partitions, XxYxZ, where X, Y, and Z
are the number of base partitions in the X-direction,
Y-direction, and Z-direction. (bg_shape)
Partition Name Partition The name of an existing partition in the Blue Gene
system where the requested job should run.
(bg_partition)
Connection Type Size and Shape The kinds of Blue Gene partitions that can be selected
for this job. You can select Torus, Mesh, or Prefer
Torus. (bg_connection)

Optional. The default is Mesh.
Rotate Shape Whether to consider all possible rotations of the
Dimensions specified shape (True) or only the specified shape
(False) when assigning a partition for the Blue Gene
job. (bg_rotate)

Optional. The default is True.


Table 61. Blue Gene job fields (continued)
Field Available when Input
requesting by:
Memory Megabytes A number (in megabytes) that represents the
minimum available virtual memory that is needed to
run the job. LoadLeveler generates a Blue Gene
requirement that specifies memory that is greater
than or equal to the amount you specify.

Optional. If you leave this field blank, this parameter
is not used when searching for machines to run your
job.
Requirements Expression An expression that specifies the Blue Gene
requirements that a machine must meet in order to
run the job.

Memory is the supported keyword.

SELECT

Editing the job command file
Use these instructions to edit the job command file that you just built.

There are several ways that you can edit the job command file that you just built:
1. Using the Jobs window:
SELECT
File → Submit a Job
The Submit a Job dialog box appears.
SELECT
The job file you want to edit from the file column.
SELECT
Edit
Your job command file appears in a window. You can use any editor
to edit the job command file. The default editor is specified in your
.Xdefaults file.
If you have an icon manager, an icon may appear. An icon manager is a
program that creates a graphic symbol, displayed on a screen, that you
can point to with a device such as a mouse in order to select a
particular function or application. Select this icon to view your job
command file.
2. Using the Tools Edit pull-down menus on the Build a Job window:
Using the Edit pull-down menu, you can modify the job command file. Your
choices appear in the Table 62:
Table 62. Modifying the job command file with the Edit pull-down menu
To Select
Add a step to the job command file Add a Step or Add a First
Step
Delete a step from the job command file Delete a Step


Table 62. Modifying the job command file with the Edit pull-down menu (continued)
To Select
Clear the fields in the Build a Job window Clear Fields
Select defaults to use in the fields Set Field Defaults
Note: Other options include Go to Next Step, Go to Previous Step, and Go to Last Step that
allow you to edit various steps in the job command file.

Using the Tools pull-down menu, you can modify the job command file. Your
choices appear in Table 63:
Table 63. Modifying the job command file with the Tools pull-down menu
To Select
Name the job Set Job Name
Specify a cluster, cluster list, or any cluster, if a multicluster Set Cluster
environment is configured.
Open a window where you can enter a script file Append Script
Fill in the fields using another file Restore from File
View the job command file in a window View Entire Job
Determine which step you are viewing What is step #
Start a new job command file Start a new job

You can save and submit the information you entered by selecting the choices
shown in Table 64:
Table 64. Saving and submitting information
To Do This
Save the information you
SELECT
entered into a file which you
Save
can submit later
A window appears prompting you to enter a job
filename.
ENTER
a job filename in the text entry field.
SELECT
OK
The window closes and the information you
entered is saved in the file you specified.
Submit the program
SELECT
immediately and discard the
Submit
information you entered

Submitting a job command file
After building a job command file, you can submit it to one or more machines for
processing.

To submit a job, from the Jobs window:
SELECT
File → Submit a Job


The Submit a Job dialog box appears.
SELECT
The job file that you want to submit from the file column.
You can also use the filter field and the directories column to select the file
or you can type in the file name in the text entry field.
SELECT
Submit
The job is submitted for processing.
You can now submit another job or you can press Close to exit the
window.

Displaying and refreshing job status
When you submit a job, the status of the job is automatically displayed in the Jobs
window.

You can update or refresh this status using the Jobs window and selecting one of
the following:
v Refresh → Refresh Jobs
v Refresh → Refresh All.

To change how often the amount of time should pass before the jobs window is
automatically refreshed, use the Jobs window.
SELECT
Refresh → Set Auto Refresh
A window appears.
TYPE IN
a value for the number of seconds to pass before the Jobs window is
updated.
Automatic refresh can be expensive in terms of network usage and CPU
cycles. You should specify a refresh interval of 120 seconds or more for
normal use.
SELECT
OK
The window closes and the value you specified takes effect.

To receive detailed information on a job:
SELECT
Actions → Extended Status to receive additional information on the job.
Selecting this option is the same as typing llq -x command.
You can also get information in the following way:
SELECT
Actions → Extended Details
Selecting this option is the same as typing llq -x -l command. You can also
double click on the job in the Jobs window to get details on the job.
Note: Obtaining extended status or details on multiple jobs can be
expensive in terms of network usage and CPU cycles.


SELECT
Actions → Job Status
You can also use the llq -s command to determine why a submitted job
remains in the Idle or Deferred state.
SELECT
Actions → Resource Use
Allows you to display resource use for running jobs. Selecting this option
is the same as entering the llq -w command.
SELECT
Actions → Blue Gene Job Status
Allows you to display Blue Gene job information for jobs. Selecting this
option is the same as entering the llq -b command.

For more information on requests for job information, see “llq - Query job status”
on page 479.

Sorting the Jobs window
You can specify up to two sorting options for the Jobs window.

The options you specify determine the order in which the jobs appear in the Jobs
window.

Select Sort → Set Sort Parameters
A window appears
Select A primary and secondary sort

Table 65 lists the sorting options:
Table 65. Sorting the jobs window
To: Select Sort
Sort jobs by the machine from which they were Sort by Submitting Machine
submitted
Sort by owner Sort by Owner
Sort by the time the jobs were submitted Sort by Submission Time
Sort by the state of the job Sort by State
Sort jobs by their user priority (last job listed runs first) Sort by Priority
Sort by the class of the job Sort by Class
Sort by the group associated with the job Sort by Group
Sort by the machine running the job Sort by Running Machine
Sort by dispatch order Sort by Dispatch Order
Not specify a sort No Sort

You can select a sort type as either a Primary or Secondary sorting option. For
example, suppose you select Sort by Owner as the primary sorting option and Sort
by Class as the secondary sorting option. The Jobs window is sorted by owner
and, within each owner, by class.


Changing the priority of your jobs
If your job has not yet begun to run and is still in the queue, you can change the
priority of the job in relation to your other jobs in the queue that belong to the
same class.

This only affects the user priority of the job. For more information on this priority,
refer to “Setting and changing the priority of a job” on page 230. Only the owner
of a job or the LoadLeveler administrator can change the priority of a job.

SELECT
a job by clicking on it with the mouse
SELECT
Actions → Priority
A window appears.
TYPE IN
a number between 0 and 100, inclusive, to indicate a new priority.
SELECT
OK
The window closes and the priority of your job changes.

Placing a job on hold
Only the owner of a job or the LoadLeveler administrator can place a hold on a
job.

SELECT
The job you want to hold by clicking on it with the mouse
SELECT
Actions → Hold
The job is put on hold and its status changes in the Jobs window.

Releasing the hold on a job
Only the owner of a job or the LoadLeveler administrator can release a hold on a
job.

SELECT
The job you want to release by clicking on it with the mouse
SELECT
Actions → Release from Hold
The job is released from hold and its status is updated in the Jobs
window.


Canceling a job
Only the owner of a job or the LoadLeveler administrator can cancel a job.

SELECT
The job you want to cancel by clicking on it with the mouse
SELECT
Actions → Cancel
LoadLeveler cancels the job and the job information disappears from the
Jobs window.

Modifying consumable resources and other job attributes
Use these commands to modify the consumable CPUs or memory requirements of
a nonrunning job.
SELECT

Modify → Consumable CPUs
or
Modify → Consumable Memory
or
Modify → Class
or
Modify → Account number
or
Modify → Blue Gene → Connection
or
Modify → Blue Gene → Partition
or
Modify → Blue Gene → Rotate
or
Modify → Blue Gene → Shape
or
Modify → Blue Gene → Size
or
Modify → Blue Gene → Requirement

A dialog box appears prompting you to enter a new value for the
selected job attribute. Blue Gene attributes are available when Blue Gene is
enabled.
TYPE IN
The new value
SELECT
OK
The dialog box closes and the value you specified takes effect.

Taking a checkpoint
Use these commands to checkpoint the selected job.


SELECT
One of the following actions to take when checkpoint has completed:
v Continue the step
v Terminate the step
v Hold the step
A checkpoint monitor for this step appears.

Adding a job to a reservation
Use these commands to bind selected job steps to a reservation so that they will
only be scheduled to run on the nodes reserved for the reservation.
SELECT
The job you want to bind by clicking on it with the mouse.
SELECT
Actions → Bind to Reservation
A window appears.
SELECT
A reservation from the list.
SELECT
OK
The window closes and the job is bound to that reservation.

Removing a job from a reservation
Use these commands to unbind selected job steps from reservations to which they
currently belong.
SELECT
The job you want to unbind by clicking on it with the mouse.
SELECT
Actions → Unbind from Reservation

If the job is bound to a reservation, it is removed from the reservation.

Displaying and refreshing machine status
The status of the machines is automatically displayed in the Machines window.

You can update or refresh this status using the Machines window and selecting
one of the following:
v Refresh → Refresh Machines
v Refresh → Refresh All.

To specify an amount of time to pass before the Machines window is automatically
refreshed, from the Machines window:
SELECT
Refresh → Set Auto Refresh
A window appears.


TYPE IN
a value for the number of seconds to pass before the Machines window is
updated.
Automatic refresh can be expensive in terms of network usage and CPU
cycles. You should specify a refresh interval of 120 seconds or more for
normal use.
SELECT
OK
The window closes and the value you specified takes effect.

To receive detailed information on a machine:
SELECT
Actions → Details
This displays status information about the selected machines. Selecting this
option has the same effect as typing the llstatus -l command
SELECT
Actions → Adapter Details
This displays virtual and physical adapter information for each selected
machine. Selecting this option has the same effect as typing the llstatus -a
command
SELECT
Actions → Floating Resources
This displays consumable resources for the LoadLeveler cluster. Selecting
this option has the same effect as typing the llstatus -R command
SELECT
Actions → Machine Resources
This displays consumable resources defined for the selected machines or all
machines. Selecting this option has the same effect as typing the llstatus -R
command
SELECT
Actions → Cluster Status
This displays status of machines in the defined cluster or clusters. It
appears only when a multicluster environment is configured and is
equivalent to the llstatus -X all command.
SELECT
Actions → Cluster Config
This displays cluster information from the LoadL_admin file. Only fields
with data specified or which have defaults when not specified are
displayed. It appears only when a multicluster environment is configured
and is equivalent to the llstatus -C command.
SELECT
Actions → Blue Gene ...
This displays information about the Blue Gene system. You can select the
option for Status for a short listing, Details for a long listing, Base
Partitions for Blue Gene base partition status, or Partitions for existing


Blue Gene partition status. It is available only when Blue Gene support is
enabled in LoadLeveler. This is equivalent to the llstatus command with
the options -b, -b -l, -B, or -P.

Sorting the Machines window
You can specify up to two sorting options for the Machines window.

The options you specify determine the order in which machines appear in the
window.

From the Machines window:
Select Sort → Set Sort Parameters
A window appears
Select A primary and secondary sort

Table 66 lists sorting options for the Machines window:
Table 66. Sorting the machines window
To: Select Sort →
Sort by machine name Sort by Name
Sort by Schedd state Sort by Schedd
Sort by total number of jobs scheduled Sort by InQ
Sort by number of running jobs scheduled by this machine Sort by Act
Sort by startd state Sort by Startd
Sort by the number of jobs running on this machine Sort by Run
Sort by load average Sort by LdAvg
Sort by keyboard idle time Sort by Idle
Sort by hardware architecture Sort by Arch
Sort by operating system type Sort by OpSys
Not specify a sort No Sort

You can select a sort type as either a Primary or Secondary sorting option. For
example, suppose you select Sort by Arch as the primary sorting option and Sort
by Name as the secondary sorting option. The Machines window is sorted by
hardware architecture, and within each architecture type, by machine name.

Finding the location of the central manager
The LoadLeveler administrator designates one of the nodes in the LoadLeveler
cluster as the central manager.

When jobs are submitted at any node, the central manager is notified and decides
where to schedule the jobs. In addition, it keeps track of the status of machines in
the cluster and the jobs in the system by communicating with each node.
LoadLeveler uses this information to make the scheduling decisions and to
respond to queries.

To find the location of the central manager, from the Machines window:


SELECT
Actions → Find Central Manager
A message appears in the message window declaring on which machine
the central manager is located.

Finding the location of the public scheduling machines
Public scheduling machines are those machines that participate in the scheduling
of LoadLeveler jobs on behalf of the submit-only machines.

To get a list of these machines in your cluster, use the Machines window:
SELECT
Actions → Find Public Scheduler
A message appears displaying the names of these machines.

Finding the type of scheduler in use
The LoadLeveler administrator defines the scheduler used by the cluster.

To determine which scheduler is currently in use:
SELECT
Actions → Find Scheduler Type
A message appears displaying the type:
v ll_default
v BACKFILL
v External (API)

Specifying which jobs appear in the Jobs window
Normally, only your jobs appear in the Jobs window.

You can, however, specify which jobs you want to appear by using the Select
pull-down menu on the Jobs window (see Table 67).
Table 67. Specifying which jobs appear in the Jobs window
To Display Select Select →
All jobs in the queue All
All jobs belonging to a specific By User
user (or users)
A window appears prompting you to enter the user IDs
whose jobs you want to view.
All jobs submitted to a specific By Machine
machine (or machines)
A window appears prompting you to enter the machine
names on which the jobs you want to view are running.
All jobs belonging to a specific By Group
group (or groups)
A window appears prompting you to enter the
LoadLeveler group names to which the jobs you want to
view belong.


Table 67. Specifying which jobs appear in the Jobs window (continued)
To Display Select Select →
All jobs having a particular ID By Job Id

A dialog box prompts you to enter the id of the job you
want to appear. This ID appears in the left column of the
Jobs window. Type in the ID and press OK.
Note: When you choose By User, By Machines, or By Group, you can use a UNIX regular
expression enclosed in parenthesis. For example, you can enter (^k10) to display all
machines beginning with the characters “k10”.

SELECT
Select → Show Selection to show the selection parameters.

Specifying which machines appear in Machines window
You can specify which machines will appear in the Machines window.

See Table 68. The default is to view all of the machines in the LoadLeveler pool.

From the Machines window:
Table 68. Specifying which machines appear in Machines window
To Select Select →
View all of the machines All
View machines by operating by OpSys
system
operating system of those machines you want to view.
View machines by hardware by Arch
architecture
hardware architecture of those machines you want to
view.
View machines by state by State

A cascading pull-down menu appears prompting you
to select the state of the machines that you want to view.

SELECT
Select → Show Selection to show the selection parameters.

Saving LoadLeveler messages in a file
Normally, all the messages that LoadLeveler generates appear in the Messages
window.

If you would also like to have these messages written to a file, use the Messages
window.
SELECT
Actions → Start logging to a file
A window appears prompting you to enter a filename in which to log
the messages.


TYPE IN
The filename in the text entry field.
SELECT
OK
The window closes.


Part 4. TWS LoadLeveler interfaces reference
The topics in the TWS LoadLeveler interfaces reference provide the details you
need to know to correctly use the IBM Tivoli Workload Scheduler (TWS)
LoadLeveler interfaces for the following tasks:
v Specifying keywords in the TWS LoadLeveler control files
v Starting and customizing the TWS LoadLeveler GUI
v Correctly coding the TWS LoadLeveler commands and APIs

261

Chapter 12. Configuration file reference
The configuration file contains many parameters that you can set or modify to
control how LoadLeveler operates.

You may control LoadLeveler’s operation either:
v Across the cluster, by modifying the global configuration file, LoadL_config, or
v Locally, by modifying the LoadL_config.local file on individual machines.

Table 69 shows the configuration subtasks:
Table 69. Configuration subtasks
Subtask Associated information (see . . . )
To find out what administrator tasks Chapter 4, “Configuring the LoadLeveler
you can accomplish by using the environment,” on page 41
configuration file
To learn how to correctly specify the v “Configuration file syntax”
contents of a configuration file
v “Configuration file keyword descriptions” on page
265
v “User-defined keywords” on page 313
v “LoadLeveler variables” on page 314

Configuration file syntax
The information in both the LoadL_config and the LoadL_config.local files is in
the form of a statement. These statements are made up of keywords and values.

There are three types of configuration file keywords:
v Keywords, described in “Configuration file keyword descriptions” on page 265.
v User-defined variables, described in “User-defined keywords” on page 313.
v LoadLeveler variables, described in “LoadLeveler variables” on page 314.

Configuration file statements take one of the following formats:
keyword=value
keyword:value

Statements in the form keyword=value are used primarily to customize an
environment. Statements in the form keyword:value are used by LoadLeveler to
characterize the machine and are known as part of the machine description. Every
machine in LoadLeveler has its own machine description which is read by the
central manager when LoadLeveler is started.

Keywords are not case sensitive. This means you can enter them in lower case,
upper case, or mixed case.

Note: For the keyword=value form, if the keyword is of a boolean type and only
true and false are valid input, a value string starting with t or T is taken as
true; all other values are taken as false.

To continue configuration file statements, use the back-slash character ().

263

In the configuration file, comments must be on a separate line from keyword
statements.

You can use the following types of constants and operators in the configuration
file.

Numerical and alphabetical constants
These are the numerical and alphabetical constants.

Constants may be represented as:
v Boolean expressions
v Signed integers
v Floating point values
v Strings enclosed in double quotes (″ ″).

Mathematical operators
You can use the following C operators.

The operators are listed in order of precedence. All of these operators are evaluated
from left to right:
v !
v */
v -+
v < <= > >=
v == !=
v &&
v ||

64-bit support for configuration file keywords and expressions
Administrators can assign 64-bit integer values to selected keywords in the
configuration file.
floating_resources
Consumable resources associated with the floating_resources keyword may be
assigned 64-bit integer values. Fractional and unit specifications are not
| allowed. The predefined ConsumableCpus, ConsumableMemory,
| ConsumableLargePageMemory, and ConsumableVirtualMemory may not be
| specified as floating resources.
Example:
floating_resources = spice2g6(9876543210123) db2_license(1234567890)
MACHPRIO expression
| The LoadLeveler variables: Disk, ConsumableCpus, ConsumableMemory,
| ConsumableVirtualMemory, ConsumableLargePageMemory, PagesScanned,
| Memory, VirtualMemory, FreeRealMemory, and PagesFreed may be used in a
| MACHPRIO expression. They are 64-bit integers and 64-bit arithmetic is used
to evaluate them.
Example:
MACHPRIO: (Memory + FreeRealMemory) - (LoadAvg*1000 + PagesScanned)


Configuration file keyword descriptions
This topic provides an alphabetical list of the keywords you can use in a
LoadLeveler configuration file.

It also provides examples of statements that use these keywords.
ACCT
Turns the accounting function on or off.
Syntax:
ACCT = flag ...

The available flags are:
A_DETAIL
Enables extended accounting. Using this flag causes LoadLeveler to
record detail resource consumption by machine and by events for each
job step. This flag also enables the -x flag of the llq command,
permitting users to view resource consumption for active jobs.
A_RES
Turns reservation data recording on.
A_OFF
Turns accounting data recording off.
A_ON Turns accounting data recording on. If specified without the
A_DETAIL flag, the following is recorded:
v The total amount of CPU time consumed by the entire job
v The maximum memory consumption of all tasks (or nodes).
A_VALIDATE
Turns account validation on.
Default value: A_OFF
Example: This example specifies that accounting should be turned on and that
extended accounting data should be collected and that the -x flag of the llq
command be enabled.
ACCT = A_ON A_DETAIL
ACCT_VALIDATION
Identifies the executable called to perform account validation.
Syntax:
ACCT_VALIDATION = program

Where program is a validation program.
Default value: $(BIN)/llacctval (the accounting validation program shipped
with LoadLeveler.
ACTION_ON_MAX_REJECT
Specifies the state in which jobs are placed when their rejection count has
reached the value of the MAX_JOB_REJECT keyword. HOLD specifies that
jobs are placed in User Hold status; SYSHOLD specifies that jobs are placed in
System Hold status; CANCEL specifies that jobs are canceled. When a job is
rejected, LoadLeveler sends a mail message stating why the job was rejected.
Syntax:
ACTION_ON_MAX_REJECT = HOLD | SYSHOLD | CANCEL

Chapter 12. Configuration file reference 265

Default value: HOLD
ACTION_ON_SWITCH_TABLE_ERROR
Points to an administrator supplied program that will be run when
DRAIN_ON_SWITCH_TABLE_ERROR is set to true and a switch table
unload error occurs.
Syntax:
ACTION_ON_SWITCH_TABLE_ERROR = program
Default value: The default is to not run a program.
ADMIN_FILE
Points to the administration file containing user, class, group, machine, and
adapter stanzas.
Syntax:
ADMIN_FILE = directory
Default value: $(tilde)/admin_file
AFS_GETNEWTOKEN
Specifies a filter that, for example, can be used to refresh an AFS token.
Syntax:
AFS_GETNEWTOKEN = full_path_to_executable

Where full_path_to_executable is an administrator-supplied program that
receives the AFS authentication information on standard input and writes the
new information to standard output. The filter is run when the job is
scheduled to run and can be used to refresh a token which expired when the
job was queued.
Default value: The default is to not run a program.
AGGREGATE_ADAPTERS
Allows an external scheduler to specify per-window adapter usages.
Syntax:
AGGREGATE_ADAPTERS = YES | NO
When this keyword is set to YES, the resources from multiple switch adapters
on the same switch network are treated as one aggregate pool available to each
job. When this keyword is set to NO, the switch adapters are treated
individually and a job cannot use resources from multiple adapters on the
same network.
Set this keyword to NO when you are using an external scheduler; otherwise,
set to YES (or accept the default).
Default value: YES
| ALLOC_EXCLUSIVE_CPU_PER_JOB
| Specifies the way CPU affinity is enforced on Linux platforms. When this
| keyword is not specified or when an unrecognized value is assigned to it,
| LoadLeveler will not attempt to set CPU affinity for any application processes
| spawned by it.

| Note: This keyword is valid only on Linux x86 and x86_64 platforms. This
| keyword is ignored by LoadLeveler on all other platforms.
| The ALLOC_EXCLUSIVE_CPU_PER_JOB keyword can be specified in the
| global or local configuration files. It can also be specified in both configuration


| files, in which case the setting in the local configuration file will override that
| of the global configuration file. The keyword cannot be turned off in a local
| configuration file if it has been set to any value in the global configuration file.
| Changes to ALLOC_EXCLUSIVE_CPU_PER_JOB will not take effect at
| reconfiguration. The administrator must stop and restart or recycle
| LoadLeveler when changing ALLOC_EXCLUSIVE_CPU_PER_JOB.
| Syntax:
| ALLOC_EXCLUSIVE_CPU_PER_JOB = LOGICAL|PHYSICAL
| Default value: By default, when this keyword is not specified, CPU affinity is
| not set.
| Example: When the value of this keyword is set to LOGICAL, only one
| LoadLeveler job step will run on each of the processors available on the
| machine:
| ALLOC_EXCLUSIVE_CPU_PER_JOB = LOGICAL
| Example: When the value of this keyword is set to PHYSICAL, all logical
| processors (or physical cores) configured in one physical CPU package will be
| allocated to one and only one LoadLeveler job step.
| ALLOC_EXCLUSIVE_CPU_PER_JOB = PHYSICAL
ARCH
Indicates the standard architecture of the system. The architecture you specify
here must be specified in the same format in the requirements and preferences
statements in job command files. The administrator defines the character string
for each architecture.
Syntax:
ARCH = string
Default value: Use the command llstatus -l to view the default.
Example: To define a machine as an RS/6000®, the keyword would look like:
ARCH = R6000
BG_ALLOW_LL_JOBS_ONLY
Specifies if only jobs submitted through LoadLeveler will be accepted by the
Blue Gene job launcher program.
Syntax:
BG_ALLOW_LL_JOBS_ONLY = true | false
Default value: false
BG_CACHE_PARTITIONS
Specifies whether allocated partitions are to be reused for Blue Gene jobs
whenever possible.
Syntax:
BG_CACHE_PARTITIONS = true | false
Default value: true
BG_ENABLED
Specifies whether Blue Gene support is enabled.
Syntax:
BG_ENABLED = true | false


If the value of this keyword is true, the central manager will load the Blue
Gene control system libraries and query the state of the Blue Gene system so
that jobs of type bluegene can be scheduled.
BG_MIN_PARTITION_SIZE
Specifies the smallest number of compute nodes in a partition.
Syntax:
BG_MIN_PARTITION_SIZE = 32 | 128 | 512 (for Blue Gene/L)

BG_MIN_PARTITION_SIZE = 16 | 32 | 64 | 128 | 256 | 512 (for Blue Gene/P)

The value for this keyword must not be smaller than the minimum partition
size supported by the physical Blue Gene hardware. If the number of compute
nodes requested in a job is less than the minimum partition size, LoadLeveler
will increase the requested size to the minimum partition size.
If the max_psets_per_bp value is set in the DB_PROPERTY file, the value for
the BG_MIN_PARTITION_SIZE must be set as described in Table 70:
Table 70. BG_MIN_PARTITION_SIZE values
max_psets_per_bp value in BG_MIN_PARTITION_SIZE for BG_MIN_PARTITION_SIZE for
DB_PROPERTY file Blue Gene/L Blue Gene/P
4 >= 128 >= 128
8 >= 128 >= 64
16 >= 32 >= 32
32 >= 32 >= 16

Default value: 32
BIN
Defines the directory where LoadLeveler binaries are kept.
Syntax:
BIN = $(RELEASEDIR)/bin
Default value: $(tilde)/bin
CENTRAL_MANAGER_HEARTBEAT_INTERVAL
Specifies the amount of time, in seconds, that defines how frequently primary
and alternate central manager communicate with each other.
Syntax:
CENTRAL_MANAGER_HEARTBEAT_INTERVAL = number
Default value: The default is 300 seconds or 5 minutes.
CENTRAL_MANAGER_TIMEOUT
Specifies the number of heartbeat intervals that an alternate central manager
will wait before declaring that the primary central manager is not operating.
Syntax:
CENTRAL_MANAGER_TIMEOUT = number
Default value: The default is 6.
CKPT_CLEANUP_INTERVAL
Specifies the interval, in seconds, at which the Schedd daemon will run the
program specified by the CKPT_CLEANUP_PROGRAM keyword.


Syntax:
CKPT_CLEANUP_INTERVAL = number

number must be a positive integer.
Default value: -1
CKPT_CLEANUP_PROGRAM
Identifies an administrator-provided program which is to be run at the interval
specified by the ckpt_cleanup_interval keyword. The intent of this program is
to delete old checkpoint files created by jobs running under LoadLeveler
during the checkpoint process.
Syntax:
CKPT_CLEANUP_PROGRAM = program

Where program is the fully qualified name of the program to be run. The
program must be accessible and executable by LoadLeveler.
A sample program to remove checkpoint files is provided in the
/usr/lpp/LoadL/full/samples/llckpt/rmckptfiles.c file.
Default value: No default value is set.
CKPT_EXECUTE_DIR
Specifies the directory where the job step’s executable will be saved for
checkpointable jobs. You can specify this keyword in either the configuration
file or the job command file; different file permissions are required depending
on where this keyword is set. For additional information, see “Planning
considerations for checkpointing jobs” on page 140.
Syntax:
CKPT_EXECUTE_DIR = directory

This directory cannot be the same as the current location of the executable file,
or LoadLeveler will not stage the executable. In this case, the user must have
execute permission for the current executable file.
Default value: By default, the executable of a checkpointable job step is not
staged.
CLASS
Determines whether a machine will accept jobs of a certain job class. For
parallel jobs, you must define a class instance for each task you want to run on
a node using one of two formats:
v The format, CLASS = class_name (count), defines the CLASS names using a
statement that names the classes and sets the number of tasks for each class
in parenthesis.
With this format, the following rules apply:
– Each class can have only one entry
– If a class has more than one entry or there is a syntax error, the entire
CLASS statement will be ignored
– If the CLASS statement has a blank value or is not specified, it will be
defaulted to No_Class (1)
– The number of instances for a class specified inside the parenthesis ( )
must be an unsigned integer. If the number specified is 0, it is correct
syntactically, but the class will not be defined in LoadLeveler
– If the number of instances for all classes in the CLASS statement are 0,
the default No_Class(1) will be used


v The format, CLASS = { ″class1″ ″class2″ ″class2″ ″class2″}, defines the CLASS
names using a statement that names each class and sets the number of tasks
for each class based on the number of times that the class name is used
inside the {} operands.

Note: With both formats, the class names list is blank delimited.
For a LoadLeveler job to run on a machine, the machine must have a vacancy
for the class of that job. If the machine is configured for only one No_Class job
and a LoadLeveler job is already running there, then no further LoadLeveler
jobs are started on that machine until the current job completes.
| You can have a maximum of 1024 characters in the class statement. You cannot
| use allclasses or data_stage as a class name, since these are reserved
| LoadLeveler keywords.
You can assign multiple classes to the same machine by specifying the classes
in the LoadLeveler configuration file (called LoadL_config) or in the local
configuration file (called LoadL_config.local). The classes, themselves, should
be defined in the administration file. See “Setting up a single machine to have
multiple job classes” on page 723 and “Defining classes” on page 89 for more
information on classes.
Syntax:
CLASS = { "class_name" ... } | {"No_Class"} | class_name (count) ...
Default value: {″No_Class″}
CLIENT_TIMEOUT
Specifies the maximum time, in seconds, that a daemon waits for a response
over TCP/IP from a process. If the waiting time exceeds the specified amount,
the daemon tries again to communicate with the process. In general, you
should use the default setting unless you are experiencing delays due to an
excessively loaded network. If so, you should try increasing this value.
Syntax:
CLIENT_TIMEOUT = number
Default value: The default is 30 seconds.
CLUSTER_METRIC
Indicates the installation exit to be run by the Schedd to determine where a
remote job is distributed. If a remote job is submitted with a list of clusters or
the reserved word any and the installation exit is not specified, the remote job
is not submitted.
Syntax:
CLUSTER_METRIC = full_pathname_to_executable

The installation exit is run with the following parameters passed as input. All
parameters are character strings.
v The job ID of the job to be distributed
v The number of clusters in the list of clusters
v A blank-delimited list of clusters to be considered
If the user specifies the reserved word any as the cluster_list during job
submission, the job is sent to the first outbound Schedd defined for the first
configured remote cluster. The CLUSTER_METRIC is executed on this
machine to determine where the job will be distributed. If this machine is not
the outbound_hosts Schedd for the assigned cluster, the job will be forwarded


to the correct outbound_hosts Schedd. If the user specifies a list of clusters as
the cluster_list during job submission, the job is sent to the first outbound
Schedd defined for the first specified remote cluster. The CLUSTER_METRIC
is executed on this machine to determine where the job will be distributed. If
this machine is not the outbound_hosts Schedd for the assigned cluster, the job
will be forwarded to the correct outbound_hosts Schedd.

Note: The list of clusters may contain a single entry of the reserved word any,
which indicates that the CLUSTER_METRIC installation exit must
determine its own list of clusters to select from. This can be all of the
clusters available using the data access API or a predetermined list set
by the administrator. If any is specified in place of a cluster list, the
metric will receive a count of 1 followed by the keyword any.
The installation exit must write the remote cluster name to which the job is
submitted as standard output and exit with a value of 0. An exit value of -1
indicates an error in determining the cluster for distribution and the job is not
submitted. Returned cluster names that are not valid also cause the job to be
not submitted. STDERR from the exit is written to the Schedd log.
LoadLeveler provides a set of sample exits for use in distributing jobs by the
following metrics:
v The number of jobs in the idle queue
v The number of jobs in the specified class
v The number of free nodes in the cluster
The installation exit samples are available in the ${RELEASEDIR}/samples/
llcluster directory.
CLUSTER_REMOTE_JOB_FILTER
Indicates the installation exit to be run by the inbound Schedd for each remote
job request to filter the user’s job command file statements during submission
or move job. If the keyword is not specified, no job filtering is done.
Syntax:
CLUSTER_REMOTE_JOB_FILTER = full_pathname_to_executable

The installation exit is run with the submitting user’s ID. All parameters are
character strings.
This installation exit is executed on the inbound_hosts of the local cluster
when receiving a job submission or move job request.
The executable specified is called with the submitting user’s unfiltered job
command file statements as the standard input. The standard output is
submitted to LoadLeveler. If the exit returns with a nonzero exit code, the
remote job submission or job move will fail. A submit filter can only make
changes to LoadLeveler job command file statements.
The data access API can be used by the remote job filter to query the Schedd
for the job object received from the sending cluster.
If the local submission filter on the submitting cluster has added or deleted
steps from the original user’s job command file, the remote job filter must add
or delete the same number of steps. The job command file statements returned
by the remote job filter must contain the same number of steps as the job
object received from the sending cluster.
Changes to the following job command file keyword statements are ignored:
v executable


v environment
v image_size
v cluster_input_file
v cluster_output_file
v cluster_list
The following job command file keyword will have different behavior:
v initialdir – If not set by the remote job filter or the submitting user’s
unfiltered job command file, the default value will remain the current
working directory at the time the job was submitted. Access to the initialdir
will be verified on the cluster selected to run the job. If access to initialdir
fails, the submission or move job will fail.
| When you distribute a scale-across job to other clusters for scheduling and a
| remote job filter is configured, the filter will be applied to the distributed job.
| However, only changes to the following job command file keyword statements
| will be accepted. Changes to any other statement by the remote job filter will
| be ignored.
| v #@ class
| v #@ priority
| v #@ as_limit
| v #@ core_limit
| v #@ cpu_limit
| v #@ data_limit
| v #@ file_limit
| v #@ job_cpu_limit
| v #@ locks_limit
| v #@ memlock_limit
| v #@ nofile_limit
| v #@ nproc_limit
| v #@ rss_limit
| v #@ stack_limit
To maintain compatibility between the SUBMIT_FILTER and
CLUSTER_REMOTE_JOB_FILTER programs, the following environment
variables are set when either exit is invoked:
v LOADL_ACTIVE – the LoadLeveler version.
v LOADL_STEP_COMMAND – the location of the job command file passed
as input to the program. This job command file only contains LoadLeveler
keywords.
v LOADL_STEP_ID – The job identifier, generated by the submitting

Note: The environment variable name is LOADL_STEP_ID although the
value it contains is a ″job″ identifier. This name is used to be
compatible with the local job filter interface.
v LOADL_STEP_OWNER – The owner (UNIX user name) of the job.
CLUSTER_USER_MAPPER
Indicates the installation exit to be run by the inbound Schedd for each remote


job request to determine the user mapping of the cluster. This keyword implies
that user mapping is performed. If the keyword is not specified, no user
mapping is done.
Syntax:
CLUSTER_USER_MAPPER = full_pathname_to_executable

The installation exit is run with the following parameters passed as input. All
parameters are character strings.
v The user name to be mapped
v The cluster name where the user originated from
This installation exit is executed on the inbound_hosts of the local cluster
when receiving a job submission, move job request or remote command.
The installation exit must write the new user name as standard output and exit
with a value of 0. An exit value of -1 indicates an error and the job is not
submitted. STDERR from the exit is written to the Schedd log. An exit value of
1 indicates that the user name returned for this job was not mapped.
CM_CHECK_USERID
Specifies whether the central manager will check the existence of user IDs that
sent requests through a command or API on the central manager machine.
Syntax:
CM_CHECK_USERID = true | false
Default value: true
COLLECTOR_DGRAM_PORT
Specifies the port number used when connecting to a daemon.
Syntax:
CM_COLLECTOR_PORT = port number
COMM
Specifies a local directory where LoadLeveler keeps special files used for UNIX
domain sockets for communicating among LoadLeveler daemons running on
the same machine. This keyword allows the administrator to choose a different
file system other than /tmp for these files. If you change the COMM option
you must stop and then restart LoadLeveler using the llctl command.
Syntax:
COMM = local directory
Default value: The default location for the files is /tmp.
CONTINUE
Determines whether suspended jobs should continue execution.
Syntax:
CONTINUE: expression that evaluates to T or F (true or false)

When T, suspended LoadLeveler jobs resume execution on the machine.

For information about time-related variables that you may use for this
keyword, see “Variables to use for setting times” on page 320.


CUSTOM_METRIC
Specifies a machine’s relative priority to run jobs.
Syntax:
CUSTOM_METRIC = number

This is an arbitrary number which you can use in the MACHPRIO expression.
Negative values are not allowed.
Default value: If you specify neither CUSTOM_METRIC nor
CUSTOM_METRIC_COMMAND, CUSTOM_METRIC = 1 is assumed. For
more information, see “Setting negotiator characteristics and policies” on page
45.
For more information related to using this keyword, see “Defining a
LoadLeveler cluster” on page 44.
CUSTOM_METRIC_COMMAND
Specifies an executable and any required arguments. The exit code of this
command is assigned to CUSTOM_METRIC. If this command does not exit
normally, CUSTOM_METRIC is assigned a value of 1. This command is
forked every (POLLING_FREQUENCY * POLLS_PER_UPDATE) period.
Syntax:
CUSTOM_METRIC_COMMAND = command
Default value: No default is set; LoadLeveler does not run any command to
determine CUSTOM_METRIC.
DCE_AUTHENTICATION_PAIR
Specifies a pair of installation supplied programs that are used to authenticate
DCE security credentials.
Restriction: DCE security is not supported by LoadLeveler for Linux.
Syntax:
DCE_AUTHENTICATION_PAIR = program1, program2

Where program1 and program2 are LoadLeveler- or installation-supplied
programs that are used to authenticate DCE security credentials. program1
obtains a handle (an opaque credentials object), at the time the job is
submitted, which is used to authenticate to DCE. program2 uses the handle
obtained by program1 to authenticate to DCE before starting the job on the
executing machines.
Default value: See “Handling DCE security credentials” on page 74 for
information about defaults.
DEFAULT_PREEMPT_METHOD
Specifies the default preemption method for LoadLeveler to use when a
preempt method is not specified in a PREEMPT_CLASS statement or in the
llpreempt command. LoadLeveler also uses this default preemption method to
preempt job steps that are running on reserved machines when a reservation
period begins.
Restrictions:
v This keyword is valid only for the BACKFILL scheduler.
v The suspend method of preemption (the default) might not be supported on
your level of Linux. If you want to preempt jobs that are running where
process tracking is not supported, you must use this keyword to specify a
method other than suspend.


Syntax:
DEFAULT_PREEMPT_METHOD = rm | sh | su | vc | uh

Valid values are:
rm
LoadLeveler preempts the jobs and removes them from the job queue. To
rerun the job, the user must resubmit the job to LoadLeveler.
sh LoadLeveler ends the jobs and puts them into System Hold state. They
remain in that state on the job queue until an administrator releases them.
After being released, the jobs go into Idle state and will be rescheduled to
run as soon as resources for the job are available.
su LoadLeveler suspends the jobs and puts them in Preempted state. They
remain in that state on the job queue until the preempting job has
terminated, and resources are available to resume the preempted job on the
same set of nodes. To use this value, process tracking must be enabled.
vc LoadLeveler ends the jobs and puts them in Vacate state. They remain in
that state on the job queue and will be rescheduled to run as soon as
resources for the job are available.
uh LoadLeveler ends the jobs and puts them into User Hold state. They
remain in that state on the job queue until an administrator releases them.
After being released, the jobs go into Idle state and will be rescheduled to
run as soon as resources for the job are available.
Default value: su (suspend method)
For more information related to using this keyword, see “Steps for configuring
a scheduler to preempt jobs” on page 130.
DRAIN_ON_SWITCH_TABLE_ERROR
Specifies whether the startd should be drained when the switch table fails to
unload. This will flag the administrator that intervention may be required to
unload the switch table. When DRAIN_ON_SWITCH_TABLE_ERROR is set
to true, the startd will be drained when the switch table fails to unload.
Syntax:
DRAIN_ON_SWITCH_TABLE_ERROR = true | false
| DSTG_MAX_STARTERS
| Specifies a machine-specific limit on the number of data staging initiators.
| Since each task of a data staging job step consumes one initiator from the
| data_stage class on the specified machine, DSTG_MAX_STARTERS provides
| the maximum number of data staging tasks that can run at the same time on
| the machine.
| Syntax:
| DSTG_MAX_STARTERS = number

| Notes:
| 1. If you have not set the DSTG_MAX_STARTERS value in either the
| global or local configuration files, there will not be any data staging
| initiators on the specified machine. In this configuration, the
| compute node will not be allowed to perform data staging tasks.
| 2. The value specified for DSTG_MAX_STARTERS will be the
| number of initiators available for the built-in data_stage class on
| that machine.


| 3. The value specified for MAX_STARTERS will not limit the value
| specified for DSTG_MAX_STARTERS.
| Default value: 0
| DSTG_MIN_SCHEDULING_INTERVAL
| Specifies a minimum interval between scheduling inbound data staging job
| steps when they cannot be scheduled immediately. With a workload that
| involves a lot of data staging jobs, this keyword can be adjusted down from
| the default value of 900 seconds, if data staging jobs remain idle when there
| are data staging resources available. Setting this keyword to a smaller interval
| may impact scheduler performance when there is contention for data staging
| resources and a large number of idle jobs in the queue.
| Syntax:
| DSTG_MIN_SCHEDULING_INTERVAL = seconds

| Notes:
| 1. You can only specify this keyword in the global configuration file; it
| will be ignored in local configuration files.
| 2. LoadLeveler ignores DSTG_MIN_SCHEDULING_INTERVAL
| when DSTG_TIME=AT_SUBMIT.
| Default value: 900 seconds
| DSTG_TIME
| Specifies that either:
| AT_SUBMIT
| LoadLeveler can schedule data staging steps any time after a job
| requiring data staging has been submitted.
| JUST_IN_TIME
| LoadLeveler must schedule data staging job steps as close as possible
| to the application job steps that were submitted in the same job.
| Syntax:
| DSTG_TIME = AT_SUBMIT | JUST_IN_TIME

| Note: You can only specify the DSTG_TIME keyword in the global
| configuration file. Any value specified for this keyword in local
| configuration files will be ignored.
| Default value: AT_SUBMIT
ENFORCE_RESOURCE_MEMORY
Specifies whether the AIX Workload Manager is configured to limit, as
precisely as possible, the real memory usage of a WLM class. For this keyword
to be valid, ConsumableMemory must be set through the
ENFORCE_RESOURCE_USAGE keyword.
Syntax:
ENFORCE_RESOURCE_MEMORY = true | false
ENFORCE_RESOURCE_POLICY
Specifies what type of resource entitlements will be assigned to the AIX
Workload Manager classes. If the value specified is shares, it means a share
value is assigned to the class based on the job step’s requested resources (one
unit of resource equals one share). This is the default policy. If the value


specified is soft, it means a percentage value is assigned to the class based on
the job step’s requested resources and the total machine resources. This
percentage can be exceeded if there is no contention for the resource. If the
value specified is hard, it means a percentage value is assigned to the class
based on the job step’s requested resources and the total machine resources.
This percentage cannot be exceeded regardless of the contention for the
| resource. This keyword is only valid for CPU and real memory with either
| shares or percent limits. If desired, this keyword can be used in the
LoadL_config.local file to set up a different policy for each machine. The
ENFORCE_RESOURCE_USAGE keyword must be set for this keyword to be
valid.
Syntax:
ENFORCE_RESOURCE_POLICY = hard |soft | shares
Default value: shares
ENFORCE_RESOURCE_SUBMISSION
Indicates whether jobs submitted should be checked for the resources and
node_resources keywords. If the value specified is true, LoadLeveler will
check all jobs at submission time for the resources and node_resources
keywords. The job command file resources and node_resources keywords
combined need to have at least the resources specified in the
ENFORCE_RESOURCE_USAGE keyword in order for the job to be submitted
successfully. When RSET_MCM_AFFINITY is enabled, the task_affinity or
parallel_threads keyword can be used instead of the resources and
node_resources keywords when the resource being enforced is
ConsumableCpus.
If the value specified is false, no checking will be done and jobs submitted
without the resources or node_resources keywords will not have resources
enforced. In this instance, those jobs might interfere with other jobs whose
resources are enforced.
Syntax:
ENFORCE_RESOURCE_SUBMISSION = true | false
ENFORCE_RESOURCE_USAGE
| Specifies whether the AIX Workload Manager is used to enforce CPU and
| memory resources. This keyword accepts either a value of deactivate or a list
| of one or more of the following predefined resources:
v ConsumableCpus
v ConsumableMemory
| v ConsumableVirtualMemory
| v ConsumableLargePageMemory
Either memory or CPUs or both can be enforced but the resources must also be
specified on the SCHEDULE_BY_RESOURCES keyword. If deactivate is
specified, LoadLeveler will deactivate AIX Workload Manager on all the nodes
in the LoadLeveler cluster.

Restriction: WLM enforcement is ignored by LoadLeveler for Linux.
Syntax:
| ENFORCE_RESOURCE_USAGE = name name ... name | deactivate


EXECUTE
Specifies the local directory to store the executables of jobs submitted by other
machines.
Syntax:
EXECUTE = local directory/execute
Default value: $(tilde)/execute
FAIR_SHARE_INTERVAL
Specifies, in units of hours, the time interval it takes for resource usage in fair
share scheduling to decay to 5% of its initial value. Historic fair share data
collected before the most recent time interval of this length will have little
impact on fair share scheduling.
Syntax:
FAIR_SHARE_INTERVAL = hours
Default value: The default value is 168 hours (one week). If a negative value
or 0 is specified, the default value is used.
FAIR_SHARE_TOTAL_SHARES
Specifies the total number of shares that the cluster CPU or Blue Gene
resources are divided into. If this value is less than or equal to 0, fair share
scheduling is turned off.
Syntax:
FAIR_SHARE_TOTAL_SHARES = shares
Default value: The default value is 0.
FEATURE
Specifies an optional characteristic to use to match jobs with machines. You can
specify unique characteristics for any machine using this keyword. When
evaluating job submissions, LoadLeveler compares any required features
specified in the job command file to those specified using this keyword. You
can have a maximum of 1024 characters in the feature statement.
Syntax:
Feature = {"string" ...}
Example: If a machine has licenses for installed products ABC and XYZ in the
local configuration file, you can enter the following:
Feature = {"abc" "xyz"}
When submitting a job that requires both of these products, you should enter
the following in your job command file:
requirements = (Feature == "abc") && (Feature == "xyz")

Note: You must define a feature on all machines that will be able to run
dynamic simultaneous multithreading (SMT). SMT is only supported on
POWER6 and POWER5 processor-based systems.
Example: When submitting a job that requires the SMT function, first specify
smt = yes in job command file (or select a class which had smt = yes defined).
Next, specify node_usage = not_shared and last, enter the following in the job
command file:
requirements = (Feature == "smt")


FLOATING_RESOURCES
Specifies which consumable resources are available collectively on all of the
machines in the LoadLeveler cluster. The count for each resource must be an
integer greater than or equal to zero, and each resource can only be specified
once in the list. Any resource specified for this keyword that is not already
listed in the SCHEDULE_BY_RESOURCES keyword will not affect job
scheduling. If any resource is specified incorrectly with the
FLOATING_RESOURCES keyword, then all floating resources will be
ignored. ConsumableCpus, ConsumableMemory,
| ConsumableVirtualMemory, and ConsumableLargePageMemory may not be
specified as floating resources.
Syntax:
FLOATING_RESOURCES = name(count) name(count) ... name(count)
FS_INTERVAL
Defines the number of minutes used as the interval for checking free file
system space or inodes. If your file system receives many log messages or
copies large executables to the LoadLeveler spool, the file system will fill up
quicker and you should perform file size checking more frequently by setting
the interval to a smaller value. LoadLeveler will not check the file system if the
value of FS_INTERVAL is:
v Set to zero
v Set to a negative integer
Syntax:
FS_INTERVAL = minutes
Default value: If FS_INTERVAL is not specified but any of the other
file-system keywords (FS_NOTIFY, FS_SUSPEND, FS_TERMINATE,
INODE_NOTIFY, INODE_SUSPEND, INODE_TERMINATE) are specified, the
FS_INTERVAL value will default to 5 and the file system will be checked. If no
file-system or inode keywords are set, LoadLeveler does not monitor file
systems at all.
For more information related to using this keyword, see “Setting up file system
monitoring” on page 54.
FS_NOTIFY
Defines the lower and upper amounts, in bytes, of free file-system space at
which LoadLeveler is to notify the administrator:
v If the amount of free space becomes less than the lower threshold value,
LoadLeveler sends a mail message to the administrator indicating that
logging problems may occur.
v When the amount of free space becomes greater than the upper threshold
value, LoadLeveler sends a mail message to the administrator indicating that
problem has been resolved.
Syntax:
FS_NOTIFY = lower threshold, upper threshold

Specify space in bytes with the unit B. A metric prefix such as K, M, or G may
precede the B. The valid range for both the lower and upper thresholds are -1B
and all positive integers. If the value is set to -1, the transition across the
threshold is not checked.
Default value: In bytes: 1KB, -1B


FS_SUSPEND
Defines the lower and upper amounts, in bytes, of free file system space at
which LoadLeveler drains and resumes the Schedd and startd daemons
running on a node.
v If the amount of free space becomes less than the lower threshold value,
then LoadLeveler drains the Schedd and the startd daemons if they are
running on a node. When this happens, logging is turned off and mail
notification is sent to the administrator.
v When the amount of free space becomes greater than the upper threshold
value, LoadLeveler signals the Schedd and the startd daemons to resume.
When this happens, logging is turned on and mail notification is sent to the
administrator.
Syntax:
FS_SUSPEND = lower threshold, upper threshold

precede the B. The valid range for both the lower and upper thresholds are -1B
and all positive integers. If the value is set to -1, the transition across the
threshold is not checked.
Default value: In bytes: -1B, -1B
FS_TERMINATE
Defines the lower and upper amounts, in bytes, of free file system space at
which LoadLeveler is terminated. This keyword sends the SIGTERM signal to
the Master daemon which then terminates all LoadLeveler daemons running
on the node.
v If the amount of free space becomes less than the lower threshold value, all
LoadLeveler daemons are terminated.
v An upper threshold value is required for this keyword. However, since
LoadLeveler has been terminated at the lower threshold, no action occurs.
Syntax:
FS_TERMINATE = lower threshold, upper threshold

precede the B. The valid range for the lower threshold is -1B and all positive
integers. If the value is set to -1, the transition across the threshold is not
checked.
Default value: In bytes: -1B, -1B
GLOBAL_HISTORY
Identifies the directory that will contain the global history files produced by
llacctmrg command when no directory is specified as a command argument.
Syntax:
GLOBAL_HISTORY = directory
Default value: The default value is $(SPOOL) (the local spool directory).


For more information related to using this keyword, see “Collecting the
accounting information and storing it into files” on page 66.
GSMONITOR
Location of the gsmonitor executable (LoadL_GSmonitor).
Restriction: This keyword is ignored by LoadLeveler for Linux.
Syntax:
GSMONITOR = directory
Default value: $(BIN)/LoadL_GSmonitor
GSMONITOR_COREDUMP_DIR
Local directory for storing LoadL_GSmonitor core dump files.
Syntax:
GSMONITOR_COREDUMP_DIR = directory
Default value: The /tmp directory.
For more information related to using this keyword, see “Specifying file and
directory locations” on page 47.
GSMONITOR_DOMAIN
Specifies the peer domain, on which the GSMONITOR daemon will execute.
Syntax:
GSMONITOR_DOMAIN = PEER
For more information related to using this keyword, see “The gsmonitor
daemon” on page 14.
GSMONITOR_RUNS_HERE
Specifies whether the gsmonitor daemon will run on the host.
Syntax:
GSMONITOR_RUNS_HERE = TRUE | FALSE
Default value: FALSE
For more information related to using this keyword, see “The gsmonitor
daemon” on page 14.
HISTORY
Defines the path name where a file containing the history of local LoadLeveler
jobs is kept.
Syntax:
HISTORY = directory
Default value: $(SPOOL)/history
For more information related to using this keyword, see “Collecting the
accounting information and storing it into files” on page 66.
HISTORY_PERMISSION
Specifies the owner, group, and world permissions of the history file associated
with a LoadL_schedd daemon.


Syntax:
HISTORY_PERMISSION = permissions | rw-rw----

permissions must be a string with a length of nine characters and consisting of
the characters, r, w, x, or -.
Default value: The default settings are 660 (rw-rw----). LoadL_schedd will use
the default setting if the specified permission are less than rw-------.
Example: A specification such as HISTORY_PERMISSION = rw-rw-r-- will result
in permission settings of 664.
INODE_NOTIFY
Defines the lower and upper amounts, in inodes, of free file-system inodes at
which LoadLeveler is to notify the administrator:
v If the number of free inodes becomes less than the lower threshold value,
LoadLeveler sends a mail message to the administrator indicating that
logging problems may occur.
v When the number of free inodes becomes greater than the upper threshold
value, LoadLeveler sends a mail message to the administrator indicating that
problem has been resolved.
Syntax:
INODE_NOTIFY = lower threshold, upper threshold

The valid range for both the lower and upper thresholds are -1 and all positive
checked.
Default value: In inodes: 1000, -1
INODE_SUSPEND
Defines the lower and upper amounts, in inodes, of free file system inodes at
which LoadLeveler drains and resumes the Schedd and startd daemons
running on a node.
v If the number of free inodes becomes less than the lower threshold value,
then LoadLeveler drains the Schedd and the startd daemons if they are
running on a node. When this happens, logging is turned off and mail
notification is sent to the administrator.
v When the number of free inodes becomes greater than the upper threshold
value, LoadLeveler signals the Schedd and the startd daemons to resume.
When this happens, logging is turned on and mail notification is sent to the
administrator.
Syntax:
INODE_SUSPEND = lower threshold, upper threshold

The valid range for both the lower and upper thresholds are -1 and all positive
checked.
Default value: In inodes: -1, -1
INODE_TERMINATE
Defines the lower and upper amounts, in inodes, of free file system inodes at


which LoadLeveler is terminated. This keyword sends the SIGTERM signal to
the Master daemon which then terminates all LoadLeveler daemons running
on the node.
v If the number of free inodes becomes less than the lower threshold value, all
LoadLeveler daemons are terminated.
v An upper threshold value is required for this keyword. However, since
LoadLeveler has been terminated at the lower threshold, no action occurs.
Syntax:
INODE_TERMINATE = lower threshold, upper threshold

The valid range for the lower threshold is -1 and all positive integers. If the
value is set to -1, the transition across the threshold is not checked.
Default value: In inodes: -1, -1
JOB_ACCT_Q_POLICY
Specifies the amount of time, in seconds, that determines how often the startd
daemon updates the Schedd daemon with accounting data of running jobs.
This controls the accuracy of the llq -x command.
Syntax:
JOB_ACCT_Q_POLICY = number
Default value: 300 seconds
For more information related to using this keyword, see “Gathering job
accounting data” on page 61.
JOB_EPILOG
Path name of the epilog program.
Syntax:
JOB_EPILOG = program name
For more information related to using this keyword, see “Writing prolog and
epilog programs” on page 77.
JOB_LIMIT_POLICY
Specifies the amount of time, in seconds, that LoadLeveler checks to see if
job_cpu_limit has been exceeded. The smaller of JOB_LIMIT_POLICY and
JOB_ACCT_Q_POLICY is used to control how often the startd daemon
collects resource consumption data on running jobs, and how often the
job_cpu_limit is checked.
Syntax:
JOB_LIMIT_POLICY = number
Default value: The default for JOB_LIMIT_POLICY is
POLLING_FREQUENCY multiplied by POLLS_PER_UPDATE.
JOB_PROLOG
Path name of the prolog program.
Syntax:
JOB_PROLOG = program name


JOB_USER_EPILOG
Path name of the user epilog program.
Syntax:
JOB_USER_EPILOG = program name
JOB_USER_PROLOG
Path name of the user prolog program.
Syntax:
JOB_USER_PROLOG = program name
KBDD
Location of kbdd executable (LoadL_kbdd).
Syntax:
KBDD = directory
Default value: $(BIN)/LoadL_kbdd
KBDD_COREDUMP_DIR
Local directory for storing LoadL_kbdd daemon core dump files.
Syntax:
KBDD_COREDUMP_DIR = directory
KILL
Determines whether or not vacated jobs should be sent the SIGKILL signal and
replaced in the queue. It is used to remove a job that is taking too long to
vacate.
Syntax:
KILL: expression that evaluates to T or F (true or false)

When T, vacated LoadLeveler jobs are removed from the machine with no
attempt to take checkpoints.

LIB
Defines the directory where LoadLeveler libraries are kept.
Syntax:
LIB = directory
Default value: $(RELEASEDIR)/lib


LL_RSH_COMMAND
Specifies an administrator provided executable to be used by llctl start when
starting LoadLeveler on remote machines in the administration file. The
LL_RSH_COMMAND keyword is any executable that can be used as a
substitute for /usr/bin/rsh. The llctl start command passes arguments to the
executable specified by LL_RSH_COMMAND in the following format:
LL_RSH_COMMAND hostname -n llctl start options

Syntax:
LL_RSH_COMMAND = full_path_to_executable
Default value: /usr/bin/rsh. This keyword must specify the full path name to
the executable provided. If no value is specified, LoadLeveler will use
/usr/bin/rsh as the default when issuing a start. If an error occurred while
locating the executable specified, an error message will be displayed.
Example: This example shows that using the secure shell (/usr/bin/ssh) is the
preferred method for the llctl start command to communicate with remote
nodes. Specify the following in the configuration file:
LL_RSH_COMMAND=/usr/bin/ssh
LOADL_ADMIN
Specifies a list of LoadLeveler administrators.
Syntax:
LOADL_ADMIN = list of user names

Where list of user names is a blank-delimited list of those individuals who will
have administrative authority. These users are able to invoke the
administrator-only commands such as llctl, llfavorjob, and llfavoruser. These
administrators can also invoke the administrator-only GUI functions. For more
information, see Chapter 7, “Using LoadLeveler’s GUI to perform
administrator tasks,” on page 169.
Default value: No default value is set, which means no one has administrator
authority until this keyword is defined with one or more user names.
Example: To grant administrative authority to users bob and mary, enter the
following in the configuration file:
LOADL_ADMIN = bob mary
For more information related to using this keyword, see “Defining LoadLeveler
administrators” on page 43.
LOCAL_CONFIG
Specifies the path name of the optional local configuration file containing
information specific to a node in the LoadLeveler network.
Syntax:
LOCAL_CONFIG = directory
Examples:
v If you are using a distributed file system like NFS, some examples are:
LOCAL_CONFIG = $(tilde)/$(host).LoadL_config.local
LOCAL_CONFIG = $(tilde)/LoadL_config.$(host).$(domain)
LOCAL_CONFIG = $(tilde)/LoadL_config.local.$(hostname)


See “LoadLeveler variables” on page 314 for information about the tilde,
host, and domain variables.
v If you are using a local file system, an example is:
LOCAL_CONFIG = /var/LoadL/LoadL_config.local
LOG
Defines the local directory to store log files. It is not necessary to keep all the
log files created by the various LoadLeveler daemons and programs in one
directory, but you will probably find it convenient to do so.
Syntax:
LOG = local directory/log
Default value: $(tilde)/log
LOG_MESSAGE_THRESHOLD
Specifies the maximum amount of memory, in bytes, for the message queue.
Messages in the queue are waiting to be written to the log file. When the
message logging thread cannot write messages to the log file as fast as they
arrive, the memory consumed by the message queue can exceed the threshold.
In this case, LoadLeveler will curtail logging by turning off all debug flags
except D_ALWAYS, therefore, reducing the amount of logging that takes place.
If the threshold is exceeded by the curtailed message queue, message logging
is stopped. Special log messages are written to the log file, which indicate that
some messages are missing. Mail is also sent to the administrator indicating
that messages are missing. A value of -1 for this keyword will turn off the
buffer threshold meaning that the threshold is unlimited.
Syntax:
LOG_MESSAGE_THRESHOLD = bytes
Default value: 20*1024*1024 (bytes)
MACHINE_AUTHENTICATE
Specifies whether machine validation is performed. When set to true,
LoadLeveler only accepts connections from machines specified in the
administration file. When set to false, LoadLeveler accepts connections from
any machine.
When set to true, every communication between LoadLeveler processes will
verify that the sending process is running on a machine which is identified via
a machine stanza in the administration file. The validation is done by
capturing the address of the sending machine when the accept function call is
issued to accept a connection. The gethostbyaddr function is called to translate
the address to a name, and the name is matched with the list derived from the

| Note: You must not set the MACHINE_AUTHENTICATE keyword to true for
| a cluster which is configured to be a main scale-across cluster. The main
| scale-across cluster must permit communication with LoadLeveler
| daemons running on any machine in any cluster participating in the
| scale-across multicluster environment.
Syntax:
MACHINE_AUTHENTICATE = true | false


MACHINE_UPDATE_INTERVAL
Specifies the time, in seconds, during which machines must report to the
central manager.
Syntax:
MACHINE_UPDATE_INTERVAL = number

Where number specifies the time period, in seconds, during which machines
must report to the central manager. Machines that do not report in this number
of seconds are considered down. number must be a numerical value and cannot
be an arithmetic expression.
For more information related to using this keyword, see “Setting negotiator
characteristics and policies” on page 45.
MACHPRIO
Machine priority expression.
Syntax:
MACHPRIO = expression

You can use the following LoadLeveler variables in the MACHPRIO
expression:
v LoadAvg
v Connectivity
v Cpus
v Speed
v Memory
v VirtualMemory
v Disk
v CustomMetric
v MasterMachPriority
v ConsumableCpus
v ConsumableMemory
v ConsumableVirtualMemory
v PagesFreed
v PagesScanned
v FreeRealMemory
For detailed descriptions of these variables, see “LoadLeveler variables” on
page 314.
Default value: (0 - LoadAvg)
Examples:
v Example 1
This example orders machines by the Berkeley one-minute load average.
MACHPRIO : 0 - (LoadAvg)

Therefore, if LoadAvg equals .7, this example would read:
MACHPRIO : 0 - (.7)

The MACHPRIO would evaluate to -.7.
v Example 2


This example orders machines by the Berkeley one-minute load average
normalized for machine speed:
MACHPRIO : 0 - (1000 * (LoadAvg / (Cpus * Speed)))

Therefore, if LoadAvg equals .7, Cpus equals 1, and Speed equals 2, this
example would read:
MACHPRIO : 0 - (1000 * (.7 / (1 * 2)))

This example further evaluates to:
MACHPRIO : 0 - (350)

The MACHPRIO would evaluate to -350.
Notice that if the speed of the machine were increased to 3, the equation
would read:
MACHPRIO : 0 - (1000 * (.7 / (1 * 3)))

The MACHPRIO would evaluate to approximately -233. Therefore, as the
speed of the machine increases, the MACHPRIO also increases.
v Example 3
This example orders machines accounting for real memory and available
swap space (remembering that Memory is in Mbytes and VirtualMemory is
in Kbytes):
MACHPRIO : 0 - (10000 * (LoadAvg / (Cpus * Speed))) +
(10 * Memory) + (VirtualMemory / 1000)
v Example 4
This example sets a relative machine priority based on the value of the
CUSTOM_METRIC keyword.
MACHPRIO : CustomMetric
To do this, you must specify a value for the CUSTOM_METRIC keyword or
the CUSTOM_METRIC_COMMAND keyword in either the
LoadL_config.local file of a machine or in the global LoadL_config file. To
assign the same relative priority to all machines, specify the
CUSTOM_METRIC keyword in the global configuration file. For example:
CUSTOM_METRIC = 5
You can override this value for an individual machine by specifying a
different value in that machine’s LoadL_config.local file.
v Example 5
This example gives master nodes the highest priority:
MACHPRIO : (MasterMachPriority * 10000)
v Example 6
This example gives nodes the with highest percentage of switch adapters
with connectivity the highest priority:
MACHPRIO : Connectivity
For more information related to using this keyword, see “Setting negotiator
characteristics and policies” on page 45.
MAIL
Name of a local mail program used to override default mail notification.
Syntax:
MAIL = program name


For more information related to using this keyword, see “Using your own mail
program” on page 81.
MASTER
Location of the master executable (LoadL_master).
Syntax:
MASTER = directory
Default value: $(BIN)/LoadL_master
For more information related to using this keyword, see “How LoadLeveler
daemons process jobs” on page 8.
MASTER_COREDUMP_DIR
Local directory for storing LoadL_master core dump files.
Syntax:
MASTER_COREDUMP_DIR = directory
MASTER_DGRAM_PORT
The port number used when connecting to the daemon.
Syntax:
MASTER_DGRAM_PORT = port number
For more information related to using this keyword, see “Defining network
characteristics” on page 47.
MASTER_STREAM_PORT
Specifies the port number to be used when connecting to the daemon.
Syntax:
MASTER_STREAM_PORT = port number
MAX_CKPT_INTERVAL
The maximum number of seconds between checkpoints for running jobs.
Syntax:
MAX_CKPT_INTERVAL = number
Default value: 7200 (2 hours)
For more information related to using this keyword, see “LoadLeveler support
for checkpointing jobs” on page 139.
MAX_JOB_REJECT
Determines the number of times a job is rejected before it is canceled or put in
User Hold or System Hold status.
Syntax:
MAX_JOB_REJECT = number


number must be a numerical value and cannot be an arithmetic expression.
MAX_JOB_REJECT may be set to unlimited rejects by specifying a value of –1.
Default value: The default value is 0, which indicates a rejected job will
immediately be canceled or placed on hold.
For related information, see the NEGOTIATOR_REJECT_DEFER keyword.
MAX_RESERVATIONS
Specifies the maximum number of reservations that this LoadLeveler cluster
can have. Only reservations in waiting and in use are counted toward this
limit; LoadLeveler does not count reservations that have already ended or are
in the process of being canceled.

Notes:
1. Having too many reservations in a LoadLeveler cluster can have
performance impacts. Administrators should select a suitable value
for this keyword.
| 2. A recurring reservation only counts as one reservation towards the
| MAX_RESERVATIONS limit regardless of the number of times that
| the reservation recurs.
Syntax:
MAX_RESERVATIONS = number
The value for this keyword can be 0 or a positive integer.
MAX_STARTERS
Specifies the maximum number of tasks that can run simultaneously on a
machine. In this case, a task can be a serial job step or a parallel task.
MAX_STARTERS defines the number of initiators on the machine (the number
of tasks that can be initiated from a startd).
Syntax:
MAX_STARTERS = number
Default value: If this keyword is not specified, the default is the number of
elements in the Class statement.
For more information related to using this keyword, see “Specifying how many
jobs a machine can run” on page 55.
MAX_TOP_DOGS
Specifies the maximum total number of top dogs that the central manager
daemon will allocate. When scheduling jobs, after MAX_TOP_DOGS total top
dogs have been allocated, no more will be considered.
Syntax:
MAX_TOP_DOGS = k | 1

where: k is a non-negative integer specifying the global maximum top dogs
limit.
Default value: The default value is 1.
For more information related to using this keyword, see “Using the BACKFILL
MIN_CKPT_INTERVAL
The minimum number of seconds between checkpoints for running jobs.


Syntax:
MIN_CKPT_INTERVAL = number
Default value: 900 (15 minutes)
For more information related to using this keyword, see “LoadLeveler support
for checkpointing jobs” on page 139.
NEGOTIATOR
Location of the negotiator executable (LoadL_negotiator).
Syntax:
NEGOTIATOR = directory
Default value: $(BIN)/LoadL_negotiator
NEGOTIATOR_COREDUMP_DIR
Local directory for storing LoadL_negotiator core dump files.
Syntax:
NEGOTIATOR_COREDUMP_DIR = directory
NEGOTIATOR_CYCLE_DELAY
Specifies the minimum time, in seconds, the negotiator delays between periods
when it attempts to schedule jobs. This time is used by the negotiator daemon
to respond to queries, reorder job queues, collect information about changes in
the states of jobs, and so on. Delaying the scheduling of jobs might improve
the overall performance of the negotiator by preventing it from spending
excessive time attempting to schedule jobs.
Syntax:
NEGOTIATOR_CYCLE_DELAY = number

Default value: The default is 0 seconds
NEGOTIATOR_CYCLE_TIME_LIMIT
Specifies the maximum amount of time, in seconds, that LoadLeveler will
allow the negotiator to spend in one cycle trying to schedule jobs. The
negotiator cycle will end, after the specified number of seconds, even if there
are additional jobs waiting for dispatch. Jobs waiting for dispatch will be
considered at the next negotiator cycle. The
NEGOTIATOR_CYCLE_TIME_LIMIT keyword applies only to the BACKFILL
scheduler.
Syntax:
NEGOTIATOR_CYCLE_TIME_LIMIT = number

Where number must be a positive integer or zero and cannot be an arithmetic
expression.
Default value: If the keyword value is not specified or a value of zero is used,
the negotiator cycle will be unlimited.


NEGOTIATOR_INTERVAL
The time interval, in seconds, at which the negotiator daemon updates the
status of jobs in the LoadLeveler cluster and negotiates with machines that are
available to run jobs.
Syntax:
NEGOTIATOR_INTERVAL = number

Where number specifies the interval, in seconds, at which the negotiator
daemon performs a “negotiation loop” during which it attempts to assign
available machines to waiting jobs. A negotiation loop also occurs whenever
job states or machine states change. number must be a numerical value and
cannot be an arithmetic expression.
When this keyword is set to zero, the central manager’s automatic scheduling
activity is been disabled, and LoadLeveler will not attempt to schedule any
jobs unless instructed to do so through the llrunscheduler command or
ll_run_scheduler subroutine.
For more information related to using this keyword, see “Controlling the
central manager scheduling cycle” on page 73.
NEGOTIATOR_LOADAVG_INCREMENT
Specifies the value the negotiator adds to the startd machine’s load average
whenever a job in the Pending state is queued on that machine. This value is
used to compensate for the increased load caused by starting another job.
Syntax:
NEGOTIATOR_LOADAVG_INCREMENT = number

Default value: The default value is .5
NEGOTIATOR_PARALLEL_DEFER
Specifies the amount of time, in seconds, that defines how long a job stays out
of the queue after it fails to get the correct number of processors. This keyword
applies only to the default LoadLeveler scheduler. This keyword must be
greater than the NEGOTIATOR_INTERVAL. value; if it is not, the default is
used.
Syntax:
NEGOTIATOR_PARALLEL_DEFER = number

Default value: The default is NEGOTIATOR_INTERVAL multiplied by 5.
NEGOTIATOR_PARALLEL_HOLD
Specifies the amount of time, in seconds, that defines how long a job is given
to accumulate processors. This keyword applies only to the default
LoadLeveler scheduler. This keyword must be greater than the
NEGOTIATOR_INTERVAL value; if it is not, the default is used.
Syntax:
NEGOTIATOR_PARALLEL_HOLD = number

Default value: The default is NEGOTIATOR_INTERVAL multiplied by 5.


NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL
Specifies the amount of time, in seconds, between calculation of the SYSPRIO
values for waiting jobs. Recalculating the priority can be CPU-intensive;
specifying low values for the
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword may lead to
a heavy CPU load on the negotiator if a large number of jobs are running or
waiting for resources. A value of 0 means the SYSPRIO values are not
recalculated.
You can use this keyword to base the order in which jobs are run on the
current number of running, queued, or total jobs for a user or a group.
Syntax:
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = number

NEGOTIATOR_REJECT_DEFER
Specifies the amount of time in seconds the negotiator waits before it considers
scheduling a job to a machine that recently rejected the job.
Syntax:
NEGOTIATOR_REJECT_DEFER = number

For related information, see the MAX_JOB_REJECT keyword.
NEGOTIATOR_REMOVE_COMPLETED
Specifies the amount of time, in seconds, that you want the negotiator to keep
information regarding completed and removed jobs so that you can query this
information using the llq command.
Syntax:
NEGOTIATOR_REMOVE_COMPLETED = number

NEGOTIATOR_RESCAN_QUEUE
specifies the amount of time in seconds that defines how long the negotiator
waits to rescan the job queue for machines which have bypassed jobs which
could not run due to conditions which may change over time. This keyword
must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the
default is used.
Syntax:
NEGOTIATOR_RESCAN_QUEUE = number

NEGOTIATOR_STREAM_PORT
Specifies the port number used when connecting to the daemon.
Syntax:
NEGOTIATOR_STREAM_PORT = port number


OBITUARY_LOG_LENGTH
Specifies the number of lines from the end of the file that are appended to the
mail message. The master daemon mails this log to the LoadLeveler
administrators when one of the daemons dies.
Syntax:
OBITUARY_LOG_LENGTH = number

POLLING_FREQUENCY
Specifies the interval, in seconds, with which the startd daemon evaluates the
load on the local machine and decides whether to suspend, resume, or abort
jobs. This time is also the minimum interval at which the kbdd daemon reports
keyboard or mouse activity to the startd daemon.
Syntax:
POLLING_FREQUENCY = number

POLLS_PER_UPDATE
Specifies how often, in POLLING_FREQUENCY intervals, startd daemon
updates the central manager. Due to the communication overhead, it is
impractical to do this with the frequency defined by the
POLLING_FREQUENCY keyword. Therefore, the startd daemon only updates
the central manager every nth (where n is the number specified for
POLLS_PER_UPDATE) local update. Change POLLS_PER_UPDATE when
changing the POLLING_FREQUENCY.
Syntax:
POLLS_PER_UPDATE = number

PRESTARTED_STARTERS
Specifies how many prestarted starter processes LoadLeveler will maintain on
an execution node to manage jobs when they arrive. The startd daemon starts
the number of starter processes specified by this keyword. You may specify
this keyword in either the global or local configuration file.
Syntax:
PRESTARTED_STARTERS = number

number must be less than or equal to the value specified through the
MAX_STARTERS keyword. If the value of PRESTARTED_STARTERS specified
is greater then MAX_STARTERS, LoadLeveler records a warning message in
the startd log and assigns PRESTARTED_STARTERS the same value as
MAX_STARTERS.


If the value PRESTARTED_STARTERS is zero, no starter processes will be
started before jobs arrive on the execution node.
PREEMPT_CLASS
Defines the preemption rule for a job class.
Syntax: The following forms illustrate correct syntax.
PREEMPT_CLASS[incoming_class] = ALL[:preempt_method] { outgoing_class1
[outgoing_class2 ...] }
Using this form, ALL indicates that job steps of incoming_class have
priority and will not share nodes with job steps of outgoing_class1,
outgoing_class2, or other outgoing classes. If a job step of the
incoming_class is to be started on a set of nodes, all job steps of
outgoing_class1, outgoing_class2, or other outgoing classes running on
those nodes will be preempted.

Note: The ALL preemption rule does not apply to Blue Gene jobs.
PREEMPT_CLASS[incoming_class] = ENOUGH[:preempt_method] {
outgoing_class1 [outgoing_class2 ...] }
Using this form, ENOUGH indicates that job steps of incoming_class
will share nodes with job steps of outgoing_class1, outgoing_class2, or
other outgoing classes if there are sufficient resources. If a job step of
the incoming_class is to be started on a set of nodes, one or more job
steps of outgoing_class1, outgoing_class2, or other outgoing classes
running on those nodes may be preempted to get needed resources.

Combinations of these forms are also allowed.

Note:
1. The optional specification preempt_method indicates which method
LoadLeveler is to use to preempt the jobs; this specification is valid
only for the BACKFILL scheduler. Valid values for this specification
in keyword syntax are the highlighted abbreviations in parentheses:
v Remove (rm)
v System hold (sh)
v Suspend (su)
v Vacate (vc)
v User hold (uh)
For more information about preemption methods, see “Steps for
configuring a scheduler to preempt jobs” on page 130.
2. Using the ″ALL″ value in the PREEMPT_CLASS keyword places
implied restrictions on when a job can start. See “Planning to
preempt jobs” on page 128 for more information.
3. The incoming class is designated inside [ ] brackets.
4. Outgoing classes are designated inside { } curly braces.
5. The job classes on the right hand (outgoing) side of the statement
must be different from incoming class, or it may be allclasses. If the
outgoing side is defined as allclasses then all job classes are
preemptable with the exception of the incoming class specified
within brackets.


6. A class name or allclasses should not be in both the ALL list and
the ENOUGH list. If you do so, the entire statement will be
ignored. An example of this is:
PREEMPT_CLASS[Class_A]=ALL{allclasses} ENOUGH {allclasses}
7. If you use allclasses as an outgoing (preemptable) class, then no
other class names should be listed at the right hand side as the
entire statement will be ignored. An example of this is:
PREEMPT_CLASS[Class_A]=ALL{Class_B} ENOUGH {allclasses}
8. More than one ALL statement and more than one ENOUGH
statement may appear at the right hand side. Multiple statements
have a cumulative effect.
9. Each ALL or ENOUGH statement can have multiple class names
inside the curly braces. However, a blank space delimiter is
required between each class name.
10. Both the ALL and ENOUGH statements can include an optional
specification indicating the method LoadLeveler will use to
preempt the jobs. Valid values for this specification are listed in the
description of the DEFAULT_PREEMPT_METHOD keyword. If a
value is specified on the PREEMPT_CLASS ALL or ENOUGH
statement, that value overrides the value set on the
DEFAULT_PREEMPT_METHOD keyword, if any.
11. ALL and ENOUGH may be in mixed cases.
12. Spaces are allowed around the brackets and curly braces.
13. PREEMPT_CLASS [allclasses] will be ignored.
Examples:
PREEMPT_CLASS[Class_B]=ALL{Class_E Class_D} ENOUGH {Class_C}
This indicates that all Class_E jobs and all Class_D jobs and enough
Class_C jobs will be preempted to enable an incoming Class_B job to
run.
PREEMPT_CLASS[Class_D]=ENOUGH:VC {Class_E}
This indicates that zero, one, or more Class_E jobs will be preempted
using the vacate method to enable an incoming Class_D job to run.
PREEMPTION_SUPPORT
For the BACKFILL or API schedulers only, specifies the level of preemption
support for a cluster.
Syntax:
PREEMPTION_SUPPORT= full | no_adapter | none
v When set to full, preemption is fully supported.
v When set to no_adapter, preemption is supported but the adapter resources
are not released by preemption.
v When set to none, preemption is not supported, and preemption requests
will be rejected.

Note:
1. If the value of this keyword is set to any value other than none for
the default scheduler, LoadLeveler will not start.


2. For the BACKFILL or API scheduler, when this keyword is set to full
or no_adapter and preemption by the suspend method is required,
the configuration keyword PROCESS_TRACKING must be set to
true.
Default value: The default value for all schedulers is none; if you want to
enable preemption under these schedulers, you must set a value for this
keyword.
PROCESS_TRACKING
Specifies whether or not LoadLeveler will cancel any processes (throughout the
entire cluster), left behind when a job terminates.
Syntax:
PROCESS_TRACKING = TRUE | FALSE

When TRUE ensures that when a job is terminated, no processes created by
the job will continue running.

Note: It is necessary to set this keyword to true to do preemption by the
suspend method with the BACKFILL or API scheduler.
Default value: FALSE
PROCESS_TRACKING_EXTENSION
Specifies the directory containing the kernel module LoadL_pt_ke (AIX) or
proctrk.ko (Linux).
Syntax:
PROCESS_TRACKING_EXTENSION = directory
Default value: The directory $HOME/bin
For more information related to using this keyword, see “Tracking job
processes” on page 70.
PUBLISH_OBITUARIES
Specifies whether or not the master daemon sends mail to the administrator
when any daemon it manages ends abnormally. When set to true, this keyword
specifies that the master daemon sends mail to the administrators identified by
LOADL_ADMIN keyword.
Syntax:
PUBLISH_OBITUARIES = true | false
Default value: true
REJECT_ON_RESTRICTED_LOGIN
Specifies whether the user’s account status will be checked on every node
where the job will be run by calling the AIX loginrestrictions function with the
S_DIST_CLNT flag.
Restriction: Login restriction checking is ignored by LoadLeveler for Linux.
Login restriction checking includes:
v Does the account still exist?
v Is the account locked?
v Has the account expired?
v Do failed login attempts exceed the limit for this account?
v Is login disabled via /etc/nologin?


If the AIX loginrestrictions function indicates a failure then the user’s job will
be rejected and will be processed according to the LoadLeveler configuration
parameters MAX_JOB_REJECT and ACTION_ON_MAX_REJECT.
Syntax:
REJECT_ON_RESTRICTED_LOGIN = true | false
RELEASEDIR
Defines the directory where all the LoadLeveler software resides.
Syntax:
RELEASEDIR = release directory
Default value: $(RELEASEDIR)
RESERVATION_CAN_BE_EXCEEDED
Specifies whether LoadLeveler will schedule job steps that are bound to a
reservation when their end times (based on hard wall-clock limits) exceed the
reservation end time.
Syntax:
RESERVATION_CAN_BE_EXCEEDED = true | false
When this keyword is set to false, LoadLeveler schedules only those job steps
that will complete before the reservation ends. When set to true, LoadLeveler
schedules job steps to run under a reservation even if their end times are
expected to exceed the reservation end time. When the reservation ends,
however, the reserved nodes no longer belong to the reservation, and so these
nodes might not be available for the jobs to continue running. In this case,
LoadLeveler might preempt the running jobs.
Note that this keyword setting does not change the actual end time of the
reservation. It only affects how LoadLeveler manages job steps whose end
times exceed the end time of the reservation.
Default value: true
RESERVATION_HISTORY
Defines the name of a file that is to contain the local history of reservations.
Syntax:
RESERVATION_HISTORY = file name
| LoadLeveler appends a single line to the reservation history file for each
| completed occurrence of each reservation. For an example, see “Collecting
| accounting data for reservations” on page 63.
Default value: $(SPOOL)/reservation_history
RESERVATION_MIN_ADVANCE_TIME
Specifies the minimum time, in minutes, between the time at which a
reservation is created and the time at which the reservation is to start.
Syntax:
RESERVATION_MIN_ADVANCE_TIME = number of minutes

By default, the earliest time at which a reservation may start is the current time
plus the value set for the RESERVATION_SETUP_TIME keyword.
Default value: 0 (zero)


RESERVATION_PRIORITY
Specifies whether LoadLeveler administrators may reserve nodes on which
running jobs are expected to end after the reservation start time. This keyword
value applies only for LoadLeveler administrators; other reservation owners do
not have this capability.
Syntax:
RESERVATION_PRIORITY = NONE | HIGH
When you set this keyword to HIGH, before activating the reservation,
LoadLeveler preempts the job steps running on the reserved nodes (Blue Gene
job steps are handled the same way). The only exceptions are non-preemptable
jobs; LoadLeveler will not preempt those jobs because of any reservations.
Default value: NONE
RESERVATION_SETUP_TIME
Specifies how much time, in seconds, that LoadLeveler may use to prepare for
a reservation before it is to start. The tasks that LoadLeveler performs during
this time include checking and reporting node conditions, and preempting job
steps still running on the reserved nodes.
For a given reservation, LoadLeveler uses the RESERVATION_SETUP_TIME
keyword value that is set at the time that the reservation is created, not
whatever value might be set when the reservation starts. If the start time of the
reservation is modified, however, LoadLeveler uses the
RESERVATION_SETUP_TIME keyword value that is set at the time of the
modification.
Syntax:
RESERVATION_SETUP_TIME = number of seconds
Default value: 60
RESTARTS_PER_HOUR
Specifies how many times the master daemon attempts to restart a daemon
that dies abnormally. Because one or more of the daemons may be unable to
run due to a permanent error, the master only attempts
$(RESTARTS_PER_HOUR) restarts within a 60 minute period. Failing that, it
sends mail to the administrators identified by the LOADL_ADMIN keyword
and exits.
Syntax:
RESTARTS_PER_HOUR = number

RESUME_ON_SWITCH_TABLE_ERROR_CLEAR
Specifies whether or not the startd that was drained when the switch table
failed to unload will automatically resume once the unload errors are cleared.
The unload error is considered cleared after LoadLeveler can successfully
unload the switch table. For this keyword to work, the
DRAIN_ON_SWITCH_TABLE_ERROR option in the configuration file must
be turned on and not disabled. Flushing, suspending, or draining of a startd
manually or automatically will disable this option until the startd is manually
resumed.
Syntax:
RESUME_ON_SWITCH_TABLE_ERROR_CLEAR = true | false


RSET_SUPPORT
Indicates the level of RSet support present on a machine.
Syntax:
RSET_SUPPORT = option

The available options are:
RSET_MCM_AFFINITY
Indicates that the machine can run jobs requesting MCM (memory or
adapter) and processor (cache or SMT) affinity.
RSET_NONE
Indicates that LoadLeveler RSet support is not available on the
machine.
RSET_USER_DEFINED
Indicates that the machine can be used for jobs with a user-created
RSet in their job command file.
Default value: RSET_NONE
SAVELOGS
Specifies the directory in which log files are archived.
Syntax:
SAVELOGS = directory

Where directory is the directory in which log files will be archived.
For more information related to using this keyword, see “Configuring
recording activity and log files” on page 48.
SAVELOGS_COMPRESS_PROGRAM
Compresses logs after they are copied to the SAVELOGS directory. If not
specified, SAVELOGS are copied, but are not compressed.
Syntax:
SAVELOGS_COMPRESS_PROGRAM = program

Where program is a specific executable program. It can be a system-provided
facility (such as, /bin/gzip) or an administrator-provided executable program.
The value must be a full path name and can contain command-line arguments.
LoadLeveler will call the program as: program filename.
Default value: If blank, the logs are not compressed.
Example: In this example, LoadLeveler will run the gzip -f command. The log
file in SAVELOGS will be compressed after it is copied to SAVELOGS. If the
program cannot be found or is not executable, LoadLeveler will log the error
and SAVELOGS will remain uncompressed.
SAVELOGS_COMPRESS_PROGRAM = /bin/gzip -f
| SCALE_ACROSS_SCHEDULING_TIMEOUT
| Defines the amount of time a central manager will wait:
| v For the main cluster central manager, this value defines the wait time for
| responses from the non-main cluster central managers when it is scheduling
| scale-across jobs.


| v For the non-main cluster central managers, this value limits how long the
| central manager on each non-main cluster will hold resources for a
| scale-across job step while waiting for an order to start the job.
| Syntax:
| scale_across_scheduling_timeout = number
| Default value: 300 seconds
SCHEDD
Location of the Schedd executable (LoadL_schedd).
Syntax:
SCHEDD = directory
Default value: $(BIN)/LoadL_schedd
SCHEDD_COREDUMP_DIR
Specifies the local directory for storing LoadL_schedd core dump files.
Syntax:
SCHEDD_COREDUMP_DIR = directory
SCHEDD_INTERVAL
Specifies the interval, in seconds, at which the Schedd daemon checks the local
job queue and updates the negotiator daemon.
Syntax:
SCHEDD_INTERVAL = number

SCHEDD_RUNS_HERE
Specifies whether the Schedd daemon runs on the host. If you do not want to
run the Schedd daemon, specify false.
This keyword does not designate a machine as a public scheduling machine.
Unless configured as a public scheduling machine, a machine configured to
run the Schedd daemon will only accept job submissions from the same
machine running the Schedd daemon. A public scheduling machine accepts job
submissions from other machines in the LoadLeveler cluster. To configure a
machine as a public scheduling machine, see the schedd_host keyword
description in “Administration file keyword descriptions” on page 327.
Syntax:
SCHEDD_RUNS_HERE = true | false
Default value: true
SCHEDD_SUBMIT_AFFINITY
Specifies whether job submissions are directed to a locally running Schedd
daemon. When the keyword is set to true, job submissions are directed to a
Schedd daemon running on the same machine where the submission takes
place, provided there is a Schedd daemon running on that machine. In this


case the submission is said to have ″affinity″ for the local Schedd daemon. If
there is no Schedd daemon running on the machine where the submission
takes place, or if this keyword is set to false, the job submission will only be
directed to a Schedd daemon serving as a public scheduling machine. In this
case, if there are no public scheduling machines configured the job cannot be
submitted. A public scheduling machine accepts job submissions from other
machines in the LoadLeveler cluster. To configure a machine as a public
scheduling machine, see the schedd_host keyword description in
“Administration file keyword descriptions” on page 327.
Installations with a large number of nodes should consider setting this
keyword to false to more evenly distribute dispatching of jobs among the
Schedd daemons. For more information, see “Scaling considerations” on page
719.
Syntax:
SCHEDD_SUBMIT_AFFINITY = true | false
Default value: true
SCHEDD_STATUS_PORT
Syntax:
SCHEDD_STATUS_PORT = port number
SCHEDD_STREAM_PORT
Syntax:
SCHEDD_STREAM_PORT = port number
SCHEDULE_BY_RESOURCES
Specifies which consumable resources are considered by the LoadLeveler
schedulers. Each consumable resource name may be an administrator-defined
alphanumeric string, or may be one of the following predefined resources:
v ConsumableCpus
v ConsumableMemory
v ConsumableVirtualMemory
v RDMA
Each string may only appear in the list once. These resources are either floating
resources, or machine resources. If any resource is specified incorrectly with
the SCHEDULE_BY_RESOURCES keyword, then all scheduling resources will
be ignored.

Syntax:
SCHEDULE_BY_RESOURCES = name name ... name


SCHEDULER_TYPE
Specifies the LoadLeveler scheduling algorithm:
LL_DEFAULT
Specifies the default LoadLeveler scheduling algorithm. If
SCHEDULER_TYPE has not been defined, LoadLeveler will use the
default scheduler (LL_DEFAULT).
BACKFILL
Specifies the LoadLeveler BACKFILL scheduler. When you specify this
keyword, you should use only the default settings for the START
expression and the other job control expressions described in
“Managing job status through control expressions” on page 68.
API Specifies that you will use an external scheduler. External schedulers
communicate to LoadLeveler through the job control API. For more
information on setting an external scheduler, see “Using an external
Syntax:
SCHEDULER_TYPE = LL_DEFAULT | BACKFILL | API
Default value: LL_DEFAULT

Note:
1. If a scheduler type is not set, LoadLeveler will start, but it will use
the default scheduler.
2. If you have set SCHEDULER_TYPE with an option that is not valid,
LoadLeveler will not start.
3. If you change the scheduler option specified by
SCHEDULER_TYPE, you must stop and restart LoadLeveler using
llctl or recycle using llctl.
SEC_ADMIN_GROUP
When security services are enabled, this keyword points to the name of the
UNIX group that contains the local identities of the LoadLeveler
administrators.
Restriction: CtSec security is not supported on LoadLeveler for Linux.
Syntax:
SEC_ADMIN_GROUP = name of lladmin group
For more information related to using this keyword, see “Configuring
LoadLeveler to use cluster security services” on page 57.
SEC_ENABLEMENT
Specifies the security mechanism to be used.
Restriction: Do not set this keyword to CtSec in the configuration file for a
Linux machine. CtSec security is not supported on LoadLeveler for Linux.
Syntax:
SEC_ENABLEMENT = COMPAT | CTSEC


SEC_SERVICES_GROUP
When security services are enabled, this keyword specifies the name of the
LoadLeveler services group.
Syntax:
SEC_SERVICES_GROUP=group name

Where group name defines the identities of the LoadLeveler daemons.
SEC_IMPOSED_MECHS
Specifies a blank-delimited list of LoadLeveler’s permitted security mechanisms
when Cluster Security (CtSec) services are enabled.
Syntax: Specify a blank delimited list containing combinations of the following
values:
none If this is the only value specified, then users will run unauthenticated
and, if authorization is necessary, the job will fail. If this is not the only
value specified, then users may run unauthenticated and, if
authorization is necessary, the job will fail.
unix If this is the only value specified, then UNIX host-based authentication
will be used; otherwise, other mechanisms may be used.
Example:
SEC_IMPOSED_MECHS = none unix
SPOOL
Defines the local directory where LoadLeveler keeps the local job queue and
checkpoint files
Syntax:
SPOOL = local directory/spool
Default value: $(tilde)/spool
START
Determines whether a machine can run a LoadLeveler job.
Syntax:
START: expression that evaluates to T or F (true or false)

When the expression evaluates to T, LoadLeveler considers dispatching a job
to the machine. When you use a START expression that is based on the CPU
load average, the negotiator may evaluate the expression as F even though the
load average indicates the machine is Idle. This is because the negotiator adds
a compensating factor to the startd machine’s load average every time the
negotiator assigns a job. For more information, see the
NEGOTIATOR_INTERVAL keyword.
Default value: No default value is set, which means that no jobs will be
started.


START_CLASS
Specifies the rule for starting a job of the incoming_class. The START_CLASS
rule is applied whenever the BACKFILL scheduler decides whether a job step
of the incoming_class should start or not.
Syntax:
START_CLASS[incoming_class] = (start_class_expression) [ && (start_class_expression) ...]

Where start_class_expression takes the form:
run_class < number_of_tasks
Which indicates that a job step of the incoming_class is only allowed to
run on a node when the number of tasks of run_class running on that
node is less than number_of_tasks.

Note:
1. START_CLASS [allclasses] will be ignored.
2. The job class specified by run_class may be the same as or different
from the class specified by incoming_class.
3. You can also define run_class as allclasses. If you do, the total
number of all job tasks running on that node cannot exceed the
value specified by number_of_tasks.
4. A class name or allclasses should not appear twice on the right-hand
side of the keyword statement. However, you can use other class
names with allclasses on the right hand side of the statement.
5. If there is more than one start_class_expression, you must use &&
between adjacent start_class_expressions.
6. Both the START keyword and the START_CLASS keyword have to
be true before a new job can start.
7. Parenthesis ( ) are optional around start_class_expression.
For information related to using this keyword, see “Planning to preempt jobs”
on page 128.
Examples:
START_CLASS[Class_A] = (Class_A < 1)
This statement indicates that a Class_A job can only start on nodes that
do not have any Class_A jobs running.
START_CLASS[Class_B] = allclasses < 5
This statement indicates that a Class_B job can only start on nodes
with maximum 4 tasks running.
START_DAEMONS
Specifies whether to start the LoadLeveler daemons on the node.
Syntax:
START_DAEMONS = true | false
Default value: true
When true, the daemons are started. In most cases, you will probably want to
set this keyword to true. An example of why this keyword would be set to
false is if you want to run the daemons on most of the machines in the cluster
but some individual users with their own local configuration files do not want
their machines to run the daemons. The individual users would modify their


local configuration files and set this keyword to false. Because the global
configuration file has the keyword set to true, their individual machines would
still be able to participate in the LoadLeveler cluster.
Also, to define the machine as strictly a submit-only machine, set this keyword
to false.
STARTD
Location of the startd executable (LoadL_startd).
Syntax:
STARTD = directory
Default value: $(BIN)/LoadL_startd
STARTD_COREDUMP_DIR
Local directory for storing LoadL_startd core dump files.
Syntax:
STARTD_COREDUMP_DIR = directory
STARTD_DGRAM_PORT
Syntax:
STARTD_DGRAM_PORT = port number
STARTD_RUNS_HERE = true | false
Specifies whether the startd daemon runs on the host. If you do not want to
run the startd daemon, specify false.
Syntax:
STARTD_RUNS_HERE = true | false
Default value: true
STARTD_STREAM_PORT
Syntax:
STARTD_STREAM_PORT = port number
STARTER
Location of the starter executable (LoadL_starter).
Syntax:
STARTER = directory


Default value: $(BIN)/LoadL_starter
STARTER_COREDUMP_DIR
Local directory for storing LoadL_starter coredump files.
Syntax:
STARTER_COREDUMP_DIR = directory
SUBMIT_FILTER
Specifies the program you want to run to filter a job script when the job is
submitted.
Syntax:
SUBMIT_FILTER = full_path_to_executable

Where full_path_to_executable is called with the job command file as the
standard input. The standard output is submitted to LoadLeveler. If the
program returns with a nonzero exit code, the job submission is canceled. A
submit filter can only make changes to LoadLeveler job command file keyword
statements.
Multicluster use: In a multicluster environment, if you specified a valid cluster
list with either the llsubmit -X option or the ll_cluster API, then the
SUBMIT_FILTER will instead be invoked with a modified job command file
that contains a cluster_list keyword generated from either the llsubmit -X
option or the ll_cluster API.
The modified job command file will contain an inserted # @ cluster_list =
cluster statement just prior to the first # @ queue statement. This cluster_list
statement takes precedence and overrides all previous specifications of any
cluster_list statements from the original job command file.
Example: SUBMIT_FILTER in a multicluster environment
The following job command file, job.cmd, requests to be run remotely on
cluster1:
#!/bin/sh
# @ cluster_list = cluster1
# @ error = job1.$(Host).$(Cluster).$(Process).err
# @ output = job1.$(Host).$(Cluster).$(Process).out
# @ queue

After issuing llsubmit -X cluster2 job.cmd, the modified job command file
statements will be run on cluster2:
#!/bin/sh
# @ error = job1.$(Host).$(Cluster).$(Process).err
# @ output = job1.$(Host).$(Cluster).$(Process).out
# @ queue
For more information related to using this keyword, see “Filtering a job script”
on page 76.


SUSPEND
Determines whether running jobs should be suspended.
Syntax:
SUSPEND: expression that evaluates to T or F (true or false)

When T, LoadLeveler temporarily suspends jobs currently running on the
machine. Suspended LoadLeveler jobs will either be continued or vacated. This
keyword is not supported for parallel jobs.

SYSPRIO
System priority expression.
Syntax:
SYSPRIO : expression

You can use the following LoadLeveler variables to define the SYSPRIO
expression:
v ClassSysprio
v GroupQueuedJobs
v GroupRunningJobs
v GroupSysprio
v GroupTotalJobs
v GroupTotalShares
v GroupUsedBgShares
v GroupUsedShares
v JobIsBlueGene
v QDate
| v UserHoldTime
v UserPrio
v UserQueuedJobs
v UserRunningJobs
v UserSysprio
v UserTotalJobs
v UserTotalShares
v UserUsedBgShares
v UserUsedShares
For detailed descriptions of these variables, see “LoadLeveler variables” on
page 314.
Default value: 0 (zero)

Note:
1. The SYSPRIO keyword is valid only on the machine where the
central manager is running. Using this keyword in a local
configuration file has no effect.
2. It is recommended that you do not use UserPrio in the SYSPRIO
expression, since user jobs are already ordered by UserPrio.
3. The string SYSPRIO can be used as both the name of an expression
(SYSPRIO: value) and the name of a variable (SYSPRIO = value).
To specify the expression to be used to calculate job priority you
must use the syntax for the SYSPRIO expression. If the variable is


mistakenly used for the SYSPRIO expression, which requires a colon
(:) after the name, the job priority value will always be 0 because the
SYSPRIO expression has not been defined.
4. When the UserRunningJobs, GroupRunningJobs, UserQueuedJobs,
GroupQueuedJobs, UserTotalJobs, GroupTotalJobs,
GroupTotalShares, GroupUsedShares, UserTotalShares,
UserUsedShares, GroupUsedBgShares, JobIsBlueGene, and
UserUsedBgShares variables are used to prioritize the queue based
on current usage, you should also set
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL so that the
priorities are adjusted according to current usage rather than usage
only at submission time.
Examples:
v Example 1
This example creates a FIFO job queue based on submission time:
SYSPRIO : 0 - (QDate)
v Example 2
This example accounts for Class, User, and Group system priorities:
SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate)
v Example 3
This example orders the queue based on the number of jobs a user is
currently running. The user who has the fewest jobs running is first in the
queue. You should set
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL in conjunction with
this SYSPRIO expression.
SYSPRIO : 0 - UserRunningJobs
v Example 4
This example shows one possible way to set up the SYSPRIO expression for
fair share scheduling. For those jobs whose owner has no unused shares
($(UserHasShares)= 0), that job priority depends only on QDate, making it a
simple FIFO queue as in Example 1.
For those jobs whose owner has unused shares ($(UserHasShares)= 1), job
priority depends not only on QDate, but also on a uniform boost of
31 536 000 (the equivalent to the job being submitted one year earlier).
These jobs still have priority differences because of submit time differences.
It is like forming two priority tiers: the higher priority tier for jobs with
unused shares and the lower priority tier for jobs without unused shares.
SYSPRIO: 31536000 * $(UserHasShares) - QDate
v Example 5
This example divides the jobs into three priority tiers:
– Those jobs whose owner and group both have unused shares are at the
top tier
– Those jobs whose owner or group has unused shares are at the middle
tier
– Those jobs whose owner and group both have no shares remaining are at
the bottom tier
A user can submit two jobs to two different groups, the first job to a group
with shares remaining and the second job to a group without any unused
shares. If the user has unused shares, the first job will belong to the top tier
and the second job will belong to the middle tier. If the user has no shares
remaining, the first job will belong to the middle tier and the second job will


Ibm tivoli workload scheduler load leveler using and administering v3.5

More Related Content

What's hot (17)

Similar to Ibm tivoli workload scheduler load leveler using and administering v3.5 (20)

More from Banking at Ho Chi Minh city (20)

Recently uploaded (20)

Ibm tivoli workload scheduler load leveler using and administering v3.5